Course Content
Machine Learning in just 30 Days
0/39
Data Science 30 Days Course easy to learn

    Welcome to Day 3 of the 30 Days of Data Science Series! Today, we’re exploring Decision Trees, a versatile and interpretable algorithm used for both classification and regression tasks. By the end of this lesson, you’ll understand the concept, implementation, and evaluation of decision trees in Python.


    1. What is a Decision Tree?

    A decision tree is a non-parametric supervised learning algorithm that models decisions and their possible consequences in a tree-like structure. It consists of:

    • Nodes: Represent features or attributes.

    • Branches: Represent decision rules or conditions.

    • Leaf Nodes: Represent the final output (class label for classification or continuous value for regression).

    Key Concepts:

    1. Splitting Criteria:

      • For classification, decision trees use:

        • Gini Impurity: Measures the likelihood of an incorrect classification of a randomly chosen element.

          Gini=1−∑i=1n(pi)2

        • Entropy (Information Gain): Measures the amount of uncertainty or impurity in the data.

          Entropy=−∑i=1npilog⁡2(pi)

      • For regression, decision trees minimize the variance (mean squared error) in the splits.

    2. Tree Depth: Controls the complexity of the tree. Deeper trees can lead to overfitting.

    3. Pruning: A technique to reduce the size of the tree by removing unnecessary branches to prevent overfitting.


    2. When to Use Decision Trees?

    • Interpretability is important (decision trees are easy to visualize and explain).

    • The dataset has a mix of categorical and numerical features.

    • Non-linear relationships exist between features and the target.


    3. Implementation in Python

    Let’s implement a decision tree for a classification problem using Python.

    Step 1: Import Libraries

    python
    Copy
    import numpy as np
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.tree import DecisionTreeClassifier, plot_tree
    from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
    import matplotlib.pyplot as plt

    Step 2: Prepare the Data

    We’ll use a dataset with features like AgeIncome, and Student to predict whether a person buys a computer.

    python
    Copy
    data = {
        'Age': [25, 45, 35, 50, 23, 37, 32, 28, 40, 27],
        'Income': ['High', 'High', 'High', 'Medium', 'Low', 'Low', 'Low', 'Medium', 'Low', 'Medium'],
        'Student': ['No', 'No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No'],
        'Buys_Computer': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes']
    }
    df = pd.DataFrame(data)

    Step 3: Convert Categorical Features to Numeric

    python
    Copy
    df['Income'] = df['Income'].map({'Low': 1, 'Medium': 2, 'High': 3})
    df['Student'] = df['Student'].map({'No': 0, 'Yes': 1})
    df['Buys_Computer'] = df['Buys_Computer'].map({'No': 0, 'Yes': 1})

    Step 4: Split Data into Features and Target

    python
    Copy
    X = df[['Age', 'Income', 'Student']]  # Features
    y = df['Buys_Computer']               # Target

    Step 5: Train-Test Split

    python
    Copy
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

    Step 6: Train the Decision Tree Model

    python
    Copy
    model = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=0)
    model.fit(X_train, y_train)

    Step 7: Make Predictions

    python
    Copy
    y_pred = model.predict(X_test)

    Step 8: Evaluate the Model

    Accuracy

    python
    Copy
    accuracy = accuracy_score(y_test, y_pred)
    print("Accuracy:", accuracy)

    Output:

     
    Copy
    Accuracy: 1.0

    Confusion Matrix

    python
    Copy
    conf_matrix = confusion_matrix(y_test, y_pred)
    print("Confusion Matrix:n", conf_matrix)

    Output:

     
    Copy
    Confusion Matrix:
     [[1 0]
     [0 1]]

    Classification Report

    python
    Copy
    class_report = classification_report(y_test, y_pred)
    print("Classification Report:n", class_report)

    Output:

     
    Copy
    Classification Report:
                   precision    recall  f1-score   support
               0       1.00      1.00      1.00         1
               1       1.00      1.00      1.00         1
        accuracy                           1.00         2
       macro avg       1.00      1.00      1.00         2
    weighted avg       1.00      1.00      1.00         2

    Step 9: Visualize the Decision Tree

    python
    Copy
    plt.figure(figsize=(12, 8))
    plot_tree(model, feature_names=['Age', 'Income', 'Student'], class_names=['No', 'Yes'], filled=True)
    plt.title('Decision Tree')
    plt.show()

    4. Key Evaluation Metrics

    1. Accuracy: Percentage of correct predictions.

    2. Confusion Matrix:

      • True Positives (TP), True Negatives (TN), False Positives (FP), False Negatives (FN).

    3. Classification Report:

      • Precision: Ratio of correctly predicted positive observations to total predicted positives.

      • Recall: Ratio of correctly predicted positive observations to all actual positives.

      • F1-Score: Weighted average of precision and recall.

      • Support: Number of actual occurrences of each class.


    5. Key Takeaways

    • Decision trees are easy to interpret and visualize.

    • They can handle both categorical and numerical data.

    • Pruning and limiting tree depth are essential to prevent overfitting.


    6. Practice Exercise

    1. Experiment with different criterion values (gini vs entropy) and observe how it affects the tree.

    2. Modify the max_depth parameter and analyze its impact on model performance.

    3. Apply decision trees to a real-world dataset (e.g., Iris dataset) and evaluate the results.


    That’s it for Day 3! Tomorrow, we’ll dive into Random Forests, an ensemble method that builds on decision trees. Keep practicing, and feel free to ask questions in the comments! 🚀

    Scroll to Top
    Verified by MonsterInsights