Course Content
Machine Learning in just 30 Days
0/39
Data Science 30 Days Course easy to learn

    Welcome to Day 4 of the 30 Days of Data Science Series! Today, we’re diving into Random Forests, a powerful ensemble learning method that builds on decision trees to improve performance and reduce overfitting. By the end of this lesson, you’ll understand the concept, implementation, and evaluation of random forests in Python.


    1. What is a Random Forest?

    Random Forest is an ensemble learning method that combines multiple decision trees to improve classification or regression performance. It works by:

    1. Building multiple decision trees on random subsets of the data (using bagging or bootstrap aggregation).

    2. Selecting a random subset of features at each split in the tree.

    3. Aggregating the predictions from all trees (majority vote for classification, average for regression).

    Key Advantages:

    • Reduced Overfitting: By averaging multiple trees, random forests reduce the risk of overfitting compared to individual decision trees.

    • Robustness: Less sensitive to noise and variability in the data.

    • Feature Importance: Provides insights into the importance of each feature.


    2. When to Use Random Forests?

    • High-dimensional datasets with many features.

    • Datasets with non-linear relationships between features and the target.

    • When interpretability is important (feature importance is available).


    3. Implementation in Python

    Let’s implement a random forest for a classification problem using Python.

    Step 1: Import Libraries

    python
    Copy
    import numpy as np
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
    import matplotlib.pyplot as plt
    import seaborn as sns

    Step 2: Prepare the Data

    We’ll use a dataset with features like AgeCholesterol, and Max_Heart_Rate to predict whether a patient has heart disease.

    python
    Copy
    data = {
        'Age': [29, 45, 50, 39, 48, 50, 55, 60, 62, 43],
        'Cholesterol': [220, 250, 230, 180, 240, 290, 310, 275, 300, 280],
        'Max_Heart_Rate': [180, 165, 170, 190, 155, 160, 150, 140, 130, 148],
        'Heart_Disease': [0, 1, 1, 0, 1, 1, 1, 1, 1, 0]
    }
    df = pd.DataFrame(data)

    Step 3: Split Data into Features and Target

    python
    Copy
    X = df[['Age', 'Cholesterol', 'Max_Heart_Rate']]  # Features
    y = df['Heart_Disease']                           # Target

    Step 4: Train-Test Split

    python
    Copy
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

    Step 5: Train the Random Forest Model

    python
    Copy
    model = RandomForestClassifier(n_estimators=100, random_state=0)
    model.fit(X_train, y_train)

    Step 6: Make Predictions

    python
    Copy
    y_pred = model.predict(X_test)

    Step 7: Evaluate the Model

    Accuracy

    python
    Copy
    accuracy = accuracy_score(y_test, y_pred)
    print("Accuracy:", accuracy)

    Output:

     
    Copy
    Accuracy: 1.0

    Confusion Matrix

    python
    Copy
    conf_matrix = confusion_matrix(y_test, y_pred)
    print("Confusion Matrix:n", conf_matrix)

    Output:

     
    Copy
    Confusion Matrix:
     [[1 0]
     [0 1]]

    Classification Report

    python
    Copy
    class_report = classification_report(y_test, y_pred)
    print("Classification Report:n", class_report)

    Output:

     
    Copy
    Classification Report:
                   precision    recall  f1-score   support
               0       1.00      1.00      1.00         1
               1       1.00      1.00      1.00         1
        accuracy                           1.00         2
       macro avg       1.00      1.00      1.00         2
    weighted avg       1.00      1.00      1.00         2

    Step 8: Feature Importance

    Random forests provide a measure of feature importance based on how much each feature contributes to the model’s predictions.

    python
    Copy
    feature_importances = pd.DataFrame(model.feature_importances_, index=X.columns, columns=['Importance']).sort_values('Importance', ascending=False)
    print("Feature Importances:n", feature_importances)

    Output:

     
    Copy
    Feature Importances:
                   Importance
    Max_Heart_Rate     0.60
    Age                0.30
    Cholesterol        0.10

    Step 9: Visualize Feature Importances

    python
    Copy
    sns.barplot(x=feature_importances.index, y=feature_importances['Importance'])
    plt.title('Feature Importances')
    plt.xlabel('Feature')
    plt.ylabel('Importance')
    plt.show()

    4. Key Evaluation Metrics

    1. Accuracy: Percentage of correct predictions.

    2. Confusion Matrix:

      • True Positives (TP), True Negatives (TN), False Positives (FP), False Negatives (FN).

    3. Classification Report:

      • Precision: Ratio of correctly predicted positive observations to total predicted positives.

      • Recall: Ratio of correctly predicted positive observations to all actual positives.

      • F1-Score: Weighted average of precision and recall.

      • Support: Number of actual occurrences of each class.


    5. Key Takeaways

    • Random forests combine multiple decision trees to improve performance and reduce overfitting.

    • They provide feature importance, making them interpretable.

    • They are robust to noise and work well with high-dimensional data.


    6. Practice Exercise

    1. Experiment with different values of n_estimators (number of trees) and observe how it affects the model’s performance.

    2. Apply random forests to a real-world dataset (e.g., Titanic dataset) and evaluate the results.

    3. Compare the performance of a single decision tree vs. a random forest on the same dataset.


    7. Additional Resources


    That’s it for Day 4! Tomorrow, we’ll explore Gradient Boosting, another powerful ensemble method. Keep practicing, and feel free to ask questions in the comments! 🚀

    Scroll to Top
    Verified by MonsterInsights