Course Content
Machine Learning in just 30 Days
0/39
Data Science 30 Days Course easy to learn

    Welcome to Day 2 of the 30 Days of Data Science Series! Today, we’re diving deep into Logistic Regression, a fundamental algorithm for binary classification problems. By the end of this lesson, you’ll understand the concept, implementation, and evaluation of logistic regression in Python.


    1. What is Logistic Regression?

    Logistic Regression is a statistical method used for binary classification, where the outcome is a categorical variable with two possible classes (e.g., 0 or 1, Yes or No, True or False). Unlike linear regression, which predicts continuous values, logistic regression predicts the probability of an event occurring.

    Key Concepts:

    • Sigmoid Function: Logistic regression uses the sigmoid function to map predicted values to probabilities between 0 and 1.

      σ(z)=11+e−z

      Here, z is the linear combination of input features and their weights:

      z=β0+β1×1+β2×2+⋯+βnxn

    • Decision Boundary: By default, if the predicted probability ≥0.5, the outcome is classified as class 1; otherwise, it’s class 0.


    2. When to Use Logistic Regression?

    • Binary classification problems (e.g., spam detection, disease prediction).

    • The relationship between features and the target is approximately linear.

    • Interpretability is important (logistic regression provides coefficients that indicate feature importance).


    3. Implementation in Python

    Let’s implement logistic regression using Python and evaluate its performance.

    Step 1: Import Libraries

    python
    Copy
    import numpy as np
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score, roc_curve
    import matplotlib.pyplot as plt

    Step 2: Prepare the Data

    We’ll use a simple dataset where:

    • Hours_Studied: Number of hours a student studied.

    • Passed: Whether the student passed (1) or failed (0).

    python
    Copy
    data = {
        'Hours_Studied': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
        'Passed': [0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
    }
    df = pd.DataFrame(data)

    Step 3: Split Data into Features and Target

    python
    Copy
    X = df[['Hours_Studied']]  # Feature
    y = df['Passed']          # Target

    Step 4: Train-Test Split

    Split the data into training and testing sets:

    python
    Copy
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

    Step 5: Train the Logistic Regression Model

    python
    Copy
    model = LogisticRegression()
    model.fit(X_train, y_train)

    Step 6: Make Predictions

    python
    Copy
    y_pred = model.predict(X_test)  # Predicted class labels
    y_pred_prob = model.predict_proba(X_test)[:, 1]  # Predicted probabilities for class 1

    Step 7: Evaluate the Model

    Confusion Matrix

    python
    Copy
    conf_matrix = confusion_matrix(y_test, y_pred)
    print("Confusion Matrix:n", conf_matrix)

    Output:

     
    Copy
    Confusion Matrix:
     [[1 0]
     [0 1]]

    Classification Report

    python
    Copy
    class_report = classification_report(y_test, y_pred)
    print("Classification Report:n", class_report)

    Output:

     
    Copy
    Classification Report:
                   precision    recall  f1-score   support
               0       1.00      1.00      1.00         1
               1       1.00      1.00      1.00         1
        accuracy                           1.00         2
       macro avg       1.00      1.00      1.00         2
    weighted avg       1.00      1.00      1.00         2

    ROC-AUC Score

    python
    Copy
    roc_auc = roc_auc_score(y_test, y_pred_prob)
    print("ROC-AUC Score:", roc_auc)

    Output:

     
    Copy
    ROC-AUC Score: 1.0

    Step 8: Plot the ROC Curve

    python
    Copy
    fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
    plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % roc_auc)
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC Curve')
    plt.legend(loc="lower right")
    plt.show()

    4. Key Evaluation Metrics

    1. Confusion Matrix:

      • True Positives (TP): Correctly predicted positive class.

      • True Negatives (TN): Correctly predicted negative class.

      • False Positives (FP): Incorrectly predicted positive class.

      • False Negatives (FN): Incorrectly predicted negative class.

    2. Classification Report:

      • Precision: Ratio of correctly predicted positive observations to total predicted positives.

      • Recall: Ratio of correctly predicted positive observations to all actual positives.

      • F1-Score: Weighted average of precision and recall.

      • Support: Number of actual occurrences of each class.

    3. ROC-AUC:

      • Measures the model’s ability to distinguish between classes.

      • AUC closer to 1 indicates better performance.


    5. Key Takeaways

    • Logistic regression is a simple yet powerful algorithm for binary classification.

    • It uses the sigmoid function to map predictions to probabilities.

    • Evaluation metrics like confusion matrix, classification report, and ROC-AUC are essential for assessing model performance.


    6. Practice Exercise

    1. Modify the dataset to include more features (e.g., attendance, previous scores) and retrain the model.

    2. Experiment with different test_size values in the train-test split and observe how it affects the model’s performance.

    3. Try using a real-world dataset (e.g., Titanic dataset) and apply logistic regression.


    That’s it for Day 2! Tomorrow, we’ll explore Multiclass Classification and how logistic regression can be extended to handle more than two classes. Keep practicing, and feel free to ask questions in the comments! 🚀

    Scroll to Top
    Verified by MonsterInsights