Day 2: Deep Dive into Logistic Regression

Data Science 30 Days Course easy to learn

Welcome to Day 2 of the 30 Days of Data Science Series! Today, we’re diving deep into Logistic Regression, a fundamental algorithm for binary classification problems. By the end of this lesson, you’ll understand the concept, implementation, and evaluation of logistic regression in Python.

1. What is Logistic Regression?

Logistic Regression is a statistical method used for binary classification, where the outcome is a categorical variable with two possible classes (e.g., 0 or 1, Yes or No, True or False). Unlike linear regression, which predicts continuous values, logistic regression predicts the probability of an event occurring.

Key Concepts:

Sigmoid Function: Logistic regression uses the sigmoid function to map predicted values to probabilities between 0 and 1.
$σ (z) = 1 + e ^{- z} 1$
Here, $z$ is the linear combination of input features and their weights:
$z = β_{0} + β_{1} x_{1} + β_{2} x_{2} + \dots + β_{n} x_{n}$
Decision Boundary: By default, if the predicted probability $\geq 0.5$ , the outcome is classified as class 1; otherwise, it’s class 0.

2. When to Use Logistic Regression?

Binary classification problems (e.g., spam detection, disease prediction).
The relationship between features and the target is approximately linear.
Interpretability is important (logistic regression provides coefficients that indicate feature importance).

3. Implementation in Python

Let’s implement logistic regression using Python and evaluate its performance.

Step 1: Import Libraries

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score, roc_curve
import matplotlib.pyplot as plt

Step 2: Prepare the Data

We’ll use a simple dataset where:

Hours_Studied: Number of hours a student studied.
Passed: Whether the student passed (1) or failed (0).

data = {
    'Hours_Studied': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'Passed': [0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
}
df = pd.DataFrame(data)

Step 3: Split Data into Features and Target

X = df[['Hours_Studied']]  # Feature
y = df['Passed']          # Target

Step 4: Train-Test Split

Split the data into training and testing sets:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Step 5: Train the Logistic Regression Model

model = LogisticRegression()
model.fit(X_train, y_train)

Step 6: Make Predictions

y_pred = model.predict(X_test)  # Predicted class labels
y_pred_prob = model.predict_proba(X_test)[:, 1]  # Predicted probabilities for class 1

Step 7: Evaluate the Model

Confusion Matrix

conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:n", conf_matrix)

Output:

Confusion Matrix:
 [[1 0]
 [0 1]]

Classification Report

class_report = classification_report(y_test, y_pred)
print("Classification Report:n", class_report)

Output:

Classification Report:
               precision    recall  f1-score   support
           0       1.00      1.00      1.00         1
           1       1.00      1.00      1.00         1
    accuracy                           1.00         2
   macro avg       1.00      1.00      1.00         2
weighted avg       1.00      1.00      1.00         2

ROC-AUC Score

roc_auc = roc_auc_score(y_test, y_pred_prob)
print("ROC-AUC Score:", roc_auc)

Output:

ROC-AUC Score: 1.0

Step 8: Plot the ROC Curve

fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc="lower right")
plt.show()

4. Key Evaluation Metrics

Confusion Matrix:
- True Positives (TP): Correctly predicted positive class.
- True Negatives (TN): Correctly predicted negative class.
- False Positives (FP): Incorrectly predicted positive class.
- False Negatives (FN): Incorrectly predicted negative class.
Classification Report:
- Precision: Ratio of correctly predicted positive observations to total predicted positives.
- Recall: Ratio of correctly predicted positive observations to all actual positives.
- F1-Score: Weighted average of precision and recall.
- Support: Number of actual occurrences of each class.
ROC-AUC:
- Measures the model’s ability to distinguish between classes.
- AUC closer to 1 indicates better performance.

5. Key Takeaways

Logistic regression is a simple yet powerful algorithm for binary classification.
It uses the sigmoid function to map predictions to probabilities.
Evaluation metrics like confusion matrix, classification report, and ROC-AUC are essential for assessing model performance.

6. Practice Exercise

Modify the dataset to include more features (e.g., attendance, previous scores) and retrain the model.
Experiment with different test_size values in the train-test split and observe how it affects the model’s performance.
Try using a real-world dataset (e.g., Titanic dataset) and apply logistic regression.

That’s it for Day 2! Tomorrow, we’ll explore Multiclass Classification and how logistic regression can be extended to handle more than two classes. Keep practicing, and feel free to ask questions in the comments! 🚀