Welcome to Day 2 of the 30 Days of Data Science Series! Today, we’re diving deep into Logistic Regression, a fundamental algorithm for binary classification problems. By the end of this lesson, you’ll understand the concept, implementation, and evaluation of logistic regression in Python.
1. What is Logistic Regression?
Logistic Regression is a statistical method used for binary classification, where the outcome is a categorical variable with two possible classes (e.g., 0 or 1, Yes or No, True or False). Unlike linear regression, which predicts continuous values, logistic regression predicts the probability of an event occurring.
Key Concepts:
Sigmoid Function: Logistic regression uses the sigmoid function to map predicted values to probabilities between 0 and 1.
σ(z)=11+e−z
Here, z is the linear combination of input features and their weights:
z=β0+β1×1+β2×2+⋯+βnxn
Decision Boundary: By default, if the predicted probability ≥0.5, the outcome is classified as class 1; otherwise, it’s class 0.
2. When to Use Logistic Regression?
Binary classification problems (e.g., spam detection, disease prediction).
The relationship between features and the target is approximately linear.
Interpretability is important (logistic regression provides coefficients that indicate feature importance).
3. Implementation in Python
Let’s implement logistic regression using Python and evaluate its performance.
Step 1: Import Libraries
import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score, roc_curve import matplotlib.pyplot as plt
Step 2: Prepare the Data
We’ll use a simple dataset where:
Hours_Studied
: Number of hours a student studied.Passed
: Whether the student passed (1) or failed (0).
data = { 'Hours_Studied': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'Passed': [0, 0, 0, 0, 1, 1, 1, 1, 1, 1] } df = pd.DataFrame(data)
Step 3: Split Data into Features and Target
X = df[['Hours_Studied']] # Feature y = df['Passed'] # Target
Step 4: Train-Test Split
Split the data into training and testing sets:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
Step 5: Train the Logistic Regression Model
model = LogisticRegression() model.fit(X_train, y_train)
Step 6: Make Predictions
y_pred = model.predict(X_test) # Predicted class labels y_pred_prob = model.predict_proba(X_test)[:, 1] # Predicted probabilities for class 1
Step 7: Evaluate the Model
Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred) print("Confusion Matrix:n", conf_matrix)
Output:
Confusion Matrix: [[1 0] [0 1]]
Classification Report
class_report = classification_report(y_test, y_pred) print("Classification Report:n", class_report)
Output:
Classification Report: precision recall f1-score support 0 1.00 1.00 1.00 1 1 1.00 1.00 1.00 1 accuracy 1.00 2 macro avg 1.00 1.00 1.00 2 weighted avg 1.00 1.00 1.00 2
ROC-AUC Score
roc_auc = roc_auc_score(y_test, y_pred_prob) print("ROC-AUC Score:", roc_auc)
Output:
ROC-AUC Score: 1.0
Step 8: Plot the ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob) plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % roc_auc) plt.plot([0, 1], [0, 1], 'k--') plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('ROC Curve') plt.legend(loc="lower right") plt.show()
4. Key Evaluation Metrics
Confusion Matrix:
True Positives (TP): Correctly predicted positive class.
True Negatives (TN): Correctly predicted negative class.
False Positives (FP): Incorrectly predicted positive class.
False Negatives (FN): Incorrectly predicted negative class.
Classification Report:
Precision: Ratio of correctly predicted positive observations to total predicted positives.
Recall: Ratio of correctly predicted positive observations to all actual positives.
F1-Score: Weighted average of precision and recall.
Support: Number of actual occurrences of each class.
ROC-AUC:
Measures the model’s ability to distinguish between classes.
AUC closer to 1 indicates better performance.
5. Key Takeaways
Logistic regression is a simple yet powerful algorithm for binary classification.
It uses the sigmoid function to map predictions to probabilities.
Evaluation metrics like confusion matrix, classification report, and ROC-AUC are essential for assessing model performance.
6. Practice Exercise
Modify the dataset to include more features (e.g., attendance, previous scores) and retrain the model.
Experiment with different
test_size
values in the train-test split and observe how it affects the model’s performance.Try using a real-world dataset (e.g., Titanic dataset) and apply logistic regression.
That’s it for Day 2! Tomorrow, we’ll explore Multiclass Classification and how logistic regression can be extended to handle more than two classes. Keep practicing, and feel free to ask questions in the comments! 🚀