Course Content
Machine Learning in just 30 Days
0/39
Data Science 30 Days Course easy to learn

Welcome to Day 2 of the 30 Days of Data Science Series! Today, we’re diving deep into Logistic Regression, a fundamental algorithm for binary classification problems. By the end of this lesson, you’ll understand the concept, implementation, and evaluation of logistic regression in Python.


1. What is Logistic Regression?

Logistic Regression is a statistical method used for binary classification, where the outcome is a categorical variable with two possible classes (e.g., 0 or 1, Yes or No, True or False). Unlike linear regression, which predicts continuous values, logistic regression predicts the probability of an event occurring.

Key Concepts:

  • Sigmoid Function: Logistic regression uses the sigmoid function to map predicted values to probabilities between 0 and 1.

    σ(z)=11+e−z

    Here, z is the linear combination of input features and their weights:

    z=β0+β1×1+β2×2+⋯+βnxn

  • Decision Boundary: By default, if the predicted probability ≥0.5, the outcome is classified as class 1; otherwise, it’s class 0.


2. When to Use Logistic Regression?

  • Binary classification problems (e.g., spam detection, disease prediction).

  • The relationship between features and the target is approximately linear.

  • Interpretability is important (logistic regression provides coefficients that indicate feature importance).


3. Implementation in Python

Let’s implement logistic regression using Python and evaluate its performance.

Step 1: Import Libraries

python
Copy
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score, roc_curve
import matplotlib.pyplot as plt

Step 2: Prepare the Data

We’ll use a simple dataset where:

  • Hours_Studied: Number of hours a student studied.

  • Passed: Whether the student passed (1) or failed (0).

python
Copy
data = {
    'Hours_Studied': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'Passed': [0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
}
df = pd.DataFrame(data)

Step 3: Split Data into Features and Target

python
Copy
X = df[['Hours_Studied']]  # Feature
y = df['Passed']          # Target

Step 4: Train-Test Split

Split the data into training and testing sets:

python
Copy
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Step 5: Train the Logistic Regression Model

python
Copy
model = LogisticRegression()
model.fit(X_train, y_train)

Step 6: Make Predictions

python
Copy
y_pred = model.predict(X_test)  # Predicted class labels
y_pred_prob = model.predict_proba(X_test)[:, 1]  # Predicted probabilities for class 1

Step 7: Evaluate the Model

Confusion Matrix

python
Copy
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:n", conf_matrix)

Output:

 
Copy
Confusion Matrix:
 [[1 0]
 [0 1]]

Classification Report

python
Copy
class_report = classification_report(y_test, y_pred)
print("Classification Report:n", class_report)

Output:

 
Copy
Classification Report:
               precision    recall  f1-score   support
           0       1.00      1.00      1.00         1
           1       1.00      1.00      1.00         1
    accuracy                           1.00         2
   macro avg       1.00      1.00      1.00         2
weighted avg       1.00      1.00      1.00         2

ROC-AUC Score

python
Copy
roc_auc = roc_auc_score(y_test, y_pred_prob)
print("ROC-AUC Score:", roc_auc)

Output:

 
Copy
ROC-AUC Score: 1.0

Step 8: Plot the ROC Curve

python
Copy
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc="lower right")
plt.show()

4. Key Evaluation Metrics

  1. Confusion Matrix:

    • True Positives (TP): Correctly predicted positive class.

    • True Negatives (TN): Correctly predicted negative class.

    • False Positives (FP): Incorrectly predicted positive class.

    • False Negatives (FN): Incorrectly predicted negative class.

  2. Classification Report:

    • Precision: Ratio of correctly predicted positive observations to total predicted positives.

    • Recall: Ratio of correctly predicted positive observations to all actual positives.

    • F1-Score: Weighted average of precision and recall.

    • Support: Number of actual occurrences of each class.

  3. ROC-AUC:

    • Measures the model’s ability to distinguish between classes.

    • AUC closer to 1 indicates better performance.


5. Key Takeaways

  • Logistic regression is a simple yet powerful algorithm for binary classification.

  • It uses the sigmoid function to map predictions to probabilities.

  • Evaluation metrics like confusion matrix, classification report, and ROC-AUC are essential for assessing model performance.


6. Practice Exercise

  1. Modify the dataset to include more features (e.g., attendance, previous scores) and retrain the model.

  2. Experiment with different test_size values in the train-test split and observe how it affects the model’s performance.

  3. Try using a real-world dataset (e.g., Titanic dataset) and apply logistic regression.


That’s it for Day 2! Tomorrow, we’ll explore Multiclass Classification and how logistic regression can be extended to handle more than two classes. Keep practicing, and feel free to ask questions in the comments! 🚀

Scroll to Top
Verified by MonsterInsights