# Python example demonstrating logistic regression with a dataset
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
# Load dataset
= load_iris()
data = data.data
X = data.target
y
# Since logistic regression is for binary classification, we will only use two classes
= X[y != 2]
X = y[y != 2]
y
# Split the data into training and test sets
= train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_test, y_train, y_test
# Initialize and train logistic regression model
= LogisticRegression()
model
model.fit(X_train, y_train)
# Predictions
= model.predict(X_test)
predictions
# Performance
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))
Classification
Classification is a type of supervised learning where the goal is to predict the categorical class labels of new instances, based on past observations. It involves assigning a class label to input data, and the classification is binary if there are only two classes to predict, or multiclass if there are more than two classes.
Examples of Classification
- Email spam filter: Classifying emails as ‘Spam’ or ‘Not Spam’.
- Medical diagnosis: Determining if a patient has a disease or not based on their medical records.
- Credit scoring: Assessing if an applicant is a ‘high’ or ‘low’ credit risk.
Classification is used in various domains such as finance, healthcare, marketing, and more, making it a fundamental technique in the field of data science and machine learning.
C.1 Logistic Regression
Logistic Regression is a statistical method for predicting binary outcomes based on independent variables. It is a type of regression analysis that is suited for binary classification problems.
How Logistic Regression Works
- Logistic regression measures the relationship between the categorical dependent variable and one or more independent variables by estimating probabilities using a logistic function.
- This logistic function is an S-shaped curve that can take any real-valued number and map it into a value between 0 and 1, but never exactly at those limits.
- The function is defined as: \[\frac{1}{1 + e^{-(\beta_0 + \beta_1x)}}\] where $ _0 $ is the intercept and $ _1 $ are the coefficients of the independent variables $ x $.
Use Cases for Logistic Regression
- Predicting the probability of a customer purchasing a product.
- Estimating the odds of a student being admitted to a college, based on their grades and test scores.
- Determining whether a transaction is fraudulent or not.
Technique and Methodology
The process of applying logistic regression typically involves several key steps:
- Data Collection: Gather the data that will be used for training the model.
- Data Preprocessing: Prepare the data for modeling by handling missing values, encoding categorical variables, normalizing data, etc.
- Feature Selection: Choose the most relevant features that will contribute to the model’s predictive power.
- Model Training: Use the training data to fit the logistic regression model. This involves finding the coefficients that minimize a loss function.
- Model Evaluation: Assess the model’s performance using a test set and metrics like accuracy, precision, recall, and the F1 score.
- Parameter Tuning: Adjust the model parameters to improve performance, if necessary.
- Model Deployment: Once the model is trained and evaluated, it can be deployed for making predictions on new data.
In the following Python example, we’ll go through some of these steps using the logistic regression model we previously trained.
# Example of data preprocessing steps
from sklearn.preprocessing import StandardScaler
# Standardizing the features (mean=0, std=1)
= StandardScaler()
scaler = scaler.fit_transform(X_train)
X_train_scaled = scaler.transform(X_test)
X_test_scaled
# Re-training the logistic regression model with standardized data
= LogisticRegression()
model_scaled
model_scaled.fit(X_train_scaled, y_train)
# Predictions with standardized data
= model_scaled.predict(X_test_scaled)
predictions_scaled
# Performance with standardized data
print(confusion_matrix(y_test, predictions_scaled))
print(classification_report(y_test, predictions_scaled))
Measuring the Model Performance
To evaluate the performance of a logistic regression model, we use several metrics:
- Accuracy: The proportion of true results among the total number of cases examined.
- Precision: The proportion of true positive results divided by the number of all positive results.
- Recall (Sensitivity): The proportion of true positive results divided by the number of positives that should have been identified.
- F1 Score: The harmonic mean of precision and recall, giving both metrics equal weight.
These metrics can be derived from the confusion matrix, which is a table showing correct predictions and types of incorrect predictions.
In the Python example below, we will calculate these metrics for our logistic regression model.
# Calculating metrics from the confusion matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
# Accuracy
= accuracy_score(y_test, predictions)
accuracy print('Accuracy:', accuracy)
# Precision
= precision_score(y_test, predictions)
precision print('Precision:', precision)
# Recall
= recall_score(y_test, predictions)
recall print('Recall:', recall)
# F1 Score
= f1_score(y_test, predictions)
f1 print('F1 Score:', f1)
The ROC Curve
The Receiver Operating Characteristic (ROC) curve is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. It is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings.
The area under the ROC curve (AUC) is a measure of the model’s ability to distinguish between the classes. An AUC of 0.5 suggests no discrimination (i.e., random chance), while an AUC of 1.0 indicates perfect discrimination.
In the following Python example, we will plot the ROC curve for our logistic regression model.
# Plotting the ROC curve
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
# Compute ROC curve and ROC area
= roc_curve(y_test, model.predict_proba(X_test)[:, 1])
fpr, tpr, thresholds = auc(fpr, tpr)
roc_auc
# Plotting
plt.figure()='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot(fpr, tpr, color0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.plot([0.0, 1.0])
plt.xlim([0.0, 1.05])
plt.ylim(['False Positive Rate')
plt.xlabel('True Positive Rate')
plt.ylabel('Receiver Operating Characteristic')
plt.title(='lower right')
plt.legend(loc plt.show()
Detailed Explanation of Evaluation Metrics
In the context of classification problems, evaluation metrics are crucial for assessing the performance of a model. Here’s a detailed explanation of the four primary metrics:
Accuracy
Accuracy is the most intuitive performance measure and it is simply a ratio of correctly predicted observation to the total observations. It is a measure of how many classifications are correct. The formula for accuracy is:
\[ \text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions made}} \]
Precision
Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. High precision relates to the low false positive rate. It is a measure of the quality of a positive prediction made by the model. The formula for precision is:
\[ \text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}} \]
Recall (Sensitivity)
Recall is the ratio of correctly predicted positive observations to the all observations in actual class - yes. It is a measure of the ability of a model to find all the relevant cases within a dataset. The formula for recall is:
\[ \text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}} \]
F1 Score
The F1 Score is the 2((precisionrecall)/(precision+recall)). It is also known as the F-Score or the F-Measure. The F1 Score is the harmonic mean of precision and recall taking both false positives and false negatives into account. It is a measure of the test’s accuracy. The formula for the F1 score is:
\[ \text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} \]
In the next Python example, we will calculate these metrics to better understand their interpretation.
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
# Let's assume the following confusion matrix for a binary classifier
# Confusion matrix
# Predicted
# No Yes
# Actual No TN FP
# Yes FN TP
# True Positive (TP)
= 30
TP
# True Negative (TN)
= 45
TN
# False Positive (FP)
= 5
FP
# False Negative (FN)
= 20
FN
# Total number of predictions
= TP + TN + FP + FN
total_predictions
# Calculating metrics
= (TP + TN) / total_predictions
accuracy = TP / (TP + FP)
precision = TP / (TP + FN)
recall = 2 * (precision * recall) / (precision + recall)
f1
print(f'Accuracy: {accuracy:.2f}')
print(f'Precision: {precision:.2f}')
print(f'Recall: {recall:.2f}')
print(f'F1 Score: {f1:.2f}')
Confusion Matrix Explanation
A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known. It allows the visualization of the performance of an algorithm. Each row of the matrix represents the instances in an actual class while each column represents the instances in a predicted class, or vice versa. Here’s the structure of a confusion matrix:
Actual Predicted | Negative (0) | Positive (1) |
---|---|---|
Negative (0) | TN | FP |
Positive (1) | FN | TP |
- True Positive (TP): The cases in which the model correctly predicted the positive class.
- True Negative (TN): The cases in which the model correctly predicted the negative class.
- False Positive (FP): The cases in which the model incorrectly predicted the positive class (a “Type I error”).
- False Negative (FN): The cases in which the model incorrectly predicted the negative class (a “Type II error”).
The confusion matrix itself is not a performance measure as such, but almost all of the performance metrics are based on it.
When to Use Precision vs Recall
Precision is used when the cost of a false positive is high. For example, in email spam detection, a false positive means that a regular email is incorrectly classified as spam. The consequence is that an important email might be missed if it’s sent to the spam folder.
Recall is used when the cost of a false negative is high. For example, in fraud detection or disease screening, a false negative means that a fraudulent transaction or a disease is not identified. The consequence could be very serious, leading to financial loss or harm to health.
In the following Python example, we will create a confusion matrix for a hypothetical classifier and discuss the implications of precision and recall in a practical scenario.
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
# Hypothetical predictions and true labels
= [0, 1, 0, 1, 0, 1, 0, 0, 0, 1]
y_true = [0, 1, 0, 1, 0, 0, 1, 0, 0, 1]
y_pred
# Generating the confusion matrix
= confusion_matrix(y_true, y_pred)
cm
# Plotting the confusion matrix
=True, fmt='d', cmap='Blues')
sns.heatmap(cm, annot'Predicted')
plt.xlabel('Actual')
plt.ylabel('Confusion Matrix')
plt.title( plt.show()
Logistic Regression: Precision, Recall, and Confusion Matrix Practice Exercises
In this section, we will provide you with exercises to practice calculating precision, recall, and the confusion matrix in the context of logistic regression. These metrics are crucial for evaluating the performance of your classification model beyond simple accuracy.
Exercise 1: Calculating Precision and Recall
Given a logistic regression model that predicts whether an email is spam or not, you have the following classification results on the test set:
- True Positives (TP): 90
- False Positives (FP): 10
- True Negatives (TN): 50
- False Negatives (FN): 30
Calculate the precision and recall of the model.
Precision
Precision is the ratio of correctly predicted positive observations to the total predicted positives. It is a measure of a classifier’s exactness. High precision relates to a low false positive rate. Calculate the precision using the formula:
\[ Precision = \frac{TP}{TP + FP} \]
Recall
Recall (Sensitivity) is the ratio of correctly predicted positive observations to all observations in the actual class. It is a measure of a classifier’s completeness. High recall relates to a low false negative rate. Calculate the recall using the formula:
\[ Recall = \frac{TP}{TP + FN} \]
Write a Python function to calculate precision and recall, and then calculate these metrics using the given data.
Exercise 2: Confusion Matrix
Using the same data from Exercise 1, create a confusion matrix. A confusion matrix is a table that is often used to describe the performance of a classification model on a set of test data for which the true values are known. It allows the visualization of the performance of an algorithm.
Create a Python function that takes in the values of TP, FP, TN, and FN and outputs a confusion matrix in a readable format.
Exercise 3: Real-world Scenario
Consider a logistic regression model that has been trained to detect fraud in credit card transactions. The model has produced the following results on a test dataset:
- True Positives (TP): 120
- False Positives (FP): 30
- True Negatives (TN): 900
- False Negatives (FN): 60
Calculate the precision, recall, and F1-score for the model. The F1-score is the harmonic mean of precision and recall and is a balance between the two. It is calculated using the formula:
\[ F1 = 2 \times \frac{Precision \times Recall}{Precision + Recall} \]
Write a Python function to calculate the F1-score and apply it to the given data.
def calculate_precision_recall(TP, FP, FN):
= TP / (TP + FP)
precision = TP / (TP + FN)
recall return precision, recall
# Given data
= 90
TP = 10
FP = 30
FN
# Calculate precision and recall
= calculate_precision_recall(TP, FP, FN)
precision, recall
print(f'Precision: {precision:.2f}')
print(f'Recall: {recall:.2f}')
def create_confusion_matrix(TP, FP, TN, FN):
= {
confusion_matrix 'Predicted Positive': {'Actual Positive': TP, 'Actual Negative': FP},
'Predicted Negative': {'Actual Positive': FN, 'Actual Negative': TN}
}return confusion_matrix
# Given data
= 90
TP = 10
FP = 50
TN = 30
FN
# Create confusion matrix
= create_confusion_matrix(TP, FP, TN, FN)
confusion_matrix
for predicted, actual in confusion_matrix.items():
print(f'{predicted}: {actual}')
def calculate_f1_score(TP, FP, TN, FN):
= TP / (TP + FP)
precision = TP / (TP + FN)
recall = 2 * (precision * recall) / (precision + recall)
f1_score return f1_score
# Given data for a fraud detection model
= 120
TP = 30
FP = 900
TN = 60
FN
# Calculate F1-score
= calculate_f1_score(TP, FP, TN, FN)
f1_score
print(f'F1-Score: {f1_score:.2f}')
Exercise 4: Logistic Regression Implementation
In this exercise, you will implement a logistic regression model on a dataset other than the commonly used Iris dataset. You will use the Pima Indians Diabetes Database, which is a standard dataset used in machine learning for binary classification problems. The dataset contains various diagnostic measurements and a binary outcome indicating whether the patient has diabetes.
Your tasks are as follows:
- Load the dataset.
- Perform any necessary preprocessing, such as handling missing values, feature scaling, etc.
- Split the dataset into training and testing sets.
- Implement a logistic regression model.
- Train the model on the training set.
- Evaluate the model on the test set using accuracy, precision, recall, and the confusion matrix.
- Interpret the results.
You will need to write Python code to accomplish these tasks. You can use libraries such as pandas
for data manipulation, scikit-learn
for logistic regression and evaluation metrics, and matplotlib
or seaborn
for data visualization if needed.
!pip install -q pandas scikit-learn matplotlib seaborn
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
# Load the dataset
= 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv'
url = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']
column_names = pd.read_csv(url, names=column_names)
data
# Data preprocessing
# Here you would handle missing values, feature scaling, etc.
# Split the dataset
= data.drop('Outcome', axis=1)
X = data['Outcome']
y = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_test, y_train, y_test
# Implement the logistic regression model
= LogisticRegression()
model
# Train the model
model.fit(X_train, y_train)
# Evaluate the model
= model.predict(X_test)
y_pred = accuracy_score(y_test, y_pred)
accuracy = precision_score(y_test, y_pred)
precision = recall_score(y_test, y_pred)
recall = confusion_matrix(y_test, y_pred)
conf_matrix
# Results
print(f'Accuracy: {accuracy:.2f}')
print(f'Precision: {precision:.2f}')
print(f'Recall: {recall:.2f}')
print('Confusion Matrix:')
=True, fmt='g')
sns.heatmap(conf_matrix, annot plt.show()