Classification

Classification is a type of supervised learning where the goal is to predict the categorical class labels of new instances, based on past observations. It involves assigning a class label to input data, and the classification is binary if there are only two classes to predict, or multiclass if there are more than two classes.

Examples of Classification

Email spam filter: Classifying emails as ‘Spam’ or ‘Not Spam’.
Medical diagnosis: Determining if a patient has a disease or not based on their medical records.
Credit scoring: Assessing if an applicant is a ‘high’ or ‘low’ credit risk.

Classification is used in various domains such as finance, healthcare, marketing, and more, making it a fundamental technique in the field of data science and machine learning.

C.1 Logistic Regression

Logistic Regression is a statistical method for predicting binary outcomes based on independent variables. It is a type of regression analysis that is suited for binary classification problems.

How Logistic Regression Works

Logistic regression measures the relationship between the categorical dependent variable and one or more independent variables by estimating probabilities using a logistic function.
This logistic function is an S-shaped curve that can take any real-valued number and map it into a value between 0 and 1, but never exactly at those limits.
The function is defined as: \[\frac{1}{1 + e^{-(\beta_0 + \beta_1x)}}\] where $ _0 $ is the intercept and $ _1 $ are the coefficients of the independent variables $ x $.

Use Cases for Logistic Regression

Predicting the probability of a customer purchasing a product.
Estimating the odds of a student being admitted to a college, based on their grades and test scores.
Determining whether a transaction is fraudulent or not.

# Python example demonstrating logistic regression with a dataset
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

# Load dataset
data = load_iris()
X = data.data
y = data.target

# Since logistic regression is for binary classification, we will only use two classes
X = X[y != 2]
y = y[y != 2]

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Predictions
predictions = model.predict(X_test)

# Performance
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))

Technique and Methodology

The process of applying logistic regression typically involves several key steps:

Data Collection: Gather the data that will be used for training the model.
Data Preprocessing: Prepare the data for modeling by handling missing values, encoding categorical variables, normalizing data, etc.
Feature Selection: Choose the most relevant features that will contribute to the model’s predictive power.
Model Training: Use the training data to fit the logistic regression model. This involves finding the coefficients that minimize a loss function.
Model Evaluation: Assess the model’s performance using a test set and metrics like accuracy, precision, recall, and the F1 score.
Parameter Tuning: Adjust the model parameters to improve performance, if necessary.
Model Deployment: Once the model is trained and evaluated, it can be deployed for making predictions on new data.

In the following Python example, we’ll go through some of these steps using the logistic regression model we previously trained.

# Example of data preprocessing steps
from sklearn.preprocessing import StandardScaler

# Standardizing the features (mean=0, std=1)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Re-training the logistic regression model with standardized data
model_scaled = LogisticRegression()
model_scaled.fit(X_train_scaled, y_train)

# Predictions with standardized data
predictions_scaled = model_scaled.predict(X_test_scaled)

# Performance with standardized data
print(confusion_matrix(y_test, predictions_scaled))
print(classification_report(y_test, predictions_scaled))

Measuring the Model Performance

To evaluate the performance of a logistic regression model, we use several metrics:

Accuracy: The proportion of true results among the total number of cases examined.
Precision: The proportion of true positive results divided by the number of all positive results.
Recall (Sensitivity): The proportion of true positive results divided by the number of positives that should have been identified.
F1 Score: The harmonic mean of precision and recall, giving both metrics equal weight.

These metrics can be derived from the confusion matrix, which is a table showing correct predictions and types of incorrect predictions.

In the Python example below, we will calculate these metrics for our logistic regression model.

# Calculating metrics from the confusion matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Accuracy
accuracy = accuracy_score(y_test, predictions)
print('Accuracy:', accuracy)

# Precision
precision = precision_score(y_test, predictions)
print('Precision:', precision)

# Recall
recall = recall_score(y_test, predictions)
print('Recall:', recall)

# F1 Score
f1 = f1_score(y_test, predictions)
print('F1 Score:', f1)

The ROC Curve

The Receiver Operating Characteristic (ROC) curve is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. It is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings.

The area under the ROC curve (AUC) is a measure of the model’s ability to distinguish between the classes. An AUC of 0.5 suggests no discrimination (i.e., random chance), while an AUC of 1.0 indicates perfect discrimination.

In the following Python example, we will plot the ROC curve for our logistic regression model.

# Plotting the ROC curve
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

# Compute ROC curve and ROC area
fpr, tpr, thresholds = roc_curve(y_test, model.predict_proba(X_test)[:, 1])
roc_auc = auc(fpr, tpr)

# Plotting
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc='lower right')
plt.show()

Detailed Explanation of Evaluation Metrics

In the context of classification problems, evaluation metrics are crucial for assessing the performance of a model. Here’s a detailed explanation of the four primary metrics:

Accuracy

Accuracy is the most intuitive performance measure and it is simply a ratio of correctly predicted observation to the total observations. It is a measure of how many classifications are correct. The formula for accuracy is:

\[ \text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions made}} \]

Precision

Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. High precision relates to the low false positive rate. It is a measure of the quality of a positive prediction made by the model. The formula for precision is:

\[ \text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}} \]

Recall (Sensitivity)

Recall is the ratio of correctly predicted positive observations to the all observations in actual class - yes. It is a measure of the ability of a model to find all the relevant cases within a dataset. The formula for recall is:

\[ \text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}} \]

F1 Score

The F1 Score is the 2((precisionrecall)/(precision+recall)). It is also known as the F-Score or the F-Measure. The F1 Score is the harmonic mean of precision and recall taking both false positives and false negatives into account. It is a measure of the test’s accuracy. The formula for the F1 score is:

\[ \text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} \]

In the next Python example, we will calculate these metrics to better understand their interpretation.

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Let's assume the following confusion matrix for a binary classifier
# Confusion matrix
#               Predicted
#               No     Yes
# Actual No      TN      FP
#        Yes     FN      TP

# True Positive (TP)
TP = 30

# True Negative (TN)
TN = 45

# False Positive (FP)
FP = 5

# False Negative (FN)
FN = 20

# Total number of predictions
total_predictions = TP + TN + FP + FN

# Calculating metrics
accuracy = (TP + TN) / total_predictions
precision = TP / (TP + FP)
recall = TP / (TP + FN)
f1 = 2 * (precision * recall) / (precision + recall)

print(f'Accuracy: {accuracy:.2f}')
print(f'Precision: {precision:.2f}')
print(f'Recall: {recall:.2f}')
print(f'F1 Score: {f1:.2f}')

Confusion Matrix Explanation

A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known. It allows the visualization of the performance of an algorithm. Each row of the matrix represents the instances in an actual class while each column represents the instances in a predicted class, or vice versa. Here’s the structure of a confusion matrix:

Actual Predicted	Negative (0)	Positive (1)
Negative (0)	TN	FP
Positive (1)	FN	TP

True Positive (TP): The cases in which the model correctly predicted the positive class.
True Negative (TN): The cases in which the model correctly predicted the negative class.
False Positive (FP): The cases in which the model incorrectly predicted the positive class (a “Type I error”).
False Negative (FN): The cases in which the model incorrectly predicted the negative class (a “Type II error”).

The confusion matrix itself is not a performance measure as such, but almost all of the performance metrics are based on it.

When to Use Precision vs Recall

Precision is used when the cost of a false positive is high. For example, in email spam detection, a false positive means that a regular email is incorrectly classified as spam. The consequence is that an important email might be missed if it’s sent to the spam folder.
Recall is used when the cost of a false negative is high. For example, in fraud detection or disease screening, a false negative means that a fraudulent transaction or a disease is not identified. The consequence could be very serious, leading to financial loss or harm to health.

In the following Python example, we will create a confusion matrix for a hypothetical classifier and discuss the implications of precision and recall in a practical scenario.

from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Hypothetical predictions and true labels
y_true = [0, 1, 0, 1, 0, 1, 0, 0, 0, 1]
y_pred = [0, 1, 0, 1, 0, 0, 1, 0, 0, 1]

# Generating the confusion matrix
cm = confusion_matrix(y_true, y_pred)

# Plotting the confusion matrix
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

Logistic Regression: Precision, Recall, and Confusion Matrix Practice Exercises

In this section, we will provide you with exercises to practice calculating precision, recall, and the confusion matrix in the context of logistic regression. These metrics are crucial for evaluating the performance of your classification model beyond simple accuracy.

Exercise 1: Calculating Precision and Recall

Given a logistic regression model that predicts whether an email is spam or not, you have the following classification results on the test set:

True Positives (TP): 90
False Positives (FP): 10
True Negatives (TN): 50
False Negatives (FN): 30

Calculate the precision and recall of the model.

Precision

Precision is the ratio of correctly predicted positive observations to the total predicted positives. It is a measure of a classifier’s exactness. High precision relates to a low false positive rate. Calculate the precision using the formula:

\[ Precision = \frac{TP}{TP + FP} \]

Recall

Recall (Sensitivity) is the ratio of correctly predicted positive observations to all observations in the actual class. It is a measure of a classifier’s completeness. High recall relates to a low false negative rate. Calculate the recall using the formula:

\[ Recall = \frac{TP}{TP + FN} \]

Write a Python function to calculate precision and recall, and then calculate these metrics using the given data.

Exercise 2: Confusion Matrix

Using the same data from Exercise 1, create a confusion matrix. A confusion matrix is a table that is often used to describe the performance of a classification model on a set of test data for which the true values are known. It allows the visualization of the performance of an algorithm.

Create a Python function that takes in the values of TP, FP, TN, and FN and outputs a confusion matrix in a readable format.

Exercise 3: Real-world Scenario

Consider a logistic regression model that has been trained to detect fraud in credit card transactions. The model has produced the following results on a test dataset:

True Positives (TP): 120
False Positives (FP): 30
True Negatives (TN): 900
False Negatives (FN): 60

Calculate the precision, recall, and F1-score for the model. The F1-score is the harmonic mean of precision and recall and is a balance between the two. It is calculated using the formula:

\[ F1 = 2 \times \frac{Precision \times Recall}{Precision + Recall} \]

Write a Python function to calculate the F1-score and apply it to the given data.

def calculate_precision_recall(TP, FP, FN):
    precision = TP / (TP + FP)
    recall = TP / (TP + FN)
    return precision, recall

# Given data
TP = 90
FP = 10
FN = 30

# Calculate precision and recall
precision, recall = calculate_precision_recall(TP, FP, FN)

print(f'Precision: {precision:.2f}')
print(f'Recall: {recall:.2f}')

def create_confusion_matrix(TP, FP, TN, FN):
    confusion_matrix = {
        'Predicted Positive': {'Actual Positive': TP, 'Actual Negative': FP},
        'Predicted Negative': {'Actual Positive': FN, 'Actual Negative': TN}
    }
    return confusion_matrix

# Given data
TP = 90
FP = 10
TN = 50
FN = 30

# Create confusion matrix
confusion_matrix = create_confusion_matrix(TP, FP, TN, FN)

for predicted, actual in confusion_matrix.items():
    print(f'{predicted}: {actual}')

def calculate_f1_score(TP, FP, TN, FN):
    precision = TP / (TP + FP)
    recall = TP / (TP + FN)
    f1_score = 2 * (precision * recall) / (precision + recall)
    return f1_score

# Given data for a fraud detection model
TP = 120
FP = 30
TN = 900
FN = 60

# Calculate F1-score
f1_score = calculate_f1_score(TP, FP, TN, FN)

print(f'F1-Score: {f1_score:.2f}')

Exercise 4: Logistic Regression Implementation

In this exercise, you will implement a logistic regression model on a dataset other than the commonly used Iris dataset. You will use the Pima Indians Diabetes Database, which is a standard dataset used in machine learning for binary classification problems. The dataset contains various diagnostic measurements and a binary outcome indicating whether the patient has diabetes.

Your tasks are as follows:

Load the dataset.
Perform any necessary preprocessing, such as handling missing values, feature scaling, etc.
Split the dataset into training and testing sets.
Implement a logistic regression model.
Train the model on the training set.
Evaluate the model on the test set using accuracy, precision, recall, and the confusion matrix.
Interpret the results.

You will need to write Python code to accomplish these tasks. You can use libraries such as pandas for data manipulation, scikit-learn for logistic regression and evaluation metrics, and matplotlib or seaborn for data visualization if needed.

!pip install -q pandas scikit-learn matplotlib seaborn

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv'
column_names = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']
data = pd.read_csv(url, names=column_names)

# Data preprocessing
# Here you would handle missing values, feature scaling, etc.

# Split the dataset
X = data.drop('Outcome', axis=1)
y = data['Outcome']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Implement the logistic regression model
model = LogisticRegression()

# Train the model
model.fit(X_train, y_train)

# Evaluate the model
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

# Results
print(f'Accuracy: {accuracy:.2f}')
print(f'Precision: {precision:.2f}')
print(f'Recall: {recall:.2f}')
print('Confusion Matrix:')
sns.heatmap(conf_matrix, annot=True, fmt='g')
plt.show()