Machine Learning Models

In the realm of data science and artificial intelligence, machine learning models play a pivotal role. They are algorithms that learn patterns from data and then make predictions or decisions without being explicitly programmed to do so. In the business context, machine learning models can be used for a variety of tasks such as customer segmentation, sales forecasting, and even fraud detection.

What is a ML Model?

A machine learning model is a mathematical representation of a real-world process. Think of it as a mathematical equation that has been trained on historical data to make predictions about future data. For businesses, these models can be used to gain insights, make informed decisions, and automate tasks.

Fitting a Model

Fitting a model, also known as training a model, involves providing a machine learning algorithm with data and allowing it to learn the patterns. This is done by feeding the model input data and the corresponding output. The model then adjusts its weights based on the error of its predictions. In a business scenario, this could mean using past sales data to predict future sales.

In the following sections, we will delve deeper into specific machine learning models and their applications in the business world.

KNN (K-Nearest Neighbors)

K-Nearest Neighbors (KNN) is a simple, yet powerful supervised machine learning algorithm used for classification and regression. It works by finding the ‘k’ training examples that are closest to a given input example and returning the most common output value among them.

Business Application of KNN

Imagine a retail business that wants to segment its customers based on their purchase behavior. Using KNN, the business can classify a new customer into a particular segment by looking at the purchase behaviors of the ‘k’ most similar existing customers. This can help the business tailor its marketing strategies for different customer segments.

Let’s see a simple example in Python where we use KNN to classify customers into two segments: ‘High Value’ and ‘Low Value’ based on their annual spending and frequency of purchase.

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt

# Sample data: Annual Spending (in $) and Frequency of Purchase (number of times in a year)
X = np.array([[500, 2], [1500, 5], [3000, 8], [4000, 10], [750, 3], [400, 1], [4500, 12], [2000, 6]])
# Labels: 0 for 'Low Value' and 1 for 'High Value'
y = np.array([0, 0, 1, 1, 0, 0, 1, 1])

# Splitting data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Using KNN for classification with k=3
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
predictions = knn.predict(X_test)

# Plotting the data points
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis')
plt.xlabel('Annual Spending ($)')
plt.ylabel('Frequency of Purchase')
plt.title('Customer Segmentation using KNN')
plt.show()

predictions

Polynomial Regression

Polynomial Regression is a type of regression analysis used to model the relationship between a dependent variable and one or more independent variables by fitting a polynomial equation to the observed data. Unlike linear regression which models the relationship using a straight line, polynomial regression models it using a curve.

Business Application of Polynomial Regression

Consider a business that wants to understand the relationship between advertising spend and sales. While initially, an increase in advertising might lead to a significant increase in sales, after a certain point, the effect of additional advertising might diminish. In such cases, a polynomial regression can capture the non-linear relationship between advertising spend and sales more effectively than a linear model.

Let’s see a Python example where we use Polynomial Regression to model the relationship between advertising spend and sales.

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

# Sample data: Advertising Spend (in $) and Sales (in $)
X_ad = np.array([[50], [100], [150], [200], [250], [300], [350], [400]])
y_sales = np.array([150, 220, 260, 275, 280, 285, 290, 295])

# Transforming our data for Polynomial Regression
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X_ad)

# Fitting the Polynomial Regression model
poly_reg = LinearRegression()
poly_reg.fit(X_poly, y_sales)

# Predicting sales based on advertising spend
y_pred = poly_reg.predict(X_poly)

# Plotting the data points and the polynomial regression curve
plt.scatter(X_ad, y_sales, color='blue', label='Actual Sales')
plt.plot(X_ad, y_pred, color='red', label='Polynomial Regression')
plt.xlabel('Advertising Spend ($)')
plt.ylabel('Sales ($)')
plt.title('Advertising Spend vs Sales using Polynomial Regression')
plt.legend()
plt.show()

Advanced KNN Example: Loan Default Prediction

In this scenario, a bank wants to predict the risk associated with a loan based on a customer’s credit score and annual income. The bank classifies the risk into three categories: ‘Low Risk’, ‘Medium Risk’, and ‘High Risk’.

Let’s use KNN to classify customers into these risk categories.

# Sample data: Credit Score and Annual Income (in $)
X_loan = np.array([[650, 40000], [700, 55000], [600, 30000], [720, 75000], [630, 25000], [680, 50000], [590, 20000], [750, 90000]])
# Labels: 0 for 'Low Risk', 1 for 'Medium Risk', and 2 for 'High Risk'
y_risk = np.array([0, 0, 1, 0, 2, 0, 2, 0])

# Splitting data into training and testing sets
X_loan_train, X_loan_test, y_risk_train, y_risk_test = train_test_split(X_loan, y_risk, test_size=0.2, random_state=42)

# Using KNN for classification with k=3
knn_loan = KNeighborsClassifier(n_neighbors=3)
knn_loan.fit(X_loan_train, y_risk_train)
risk_predictions = knn_loan.predict(X_loan_test)

# Plotting the data points
plt.scatter(X_loan[:, 0], X_loan[:, 1], c=y_risk, cmap='viridis')
plt.xlabel('Credit Score')
plt.ylabel('Annual Income ($)')
plt.title('Loan Default Risk Prediction using KNN')
plt.colorbar().set_label('Risk Level')
plt.show()

risk_predictions

Advanced Polynomial Regression Example: Salary Prediction

In this scenario, a company wants to understand the relationship between the years of experience and the salary of its employees. As employees gain more experience, the growth in their salary might not be linear, especially at higher experience levels where salary increments might start to plateau.

Let’s use Polynomial Regression to model this non-linear relationship between years of experience and salary.

# Sample data: Years of Experience and Salary (in $)
X_exp = np.array([[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]])
y_sal = np.array([40000, 45000, 50000, 60000, 75000, 85000, 95000, 105000, 110000, 115000])

# Transforming our data for Polynomial Regression of degree 2
poly_exp = PolynomialFeatures(degree=2)
X_exp_poly = poly_exp.fit_transform(X_exp)

# Fitting the Polynomial Regression model
poly_reg_exp = LinearRegression()
poly_reg_exp.fit(X_exp_poly, y_sal)

# Predicting salary based on years of experience
y_sal_pred = poly_reg_exp.predict(X_exp_poly)

# Plotting the data points and the polynomial regression curve
plt.scatter(X_exp, y_sal, color='blue', label='Actual Salary')
plt.plot(X_exp, y_sal_pred, color='red', label='Polynomial Regression')
plt.xlabel('Years of Experience')
plt.ylabel('Salary ($)')
plt.title('Years of Experience vs Salary using Polynomial Regression')
plt.legend()
plt.show()

KNN Exercise: Customer Churn Prediction

In this exercise, you’ll be working with a fictional dataset representing a company’s customer data. Your task is to predict customer churn, i.e., the likelihood of a customer leaving the company’s services, based on their usage metrics and demographic information.

Dataset Description

Age: Age of the customer (numeric)
MonthlyCharge: Monthly charge for the services (numeric, in $)
CustomerServiceCalls: Number of calls made to customer service (numeric)
Churn: Whether the customer left the company within the last month (0 for No, 1 for Yes)

Objective

Using the KNN algorithm, classify customers into ‘Churn’ or ‘No Churn’ based on the given features.

Let’s get started!

# Sample dataset
data = {
    'Age': [25, 30, 35, 40, 45, 50, 55, 60, 65, 70],
    'MonthlyCharge': [50, 55, 60, 65, 70, 75, 80, 85, 90, 95],
    'CustomerServiceCalls': [1, 2, 1, 3, 2, 3, 1, 2, 3, 1],
    'Churn': [0, 0, 0, 1, 0, 1, 0, 1, 1, 0]
}

# Convert the dictionary to a DataFrame for better visualization
import pandas as pd
df = pd.DataFrame(data)
df

Data Preparation

Before applying the KNN algorithm, we need to prepare our data. This involves:

Splitting the data into features (X) and target label (y). In our case, ‘Age’, ‘MonthlyCharge’, and ‘CustomerServiceCalls’ are our features, and ‘Churn’ is our target label.
Splitting the dataset into training and testing sets. This allows us to train our model on one subset and test its performance on another unseen subset.

Let’s perform these steps.

from sklearn.model_selection import train_test_split

# Splitting the data into features (X) and target label (y)
X = df[['Age', 'MonthlyCharge', 'CustomerServiceCalls']]
y = df['Churn']

# Splitting the dataset into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train.head(), y_train.head()

Applying the KNN Algorithm

Now that our data is prepared, we can apply the KNN algorithm. Here are the steps we’ll follow:

Initialize the KNN Classifier: We’ll use the KNeighborsClassifier from sklearn to create our KNN model. The key parameter here is n_neighbors, which represents the number of neighbors to consider when making a prediction.
Train the Classifier: We’ll use the fit method to train our KNN classifier on the training data.
Make Predictions: Once trained, we can use the predict method to make predictions on new, unseen data.

Let’s go through each step with code and explanations.

from sklearn.neighbors import KNeighborsClassifier

# Step 1: Initialize the KNN Classifier
# We'll start with 3 neighbors for our classifier
knn = KNeighborsClassifier(n_neighbors=3)

# Step 2: Train the Classifier on the training data
knn.fit(X_train, y_train)

# Displaying the trained classifier
knn

Making Predictions with the KNN Classifier

Now that our KNN classifier is trained, we can use it to make predictions on new data. For this exercise, we’ll predict the churn for the test data and then compare these predictions to the actual churn values to evaluate the performance of our model.

To make predictions, we’ll use the predict method of our trained KNN classifier. This method takes in the features of the data we want to predict and returns the predicted labels.

Let’s make predictions on our test data and display the results.

# Step 3: Make Predictions on the test data
y_pred = knn.predict(X_test)

# Displaying the predicted churn values for the test data
y_pred

from sklearn.metrics import accuracy_score

# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)

# Displaying the accuracy
accuracy

Overfitting and Underfitting: Bias versus Variance

In machine learning, achieving the right balance between bias and variance is crucial for creating models that generalize well to new, unseen data. Let’s delve into these concepts:

Bias: Refers to the error introduced by approximating a real-world problem (which may be complex) by a too-simple model. High bias can cause the model to miss relevant relations between features and target outputs, leading to underfitting.
Variance: Refers to the error introduced by using a model that’s too complex. High variance can cause the model to model the random noise in the training data, leading to overfitting.

Underfitting:

Occurs when a model is too simple to capture the underlying structure of the data. Such a model has high bias and low variance.

Overfitting:

Occurs when a model is too complex and fits the training data too closely, including its noise and outliers. Such a model has low bias and high variance.

The goal in machine learning is to achieve a balance between bias and variance, ensuring that the model is flexible enough to model the data’s structure but not so flexible that it fits the noise in the data.

Let’s visualize the concepts of overfitting and underfitting.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

# Generate sample data
np.random.seed(0)
X = np.sort(5 * np.random.rand(80, 1), axis=0)
y = np.sin(X).ravel() + np.random.normal(0, 0.1, X.shape[0])

# Create polynomial regression models of different degrees
degrees = [1, 4, 15]
plt.figure(figsize=(14, 5))
for i in range(len(degrees)):
    ax = plt.subplot(1, len(degrees), i + 1)
    plt.setp(ax, xticks=(), yticks=())

    polynomial_features = PolynomialFeatures(degree=degrees[i], include_bias=False)
    linear_regression = LinearRegression()
    pipeline = make_pipeline(polynomial_features, linear_regression)
    pipeline.fit(X, y)

    # Evaluate the models using cross-validation
    X_test = np.linspace(0, 5, 100)
    plt.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label="Model")
    plt.scatter(X, y, edgecolor='b', s=20, label="Samples")
    plt.xlabel("x")
    plt.ylabel("y")
    plt.xlim((0, 5))
    plt.ylim((-2, 2))
    plt.legend(loc="best")
    plt.title("Degree %d" % degrees[i])

plt.tight_layout()
plt.show()

The Cost Function

In machine learning, the cost function (or loss function) quantifies how well the model’s predictions match the actual values. In other words, it measures the error of the model. The goal during training is to minimize this error.

For linear regression, a common cost function is the Mean Squared Error (MSE), which calculates the average of the squared differences between the predicted and actual values.

The formula for MSE is:

\[ MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \]

Where: - $y_i$ is the actual value. - $\hat{y}_i$ is the predicted value. - $ n $ is the number of observations.

Let’s implement the MSE in Python and calculate it for a sample set of actual and predicted values.

from sklearn.metrics import mean_squared_error
# Sample actual and predicted values
y_true = np.array([3, 2.5, 4, 5.6])
y_pred = np.array([2.8, 2.7, 3.8, 5.5])

# Calculate MSE
mse_value = mean_squared_error(y_true, y_pred)
mse_value

The Training Error and The Test Error

In machine learning, it’s essential to evaluate how well a model performs. Two common metrics used for this purpose are the training error and the test error:

Training Error: This is the error (typically the MSE) of the model on the same data it was trained on. A low training error might indicate that the model fits the training data well, but it doesn’t necessarily mean the model will perform well on new, unseen data.
Test Error: This is the error of the model on a separate set of data that it hasn’t seen during training. It gives a better indication of how the model will perform in real-world scenarios. A model that performs well on the training data but poorly on the test data is likely overfitting.

In a business context, the training error can help in understanding how well the model fits historical data, while the test error can provide insights into how the model might perform on future data.

Let’s calculate the training and test errors for a sample linear regression model using a business-related dataset.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Generate sample business data: Sales vs. Advertising Spend
np.random.seed(0)
X = 2.5 * np.random.rand(100, 1)
y = 5 + 3 * X + np.random.randn(100, 1)

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Train a linear regression model
regressor = LinearRegression()
regressor.fit(X_train, y_train)

# Calculate the training error
y_train_pred = regressor.predict(X_train)
training_error = mean_squared_error(y_train, y_train_pred)

# Calculate the test error
y_test_pred = regressor.predict(X_test)
test_error = mean_squared_error(y_test, y_test_pred)

training_error, test_error

The Cost Function (Revisited)

The cost function, as previously mentioned, quantifies the error between the model’s predictions and the actual values. For regression problems, the Mean Squared Error (MSE) is a commonly used cost function. It’s crucial to minimize this error during the training process to ensure the model makes accurate predictions.

In a business context, the cost function can be thought of as a measure of how far off our predictions are from the actual outcomes. For instance, if we’re predicting monthly sales, a high cost would indicate that our predictions are far from the actual sales figures, which could lead to incorrect business decisions.

Now, let’s move on to the next topic.

Bias versus Variance (Revisited)

Bias and variance are two fundamental concepts in understanding the performance of machine learning models. They represent two types of errors that can occur:

Bias: This is the error introduced by approximating a real-world problem, which might be complex, by a too-simple model. High bias can cause the model to miss the relevant relations between features and target outputs, leading to underfitting. In a business context, a high-bias model might consistently make the same type of error, such as consistently underestimating sales by a certain amount.
Variance: This is the error introduced by using a model that’s too complex. High variance can cause the model to model the random noise in the training data, leading to overfitting. In a business context, a high-variance model might be very sensitive to small fluctuations in the training data, leading to erratic predictions.

The challenge in machine learning is to find the right trade-off between bias and variance. Ideally, we want a model with low bias and low variance, but in practice, there’s often a trade-off. Reducing bias might increase variance and vice versa.

In a business scenario, understanding bias and variance is crucial. A model with high bias might lead to consistent errors in decision-making, while a model with high variance might lead to unpredictable and erratic decisions.

Now, let’s move on to the next topic.

# Generate sample business data: Sales vs. Advertising Spend
np.random.seed(42)
X = 2.5 * np.random.rand(100, 1)
y = 5 + 3 * X + np.random.randn(100, 1)

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a linear regression model
regressor = LinearRegression()
regressor.fit(X_train, y_train)

# Calculate the training error
y_train_pred = regressor.predict(X_train)
training_error = mean_squared_error(y_train, y_train_pred)

# Calculate the test error
y_test_pred = regressor.predict(X_test)
test_error = mean_squared_error(y_test, y_test_pred)

training_error, test_error

Exercise: Overfitting and Underfitting in Business Sales Prediction

Scenario: You are a data scientist at a retail company, and you are tasked with predicting monthly sales based on advertising spend. You decide to use polynomial regression. However, you want to ensure that your model neither overfits nor underfits the data.

Objective: Fit polynomial regression models of varying degrees to the sales data and visualize the results to understand the concepts of overfitting and underfitting.

Let’s start by visualizing the provided sales data.

# Generate sample business data: Sales vs. Advertising Spend
np.random.seed(0)
X = 2.5 * np.random.rand(100, 1)
y = 5 + 3 * X + np.random.randn(100, 1)

# Visualize the data
plt.figure(figsize=(10, 6))
plt.scatter(X, y, color='blue', s=30)
plt.title('Sales vs. Advertising Spend')
plt.xlabel('Advertising Spend (in thousands)')
plt.ylabel('Sales (in thousands)')
plt.grid(True)
plt.show()

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Display the size of the training and test sets
len(X_train), len(X_test)

# Fit polynomial regression models of varying degrees and visualize the results
degrees = [1, 4, 15]
plt.figure(figsize=(14, 5))
for i in range(len(degrees)):
    ax = plt.subplot(1, len(degrees), i + 1)
    plt.setp(ax, xticks=(), yticks=())

    polynomial_features = PolynomialFeatures(degree=degrees[i], include_bias=False)
    linear_regression = LinearRegression()
    pipeline = make_pipeline(polynomial_features, linear_regression)
    pipeline.fit(X_train, y_train)

    # Visualize the models
    X_range = np.linspace(0, 2.5, 100)
    plt.plot(X_range, pipeline.predict(X_range[:, np.newaxis]), label="Model")
    plt.scatter(X_train, y_train, edgecolor='b', s=20, label="Training Data")
    plt.scatter(X_test, y_test, edgecolor='r', s=20, label="Test Data")
    plt.xlabel("Advertising Spend (in thousands)")
    plt.ylabel("Sales (in thousands)")
    plt.xlim((0, 2.5))
    plt.ylim((0, 15))
    plt.legend(loc="best")
    plt.title("Degree %d" % degrees[i])

plt.tight_layout()
plt.show()

Exercise: Understanding the Cost Function in Business

Scenario: You are working for a retail company and are tasked with predicting monthly sales based on advertising spend. You’ve chosen a linear regression model for this task. To evaluate the performance of your model, you decide to compute the Mean Squared Error (MSE) as your cost function.

Objective: Calculate the MSE for your linear regression model using both the training and test data. Compare the results to understand the model’s performance.

Let’s begin by calculating the MSE for the training data.

# Calculate the MSE for the training data
y_train_pred = regressor.predict(X_train)
mse_train = mean_squared_error(y_train, y_train_pred)
mse_train

# Calculate the MSE for the test data
y_test_pred = regressor.predict(X_test)
mse_test = mean_squared_error(y_test, y_test_pred)
mse_test

Exercise: Evaluating Model Performance using Training and Test Errors

Scenario: You are a data scientist at a retail company. The marketing team wants to understand the performance of the sales prediction model before launching a new advertising campaign. They are particularly interested in knowing how well the model performs on historical data (training data) and how it might perform on future data (test data).

Objective: Calculate the training and test errors for the sales prediction model. Analyze the results to provide insights to the marketing team.

Let’s begin by calculating the training and test errors.

# Calculate the training error
training_error = mean_squared_error(y_train, y_train_pred)

# Calculate the test error
test_error = mean_squared_error(y_test, y_test_pred)

training_error, test_error