import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt
# Sample data: Annual Spending (in $) and Frequency of Purchase (number of times in a year)
= np.array([[500, 2], [1500, 5], [3000, 8], [4000, 10], [750, 3], [400, 1], [4500, 12], [2000, 6]])
X # Labels: 0 for 'Low Value' and 1 for 'High Value'
= np.array([0, 0, 1, 1, 0, 0, 1, 1])
y
# Splitting data into training and testing sets
= train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_test, y_train, y_test
# Using KNN for classification with k=3
= KNeighborsClassifier(n_neighbors=3)
knn
knn.fit(X_train, y_train)= knn.predict(X_test)
predictions
# Plotting the data points
0], X[:, 1], c=y, cmap='viridis')
plt.scatter(X[:, 'Annual Spending ($)')
plt.xlabel('Frequency of Purchase')
plt.ylabel('Customer Segmentation using KNN')
plt.title(
plt.show()
predictions
Machine Learning Models
In the realm of data science and artificial intelligence, machine learning models play a pivotal role. They are algorithms that learn patterns from data and then make predictions or decisions without being explicitly programmed to do so. In the business context, machine learning models can be used for a variety of tasks such as customer segmentation, sales forecasting, and even fraud detection.
What is a ML Model?
A machine learning model is a mathematical representation of a real-world process. Think of it as a mathematical equation that has been trained on historical data to make predictions about future data. For businesses, these models can be used to gain insights, make informed decisions, and automate tasks.
Fitting a Model
Fitting a model, also known as training a model, involves providing a machine learning algorithm with data and allowing it to learn the patterns. This is done by feeding the model input data and the corresponding output. The model then adjusts its weights based on the error of its predictions. In a business scenario, this could mean using past sales data to predict future sales.
In the following sections, we will delve deeper into specific machine learning models and their applications in the business world.
KNN (K-Nearest Neighbors)
K-Nearest Neighbors (KNN) is a simple, yet powerful supervised machine learning algorithm used for classification and regression. It works by finding the ‘k’ training examples that are closest to a given input example and returning the most common output value among them.
Business Application of KNN
Imagine a retail business that wants to segment its customers based on their purchase behavior. Using KNN, the business can classify a new customer into a particular segment by looking at the purchase behaviors of the ‘k’ most similar existing customers. This can help the business tailor its marketing strategies for different customer segments.
Let’s see a simple example in Python where we use KNN to classify customers into two segments: ‘High Value’ and ‘Low Value’ based on their annual spending and frequency of purchase.
Polynomial Regression
Polynomial Regression is a type of regression analysis used to model the relationship between a dependent variable and one or more independent variables by fitting a polynomial equation to the observed data. Unlike linear regression which models the relationship using a straight line, polynomial regression models it using a curve.
Business Application of Polynomial Regression
Consider a business that wants to understand the relationship between advertising spend and sales. While initially, an increase in advertising might lead to a significant increase in sales, after a certain point, the effect of additional advertising might diminish. In such cases, a polynomial regression can capture the non-linear relationship between advertising spend and sales more effectively than a linear model.
Let’s see a Python example where we use Polynomial Regression to model the relationship between advertising spend and sales.
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
# Sample data: Advertising Spend (in $) and Sales (in $)
= np.array([[50], [100], [150], [200], [250], [300], [350], [400]])
X_ad = np.array([150, 220, 260, 275, 280, 285, 290, 295])
y_sales
# Transforming our data for Polynomial Regression
= PolynomialFeatures(degree=2)
poly = poly.fit_transform(X_ad)
X_poly
# Fitting the Polynomial Regression model
= LinearRegression()
poly_reg
poly_reg.fit(X_poly, y_sales)
# Predicting sales based on advertising spend
= poly_reg.predict(X_poly)
y_pred
# Plotting the data points and the polynomial regression curve
='blue', label='Actual Sales')
plt.scatter(X_ad, y_sales, color='red', label='Polynomial Regression')
plt.plot(X_ad, y_pred, color'Advertising Spend ($)')
plt.xlabel('Sales ($)')
plt.ylabel('Advertising Spend vs Sales using Polynomial Regression')
plt.title(
plt.legend() plt.show()
Advanced KNN Example: Loan Default Prediction
In this scenario, a bank wants to predict the risk associated with a loan based on a customer’s credit score and annual income. The bank classifies the risk into three categories: ‘Low Risk’, ‘Medium Risk’, and ‘High Risk’.
Let’s use KNN to classify customers into these risk categories.
# Sample data: Credit Score and Annual Income (in $)
= np.array([[650, 40000], [700, 55000], [600, 30000], [720, 75000], [630, 25000], [680, 50000], [590, 20000], [750, 90000]])
X_loan # Labels: 0 for 'Low Risk', 1 for 'Medium Risk', and 2 for 'High Risk'
= np.array([0, 0, 1, 0, 2, 0, 2, 0])
y_risk
# Splitting data into training and testing sets
= train_test_split(X_loan, y_risk, test_size=0.2, random_state=42)
X_loan_train, X_loan_test, y_risk_train, y_risk_test
# Using KNN for classification with k=3
= KNeighborsClassifier(n_neighbors=3)
knn_loan
knn_loan.fit(X_loan_train, y_risk_train)= knn_loan.predict(X_loan_test)
risk_predictions
# Plotting the data points
0], X_loan[:, 1], c=y_risk, cmap='viridis')
plt.scatter(X_loan[:, 'Credit Score')
plt.xlabel('Annual Income ($)')
plt.ylabel('Loan Default Risk Prediction using KNN')
plt.title('Risk Level')
plt.colorbar().set_label(
plt.show()
risk_predictions
Advanced Polynomial Regression Example: Salary Prediction
In this scenario, a company wants to understand the relationship between the years of experience and the salary of its employees. As employees gain more experience, the growth in their salary might not be linear, especially at higher experience levels where salary increments might start to plateau.
Let’s use Polynomial Regression to model this non-linear relationship between years of experience and salary.
# Sample data: Years of Experience and Salary (in $)
= np.array([[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]])
X_exp = np.array([40000, 45000, 50000, 60000, 75000, 85000, 95000, 105000, 110000, 115000])
y_sal
# Transforming our data for Polynomial Regression of degree 2
= PolynomialFeatures(degree=2)
poly_exp = poly_exp.fit_transform(X_exp)
X_exp_poly
# Fitting the Polynomial Regression model
= LinearRegression()
poly_reg_exp
poly_reg_exp.fit(X_exp_poly, y_sal)
# Predicting salary based on years of experience
= poly_reg_exp.predict(X_exp_poly)
y_sal_pred
# Plotting the data points and the polynomial regression curve
='blue', label='Actual Salary')
plt.scatter(X_exp, y_sal, color='red', label='Polynomial Regression')
plt.plot(X_exp, y_sal_pred, color'Years of Experience')
plt.xlabel('Salary ($)')
plt.ylabel('Years of Experience vs Salary using Polynomial Regression')
plt.title(
plt.legend() plt.show()
KNN Exercise: Customer Churn Prediction
In this exercise, you’ll be working with a fictional dataset representing a company’s customer data. Your task is to predict customer churn, i.e., the likelihood of a customer leaving the company’s services, based on their usage metrics and demographic information.
Dataset Description
- Age: Age of the customer (numeric)
- MonthlyCharge: Monthly charge for the services (numeric, in $)
- CustomerServiceCalls: Number of calls made to customer service (numeric)
- Churn: Whether the customer left the company within the last month (0 for No, 1 for Yes)
Objective
Using the KNN algorithm, classify customers into ‘Churn’ or ‘No Churn’ based on the given features.
Let’s get started!
# Sample dataset
= {
data 'Age': [25, 30, 35, 40, 45, 50, 55, 60, 65, 70],
'MonthlyCharge': [50, 55, 60, 65, 70, 75, 80, 85, 90, 95],
'CustomerServiceCalls': [1, 2, 1, 3, 2, 3, 1, 2, 3, 1],
'Churn': [0, 0, 0, 1, 0, 1, 0, 1, 1, 0]
}
# Convert the dictionary to a DataFrame for better visualization
import pandas as pd
= pd.DataFrame(data)
df df
Data Preparation
Before applying the KNN algorithm, we need to prepare our data. This involves:
- Splitting the data into features (X) and target label (y). In our case, ‘Age’, ‘MonthlyCharge’, and ‘CustomerServiceCalls’ are our features, and ‘Churn’ is our target label.
- Splitting the dataset into training and testing sets. This allows us to train our model on one subset and test its performance on another unseen subset.
Let’s perform these steps.
from sklearn.model_selection import train_test_split
# Splitting the data into features (X) and target label (y)
= df[['Age', 'MonthlyCharge', 'CustomerServiceCalls']]
X = df['Churn']
y
# Splitting the dataset into training and testing sets (80% training, 20% testing)
= train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_test, y_train, y_test
X_train.head(), y_train.head()
Applying the KNN Algorithm
Now that our data is prepared, we can apply the KNN algorithm. Here are the steps we’ll follow:
- Initialize the KNN Classifier: We’ll use the
KNeighborsClassifier
fromsklearn
to create our KNN model. The key parameter here isn_neighbors
, which represents the number of neighbors to consider when making a prediction. - Train the Classifier: We’ll use the
fit
method to train our KNN classifier on the training data. - Make Predictions: Once trained, we can use the
predict
method to make predictions on new, unseen data.
Let’s go through each step with code and explanations.
from sklearn.neighbors import KNeighborsClassifier
# Step 1: Initialize the KNN Classifier
# We'll start with 3 neighbors for our classifier
= KNeighborsClassifier(n_neighbors=3)
knn
# Step 2: Train the Classifier on the training data
knn.fit(X_train, y_train)
# Displaying the trained classifier
knn
Making Predictions with the KNN Classifier
Now that our KNN classifier is trained, we can use it to make predictions on new data. For this exercise, we’ll predict the churn for the test data and then compare these predictions to the actual churn values to evaluate the performance of our model.
To make predictions, we’ll use the predict
method of our trained KNN classifier. This method takes in the features of the data we want to predict and returns the predicted labels.
Let’s make predictions on our test data and display the results.
# Step 3: Make Predictions on the test data
= knn.predict(X_test)
y_pred
# Displaying the predicted churn values for the test data
y_pred
from sklearn.metrics import accuracy_score
# Calculate the accuracy of the model
= accuracy_score(y_test, y_pred)
accuracy
# Displaying the accuracy
accuracy
Overfitting and Underfitting: Bias versus Variance
In machine learning, achieving the right balance between bias and variance is crucial for creating models that generalize well to new, unseen data. Let’s delve into these concepts:
Bias: Refers to the error introduced by approximating a real-world problem (which may be complex) by a too-simple model. High bias can cause the model to miss relevant relations between features and target outputs, leading to underfitting.
Variance: Refers to the error introduced by using a model that’s too complex. High variance can cause the model to model the random noise in the training data, leading to overfitting.
Underfitting:
Occurs when a model is too simple to capture the underlying structure of the data. Such a model has high bias and low variance.
Overfitting:
Occurs when a model is too complex and fits the training data too closely, including its noise and outliers. Such a model has low bias and high variance.
The goal in machine learning is to achieve a balance between bias and variance, ensuring that the model is flexible enough to model the data’s structure but not so flexible that it fits the noise in the data.
Let’s visualize the concepts of overfitting and underfitting.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
# Generate sample data
0)
np.random.seed(= np.sort(5 * np.random.rand(80, 1), axis=0)
X = np.sin(X).ravel() + np.random.normal(0, 0.1, X.shape[0])
y
# Create polynomial regression models of different degrees
= [1, 4, 15]
degrees =(14, 5))
plt.figure(figsizefor i in range(len(degrees)):
= plt.subplot(1, len(degrees), i + 1)
ax =(), yticks=())
plt.setp(ax, xticks
= PolynomialFeatures(degree=degrees[i], include_bias=False)
polynomial_features = LinearRegression()
linear_regression = make_pipeline(polynomial_features, linear_regression)
pipeline
pipeline.fit(X, y)
# Evaluate the models using cross-validation
= np.linspace(0, 5, 100)
X_test ="Model")
plt.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label='b', s=20, label="Samples")
plt.scatter(X, y, edgecolor"x")
plt.xlabel("y")
plt.ylabel(0, 5))
plt.xlim((-2, 2))
plt.ylim((="best")
plt.legend(loc"Degree %d" % degrees[i])
plt.title(
plt.tight_layout() plt.show()
The Cost Function
In machine learning, the cost function (or loss function) quantifies how well the model’s predictions match the actual values. In other words, it measures the error of the model. The goal during training is to minimize this error.
For linear regression, a common cost function is the Mean Squared Error (MSE), which calculates the average of the squared differences between the predicted and actual values.
The formula for MSE is:
\[ MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \]
Where: - \(y_i\) is the actual value. - \(\hat{y}_i\) is the predicted value. - $ n $ is the number of observations.
Let’s implement the MSE in Python and calculate it for a sample set of actual and predicted values.
from sklearn.metrics import mean_squared_error
# Sample actual and predicted values
= np.array([3, 2.5, 4, 5.6])
y_true = np.array([2.8, 2.7, 3.8, 5.5])
y_pred
# Calculate MSE
= mean_squared_error(y_true, y_pred)
mse_value mse_value
The Training Error and The Test Error
In machine learning, it’s essential to evaluate how well a model performs. Two common metrics used for this purpose are the training error and the test error:
Training Error: This is the error (typically the MSE) of the model on the same data it was trained on. A low training error might indicate that the model fits the training data well, but it doesn’t necessarily mean the model will perform well on new, unseen data.
Test Error: This is the error of the model on a separate set of data that it hasn’t seen during training. It gives a better indication of how the model will perform in real-world scenarios. A model that performs well on the training data but poorly on the test data is likely overfitting.
In a business context, the training error can help in understanding how well the model fits historical data, while the test error can provide insights into how the model might perform on future data.
Let’s calculate the training and test errors for a sample linear regression model using a business-related dataset.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# Generate sample business data: Sales vs. Advertising Spend
0)
np.random.seed(= 2.5 * np.random.rand(100, 1)
X = 5 + 3 * X + np.random.randn(100, 1)
y
# Split the data into training and test sets
= train_test_split(X, y, test_size=0.2, random_state=0)
X_train, X_test, y_train, y_test
# Train a linear regression model
= LinearRegression()
regressor
regressor.fit(X_train, y_train)
# Calculate the training error
= regressor.predict(X_train)
y_train_pred = mean_squared_error(y_train, y_train_pred)
training_error
# Calculate the test error
= regressor.predict(X_test)
y_test_pred = mean_squared_error(y_test, y_test_pred)
test_error
training_error, test_error
The Cost Function (Revisited)
The cost function, as previously mentioned, quantifies the error between the model’s predictions and the actual values. For regression problems, the Mean Squared Error (MSE) is a commonly used cost function. It’s crucial to minimize this error during the training process to ensure the model makes accurate predictions.
In a business context, the cost function can be thought of as a measure of how far off our predictions are from the actual outcomes. For instance, if we’re predicting monthly sales, a high cost would indicate that our predictions are far from the actual sales figures, which could lead to incorrect business decisions.
Now, let’s move on to the next topic.
Bias versus Variance (Revisited)
Bias and variance are two fundamental concepts in understanding the performance of machine learning models. They represent two types of errors that can occur:
Bias: This is the error introduced by approximating a real-world problem, which might be complex, by a too-simple model. High bias can cause the model to miss the relevant relations between features and target outputs, leading to underfitting. In a business context, a high-bias model might consistently make the same type of error, such as consistently underestimating sales by a certain amount.
Variance: This is the error introduced by using a model that’s too complex. High variance can cause the model to model the random noise in the training data, leading to overfitting. In a business context, a high-variance model might be very sensitive to small fluctuations in the training data, leading to erratic predictions.
The challenge in machine learning is to find the right trade-off between bias and variance. Ideally, we want a model with low bias and low variance, but in practice, there’s often a trade-off. Reducing bias might increase variance and vice versa.
In a business scenario, understanding bias and variance is crucial. A model with high bias might lead to consistent errors in decision-making, while a model with high variance might lead to unpredictable and erratic decisions.
Now, let’s move on to the next topic.
# Generate sample business data: Sales vs. Advertising Spend
42)
np.random.seed(= 2.5 * np.random.rand(100, 1)
X = 5 + 3 * X + np.random.randn(100, 1)
y
# Split the data into training and test sets
= train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_test, y_train, y_test
# Train a linear regression model
= LinearRegression()
regressor
regressor.fit(X_train, y_train)
# Calculate the training error
= regressor.predict(X_train)
y_train_pred = mean_squared_error(y_train, y_train_pred)
training_error
# Calculate the test error
= regressor.predict(X_test)
y_test_pred = mean_squared_error(y_test, y_test_pred)
test_error
training_error, test_error
Exercise: Overfitting and Underfitting in Business Sales Prediction
Scenario: You are a data scientist at a retail company, and you are tasked with predicting monthly sales based on advertising spend. You decide to use polynomial regression. However, you want to ensure that your model neither overfits nor underfits the data.
Objective: Fit polynomial regression models of varying degrees to the sales data and visualize the results to understand the concepts of overfitting and underfitting.
Let’s start by visualizing the provided sales data.
# Generate sample business data: Sales vs. Advertising Spend
0)
np.random.seed(= 2.5 * np.random.rand(100, 1)
X = 5 + 3 * X + np.random.randn(100, 1)
y
# Visualize the data
=(10, 6))
plt.figure(figsize='blue', s=30)
plt.scatter(X, y, color'Sales vs. Advertising Spend')
plt.title('Advertising Spend (in thousands)')
plt.xlabel('Sales (in thousands)')
plt.ylabel(True)
plt.grid( plt.show()
# Split the data into training and test sets
= train_test_split(X, y, test_size=0.2, random_state=0)
X_train, X_test, y_train, y_test
# Display the size of the training and test sets
len(X_train), len(X_test)
# Fit polynomial regression models of varying degrees and visualize the results
= [1, 4, 15]
degrees =(14, 5))
plt.figure(figsizefor i in range(len(degrees)):
= plt.subplot(1, len(degrees), i + 1)
ax =(), yticks=())
plt.setp(ax, xticks
= PolynomialFeatures(degree=degrees[i], include_bias=False)
polynomial_features = LinearRegression()
linear_regression = make_pipeline(polynomial_features, linear_regression)
pipeline
pipeline.fit(X_train, y_train)
# Visualize the models
= np.linspace(0, 2.5, 100)
X_range ="Model")
plt.plot(X_range, pipeline.predict(X_range[:, np.newaxis]), label='b', s=20, label="Training Data")
plt.scatter(X_train, y_train, edgecolor='r', s=20, label="Test Data")
plt.scatter(X_test, y_test, edgecolor"Advertising Spend (in thousands)")
plt.xlabel("Sales (in thousands)")
plt.ylabel(0, 2.5))
plt.xlim((0, 15))
plt.ylim((="best")
plt.legend(loc"Degree %d" % degrees[i])
plt.title(
plt.tight_layout() plt.show()
Exercise: Understanding the Cost Function in Business
Scenario: You are working for a retail company and are tasked with predicting monthly sales based on advertising spend. You’ve chosen a linear regression model for this task. To evaluate the performance of your model, you decide to compute the Mean Squared Error (MSE) as your cost function.
Objective: Calculate the MSE for your linear regression model using both the training and test data. Compare the results to understand the model’s performance.
Let’s begin by calculating the MSE for the training data.
# Calculate the MSE for the training data
= regressor.predict(X_train)
y_train_pred = mean_squared_error(y_train, y_train_pred)
mse_train mse_train
# Calculate the MSE for the test data
= regressor.predict(X_test)
y_test_pred = mean_squared_error(y_test, y_test_pred)
mse_test mse_test
Exercise: Evaluating Model Performance using Training and Test Errors
Scenario: You are a data scientist at a retail company. The marketing team wants to understand the performance of the sales prediction model before launching a new advertising campaign. They are particularly interested in knowing how well the model performs on historical data (training data) and how it might perform on future data (test data).
Objective: Calculate the training and test errors for the sales prediction model. Analyze the results to provide insights to the marketing team.
Let’s begin by calculating the training and test errors.
# Calculate the training error
= mean_squared_error(y_train, y_train_pred)
training_error
# Calculate the test error
= mean_squared_error(y_test, y_test_pred)
test_error
training_error, test_error