Dealing with Missing Data in a Dataset

Identifying Missing Data

The first step in dealing with missing data is to identify its presence and location within your dataset. This can be done using various methods, depending on the tools and programming language you are using. In Python, for instance, you can use libraries like Pandas to easily find missing values.


import pandas as pd


# Assuming 'df' is your DataFrame

missing_values = df.isnull()

print(missing_values.sum())

This code will give you a count of missing values in each column of your DataFrame.

Handling Missing Data

Once you’ve identified the missing data, the next step is to decide how to handle it. The approach depends on the nature of your data and the amount of missing information. Here are some common strategies:

1. Removing Data

Drop rows with missing values: If the dataset is large and the number of rows with missing data is small, you might consider removing these rows.


df.dropna(inplace=True)

Drop columns with missing values: If a specific column has a significant number of missing values, it might be better to remove the entire column.


df.dropna(axis=1, inplace=True)

2. Imputing Data

Mean/Median/Mode Imputation: Replace missing values with the mean, median, or mode of the column. This is useful for numerical data.


# For mean imputation

df.fillna(df.mean(), inplace=True)


# For median imputation

df.fillna(df.median(), inplace=True)


# For mode imputation (for categorical data)

df.fillna(df.mode().iloc[0], inplace=True)

Custom Imputation: Use domain knowledge or other algorithms to impute missing values.

3. Using Algorithms that Support Missing Values

Some machine learning algorithms can handle missing values inherently. For example, decision trees and random forests can handle missing data without imputation.

import pandas as pd

# Example dataset with missing values

data = {'Name': ['Alice', 'Bob', 'Charlie', None],

        'Age': [25, None, 30, 22],

        'Salary': [70000, 80000, None, 40000]}

df = pd.DataFrame(data)

print(df)

# Identifying missing values

print("Missing values in each column:")

print(df.isnull().sum())

# Handling missing data

# Option 1: Removing rows with missing values

df_dropped_rows = df.dropna()

print("DataFrame after dropping rows with missing values:")

print(df_dropped_rows)


# Option 2: Imputing missing values

# For simplicity, we'll use mean for numerical columns and mode for categorical columns

df_imputed = df.copy()

df_imputed['Age'].fillna(df['Age'].mean(), inplace=True)

df_imputed['Salary'].fillna(df['Salary'].mean(), inplace=True)

df_imputed['Name'].fillna(df['Name'].mode()[0], inplace=True)

print("\nDataFrame after imputing missing values:")

print(df_imputed)

Dealing with Outliers in a Dataset

Understanding and Handling Outliers

Outliers are data points that differ significantly from other observations. They can occur due to variability in the measurement or may indicate experimental errors. Handling outliers is crucial as they can lead to misleading representations and affect the results of data analysis.

Steps to Handle Outliers:

Visualizing Data: Using plots like boxplots to identify outliers.
Identifying Outliers: Determining which data points are considered outliers.
Removing Outliers: Deciding on a strategy to handle outliers, often by removing them.

We will demonstrate these steps using a dataset with more than 50 rows, analyze it with a boxplot, identify outliers, and then remove them.

import numpy as np

import matplotlib.pyplot as plt


# Generating a dataset with more than 50 rows

np.random.seed(0)

data = np.random.normal(100, 20, 60)


# Introducing outliers

data = np.append(data, [300, 305])


# Creating a DataFrame

df_outliers = pd.DataFrame(data, columns=['Values'])


# Displaying the first few rows

print(df_outliers.head())

# Visualizing the data with a boxplot

plt.figure(figsize=(10, 6))

plt.boxplot(df_outliers['Values'])

plt.title('Boxplot of Values')

plt.ylabel('Value')

plt.show()


# This boxplot will help us identify the outliers visually.

# Identifying and removing outliers

Q1 = df_outliers['Values'].quantile(0.25)

Q3 = df_outliers['Values'].quantile(0.75)

IQR = Q3 - Q1


# Defining bounds for outliers

lower_bound = Q1 - 1.5 * IQR

upper_bound = Q3 + 1.5 * IQR


# Filtering out the outliers

df_filtered = df_outliers[(df_outliers['Values'] >= lower_bound) & (df_outliers['Values'] <= upper_bound)]


# Displaying the filtered DataFrame

print("DataFrame after removing outliers:")

print(df_filtered)

Selecting Variables for a Classification Model

Selecting Variables for Classification Using the Iris Dataset

In this section, we will demonstrate how to select variables for a classification model using the Iris dataset. This dataset is a classic in machine learning, featuring measurements of iris flowers and their species. We will explore different techniques to determine which features are most important for classification.

from sklearn.datasets import load_iris

from sklearn.feature_selection import SelectKBest

from sklearn.feature_selection import chi2


# Loading the Iris dataset

iris = load_iris()

X = iris.data

y = iris.target


# Displaying the first few rows of the dataset

print(X[:5, :])

print(y[:5])

# Feature Selection using SelectKBest and Chi-Squared Test

# Selecting the top 2 features

selector = SelectKBest(score_func=chi2, k=2)

X_selected = selector.fit_transform(X, y)


# Displaying the selected features

print("Selected Features:")

print(X_selected[:5, :])


# The scores for each feature

print("Feature Scores:")

print(selector.scores_)

Feature Selection Using Random Forest on the Wine Dataset

Another popular technique for feature selection is using a Random Forest classifier. This method is particularly useful for understanding feature importance in classification tasks. We will use the Wine dataset, another classic dataset in machine learning, to demonstrate this technique.

from sklearn.datasets import load_wine

from sklearn.ensemble import RandomForestClassifier


# Loading the Wine dataset

wine = load_wine()

X_wine = wine.data

y_wine = wine.target


# Displaying the first few rows of the dataset

print(X_wine[:5, :])

print(y_wine[:5])

# Feature Selection using Random Forest

rf = RandomForestClassifier(n_estimators=100)

rf.fit(X_wine, y_wine)


# Getting feature importances

importances = rf.feature_importances_


# Sorting the feature importances in descending order

indices = np.argsort(importances)[::-1]


# Displaying the feature importances

print("Feature ranking:")

for f in range(X_wine.shape[1]):

    print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))

# Plotting feature importances

plt.figure(figsize=(12, 6))

plt.title("Feature Importances in the Wine Dataset")

plt.bar(range(X_wine.shape[1]), importances[indices], color="r", align="center")

plt.xticks(range(X_wine.shape[1]), wine.feature_names, rotation=45)

plt.xlim([-1, X_wine.shape[1]])

plt.ylabel('Importance')

plt.xlabel('Features')

plt.show()


# This plot will help us visually assess which features are most important.

Conclusion on Feature Selection for the Wine Dataset

Based on the feature importance rankings and the visual representation, we can conclude which features are most significant for the classification model. Generally, features with higher importance scores are more influential in predicting the target variable. In this case, we would select the top-ranking features as they have the highest impact on the model’s performance.

Feature Selection Using Principal Component Analysis (PCA) on the Breast Cancer Dataset

Principal Component Analysis (PCA) is a technique used for dimensionality reduction, which can also be helpful in feature selection. It transforms the data into a new set of variables, the principal components, which are orthogonal and uncorrelated. We will use the Breast Cancer dataset to demonstrate this technique.

from sklearn.datasets import load_breast_cancer

from sklearn.decomposition import PCA

import pandas as pd


# Loading the Breast Cancer dataset

breast_cancer = load_breast_cancer()

X_cancer = breast_cancer.data

y_cancer = breast_cancer.target


# Creating a DataFrame for better visualization

df_cancer = pd.DataFrame(X_cancer, columns=breast_cancer.feature_names)

df_cancer['target'] = y_cancer


# Displaying the first few rows of the dataset

df_cancer.head()

# Applying PCA

pca = PCA(n_components=2)

X_pca = pca.fit_transform(X_cancer)


# Creating a DataFrame for the PCA results

df_pca = pd.DataFrame(X_pca, columns=['PC1', 'PC2'])

df_pca['target'] = y_cancer


# Displaying the first few rows of the PCA results

df_pca.head()

# Visualizing the PCA results

import matplotlib.pyplot as plt

import seaborn as sns


# Plotting the PCA components

plt.figure(figsize=(10, 6))

sns.scatterplot(x='PC1', y='PC2', hue='target', data=df_pca, palette='Set1')

plt.title('PCA of Breast Cancer Dataset')

plt.xlabel('Principal Component 1')

plt.ylabel('Principal Component 2')

plt.legend(title='Target')

plt.show()


# This plot will help us understand the distribution of the data in the new feature space created by PCA.

Conclusion on Feature Selection with PCA

After applying PCA to the Breast Cancer dataset, we can observe how the data is distributed across the principal components. PCA helps in reducing the dimensionality of the dataset while retaining the most significant features. In this case, we transformed the data into two principal components. These components can be used for further analysis or in building classification models, as they encapsulate the most variance of the dataset.

Student Exercise: Preprocessing and Model Selection with Cardio Dataset

Now, it’s your turn to apply the techniques we’ve discussed using the cardio.csv dataset. Your tasks are as follows:

Handle Missing Data: Either remove missing values or impute them based on the techniques discussed.
Eliminate Outliers: Identify and remove outliers from each column.
Feature Selection: Choose the best variables for the model using the methods we’ve explored.
Model Building and Comparison: Build models using KNN (K-Nearest Neighbors) and Logistic Regression. Compare the performance of these models with the metrics obtained in the previous class.

This exercise will help you solidify your understanding of data preprocessing and model selection. Good luck!