Class 6: Data Manipulation II

Open In Colab

Discretization of Data

Discretization refers to the process of converting continuous variables into discrete ones. This can be particularly useful when we want to categorize continuous data into specific bins or intervals. In data analysis, sometimes, it’s more meaningful to work with categories or intervals rather than raw continuous values.

Continuous vs. Discrete Variables

  • Continuous Variable: A variable that can take any value within a given range. Examples include height, weight, and temperature.
  • Discrete Variable: A variable that can take specific values, often in the form of integers. Examples include the number of cars in a household or the number of students in a class.

The pandas library in Python provides a function called pd.cut() that helps in discretizing continuous variables. Let’s delve deeper into its parameters and see how it works.

Parameters of pd.cut()

  • x: The input array to be binned. Must be 1-dimensional.
  • bins: Defines the bin edges for the discretization. It can be an integer, sequence of scalars, or an IntervalIndex.
  • right: Indicates whether the bins include the rightmost edge or not. Default is True.
  • labels: Specifies the labels for the returned bins. Must be the same length as the resulting bins.
  • retbins: Whether to return the bin edges (True or False). Default is False.
  • precision: The precision at which to store and display the bins labels.
  • include_lowest: Whether the first interval should be left-inclusive or not. Default is False.

Now, let’s see some examples of how to use pd.cut().

import pandas as pd

# Example 1: Basic usage of pd.cut()
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
bins = [0, 2, 4, 6, 8, 10]
categories = pd.cut(data, bins)
categories
[(0, 2], (0, 2], (2, 4], (2, 4], (4, 6], (4, 6], (6, 8], (6, 8], (8, 10], (8, 10]]
Categories (5, interval[int64, right]): [(0, 2] < (2, 4] < (4, 6] < (6, 8] < (8, 10]]
# Example 2: Using labels parameter
bin_labels = ['low', 'medium', 'high', 'very high', 'extreme']
categories_with_labels = pd.cut(data, bins, labels=bin_labels)
categories_with_labels
['low', 'low', 'medium', 'medium', 'high', 'high', 'very high', 'very high', 'extreme', 'extreme']
Categories (5, object): ['low' < 'medium' < 'high' < 'very high' < 'extreme']
# Example 3: Using right parameter to exclude the rightmost edge
categories_excluding_right = pd.cut(data, bins, right=False)
categories_excluding_right
[[0.0, 2.0), [2.0, 4.0), [2.0, 4.0), [4.0, 6.0), [4.0, 6.0), [6.0, 8.0), [6.0, 8.0), [8.0, 10.0), [8.0, 10.0), NaN]
Categories (5, interval[int64, left]): [[0, 2) < [2, 4) < [4, 6) < [6, 8) < [8, 10)]

Using np.where with Pandas DataFrames

The np.where function from the NumPy library is a versatile tool that allows for conditional logic based on array-like structures. In essence, it’s a vectorized version of the ternary expression x if condition else y.

When working with Pandas DataFrames, np.where can be particularly useful for creating or modifying columns based on certain conditions.

The basic syntax is as follows:

np.where(condition, value_if_true, value_if_false)

Let’s explore some examples to understand its usage with Pandas DataFrames.

import numpy as np

# Example 1: Basic usage of np.where with DataFrame
df1 = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': [10, 20, 30, 40, 50]})
df1['C'] = np.where(df1['A'] > 3, 'high', 'low')
df1
Unable to display output for mime type(s): application/vnd.dataresource+json
# Example 2: Using np.where with multiple conditions
df2 = pd.DataFrame({'Score': [85, 90, 78, 88, 76, 95, 89]})
conditions = [df2['Score'] < 80, (df2['Score'] >= 80) & (df2['Score'] < 90), df2['Score'] >= 90]
choices = ['Fail', 'Pass', 'Excellent']
df2['Grade'] = np.select(conditions, choices, default='Unknown')
df2
Unable to display output for mime type(s): application/vnd.dataresource+json
# Example 3: Modifying an existing column based on a condition
df3 = pd.DataFrame({'Age': [25, 30, 35, 40, 45, 50, 55]})
df3['Category'] = np.where(df3['Age'] < 40, 'Young', 'Old')
df3
Unable to display output for mime type(s): application/vnd.dataresource+json

One-Hot Encoding with pd.get_dummies()

One of the common challenges in data preprocessing, especially when dealing with categorical data, is how to convert these categories into a format that can be provided to machine learning algorithms. One popular way to convert categorical variables is by using one-hot encoding.

One-hot encoding is a process of converting categorical data variables so they can be provided to machine learning algorithms to improve predictions. The pd.get_dummies() function is used for this purpose. It converts categorical variable(s) into dummy/indicator variables.

The basic idea is to convert each category into a new column, and assign a 1 or 0 (True/False) value to the column. This has the benefit of not weighting a value improperly.

Let’s explore some examples to understand its usage with Pandas DataFrames.

# Example 1: Basic usage of pd.get_dummies()
df1 = pd.DataFrame({'Fruit': ['Apple', 'Banana', 'Cherry', 'Apple', 'Banana']})
dummies1 = pd.get_dummies(df1, columns=['Fruit'], prefix='', prefix_sep='')
dummies1
Unable to display output for mime type(s): application/vnd.dataresource+json
# Example 2: Using prefix and prefix_sep parameters
df2 = pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Red', 'Green']})
dummies2 = pd.get_dummies(df2, columns=['Color'], prefix='Color', prefix_sep='_')
dummies2
Unable to display output for mime type(s): application/vnd.dataresource+json
# Example 3: Handling multiple categorical columns
df3 = pd.DataFrame({'Animal': ['Dog', 'Cat', 'Bird', 'Dog', 'Bird'],
                   'Size': ['Small', 'Medium', 'Small', 'Large', 'Medium']})
dummies3 = pd.get_dummies(df3, columns=['Animal', 'Size'], prefix=['Animal', 'Size'])
dummies3
Unable to display output for mime type(s): application/vnd.dataresource+json

Dropping a Dummy Variable (Avoiding the Dummy Variable Trap)

When we use one-hot encoding, we can sometimes fall into the ‘dummy variable trap’, which can mess up the results of some machine learning models. The dummy variable trap is a scenario where variables are highly correlated, meaning one variable can be predicted from the others.

For instance, if we have three categories (A, B, and C) and we know the values of A and B dummy variables, we can easily infer the value of the C dummy variable. This is because if A=0 and B=0, then C must be 1. This multicollinearity can lead to issues in certain machine learning models.

To avoid the dummy variable trap, one common practice is to drop one of the dummy variables if we have n categories. By doing this, we will have n-1 dummy variables. This is often referred to as ‘dropping the first level’.

Let’s see an example to understand this better.

# Creating a sample DataFrame with three categories
df = pd.DataFrame({'Category': ['A', 'B', 'C', 'A', 'B', 'C', 'A']})

# Using pd.get_dummies() and dropping the first level
dummies = pd.get_dummies(df, columns=['Category'], drop_first=True)
dummies

String Manipulation

In data analysis, often we encounter columns that contain string (text) data. Pandas provides a set of string functions which make it easy to operate on string data. These functions are available in the str attribute of the Series/DataFrame.

Let’s explore some of the commonly used string functions:

  1. .lower(): Converts all characters in string to lowercase.
  2. .str: Allows us to apply string functions on a Series of a DataFrame.
  3. .strip(): Removes leading and trailing whitespaces (including spaces).
# Example for .lower()
df1 = pd.DataFrame({'Names': ['ALICE', 'BOB', 'CHARLIE', 'DAVID']})
df1['Names_lower'] = df1['Names'].str.lower()
df1
# Example for .strip()
df2 = pd.DataFrame({'Data': ['  apple  ', ' banana ', ' cherry  ', ' date']})
df2['Stripped_Data'] = df2['Data'].str.strip()
df2

Date Manipulation with Pandas

Handling dates and time is a common task in data analysis. Pandas provides a rich set of functions to work with date and time, which makes it easy to perform operations like parsing dates, extracting date components, and even doing arithmetic with dates.

One of the most commonly used functions for date manipulation in Pandas is pd.to_datetime(). This function is used to convert arguments to datetime format.

# Example: Converting a Series of strings to datetime format
df_dates = pd.DataFrame({'Date_String': ['2023-01-01', '2023-02-15', '2023-03-20', '2023-04-25']})
df_dates['Date'] = pd.to_datetime(df_dates['Date_String'])
df_dates

The pd.to_datetime() function has several parameters that allow for customization. Some of the commonly used parameters are:

  • format: This is used to specify the date format. For example, ‘%Y-%m-%d’ represents the format ‘YYYY-MM-DD’.
  • errors: This parameter specifies what should be returned if parsing fails. The default is ‘raise’, which means an exception is raised. Other options include ‘coerce’ (returns NaT for invalid parsing) and ‘ignore’ (returns the original input).

Once we have our dates in datetime format, we can easily extract various components from it. Let’s see how to extract the year, month, day of the week, and the name of the month from our datetime column.

# Extracting various components from the datetime column
df_dates['Year'] = df_dates['Date'].dt.year
df_dates['Month'] = df_dates['Date'].dt.month
df_dates['Day_of_Week'] = df_dates['Date'].dt.dayofweek
df_dates['Month_Name'] = df_dates['Date'].dt.strftime('%B')
df_dates

The format parameter in pd.to_datetime() allows us to specify the expected date format of our input. This can be particularly useful when dealing with dates that are in non-standard formats. By providing the correct format, we can ensure that our dates are parsed correctly.

Let’s look at some examples with different date formats:

# Example with different date formats
df_formats = pd.DataFrame({'Date_String': ['01-23-2023', '15/02/2023', '20.03.2023']})

# Parsing dates with specified formats
df_formats['Date_1'] = pd.to_datetime(df_formats['Date_String'], format='%m-%d-%Y')
df_formats['Date_2'] = pd.to_datetime(df_formats['Date_String'], format='%d/%m/%Y', errors='coerce')
df_formats['Date_3'] = pd.to_datetime(df_formats['Date_String'], format='%d.%m.%Y', errors='coerce')
df_formats

Exercises

Now that you’ve learned various data manipulation techniques with Pandas, it’s time to put your knowledge to the test! Below are exercises that cover the topics we’ve discussed. Each exercise provides a sample dataframe for you to work with.

Discretization of Data

  1. Age Groups:

Given the following dataset of customers with their ages, create age groups as ‘Young (0-18)’, ‘Adult (19-50)’, and ‘Senior (51 and above)’. How many customers fall into each category?

df_ages = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
                       'Age': [15, 25, 55, 60, 30]})
  1. Salary Brackets:

Given the following dataset of employee salaries, categorize them into ‘Low’, ‘Medium’, ‘High’ salary brackets. Define the salary ranges for each bracket yourself.

df_salaries = pd.DataFrame({'Employee': ['John', 'Doe', 'Smith', 'Jane', 'Emily'],
                           'Salary': [3000, 5000, 7000, 10000, 4000]})

np.where Usage

  1. Temperature Analysis:

Given the following dataset of daily temperatures, identify days with temperatures below freezing (0°C) and label them as ‘Cold’. Days with temperatures between 0°C and 20°C are ‘Moderate’, and days above 20°C are ‘Hot’.

df_temperatures = pd.DataFrame({'Day': ['Mon', 'Tue', 'Wed', 'Thu', 'Fri'],
                               'Temperature': [-5, 10, 15, 25, 30]})
  1. Product Sales:

Given the following dataset of product sales, identify products with sales less than 100 as ‘Low Selling’, between 100 and 500 as ‘Moderate Selling’, and above 500 as ‘High Selling’.

df_sales = pd.DataFrame({'Product': ['A', 'B', 'C', 'D', 'E'],
                        'Sales': [50, 150, 300, 600, 800]})

One-Hot Encoding with pd.get_dummies

  1. Car Brands:

Given the following dataset of car sales, perform one-hot encoding on the ‘Brand’ column.

df_cars = pd.DataFrame({'Brand': ['Toyota', 'Honda', 'Ford', 'Toyota', 'Ford'],
                       'Sales': [200, 150, 300, 400, 250]})
  1. Student Majors:

Given the following dataset of students and their majors, perform one-hot encoding on the ‘Major’ column, but only keep columns for ‘Engineering’ and ‘Business’.

df_students = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
                           'Major': ['Engineering', 'Business', 'Arts', 'Business', 'Science']})

String Manipulation

  1. Name Formatting:

Given the following dataset of names, convert all names to lowercase.

df_names = pd.DataFrame({'Name': ['ALICE', 'bOB', 'ChArLiE', 'David', 'EVA']})
  1. Whitespace Removal:

Given the following dataset of product descriptions, remove any leading or trailing whitespace from the ‘Description’ column.

df_products = pd.DataFrame({'Product': ['A', 'B', 'C', 'D', 'E'],
                           'Description': ['  toy ', 'book  ', '  pen', ' notebook', ' pencil ']})

Date Manipulation

  1. Date Conversion:

Given the following dataset of events with dates in the format ‘dd-mm-yyyy’, convert the ‘Date’ column to a datetime format.

df_events = pd.DataFrame({'Event': ['Concert', 'Festival', 'Exhibition', 'Play', 'Conference'],
                         'Date': ['01-05-2022', '15-06-2022', '20-07-2022', '10-08-2022', '25-09-2022']})
  1. Extracting Date Components:

Given the following dataset of birthdays, extract the month and day of the week for each date.

df_birthdays = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
                            'Birthday': ['1995-05-01', '1987-06-15', '1992-07-20', '1985-08-10', '1990-09-25']})

Remember to practice regularly and apply these techniques to real-world datasets to reinforce your understanding. Happy coding!