In mathematics, a set is a collection of distinct objects, considered as an object in its own right. The operations associated with sets include:
Union: Represents all elements that are in either set. Denoted as \(A \cup B\).
Intersection: Represents all elements that are common to both sets. Denoted as \(A \cap B\).
Difference: Represents all elements that are in the first set but not in the second. Denoted as \(A - B\).
Complement: Represents all elements that are not in the given set. Denoted as \(\overline{A}\).
These operations form the basic foundation for set theory and are crucial when dealing with collections of data.
Pandas Merge Function
The merge function in pandas is used to combine two DataFrames based on common columns or indices. It’s similar to SQL JOIN operations. The key parameters include:
right: The DataFrame to merge with.
on: Column(s) that should be used to join the DataFrames. These columns should exist in both DataFrames.
left_on: Columns from the left DataFrame to use as keys.
right_on: Columns from the right DataFrame to use as keys.
Let’s create two example DataFrames to demonstrate the use of merge with right_on and left_on.
import pandas as pd# Creating two example DataFramesstudents = pd.DataFrame({'student_id': [101, 102, 103],'name': ['Alice', 'Bob', 'Charlie']})courses = pd.DataFrame({'registration_no': [101, 102, 104],'course': ['Math', 'Physics', 'Chemistry']})# Merging the DataFrames using merge with right_on and left_onmerged_df = students.merge(courses, left_on='student_id', right_on='registration_no', how='inner')merged_df
Unable to display output for mime type(s): application/vnd.dataresource+json
Outer Join vs Inner Join
When merging or joining DataFrames in pandas, it’s essential to understand the type of join you want to perform. The two most common types are outer join and inner join:
Outer Join: This type of join returns all the rows when there is a match in one of the DataFrames. Therefore, it returns all the rows from the left DataFrame and all the rows from the right DataFrame. If there is no match, the missing side will contain NaN.
Inner Join: This type of join returns only the rows where there is a match in both DataFrames. If there’s no match in one of the DataFrames, that row will not be in the result.
To illustrate the difference between these two types of joins, let’s consider an example. Suppose we have a DataFrame of authors and their books, and another DataFrame of books and their prices. We want to merge these DataFrames based on the book titles. Let’s first create and display these DataFrames.
# Creating two example DataFramesauthors = pd.DataFrame({'book_title': ['The Great Gatsby', 'Moby Dick', 'Pride and Prejudice'],'author': ['F. Scott Fitzgerald', 'Herman Melville', 'Jane Austen']})prices = pd.DataFrame({'book_title': ['The Great Gatsby', 'Moby Dick', 'War and Peace'],'price': [10, 15, 20]})authors, prices
( book_title author
0 The Great Gatsby F. Scott Fitzgerald
1 Moby Dick Herman Melville
2 Pride and Prejudice Jane Austen,
book_title price
0 The Great Gatsby 10
1 Moby Dick 15
2 War and Peace 20)
# Performing an outer joinouter_join_result = authors.merge(prices, on='book_title', how='outer')# Performing an inner joininner_join_result = authors.merge(prices, on='book_title', how='inner')outer_join_result, inner_join_result
( book_title author price
0 The Great Gatsby F. Scott Fitzgerald 10.0
1 Moby Dick Herman Melville 15.0
2 Pride and Prejudice Jane Austen NaN
3 War and Peace NaN 20.0,
book_title author price
0 The Great Gatsby F. Scott Fitzgerald 10
1 Moby Dick Herman Melville 15)
Pivot Tables, Transposition, and Lambda Functions
In this section, we’ll delve into some advanced data manipulation techniques in pandas, including pivot tables, transposition, and the use of lambda functions.
Pivot Tables
A pivot table is a data summarization tool used in spreadsheets and other data visualization software. It aggregates a table of data by one or more keys, arranging the data in a rectangle with some of the group keys along the rows and some along the columns. In pandas, the pivot_table function provides this functionality.
The primary parameters of the pivot_table function are:
values: Column(s) to aggregate.
index: Column(s) to use as the row of the pivot table.
columns: Column(s) to use as the columns of the pivot table.
aggfunc: Aggregation function to use on the data. It can be ‘sum’, ‘mean’, etc. or a custom function.
Let’s look at two practical examples to understand the use of pivot tables.
# Creating two example DataFrames for pivot_table demonstration# Example 1: Sales datasales_data = pd.DataFrame({'Date': ['2022-01-01', '2022-01-01', '2022-01-02', '2022-01-02'],'Product': ['A', 'B', 'A', 'B'],'Sales': [100, 150, 110, 160]})# Pivot table to get total sales for each product across datespivot_sales = sales_data.pivot_table(values='Sales', index='Date', columns='Product', aggfunc='sum')# Example 2: Student gradesgrades_data = pd.DataFrame({'Student': ['Alice', 'Bob', 'Alice', 'Bob'],'Subject': ['Math', 'Math', 'History', 'History'],'Grade': [85, 90, 78, 88]})# Pivot table to get grades for each student across subjectspivot_grades = grades_data.pivot_table(values='Grade', index='Student', columns='Subject')pivot_sales, pivot_grades
(Product A B
Date
2022-01-01 100 150
2022-01-02 110 160,
Subject History Math
Student
Alice 78 85
Bob 88 90)
Transposition
Transposition is a fundamental operation in linear algebra and data manipulation. In the context of DataFrames, transposition refers to switching the rows and columns with each other. In pandas, you can transpose a DataFrame using the transpose() method or its shorthand .T.
Transposing can be useful in various scenarios, such as when you want to change the orientation of your data for visualization or when you want to treat columns as rows for certain operations.
Let’s look at an example to understand the use of transposition.
# Example DataFrame for transpositiondf_transpose = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie'],'Age': [25, 30, 35],'City': ['New York', 'London', 'Paris']})# Transposing the DataFrametransposed_df = df_transpose.transpose()df_transpose, transposed_df
( Name Age City
0 Alice 25 New York
1 Bob 30 London
2 Charlie 35 Paris,
0 1 2
Name Alice Bob Charlie
Age 25 30 35
City New York London Paris)
Lambda Functions with Pivot Tables
Lambda functions are small, anonymous functions that can have any number of arguments, but can only have one expression. They are useful for performing simple operations without the need to define a full function. In the context of pivot_table, lambda functions can be used as custom aggregation functions.
For instance, you might want to aggregate data in a way that’s not directly supported by built-in functions. In such cases, a lambda function can be handy.
Let’s look at an example where we use a lambda function with pivot_table to calculate the range (difference between max and min) of values.
# Example DataFrame for lambda function with pivot_tabledf_lambda = pd.DataFrame({'Category': ['A', 'A', 'B', 'B', 'A', 'B'],'Values': [10, 15, 20, 25, 30, 35]})# Using pivot_table with lambda function to calculate the range of values for each categorypivot_lambda = df_lambda.pivot_table(index='Category', values='Values', aggfunc=lambda x: x.max() - x.min())pivot_lambda
Unable to display output for mime type(s): application/vnd.dataresource+json
Data Cleaning and Transformation
One of the essential steps in the data preprocessing pipeline is data cleaning and transformation. This step ensures that the data is of high quality and is ready for analysis. A common issue encountered during this phase is the presence of duplicate records.
Duplicates in a DataFrame
Duplicates refer to rows in a DataFrame that are identical to another row. These can arise due to various reasons such as data entry errors, merging datasets, or not handling data updates correctly. Duplicates can lead to incorrect analysis results, so it’s crucial to identify and handle them appropriately.
In pandas, you can use the drop_duplicates() method to remove duplicate rows and the duplicated() method to identify them. The subset parameter allows you to consider certain columns for identifying duplicates.
Let’s dive into an entertaining example to understand the use of these functions.
( Superhero Real Name City
0 Spider-Man Peter Parker New York
1 Iron Man Tony Stark New York
2 Spider-Man Peter Parker New York
3 Thor Thor Odinson Asgard
4 Iron Man Tony Stark Los Angeles
5 Hulk Bruce Banner New York,
Superhero Real Name City
2 Spider-Man Peter Parker New York
4 Iron Man Tony Stark Los Angeles,
Superhero Real Name City
0 Spider-Man Peter Parker New York
1 Iron Man Tony Stark New York
3 Thor Thor Odinson Asgard
5 Hulk Bruce Banner New York)
Original DataFrame (Superhero Records)
This DataFrame contains records of superheroes, their real names, and cities. As you can observe, there are some duplicate entries in the data.
df_duplicates
Identified Duplicate Rows
Using the duplicated() function with the subset parameter, we can identify rows where both the ‘Superhero’ and ‘Real Name’ columns are duplicated. Here are the rows that have been identified as duplicates:
duplicated_rows
Cleaned DataFrame (After Removing Duplicates)
After identifying the duplicate rows, we can use the drop_duplicates() function with the subset parameter to remove these rows and obtain a cleaned DataFrame. Here’s the DataFrame after removing the duplicates:
df_cleaned
Missing Data
In the realm of data analysis, missing data, often represented as NaN (Not a Number), is a common occurrence. It can arise due to various reasons, such as data entry errors, unrecorded observations, or during data processing. Handling missing data is crucial as it can significantly impact the results of your analysis.
Identifying Missing Data
Before handling missing data, it’s essential to identify them in your dataset. In pandas, the isna() or isnull() methods can be used to detect missing values. These methods return a DataFrame of the same shape as the original, but with True where the data is missing and False where it’s not.
Once you’ve identified missing data, you might decide to remove them from your dataset. Pandas provides two main methods for this:
dropna(): This method allows you to drop rows or columns containing missing data.
fillna(): Instead of dropping missing data, you can also replace them with a specific value or a computed value (like mean, median, etc.).
# Removing rows with missing datadf_no_missing_rows = df_missing.dropna()# Removing columns with missing datadf_no_missing_columns = df_missing.dropna(axis=1)df_no_missing_rows, df_no_missing_columns
Filling Missing Data
Instead of removing missing data, another approach is to fill or replace them. The fillna() method in pandas allows you to replace missing values with a specific value, forward fill, backward fill, or even a computed value like the mean or median of the column.
# Filling missing data with a specific valuedf_filled_value = df_missing.fillna('Unknown')# Filling missing data with mean of the columndf_filled_mean = df_missing.copy()df_filled_mean['Age'] = df_filled_mean['Age'].fillna(df_filled_mean['Age'].mean())df_filled_value, df_filled_mean
Apply Function
The apply() function in pandas is a powerful tool that allows you to apply a function along the axis of a DataFrame (either rows or columns). It’s especially useful when you want to perform custom operations that are not readily available through built-in pandas functions.
Let’s explore some entertaining examples to understand the versatility of the apply() function.
# Example 2: Calculating the length of stringsdf_phrases = pd.DataFrame({'Phrase': ['Hello World', 'Pandas is awesome', 'Apply function rocks!', 'Data Science']})df_phrases['Length'] = df_phrases['Phrase'].apply(len)df_phrases
# Example 3: Categorizing based on agedf_age = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie', 'David'],'Age': [15, 28, 35, 42]})def age_category(age):if age <18:return'Minor'elif18<= age <35:return'Young Adult'else:return'Adult'df_age['Category'] = df_age['Age'].apply(age_category)df_age
Differences between apply and applymap
Both apply and applymap are essential tools in pandas for element-wise operations, but they serve different purposes and are used in different contexts.
apply: This function is used on both Series and DataFrame. When used on a Series, it applies a function to each element. When used on a DataFrame, it can apply a function along the axis (either rows or columns).
applymap: This function is exclusive to DataFrames and is used to apply a function to each element of the DataFrame.
Let’s delve into some examples to understand when to use one over the other.
# Example DataFramedf_example = pd.DataFrame({'A': [1, 2, 3],'B': [4, 5, 6],'C': [7, 8, 9]})# Using apply on a Seriesseries_squared = df_example['A'].apply(lambda x: x**2)# Using apply on a DataFrame to get the sum of each columncolumn_sum = df_example.apply(sum, axis=0)# Using applymap to square each element of the DataFramedf_squared = df_example.applymap(lambda x: x**2)
series_squared
Unable to display output for mime type(s): application/vnd.dataresource+json
column_sum
Unable to display output for mime type(s): application/vnd.dataresource+json
df_squared
Unable to display output for mime type(s): application/vnd.dataresource+json
Replacing Data
In data manipulation, there are often scenarios where you need to replace certain values in your dataset. The replace() method in pandas is a versatile tool that allows you to replace values in a DataFrame or Series.
Let’s start with a basic example of the replace() method.
# Example DataFramedf_replace = pd.DataFrame({'Fruit': ['Apple', 'Banana', 'Cherry', 'Apple'],'Color': ['Red', 'Yellow', 'Red', 'Green']})# Using replace to change 'Apple' to 'Mango'df_replaced = df_replace.replace('Apple', 'Mango')df_replaced
Regular Expressions (Regex)
Regular expressions, often abbreviated as regex, are sequences of characters that define a search pattern. They are incredibly powerful for string matching and manipulation. In pandas, you can use regex with the replace() method to perform more complex replacements.
Let’s delve into the world of regex and explore some examples to understand its capabilities.
# Example DataFrame with phone numbersdf_phone = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie'],'Phone': ['123-456-7890', '(123) 456-7890', '123.456.7890']})# Using regex to standardize phone number formatdf_phone_standardized = df_phone.copy()df_phone_standardized['Phone'] = df_phone_standardized['Phone'].replace(r'\D', '', regex=True)df_phone_standardized['Phone'] = df_phone_standardized['Phone'].replace(r'(\d{3})(\d{3})(\d{4})', r'\1-\2-\3', regex=True)
df_phone
Unable to display output for mime type(s): application/vnd.dataresource+json
df_phone_standardized
Unable to display output for mime type(s): application/vnd.dataresource+json
Deep Dive into Regular Expressions (Regex)
Regular expressions (regex) are a powerful tool for working with text data. They allow you to create search patterns using sequences of characters, which can be used for string matching and manipulation. The true power of regex lies in its flexibility and the wide range of patterns it can match.
Let’s break down some of the fundamental components of regex and explore various examples to understand its capabilities.
Basic Regex Patterns
. : Matches any character except a newline.
^ : Matches the start of a string.
$ : Matches the end of a string.
* : Matches 0 or more repetitions of the preceding character or group.
+ : Matches 1 or more repetitions of the preceding character or group.
? : Matches 0 or 1 repetition of the preceding character or group.
\d : Matches any decimal digit. Equivalent to [0-9].
\D : Matches any non-digit character.
\w : Matches any alphanumeric character or underscore. Equivalent to [a-zA-Z0-9_].
\W : Matches any non-alphanumeric character.
\s : Matches any whitespace character (spaces, tabs, line breaks).
\S : Matches any non-whitespace character.
Let’s see some of these patterns in action with examples.
import re# Sample texttext ='My phone number is 123-456-7890 and my zip code is 98765.'# Extracting phone number using regexphone_pattern =r'\d{3}-\d{3}-\d{4}'phone_match = re.search(phone_pattern, text)phone_number = phone_match.group() if phone_match elseNone# Extracting zip code using regexzip_pattern =r'\d{5}$'zip_match = re.search(zip_pattern, text)zip_code = zip_match.group() if zip_match elseNonephone_number, zip_code
In the example above, we used the following regex patterns:
\d{3}-\d{3}-\d{4}: This pattern matches a phone number format where there are three digits, followed by a hyphen, another three digits, another hyphen, and finally four digits.
\d{5}$: This pattern matches five digits at the end of a string, which is a common format for zip codes in the US.
Now, let’s explore more advanced regex patterns and their applications.
Advanced Regex Patterns
Character Sets[...]: Matches any one of the characters inside the square brackets. For example, [aeiou] matches any vowel.
Negated Character Sets[^...]: Matches any character that is not inside the square brackets. For example, [^aeiou] matches any non-vowel character.
Quantifiers{m,n}: Matches between m and n repetitions of the preceding character or group. For example, a{2,4} matches ‘aa’, ‘aaa’, or ‘aaaa’.
Non-capturing Groups(?:...): Groups multiple patterns together without creating a capture group.
Positive Lookahead(?=...): Asserts that what directly follows the current position in the string matches the pattern inside the lookahead, but doesn’t consume any characters.
Negative Lookahead(?!...): Asserts that what directly follows the current position in the string does not match the pattern inside the lookahead.
Let’s see some of these advanced patterns in action with examples.
# Sample text for regex examplestext_advanced ='The price is $100.00, but there's a discount of 10%if you pay within 5 days.'# Extracting all prices using regex (character sets)price_pattern =r'\$[0-9]+\.[0-9]{2}'prices = re.findall(price_pattern, text_advanced)# Extracting words that don't start with a vowel (negated character sets)non_vowel_pattern =r'\b[^aeiouAEIOU \d][a-zA-Z]*'non_vowel_words = re.findall(non_vowel_pattern, text_advanced)# Using positive lookahead to find 'discount' if it's followed by '10%'lookahead_pattern =r'discount(?= of 10%)'lookahead_match = re.search(lookahead_pattern, text_advanced)lookahead_word = lookahead_match.group() if lookahead_match elseNoneprices, non_vowel_words, lookahead_word
In the examples above, we utilized various advanced regex patterns:
Character Sets: The pattern \$[0-9]+\.[0-9]{2} matches dollar amounts. It looks for a dollar sign, followed by one or more digits, a period, and exactly two digits after the period.
Negated Character Sets: The pattern \b[^aeiouAEIOU \d][a-zA-Z]* matches words that don’t start with a vowel. It looks for word boundaries (\b), followed by any character that is not a vowel or a digit, and then any sequence of alphabetic characters.
Positive Lookahead: The pattern discount(?= of 10%) matches the word ‘discount’ only if it’s directly followed by ’ of 10%’. The positive lookahead (?=...) checks for the presence of a pattern without consuming any characters, allowing us to match based on what follows our main pattern.
Regular expressions are a vast topic, and there’s a lot more to explore. The key is to practice and experiment with different patterns to become proficient.
Exercises
Now that you’ve learned about various data manipulation techniques in pandas and regular expressions, it’s time to test your knowledge! Below are 10 exercises that cover the topics discussed in this notebook. Try to solve each one to reinforce your understanding.
Merging DataFrames: Given two DataFrames, df1 with columns ['A', 'B'] and df2 with columns ['B', 'C'], merge them on column ‘B’.
Pivot Tables: Create a pivot table from a DataFrame df with columns ['Date', 'Product', 'Sales'] to show the total sales for each product over time.
Transpose: Transpose a DataFrame df and explain what happens to its indices and columns.
Lambda Functions: Use a lambda function to square all the values in a DataFrame column ‘X’.
Regex Phone Numbers: Extract all phone numbers from a text string. Consider phone numbers to be in the formats 123-456-7890, (123) 456-7890, and 123.456.7890.
Handling Duplicates: Identify and remove any duplicate rows in a DataFrame df based on columns ‘A’ and ‘B’.
Handling Missing Data: Replace all NaN values in a DataFrame df with the mean of the column they are in.
Apply Function: Apply a function that calculates the length of each string in the ‘Name’ column of a DataFrame df.
Replace with Regex: Replace all occurrences of the word ‘color’ (case-insensitive) in a text string with the word ‘hue’ using regex.
Advanced Regex: Extract all email addresses from a text string. Consider email addresses to be in the format name@domain.com.