Class 5: Data Manipulation I

Open In Colab

Set Logic

In mathematics, a set is a collection of distinct objects, considered as an object in its own right. The operations associated with sets include:

  • Union: Represents all elements that are in either set. Denoted as \(A \cup B\).
  • Intersection: Represents all elements that are common to both sets. Denoted as \(A \cap B\).
  • Difference: Represents all elements that are in the first set but not in the second. Denoted as \(A - B\).
  • Complement: Represents all elements that are not in the given set. Denoted as \(\overline{A}\).

These operations form the basic foundation for set theory and are crucial when dealing with collections of data.

Pandas Merge Function

The merge function in pandas is used to combine two DataFrames based on common columns or indices. It’s similar to SQL JOIN operations. The key parameters include:

  • right: The DataFrame to merge with.
  • on: Column(s) that should be used to join the DataFrames. These columns should exist in both DataFrames.
  • left_on: Columns from the left DataFrame to use as keys.
  • right_on: Columns from the right DataFrame to use as keys.

Let’s create two example DataFrames to demonstrate the use of merge with right_on and left_on.

import pandas as pd

# Creating two example DataFrames
students = pd.DataFrame({
    'student_id': [101, 102, 103],
    'name': ['Alice', 'Bob', 'Charlie']
})

courses = pd.DataFrame({
    'registration_no': [101, 102, 104],
    'course': ['Math', 'Physics', 'Chemistry']
})

# Merging the DataFrames using merge with right_on and left_on
merged_df = students.merge(courses, left_on='student_id', right_on='registration_no', how='inner')

merged_df
Unable to display output for mime type(s): application/vnd.dataresource+json

Outer Join vs Inner Join

When merging or joining DataFrames in pandas, it’s essential to understand the type of join you want to perform. The two most common types are outer join and inner join:

  • Outer Join: This type of join returns all the rows when there is a match in one of the DataFrames. Therefore, it returns all the rows from the left DataFrame and all the rows from the right DataFrame. If there is no match, the missing side will contain NaN.

  • Inner Join: This type of join returns only the rows where there is a match in both DataFrames. If there’s no match in one of the DataFrames, that row will not be in the result.

To illustrate the difference between these two types of joins, let’s consider an example. Suppose we have a DataFrame of authors and their books, and another DataFrame of books and their prices. We want to merge these DataFrames based on the book titles. Let’s first create and display these DataFrames.

# Creating two example DataFrames
authors = pd.DataFrame({
    'book_title': ['The Great Gatsby', 'Moby Dick', 'Pride and Prejudice'],
    'author': ['F. Scott Fitzgerald', 'Herman Melville', 'Jane Austen']
})

prices = pd.DataFrame({
    'book_title': ['The Great Gatsby', 'Moby Dick', 'War and Peace'],
    'price': [10, 15, 20]
})

authors, prices
(            book_title               author
 0     The Great Gatsby  F. Scott Fitzgerald
 1            Moby Dick      Herman Melville
 2  Pride and Prejudice          Jane Austen,
          book_title  price
 0  The Great Gatsby     10
 1         Moby Dick     15
 2     War and Peace     20)
# Performing an outer join
outer_join_result = authors.merge(prices, on='book_title', how='outer')

# Performing an inner join
inner_join_result = authors.merge(prices, on='book_title', how='inner')

outer_join_result, inner_join_result
(            book_title               author  price
 0     The Great Gatsby  F. Scott Fitzgerald   10.0
 1            Moby Dick      Herman Melville   15.0
 2  Pride and Prejudice          Jane Austen    NaN
 3        War and Peace                  NaN   20.0,
          book_title               author  price
 0  The Great Gatsby  F. Scott Fitzgerald     10
 1         Moby Dick      Herman Melville     15)

Pivot Tables, Transposition, and Lambda Functions

In this section, we’ll delve into some advanced data manipulation techniques in pandas, including pivot tables, transposition, and the use of lambda functions.

Pivot Tables

A pivot table is a data summarization tool used in spreadsheets and other data visualization software. It aggregates a table of data by one or more keys, arranging the data in a rectangle with some of the group keys along the rows and some along the columns. In pandas, the pivot_table function provides this functionality.

The primary parameters of the pivot_table function are:

  • values: Column(s) to aggregate.
  • index: Column(s) to use as the row of the pivot table.
  • columns: Column(s) to use as the columns of the pivot table.
  • aggfunc: Aggregation function to use on the data. It can be ‘sum’, ‘mean’, etc. or a custom function.

Let’s look at two practical examples to understand the use of pivot tables.

# Creating two example DataFrames for pivot_table demonstration

# Example 1: Sales data
sales_data = pd.DataFrame({
    'Date': ['2022-01-01', '2022-01-01', '2022-01-02', '2022-01-02'],
    'Product': ['A', 'B', 'A', 'B'],
    'Sales': [100, 150, 110, 160]
})

# Pivot table to get total sales for each product across dates
pivot_sales = sales_data.pivot_table(values='Sales', index='Date', columns='Product', aggfunc='sum')

# Example 2: Student grades
grades_data = pd.DataFrame({
    'Student': ['Alice', 'Bob', 'Alice', 'Bob'],
    'Subject': ['Math', 'Math', 'History', 'History'],
    'Grade': [85, 90, 78, 88]
})

# Pivot table to get grades for each student across subjects
pivot_grades = grades_data.pivot_table(values='Grade', index='Student', columns='Subject')

pivot_sales, pivot_grades
(Product       A    B
 Date                
 2022-01-01  100  150
 2022-01-02  110  160,
 Subject  History  Math
 Student               
 Alice         78    85
 Bob           88    90)

Transposition

Transposition is a fundamental operation in linear algebra and data manipulation. In the context of DataFrames, transposition refers to switching the rows and columns with each other. In pandas, you can transpose a DataFrame using the transpose() method or its shorthand .T.

Transposing can be useful in various scenarios, such as when you want to change the orientation of your data for visualization or when you want to treat columns as rows for certain operations.

Let’s look at an example to understand the use of transposition.

# Example DataFrame for transposition
df_transpose = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'London', 'Paris']
})

# Transposing the DataFrame
transposed_df = df_transpose.transpose()

df_transpose, transposed_df
(      Name  Age      City
 0    Alice   25  New York
 1      Bob   30    London
 2  Charlie   35     Paris,
              0       1        2
 Name     Alice     Bob  Charlie
 Age         25      30       35
 City  New York  London    Paris)

Lambda Functions with Pivot Tables

Lambda functions are small, anonymous functions that can have any number of arguments, but can only have one expression. They are useful for performing simple operations without the need to define a full function. In the context of pivot_table, lambda functions can be used as custom aggregation functions.

For instance, you might want to aggregate data in a way that’s not directly supported by built-in functions. In such cases, a lambda function can be handy.

Let’s look at an example where we use a lambda function with pivot_table to calculate the range (difference between max and min) of values.

# Example DataFrame for lambda function with pivot_table
df_lambda = pd.DataFrame({
    'Category': ['A', 'A', 'B', 'B', 'A', 'B'],
    'Values': [10, 15, 20, 25, 30, 35]
})

# Using pivot_table with lambda function to calculate the range of values for each category
pivot_lambda = df_lambda.pivot_table(index='Category', values='Values', aggfunc=lambda x: x.max() - x.min())

pivot_lambda
Unable to display output for mime type(s): application/vnd.dataresource+json

Data Cleaning and Transformation

One of the essential steps in the data preprocessing pipeline is data cleaning and transformation. This step ensures that the data is of high quality and is ready for analysis. A common issue encountered during this phase is the presence of duplicate records.

Duplicates in a DataFrame

Duplicates refer to rows in a DataFrame that are identical to another row. These can arise due to various reasons such as data entry errors, merging datasets, or not handling data updates correctly. Duplicates can lead to incorrect analysis results, so it’s crucial to identify and handle them appropriately.

In pandas, you can use the drop_duplicates() method to remove duplicate rows and the duplicated() method to identify them. The subset parameter allows you to consider certain columns for identifying duplicates.

Let’s dive into an entertaining example to understand the use of these functions.

# Example DataFrame with duplicate records
df_duplicates = pd.DataFrame({
    'Superhero': ['Spider-Man', 'Iron Man', 'Spider-Man', 'Thor', 'Iron Man', 'Hulk'],
    'Real Name': ['Peter Parker', 'Tony Stark', 'Peter Parker', 'Thor Odinson', 'Tony Stark', 'Bruce Banner'],
    'City': ['New York', 'New York', 'New York', 'Asgard', 'Los Angeles', 'New York']
})

# Identifying duplicate rows
duplicated_rows = df_duplicates[df_duplicates.duplicated(subset=['Superhero', 'Real Name'])]

# Removing duplicate rows
df_cleaned = df_duplicates.drop_duplicates(subset=['Superhero', 'Real Name'])

df_duplicates, duplicated_rows, df_cleaned
(    Superhero     Real Name         City
 0  Spider-Man  Peter Parker     New York
 1    Iron Man    Tony Stark     New York
 2  Spider-Man  Peter Parker     New York
 3        Thor  Thor Odinson       Asgard
 4    Iron Man    Tony Stark  Los Angeles
 5        Hulk  Bruce Banner     New York,
     Superhero     Real Name         City
 2  Spider-Man  Peter Parker     New York
 4    Iron Man    Tony Stark  Los Angeles,
     Superhero     Real Name      City
 0  Spider-Man  Peter Parker  New York
 1    Iron Man    Tony Stark  New York
 3        Thor  Thor Odinson    Asgard
 5        Hulk  Bruce Banner  New York)

Original DataFrame (Superhero Records)

This DataFrame contains records of superheroes, their real names, and cities. As you can observe, there are some duplicate entries in the data.

df_duplicates

Identified Duplicate Rows

Using the duplicated() function with the subset parameter, we can identify rows where both the ‘Superhero’ and ‘Real Name’ columns are duplicated. Here are the rows that have been identified as duplicates:

duplicated_rows

Cleaned DataFrame (After Removing Duplicates)

After identifying the duplicate rows, we can use the drop_duplicates() function with the subset parameter to remove these rows and obtain a cleaned DataFrame. Here’s the DataFrame after removing the duplicates:

df_cleaned

Missing Data

In the realm of data analysis, missing data, often represented as NaN (Not a Number), is a common occurrence. It can arise due to various reasons, such as data entry errors, unrecorded observations, or during data processing. Handling missing data is crucial as it can significantly impact the results of your analysis.

Identifying Missing Data

Before handling missing data, it’s essential to identify them in your dataset. In pandas, the isna() or isnull() methods can be used to detect missing values. These methods return a DataFrame of the same shape as the original, but with True where the data is missing and False where it’s not.

# Example DataFrame with missing data
df_missing = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, np.nan, 35, 40],
    'City': ['New York', 'Los Angeles', np.nan, 'Chicago']
})

# Identifying missing data
df_missing.isna()

Removing Missing Data

Once you’ve identified missing data, you might decide to remove them from your dataset. Pandas provides two main methods for this:

  • dropna(): This method allows you to drop rows or columns containing missing data.
  • fillna(): Instead of dropping missing data, you can also replace them with a specific value or a computed value (like mean, median, etc.).
# Removing rows with missing data
df_no_missing_rows = df_missing.dropna()

# Removing columns with missing data
df_no_missing_columns = df_missing.dropna(axis=1)

df_no_missing_rows, df_no_missing_columns

Filling Missing Data

Instead of removing missing data, another approach is to fill or replace them. The fillna() method in pandas allows you to replace missing values with a specific value, forward fill, backward fill, or even a computed value like the mean or median of the column.

# Filling missing data with a specific value
df_filled_value = df_missing.fillna('Unknown')

# Filling missing data with mean of the column
df_filled_mean = df_missing.copy()
df_filled_mean['Age'] = df_filled_mean['Age'].fillna(df_filled_mean['Age'].mean())

df_filled_value, df_filled_mean

Apply Function

The apply() function in pandas is a powerful tool that allows you to apply a function along the axis of a DataFrame (either rows or columns). It’s especially useful when you want to perform custom operations that are not readily available through built-in pandas functions.

Let’s explore some entertaining examples to understand the versatility of the apply() function.

# Example 1: Capitalizing names
df_names = pd.DataFrame({
    'Name': ['alice', 'bob', 'charlie', 'david']
})

df_names['Capitalized Name'] = df_names['Name'].apply(lambda x: x.capitalize())
df_names
# Example 2: Calculating the length of strings
df_phrases = pd.DataFrame({
    'Phrase': ['Hello World', 'Pandas is awesome', 'Apply function rocks!', 'Data Science']
})

df_phrases['Length'] = df_phrases['Phrase'].apply(len)
df_phrases
# Example 3: Categorizing based on age
df_age = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [15, 28, 35, 42]
})

def age_category(age):
    if age < 18:
        return 'Minor'
    elif 18 <= age < 35:
        return 'Young Adult'
    else:
        return 'Adult'

df_age['Category'] = df_age['Age'].apply(age_category)
df_age

Differences between apply and applymap

Both apply and applymap are essential tools in pandas for element-wise operations, but they serve different purposes and are used in different contexts.

  • apply: This function is used on both Series and DataFrame. When used on a Series, it applies a function to each element. When used on a DataFrame, it can apply a function along the axis (either rows or columns).

  • applymap: This function is exclusive to DataFrames and is used to apply a function to each element of the DataFrame.

Let’s delve into some examples to understand when to use one over the other.

# Example DataFrame
df_example = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6],
    'C': [7, 8, 9]
})

# Using apply on a Series
series_squared = df_example['A'].apply(lambda x: x**2)

# Using apply on a DataFrame to get the sum of each column
column_sum = df_example.apply(sum, axis=0)

# Using applymap to square each element of the DataFrame
df_squared = df_example.applymap(lambda x: x**2)
series_squared
Unable to display output for mime type(s): application/vnd.dataresource+json
column_sum
Unable to display output for mime type(s): application/vnd.dataresource+json
df_squared
Unable to display output for mime type(s): application/vnd.dataresource+json

Replacing Data

In data manipulation, there are often scenarios where you need to replace certain values in your dataset. The replace() method in pandas is a versatile tool that allows you to replace values in a DataFrame or Series.

Let’s start with a basic example of the replace() method.

# Example DataFrame
df_replace = pd.DataFrame({
    'Fruit': ['Apple', 'Banana', 'Cherry', 'Apple'],
    'Color': ['Red', 'Yellow', 'Red', 'Green']
})

# Using replace to change 'Apple' to 'Mango'
df_replaced = df_replace.replace('Apple', 'Mango')
df_replaced

Regular Expressions (Regex)

Regular expressions, often abbreviated as regex, are sequences of characters that define a search pattern. They are incredibly powerful for string matching and manipulation. In pandas, you can use regex with the replace() method to perform more complex replacements.

Let’s delve into the world of regex and explore some examples to understand its capabilities.

# Example DataFrame with phone numbers
df_phone = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Phone': ['123-456-7890', '(123) 456-7890', '123.456.7890']
})

# Using regex to standardize phone number format
df_phone_standardized = df_phone.copy()
df_phone_standardized['Phone'] = df_phone_standardized['Phone'].replace(r'\D', '', regex=True)
df_phone_standardized['Phone'] = df_phone_standardized['Phone'].replace(r'(\d{3})(\d{3})(\d{4})', r'\1-\2-\3', regex=True)
df_phone
Unable to display output for mime type(s): application/vnd.dataresource+json
df_phone_standardized
Unable to display output for mime type(s): application/vnd.dataresource+json

Deep Dive into Regular Expressions (Regex)

Regular expressions (regex) are a powerful tool for working with text data. They allow you to create search patterns using sequences of characters, which can be used for string matching and manipulation. The true power of regex lies in its flexibility and the wide range of patterns it can match.

Let’s break down some of the fundamental components of regex and explore various examples to understand its capabilities.

Basic Regex Patterns

  • . : Matches any character except a newline.
  • ^ : Matches the start of a string.
  • $ : Matches the end of a string.
  • * : Matches 0 or more repetitions of the preceding character or group.
  • + : Matches 1 or more repetitions of the preceding character or group.
  • ? : Matches 0 or 1 repetition of the preceding character or group.
  • \d : Matches any decimal digit. Equivalent to [0-9].
  • \D : Matches any non-digit character.
  • \w : Matches any alphanumeric character or underscore. Equivalent to [a-zA-Z0-9_].
  • \W : Matches any non-alphanumeric character.
  • \s : Matches any whitespace character (spaces, tabs, line breaks).
  • \S : Matches any non-whitespace character.

Let’s see some of these patterns in action with examples.

import re

# Sample text
text = 'My phone number is 123-456-7890 and my zip code is 98765.'

# Extracting phone number using regex
phone_pattern = r'\d{3}-\d{3}-\d{4}'
phone_match = re.search(phone_pattern, text)
phone_number = phone_match.group() if phone_match else None

# Extracting zip code using regex
zip_pattern = r'\d{5}$'
zip_match = re.search(zip_pattern, text)
zip_code = zip_match.group() if zip_match else None

phone_number, zip_code

In the example above, we used the following regex patterns:

  • \d{3}-\d{3}-\d{4}: This pattern matches a phone number format where there are three digits, followed by a hyphen, another three digits, another hyphen, and finally four digits.

  • \d{5}$: This pattern matches five digits at the end of a string, which is a common format for zip codes in the US.

Now, let’s explore more advanced regex patterns and their applications.

Advanced Regex Patterns

  • Character Sets [...]: Matches any one of the characters inside the square brackets. For example, [aeiou] matches any vowel.
  • Negated Character Sets [^...]: Matches any character that is not inside the square brackets. For example, [^aeiou] matches any non-vowel character.
  • Quantifiers {m,n}: Matches between m and n repetitions of the preceding character or group. For example, a{2,4} matches ‘aa’, ‘aaa’, or ‘aaaa’.
  • Non-capturing Groups (?:...): Groups multiple patterns together without creating a capture group.
  • Positive Lookahead (?=...): Asserts that what directly follows the current position in the string matches the pattern inside the lookahead, but doesn’t consume any characters.
  • Negative Lookahead (?!...): Asserts that what directly follows the current position in the string does not match the pattern inside the lookahead.

Let’s see some of these advanced patterns in action with examples.

# Sample text for regex examples
text_advanced = 'The price is $100.00, but there's a discount of 10% if you pay within 5 days.'

# Extracting all prices using regex (character sets)
price_pattern = r'\$[0-9]+\.[0-9]{2}'
prices = re.findall(price_pattern, text_advanced)

# Extracting words that don't start with a vowel (negated character sets)
non_vowel_pattern = r'\b[^aeiouAEIOU \d][a-zA-Z]*'
non_vowel_words = re.findall(non_vowel_pattern, text_advanced)

# Using positive lookahead to find 'discount' if it's followed by '10%'
lookahead_pattern = r'discount(?= of 10%)'
lookahead_match = re.search(lookahead_pattern, text_advanced)
lookahead_word = lookahead_match.group() if lookahead_match else None

prices, non_vowel_words, lookahead_word

In the examples above, we utilized various advanced regex patterns:

  • Character Sets: The pattern \$[0-9]+\.[0-9]{2} matches dollar amounts. It looks for a dollar sign, followed by one or more digits, a period, and exactly two digits after the period.

  • Negated Character Sets: The pattern \b[^aeiouAEIOU \d][a-zA-Z]* matches words that don’t start with a vowel. It looks for word boundaries (\b), followed by any character that is not a vowel or a digit, and then any sequence of alphabetic characters.

  • Positive Lookahead: The pattern discount(?= of 10%) matches the word ‘discount’ only if it’s directly followed by ’ of 10%’. The positive lookahead (?=...) checks for the presence of a pattern without consuming any characters, allowing us to match based on what follows our main pattern.

Regular expressions are a vast topic, and there’s a lot more to explore. The key is to practice and experiment with different patterns to become proficient.

Exercises

Now that you’ve learned about various data manipulation techniques in pandas and regular expressions, it’s time to test your knowledge! Below are 10 exercises that cover the topics discussed in this notebook. Try to solve each one to reinforce your understanding.

  1. Merging DataFrames: Given two DataFrames, df1 with columns ['A', 'B'] and df2 with columns ['B', 'C'], merge them on column ‘B’.

  2. Pivot Tables: Create a pivot table from a DataFrame df with columns ['Date', 'Product', 'Sales'] to show the total sales for each product over time.

  3. Transpose: Transpose a DataFrame df and explain what happens to its indices and columns.

  4. Lambda Functions: Use a lambda function to square all the values in a DataFrame column ‘X’.

  5. Regex Phone Numbers: Extract all phone numbers from a text string. Consider phone numbers to be in the formats 123-456-7890, (123) 456-7890, and 123.456.7890.

  6. Handling Duplicates: Identify and remove any duplicate rows in a DataFrame df based on columns ‘A’ and ‘B’.

  7. Handling Missing Data: Replace all NaN values in a DataFrame df with the mean of the column they are in.

  8. Apply Function: Apply a function that calculates the length of each string in the ‘Name’ column of a DataFrame df.

  9. Replace with Regex: Replace all occurrences of the word ‘color’ (case-insensitive) in a text string with the word ‘hue’ using regex.

  10. Advanced Regex: Extract all email addresses from a text string. Consider email addresses to be in the format name@domain.com.