Self-Learning Lab: Python Dataframes with Examples
Welcome to this self-learning lab on Python Dataframes! In this lab, you will explore and practice key concepts related to dataframes using the pandas library. The lab is designed to be completed in approximately two hours. For each topic, we will provide a brief overview, an example, and then an exercise for you to complete. The exercises are designed with a business context in mind, so you’ll be able to apply these concepts in real-world scenarios. ## Contents: 1. Introduction to Dataframes in Pandas 2. Data Transformation: Creating New Columns 3. Filtering and Indexing in Dataframes 4. The apply
Function and Lambda Expressions 5. One Hot Encoding Let’s dive in!
1. Introduction to Dataframes in Pandas
A DataFrame is a 2-dimensional labeled data structure with columns that can be of different types. It is similar to a spreadsheet, a SQL table, or the data.frame
in R. The DataFrame is one of the most commonly used pandas objects and is designed to handle a mix of numeric and non-numeric data. ### Example: Let’s create a simple dataframe with product sales data.
import pandas as pd
= {
data 'Product Name': ['Laptop', 'Mouse', 'Keyboard'],
'Units Sold': [50, 150, 80],
'Price per Unit': [800, 20, 30]
}= pd.DataFrame(data)
df print(df)
Exercise 1.1:
Imagine you are a data analyst at a retail company. You have been given sales data for the past month. The data includes the product name, the number of units sold, and the price per unit. Using the pandas library, create a dataframe with the following data: | Product Name | Units Sold | Price per Unit | |————–|————|—————-| | Laptop | 50 | 800 | | Mouse | 150 | 20 | | Keyboard | 80 | 30 | Calculate the total sales for each product and add it as a new column to the dataframe.
2. Data Transformation: Creating New Columns
Often, while working with data, you might want to create new columns based on the existing columns in the dataframe. This can be done using various operations and functions. ### Example: Given the sales data, let’s calculate the total sales for each product.
'Total Sales'] = df['Units Sold'] * df['Price per Unit']
df[print(df)
Exercise 2.1:
Continuing with the retail company scenario, let’s say you have been given additional data regarding the cost of each product. You are required to calculate the profit for each product. Add the following data to your dataframe: | Product Name | Cost per Unit | |————–|—————| | Laptop | 600 | | Mouse | 10 | | Keyboard | 20 | Create a new column in the dataframe to calculate the profit for each product. Profit is calculated as (Price per Unit - Cost per Unit) * Units Sold
.
3. Filtering and Indexing in Dataframes
Filtering allows you to select specific rows based on a condition. Indexing, on the other hand, helps in selecting specific rows and columns from the dataframe. ### Example: From the sales data, let’s extract the details of products that have sold more than 100 units.
= df[df['Units Sold'] > 100]
high_selling_products print(high_selling_products[['Product Name', 'Units Sold']])
Exercise 3.1:
From the sales data, extract the details of products that have sold more than 100 units. Display only the ‘Product Name’ and ‘Units Sold’ columns. ### Exercise 3.2: From the sales data, extract the details of the product that has the highest profit. Display the ‘Product Name’ and ‘Profit’ columns.
4. The apply
Function and Lambda Expressions
The apply
function is used to apply a function along the axis of the dataframe. Lambda expressions are small anonymous functions that can be used with functions like apply
. ### Example: Using the sales data, let’s create a new column ‘Category’ that categorizes products as ‘High Selling’ if units sold are greater than 100, and ‘Low Selling’ otherwise.
'Category'] = df['Units Sold'].apply(lambda x: 'High Selling' if x > 100 else 'Low Selling')
df[print(df[['Product Name', 'Category']])
Exercise 4.1:
Using the sales data, create a new column ‘Category’ that categorizes products as ‘High Selling’ if units sold are greater than 100, and ‘Low Selling’ otherwise. Use the apply
function and a lambda expression to achieve this. ### Exercise 4.2: Calculate the average profit per unit for each product and add it as a new column to the dataframe. Use the apply
function to achieve this.
5. One Hot Encoding
One hot encoding is a process of converting categorical data variables so they can be provided to machine learning algorithms to improve predictions. It involves converting each value in a column to a new column and assigning a 1 or 0 (True/False) value to the column. ### Example: Using the sales data, let’s one-hot encode the ‘Category’ column.
= pd.get_dummies(df, columns=['Category'])
encoded_df print(encoded_df)
Exercise 5.1:
Using the sales data, one-hot encode the ‘Category’ column. This will be useful when you want to use this data for machine learning algorithms. ### Exercise 5.2: Imagine you have another column ‘Region’ with values ‘North’, ‘South’, ‘East’, and ‘West’. One-hot encode this column.