import pandas as pd
import numpy as np
# Creating a fictitious sales data
42)
np.random.seed(= {
data 'Product ID': np.arange(1, 101),
'Product Name': ['Product_' + str(i) for i in range(1, 101)],
'Category': np.random.choice(['Electronics', 'Clothing', 'Groceries', 'Household'], 100),
'Price': np.random.uniform(10, 500, 100).round(2),
'Units Sold': np.random.randint(1, 100, 100),
'Date of Sale': pd.date_range(start='2022-01-01', periods=100, freq='D')
}
# Introducing some missing values
for _ in range(10):
'Price'][np.random.randint(0, 100)] = np.nan
data['Units Sold'][np.random.randint(0, 100)] = np.nan
data[
= pd.DataFrame(data)
sales_df sales_df.head()
DataFrames
A DataFrame is a two-dimensional, size-mutable, and heterogeneous tabular data structure with labeled axes (rows and columns). It is one of the most commonly used data structures in data analysis and machine learning, especially when dealing with structured data.
Why are DataFrames important?
- Structured Representation: DataFrames provide a structured representation of data, making it easier to observe and analyze.
- Flexibility: They can hold different types of data (e.g., integer, string, float) in different columns.
- Functionality: Libraries like pandas in Python offer a wide range of functions to manipulate, filter, group, and transform data in DataFrames.
- Integration with Machine Learning: DataFrames are often used as input data structures for machine learning models. They can be easily split into training and testing sets, and their columns can be used as features for models.
Applications in Data Analysis and Machine Learning
- Data Cleaning: Handling missing values, filtering rows or columns, and transforming data.
- Data Exploration: Calculating statistics, visualizing data, and identifying patterns.
- Feature Engineering: Creating new features from existing ones to improve model performance.
- Model Training: Using DataFrames to train machine learning models.
Business Example
Imagine a retail company that wants to analyze its sales data. The data might include columns like ‘Product ID’, ‘Product Name’, ‘Category’, ‘Price’, ‘Units Sold’, and ‘Date of Sale’. With a DataFrame, the company can easily: - Analyze sales trends over time. - Identify the best-selling products. - Calculate the total revenue for a specific time period. - Group sales data by category or product. - Predict future sales using machine learning models.
Example 1: Fictitious Data
Creating a Fictitious Data
Data Cleaning
# Checking for missing values
= sales_df.isnull().sum()
missing_values missing_values
# Handling missing values by filling with the mean of the column
'Price'].fillna(sales_df['Price'].mean(), inplace=True)
sales_df['Units Sold'].fillna(sales_df['Units Sold'].mean(), inplace=True)
sales_df[
# Verifying if there are any missing values left
sum() sales_df.isnull().
Data Exploration
# Getting a summary of the data
= sales_df.describe()
sales_summary sales_summary
import matplotlib.pyplot as plt
# Visualizing the distribution of 'Price'
=(10, 6))
plt.figure(figsize'Price'].hist(bins=30, color='skyblue', edgecolor='black')
sales_df['Distribution of Product Prices')
plt.title('Price')
plt.xlabel('Number of Products')
plt.ylabel(False)
plt.grid( plt.show()
Feature Engineering
# Creating a new feature 'Revenue' which is 'Price' multiplied by 'Units Sold'
'Revenue'] = sales_df['Price'] * sales_df['Units Sold']
sales_df[
# Creating a feature 'Month of Sale' extracted from 'Date of Sale'
'Month of Sale'] = sales_df['Date of Sale'].dt.month
sales_df[
'Product Name', 'Price', 'Units Sold', 'Revenue', 'Month of Sale']].head() sales_df[[
Example 2: Fortune Data
Reading a CSV File
# Importing necessary libraries
import pandas as pd
# Reading a sample CSV file related to business and economy
# For this example, I'll use a dataset about Fortune 500 companies
= 'https://raw.githubusercontent.com/hizocar/datasets/main/fortune500.csv'
url = pd.read_csv(url, sep=',', on_bad_lines='skip')
fortune_df
# Displaying the first few rows of the dataset using .head()
fortune_df.head()
Unable to display output for mime type(s): application/vnd.dataresource+json
Descripción de las Columnas
Year: Esta columna representa el año en el que se registraron los datos de la empresa. Es de tipo
int64
, lo que significa que contiene valores enteros.Name: Esta columna contiene el nombre de la empresa. Es de tipo
object
, lo que generalmente indica que contiene cadenas de texto.Revenue: Esta columna representa los ingresos de la empresa en millones. Es de tipo
float64
, lo que indica que contiene valores decimales.
Cada columna tiene un significado específico y es esencial para el análisis de las empresas Fortune 500 y su rendimiento a lo largo de los años.
# Checking the data types of each column
= fortune_df.dtypes
data_types data_types
Unable to display output for mime type(s): application/vnd.dataresource+json
Data Cleaning: Transforming the ‘Revenue’ Column
# Replacing '-' with 0
'Revenue'] = fortune_df['Revenue'].replace('-', '0')
fortune_df[
# Removing any remaining non-numeric characters and converting to float
'Revenue'] = fortune_df['Revenue'].replace('[\$,]', '', regex=True).astype(float)
fortune_df[
# Checking the data types of each column again
= fortune_df.dtypes
data_types_updated data_types_updated
Unable to display output for mime type(s): application/vnd.dataresource+json
# Standardizing column names to lowercase and replacing spaces with underscores
= fortune_df.columns.str.lower().str.replace(' ', '_')
fortune_df.columns
# Displaying the updated column names
fortune_df.columns
Index(['year', 'name', 'revenue', 'rank'], dtype='object')
Data Exploration
# Checking for missing values in the dataframe
= fortune_df.isnull().sum()
missing_values missing_values
Unable to display output for mime type(s): application/vnd.dataresource+json
From the initial check for missing values, it appears that there are no missing values in any of the columns of the dataframe. This is great as it means we don’t have to perform any imputation or data filling for this dataset.
Next, let’s use the .describe()
method to get a summary of the distribution of the numerical data.
# Using .describe() to get a summary of the numerical columns
= fortune_df.describe()
data_summary data_summary
Unable to display output for mime type(s): application/vnd.dataresource+json
The .describe()
method provides a summary of the numerical columns in the dataframe. Here are some insights from the summary:
- Year: The data spans from 1955 to 2021.
- Revenue: The average revenue of the companies listed is approximately 8,488.47 million. The minimum revenue is 0 million (which might indicate missing or unreported data for some companies), and the maximum revenue is a staggering 559,151 million. The 25th percentile is 480.45 million, the median (50th percentile) is 2,184.2 million, and the 75th percentile is 7,829.15 million.
- Rank: The rank ranges from 1 to 500, which is expected for a list of the top 500 companies.
Next, let’s check the shape of the dataframe to understand its dimensions and then explore the unique values in the non-numerical columns.
# Checking the shape of the dataframe
= fortune_df.shape
data_shape
# Checking unique values in the 'name' column
= fortune_df['name'].nunique()
unique_companies
data_shape, unique_companies
((33500, 4), 2273)
Here are some additional insights from our exploration:
- The dataframe has 33,500 rows and 4 columns. This means we have data for 33,500 entries across the 4 columns.
- The
name
column, which represents the names of the companies, has 2,273 unique values. This indicates that many companies have appeared on the list multiple times over the years.
Given the size of the dataset and the number of unique companies, there’s a wealth of information to explore further, such as trends over time, the distribution of revenues among the top companies, and more.
# Importing necessary libraries for data visualization
import seaborn as sns
import matplotlib.pyplot as plt
# Setting the style for the plots
'whitegrid')
sns.set_style(
# Plotting the distribution of the 'year' column
=(14, 6))
plt.figure(figsize='year', data=fortune_df, palette='viridis')
sns.countplot(x'Distribution of Companies by Year')
plt.title(=45)
plt.xticks(rotation
plt.tight_layout() plt.show()
# Plotting the distribution of the 'revenue' column
=(14, 6))
plt.figure(figsize'revenue'], bins=50, color='blue', kde=True)
sns.histplot(fortune_df['Distribution of Revenue')
plt.title('Revenue (in millions)')
plt.xlabel('Frequency')
plt.ylabel(
plt.tight_layout() plt.show()
Now, let’s explore the relationships between the numerical variables. One of the best ways to visualize relationships between numerical variables is by using a correlation heatmap. This will allow us to see if there are any strong linear relationships between the variables in our dataset.
# Calculating the correlation matrix
= fortune_df.corr()
correlation_matrix
# Plotting the correlation heatmap
=(10, 6))
plt.figure(figsize=True, cmap='coolwarm', linewidths=0.5, vmin=-1, vmax=1)
sns.heatmap(correlation_matrix, annot'Correlation Heatmap')
plt.title(
plt.tight_layout() plt.show()
From the correlation heatmap, we can observe the following:
- Year and Rank: There’s a negative correlation between the year and the rank. This might indicate that over the years, the ranking criteria or the companies in the list have changed.
- Revenue and Rank: There’s also a negative correlation between revenue and rank. This is expected since a higher revenue would generally mean a higher rank (or a lower rank number) on the Fortune 500 list.
It’s important to note that correlation does not imply causation. While these variables might be correlated, it doesn’t mean that one causes the other. The heatmap provides a quick overview of potential relationships that might be worth exploring further.
# Grouping the data by company and year
= fortune_df.groupby(['name', 'year']).mean().reset_index()
grouped_data
# Plotting the correlation heatmap for the grouped data
= grouped_data.corr()
correlation_matrix_grouped =(10, 6))
plt.figure(figsize=True, cmap='coolwarm', linewidths=0.5, vmin=-1, vmax=1)
sns.heatmap(correlation_matrix_grouped, annot'Correlation Heatmap (Grouped Data)')
plt.title(
plt.tight_layout() plt.show()
After grouping the data by company and year, the correlation heatmap for the grouped data shows similar insights as before. The negative correlation between year
and rank
and between revenue
and rank
remains consistent.
Next, let’s visualize the revenue evolution over the years for three companies. We’ll choose ‘Walmart’, ‘Exxon Mobil’, and ‘Apple’ as our sample companies for this analysis.
# Filtering data for the three chosen companies
= ['Walmart', 'Exxon Mobil', 'Apple']
companies = grouped_data[grouped_data['name'].isin(companies)]
filtered_data
# Plotting the revenue evolution over the years for the three companies
=(14, 7))
plt.figure(figsize=filtered_data, x='year', y='revenue', hue='name', palette='tab10', marker='o')
sns.lineplot(data'Revenue Evolution Over the Years')
plt.title('Year')
plt.xlabel('Revenue (in millions)')
plt.ylabel(='Company')
plt.legend(title
plt.tight_layout() plt.show()