Dataframes in Python

Open In Colab

A DataFrame is a two-dimensional labeled data structure with columns of potentially different types. It is generally the most commonly used pandas object. You can think of it like a spreadsheet or SQL table, or a dictionary of Series objects. It is generally the most commonly used pandas object.

DataFrames can be created in various ways, but for this example, we’ll create a DataFrame from a dictionary of pandas Series.

import pandas as pd

data = {
    'apples': pd.Series([3, 2, 0, 1]),
    'oranges': pd.Series([0, 3, 7, 2])
}

df = pd.DataFrame(data)

df
Unable to display output for mime type(s): application/vnd.dataresource+json

DataFrame Indexing

Indexing in pandas means simply selecting particular rows and columns of data from a DataFrame. Indexes can be used to select specific rows and columns that you want to manipulate. They can also be used to modify the structure of the DataFrame itself, for example, by adding rows or columns.

Let’s explore some examples of how to work with DataFrame indexes.

# Set a column as the index
df_indexed = df.set_index('apples')
df_indexed
Unable to display output for mime type(s): application/vnd.dataresource+json
# Reset the index
df_reset = df_indexed.reset_index()
df_reset
Unable to display output for mime type(s): application/vnd.dataresource+json

Creating DataFrames

Pandas DataFrames can be created in various ways. Some of the most common methods are: from a list, from a dictionary, from a list of dictionaries, and from a NumPy array. Let’s explore examples of each.

# Creating a DataFrame from a list
list_data = [['Alex',10],['Bob',12],['Clarke',13]]
df_list = pd.DataFrame(list_data, columns=['Name','Age'])
df_list
Unable to display output for mime type(s): application/vnd.dataresource+json
# Creating a DataFrame from a dictionary
dict_data = {'Name':['Tom', 'Nick', 'John'], 'Age':[20, 21, 19]}
df_dict = pd.DataFrame(dict_data)
df_dict
Unable to display output for mime type(s): application/vnd.dataresource+json
# Creating a DataFrame from a list of dictionaries
list_dict_data = [{'a': 1, 'b': 2},{'a': 3, 'b': 4, 'c': 5}]
df_list_dict = pd.DataFrame(list_dict_data)
df_list_dict
Unable to display output for mime type(s): application/vnd.dataresource+json
# Creating a DataFrame from a NumPy array
import numpy as np

numpy_data = np.array([[1, 2], [3, 4]])
df_numpy = pd.DataFrame(numpy_data, columns=['Column1', 'Column2'])
df_numpy
Unable to display output for mime type(s): application/vnd.dataresource+json

DataFrame Operations and Methods

Pandas DataFrames offer a wide range of operations and methods that can be used to manipulate and analyze data. In this section, we’ll explore how to create new columns, how to create columns from other columns through operations, and how to combine DataFrames using the concatenate method.

# Creating a new column
df['bananas'] = pd.Series([1, 0, 2, 4])
df
Unable to display output for mime type(s): application/vnd.dataresource+json
# Creating a column from other columns
df['total_fruits'] = df['apples'] + df['oranges'] + df['bananas']
df
Unable to display output for mime type(s): application/vnd.dataresource+json
# Creating another DataFrame to concatenate
df2 = pd.DataFrame({'apples': [5, 3], 'oranges': [2, 4], 'bananas': [7, 6]}, index=[4, 5])

# Concatenating DataFrames
df_concat = pd.concat([df, df2])
df_concat
Unable to display output for mime type(s): application/vnd.dataresource+json

The ignore_index Parameter

When concatenating DataFrames, pandas provides an ignore_index parameter. If ignore_index is set to True, the resulting DataFrame will have a new integer index, ignoring the original indices of the concatenated DataFrames. Let’s see an example.

# Concatenating DataFrames with ignore_index
df_concat_ignore_index = pd.concat([df, df2], ignore_index=True)
df_concat_ignore_index
Unable to display output for mime type(s): application/vnd.dataresource+json

Exploring a DataFrame

Pandas provides several methods that are useful for quickly summarizing and gaining insights from your data. In this section, we’ll explore the value_counts, unique, nunique, and describe methods. Let’s first create a new DataFrame for these examples.

# Creating a new DataFrame
data = {
    'A': np.random.randint(1, 10, 20),
    'B': np.random.choice(['red', 'green', 'blue'], 20),
    'C': np.random.normal(0, 1, 20),
    'D': np.random.choice(['cat', 'dog', 'rabbit'], 20),
    'E': np.random.randint(1, 100, 20)
}
df_explore = pd.DataFrame(data)
df_explore
Unable to display output for mime type(s): application/vnd.dataresource+json
# value_counts method
df_explore['B'].value_counts()
Unable to display output for mime type(s): application/vnd.dataresource+json
# unique method
df_explore['D'].unique()
array(['rabbit', 'dog', 'cat'], dtype=object)
# nunique method
df_explore['A'].nunique()
9
# describe method
df_explore.describe()
Unable to display output for mime type(s): application/vnd.dataresource+json

Sorting and Ranking in a DataFrame

Pandas provides several methods for sorting and ranking data in a DataFrame. In this section, we’ll explore the sort_values, iloc, and loc methods. The sort_values method sorts a DataFrame by one or more columns, while iloc and loc are used for indexing and selecting data.

# sort_values method
df_explore_sorted = df_explore.sort_values(by='A')
df_explore_sorted.head()
Unable to display output for mime type(s): application/vnd.dataresource+json
# iloc method
df_explore_iloc = df_explore_sorted.iloc[0:5]
df_explore_iloc
Unable to display output for mime type(s): application/vnd.dataresource+json
# loc method
df_explore_loc = df_explore_sorted.loc[df_explore_sorted['B'] == 'red']
df_explore_loc.head()
Unable to display output for mime type(s): application/vnd.dataresource+json

Boolean Indexing

Boolean indexing is a powerful tool that allows you to select data that meets certain conditions. This can be done using comparison operators (>, <, ==) and logical operators (& for ‘and’, | for ‘or’). Let’s see some examples.

# Boolean indexing with '>'
df_explore[df_explore['A'] > 5]
Unable to display output for mime type(s): application/vnd.dataresource+json
# Boolean indexing with '<'
df_explore[df_explore['A'] < 5]
Unable to display output for mime type(s): application/vnd.dataresource+json
# Boolean indexing with '=='
df_explore[df_explore['B'] == 'red']
Unable to display output for mime type(s): application/vnd.dataresource+json
# Boolean indexing with '&' (and)
df_explore[(df_explore['A'] > 5) & (df_explore['B'] == 'red')]
Unable to display output for mime type(s): application/vnd.dataresource+json
# Boolean indexing with '|' (or)
df_explore[(df_explore['A'] < 5) | (df_explore['B'] == 'red')]
Unable to display output for mime type(s): application/vnd.dataresource+json

Exporting DataFrames

Pandas provides several methods to export a DataFrame to different file formats. This can be very useful when you want to save your data for later use or to share it with others. In this section, we’ll explore how to export a DataFrame to CSV, Excel, and JSON formats.

# Exporting to CSV
# df_explore.to_csv('df_explore.csv', index=False)
# Exporting to Excel
# df_explore.to_excel('df_explore.xlsx', index=False)
# Exporting to JSON
# df_explore.to_json('df_explore.json', orient='records')

Exercises

Now that we have learned about pandas DataFrames, let’s put our knowledge into practice with some exercises. These exercises cover all the topics we have discussed in this notebook and vary in difficulty from easy to hard.

Exercise 1 (Easy)

Create a DataFrame from a dictionary with keys ‘Name’, ‘Age’, and ‘City’. The ‘Name’ column should contain five different names, the ‘Age’ column should contain ages between 20 and 40, and the ‘City’ column should contain the names of five different cities.

Exercise 2 (Medium)

Given the DataFrame created in Exercise 1, perform the following operations:

  1. Set the ‘Name’ column as the index of the DataFrame.
  2. Reset the index of the DataFrame.
  3. Select only the rows where ‘Age’ is greater than 30.

Exercise 3 (Medium)

Create a DataFrame with 10 rows and 3 columns named ‘A’, ‘B’, and ‘C’. The ‘A’ column should contain random integers between 1 and 10, the ‘B’ column should contain random floats between 0 and 1, and the ‘C’ column should contain the string ‘random’ for all rows. Then, export this DataFrame to a CSV file named ‘random.csv’.

Exercise 4 (Hard)

Given the DataFrame created in Exercise 3, perform the following operations:

  1. Create a new column ‘D’ that is the result of ‘A’ divided by ‘B’.
  2. Replace all ‘inf’ values in column ‘D’ with ‘NaN’.
  3. Drop all rows with ‘NaN’ values.
  4. Reset the index of the DataFrame without adding a new ‘index’ column.

Exercise 5 (Hard)

Given a DataFrame with 100 rows and 5 columns named ‘A’, ‘B’, ‘C’, ‘D’, and ‘E’. All columns should contain random integers between 1 and 100. Perform the following operations:

  1. Select only the rows where ‘A’ is greater than 50 and ‘B’ is less than 50.
  2. From the resulting DataFrame, select only the ‘C’, ‘D’, and ‘E’ columns.
  3. Calculate the sum of the ‘C’, ‘D’, and ‘E’ columns for each row (you should end up with a Series).
  4. Find the row (index) with the highest sum.