Optimizing Data Loading in Python: Techniques for Speed and Efficiency

This article introduces five ways to load data in Python. By the end, we will have increased the speed by three orders of magnitude, making the process very fast.

Experimental Setup

We need to load 10 Excel files, each with 20,000 rows and 25 columns, totaling about 70MB.

First, we'll create the dummy data and set up our environment by importing the necessary libraries. We'll cover libraries like pickle and joblib later in the article.

import pandas as pd
import numpy as np
from joblib import Parallel, delayed
import time

# Function to generate dummy data and save in different formats
def save_dummy_data(file_number):
    # Generate random data
    values = np.random.uniform(size=(20000, 25))
    df = pd.DataFrame(values)
    
    # Save to CSV, Excel, and Pickle
    df.to_csv(f"Dummy_{file_number}.csv", index=False)
    df.to_excel(f"Dummy_{file_number}.xlsx", index=False)
    df.to_pickle(f"Dummy_{file_number}.pkl")

# Generate and save data in parallel
Parallel(n_jobs=-1)(delayed(save_dummy_data)(i) for i in range(10))

5 Ways to Load Data in Python

1. Loading Excel Files Using Pandas:

We'll load Excel files using Pandas. We’ll initialize a Pandas DataFrame and add each Excel file to it. This method lets us combine data from multiple sources into a single structure for analysis.

import pandas as pd
import time

# Measure the start time
start = time.time()

# Read and concatenate Excel files efficiently
file_paths = [f"Dummy_{i}.xlsx" for i in range(10)]
df_list = [pd.read_excel(file) for file in file_paths]
df = pd.concat(df_list, ignore_index=True)

# Measure the end time
end = time.time()

# Print the elapsed time
print("Excel:", end - start)

It runs pretty slowly, about 50 seconds to run.

2. Opt for CSVs Instead of Excel Files:

Convert our data files from .xlsx to .csv before importing them. This change speeds up data loading in Python.

import pandas as pd
import time

# Measure the start time
start = time.time()

# Read and concatenate CSV files efficiently
file_paths = [f"Dummy_{i}.csv" for i in range(10)]
df_list = [pd.read_csv(file) for file in file_paths]
df = pd.concat(df_list, ignore_index=True)

# Measure the end time
end = time.time()

# Print the elapsed time
print("CSV:", end - start)

Converting to CSV makes loading faster (0.63 seconds, 10 times faster), but CSV files are larger.

3. Smarter Creation of Pandas DataFrames:

We can improve our data loading process by creating Pandas DataFrames differently. Instead of appending files to an existing DataFrame, we can load each file into a separate DataFrame and then combine them at the end.

import pandas as pd
import time

# Measure the start time
start = time.time()

# Read and concatenate CSV files efficiently
df_list = []
for file_number in range(10):
    temp = pd.read_csv(f"Dummy_{file_number}.csv")
    df_list.append(temp)
df = pd.concat(df_list, ignore_index=True)

# Measure the end time
end = time.time()

# Print the elapsed time
print("CSV2:", end - start)

We made the load time a bit faster. This is good for big data frames.

4. Parallelize CSV Imports with Joblib:

Parallelize the import of multiple files to speed up loading. The joblib library makes it easier to run multiple Python processes in parallel.

import pandas as pd
from joblib import Parallel, delayed
import time

# Measure the start time
start = time.time()

# Function to read a CSV file
def read_csv_file(file_number):
    return pd.read_csv(f"Dummy_{file_number}.csv")

# Read CSV files in parallel
df_list = Parallel(n_jobs=-1, verbose=10)(delayed(read_csv_file)(file_number) for file_number in range(10))

# Concatenate the list of DataFrames
df = pd.concat(df_list, ignore_index=True)

# Measure the end time
end = time.time()

# Print the elapsed time
print("CSV//:", end - start)

We doubled the speed compared to the single-core version. But more cores don't always mean better performance.

5. Utilize Pickle Files:

Pickle files are a faster way to store and retrieve data. They are designed for Python and can load data faster than .csv files.

import pandas as pd
from joblib import Parallel, delayed
import time

# Measure the start time
start = time.time()

# Function to read a pickle file
def read_pickle_file(file_number):
    return pd.read_pickle(f"Dummy_{file_number}.pkl")

# Read pickle files in parallel
df_list = Parallel(n_jobs=-1, verbose=10)(delayed(read_pickle_file)(file_number) for file_number in range(10))

# Concatenate the list of DataFrames
df = pd.concat(df_list, ignore_index=True)

# Measure the end time
end = time.time()

# Print the elapsed time
print("Pickle//:", end - start)

The processing time is now 80% faster, but pickle files are not human-readable and take up more storage space.

Loading Excel Files in Parallel:

As an additional benefit, parallel processing can enhance efficiency when working with Excel files directly. The joblib library can be used to load these files in parallel.

import pandas as pd
from joblib import Parallel, delayed
import time

# Measure the start time
start = time.time()

# Function to read an Excel file
def read_excel_file(file_number):
    return pd.read_excel(f"Dummy_{file_number}.xlsx")

# Read Excel files in parallel
df_list = Parallel(n_jobs=-1, verbose=10)(delayed(read_excel_file)(file_number) for file_number in range(10))

# Concatenate the list of DataFrames
df = pd.concat(df_list, ignore_index=True)

# Measure the end time
end = time.time()

# Print the elapsed time
print("Excel//:", end - start)

We reduced load times by 70%, cutting them from 50 seconds to 13 seconds. You can also create pickle files for future efficiency.

Summary

We have improved our data processing efficiency by using different data loading techniques.

Excel Files: The initial loading time was 50 seconds.
CSV Files: The time required for the process was reduced to 0.63 seconds.
Smarter CSV Loading: The process was further refined to achieve a loading time of 0.62 seconds.
Parallel CSV Loading: The time required was reduced to 0.34 seconds.
Parallel Pickle Loading: The time required was reduced to 0.07 seconds, representing a reduction to less than a tenth of a second.

24 May 2024 NimoDB Data Science

Python Artificial Intelligence Data Science Data Loading Pandas Data Processing Joblib Parallel Processing CSV Excel

NS

Nima Saadati

Python Back-End Developer

Recent Project

Posts