Vectorize & Conquer

Basics first!

Let’s begin with Flynn’s taxonomy, which classifies computer architecture into four categories:

1. Single Instruction, Single Data (SISD)
2. Single Instruction, Multiple Data (SIMD)
3. Multiple Instruction, Single Data (MISD)
4. Multiple Instruction, Multiple Data (MIMD)

For simplicity, we’ll focus primarily on SIMD. In many modern CPUs, SIMD architecture allows a single instruction to be applied to multiple data points simultaneously.

Vectorization is a technique that optimizes algorithms to utilize SIMD instructions in processors more effectively.

Let’s talk numbers!

We’ll delve into the internals later. First, let’s examine the performance differences between native Python operations and vectorized operations, using Pandas and Numpy.

Let’s create a dataset of 1 million records

1
data = [{'a': i, 'b': i**2, 'c': i%3, 'd': i*2} for i in range(1000000)]

We’ll use this dataset for all the operations described below.

1. Data Transformation and Aggregation.

We’ll perform simple operations on this data and returning final sum value

* Increase values in 'a' by 1.
* Decrease values in 'b' by 2.
* Multiply values in 'c' by 3.
* Divide values in 'd' by 4.
* Sum all the values.
* Return the transformed data and the final sum value.
Native python.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
def python_data_transformation_aggregation(data: list) -> tuple:
    final_sum = 0
    for index in range(len(data)):
        data[index] = {
            "a": data[index]["a"] + 1,
            "b": data[index]["b"] - 2,
            "c": data[index]["c"] * 3,
            "d": data[index]["d"] / 4,
        }
        final_sum += sum(data[index].values())
    return final_sum

start_time = time.time()
python_data_transformation_aggregation(data)
print("Python Data Transformation Time: ", time.time() - start_time)

The above code took 3.8 seconds to complete.

Vectorized with Pandas
1
2
3
4
5
6
7
8
import pandas as pd

# Converting to pandas DataFrame
df = pd.DataFrame(data)

start_time = time.time()
df = (df['a'] + 0.5 + df['b'] - 2 + df['c'] * 1 + df['d'] / 4).sum()
print("Pandas Transformation and Aggregation Time: ", time.time() - start_time)

The pandas implementation, in contrast, took only 6 milliseconds.

2. Filtering and Grouping
Native Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
def python_complex_filtering_grouping(data):
    filtered = [d for d in data if d['a'] % 5 == 0 and d['b'] > 50]
    grouped = {}
    for item in filtered:
        key = item['c']
        if key in grouped:
            grouped[key].append(item)
        else:
            grouped[key] = [item]
    return grouped

start_time = time.time()
python_complex_filtering_grouping(data)
print("Python Complex Filtering Time: ", time.time() - start_time)

This process took a full 1 second to complete.

Vectorized with Pandas
1
2
3
4
start_time = time.time()
df_filtered = df[(df['a'] % 5 == 0) & (df['b'] > 50)]
grouped = df_filtered.groupby('c')
print("Pandas Complex Filtering and Grouping Time: ", time.time() - start_time)

The vectorized version took only 2 milliseconds.

A Peek Under the Hood!

  • Pandas and Numpy use C extensions to perform computationally intensive tasks. When you write a vectorized operation in Python using these libraries, you’re indirectly invoking these C functions.

  • These C functions are designed to manage memory and computation more efficiently than typical Python code. They can directly interact with memory, avoid overheads like type checking and interpretative execution, and are compiled into machine code.

  • Numpy arrays ensures data is stored in contiguous blocks of memory which reduces memory access times compared to Python list where the data is stored in random memory locations. (linked list)

Things to remember!

  • If your code involves looping through data to perform mathematical or conditional operations, it’s a candidate for vectorization.

  • Tasks like normalization, standardization, or even filling missing values are prime for vectorization.

  • Libraries like Numpy and Pandas are built with vectorization in mind. Familiarizing with their APIs, and prefer built-in functions over custom loops.

  • Sometimes, the part of your code you think is the slowest isn’t actually the bottleneck. Use profiling tools to identify where vectorization will have the most impact.

  • While vectorization speeds up computation, it can also consume more memory. Monitor application’s memory usage, especially with very large datasets. Pandas and numpy is notorious for creating intermediate dataframe and arrays, So its always good to use inplace flag in the function signature or assigning the frame back to the original variable.

  • Don’t apply(). Don’t. apply() API allows one to use custom functions over pandas dataframe. Looks convenient? right? Problem is, it is extremely slow compared to vectorized operation and it is not an vectorized operation. This stackoverflow answer explains it to the depth.

  • Limited only to exposed functions. Not everything can be vectorized. When dealing with complex custom functions that can’t be vectorized, consider other optimization methods like parallel processing.

That’s it, folks. Thanks for reading till the end! Next time!