Macro-optimization

These are the design decisions that make the biggest impact. If you learn these patterns, they can get your project off to a good start without needing to make a major refactor later.

Do things concurrently

  • For concurrent network I/O, use an event loop.

  • For file I/O and heavy-compute machine-compiled libraries like pillow, hashlib, and cryptography, use multiple threads, optionally on an event loop.

  • For concurrent CPU-intensive Python, use a parallel compute framework or multiple processes.

Event Loops

Without special handling, your interpreter executes statements one at a time. What if one of those statements is to query a database? Your interpreter might just sit there and wait until the query has a response before executing its next Python statement.

If you have other code you could be running, this is where an event loop would shine. If the DB query had been made using event loop networking, the event loop would start executing code in other tasks while your DB query task is paused to wait for a response. The other tasks the event loop could swap in might be starting their own DB queries, processing a previous query’s response, cranking out metrics, rendering graphics or something completely different.

Threads

Threads are another way of doing more than one thing at a time. Threads allow you to isolate some operations that normally block the entire interpreter from continuing. This way, they don’t block other operations because you run those other operations in a different thread.

The kinds of blocking operations that Python threading will allow you to run concurrently to another thread’s operations are:

  • File input/output like opening a file and reading its content to a variable.

  • Network input/output like waiting for the response to an HTTP request.

  • Timers like time.sleep()

In special situations, that list extends to computationally heavy operations like data compression, cryptography, and media processing, however that’s not the norm in Python. Most Python interpreters take strict percautions to avoid objects being acted on by more than one thread at a time, so the benefits of multithreading in Python will be limited by this protection. Chances are, multiple threads will not take advantage of multiple CPU cores for computation unless you’re working with libraries that explicitly unlock the interpreters protections.

Some of those special situations where Python threading leaves the interpreter unlocked for other threads to be processing during the same activity:

  • Compression or decompression with the gzip or lzma standard libraries.

  • Hashing more than 4 kibibytes of data with the standard hashlib.

  • Some ssl standard library activities.

  • Image processing with the pillow library.

Given three random byte strings in cpython 3.9…

contents = (
    random.randbytes(4_194_304),
    random.randbytes(4_194_304),
    random.randbytes(4_194_304)
)
  • ThreadPoolExecutor(4).map(gzip.compress, contents) is 3x faster than [gzip.compress(doc) for doc in contents].

  • ThreadPoolExecutor(4).map(lzma.compress, contents) is 3x faster than [lzma.compress(doc) for doc in contents].

  • ThreadPoolExecutor(4).map(lambda content: sha256(content).digest(), contents) is 2.5x faster than [sha256(doc).digest() for doc in contents] on a CPU without SHA instructions.

Threads are heavier on latency and system resources than event loop tasks. You’ll want to use an event loop for network I/O. If you do decide to use threads, you have several choices of packages in the Python standard library that can perform threading:

  • Threads on the asyncio event loop. Great for working with threads alongside event-based activities.

  • concurrent.futures, specifically the ThreadPoolExecutor. A high-level abstraction with nice tools for splitting up work to code running in multiple threads

  • The low level threading library.

Parallel Computation Frameworks

  • PyOpenCL and PyCUDA allow you to use C-like languages to write code snippets (kernels) to be run on your graphics devices (GPUs). A GPU can do simple math operations by the hundreds or thousands at the same time using its many compute cores. The hoisting of these kernels to your GPUs and your communication with the GPUs running these kernels are written in Python.

  • The Numba library can take a loop written in Python and compile it to machine code for your CPU or GPUs. It’s not the only tool that can do this, but it specifically is good at loops. Given a loop, especially one using numpy arrays, it can do one of three means of parallelization:

    • Non-locking threads: taking advantage of multiple CPU cores.

    • SIMD Vectorization: taking advantage of your processor’s instructions that run on multiple values at once.

    • GPU acceleration: the Python code of a loop iteration can be compiled to a machine code kernel for an AMD or Nvidia graphics device. Because the code is written in Python, some may find it easier to use than PyOpenCL or PyCUDA.

multiprocessing

multiprocessing is part of the Python standard library. It allows you to spawn multiple Python interpreter processes (OS processes) to run your code. It provides utilities for serving work from your main process to your sub-processes. Unlike traditional Python threads, the Python code running in your spawned sub-processes will not lock object operations on your main process or other subprocesses.

multiprocessing’s processes are much heavier than threads or event loop tasks. Spawning the sub-processes, transporting data to them, and synchronizing access to a shared work queue all take up compute time that could eat into the time you’re saving by computing in parallel.

Run long-running Python with pypy

pypy is a Python interpreter. It’s a special kind of interpreter that compiles Python code to machine code as it’s executed. If your program calls that same code again, there’s a good chance it will use pre-compiled x86_64 or ARM64 code the next time around and it will be very fast.

This is similar to the popular V8 JavaScript compiler employed by Chrome, Chromium, and Node.js.

If speed is the top priority, don’t use Python

Python is one of the most powerful and readable languages. But if you’re building a project where runtime speed is the top priority, it’s just not the best fit. Consider Rust or C depending on your dev team’s competencies and interests.