Micro-optimization

I get tunnel-vision and spend more time micro-optimizing my code than I should. It can drive my colleagues crazy! Don’t be like me; forget about micro-optimizing except in some special cases. You’ve got more important things to be working on.

When to micro-optimize:

  • Your code runs repeatedly and constantly (e.g. is already taking up 100% of the CPU you give it)

  • You pay by the execution millisecond. AWS Lambda is an example of this case.

The micro-optimizations listed below range from being helpful things you ought to do anyway to things that make your code less readable or portable.

Use Class slots if you’re constantly constructing objects

Modern Python classes construct powerful and flexible objects. You can create an object from a modern class, then attach attributes and methods the object didn’t have in its original class definition. To support this flexibility, Python manages a mutable mapping of attributes to values and functions behind the scenes. Constructing and managing this mapping is slow.

If you find that your program constructs millions of objects of the same class and you never take advantage of the flexible attributes, you can disable the mutable attribute mapping and greatly speed up object construction. To disable it, tell Python all of your class’ data attributes up front using a __slots__ class variable.

Use operators instead of functions where possible

When two objects interact, there’s often both an operator and a function version of that interaction. Given how expensive the Python function stack is, use the operator version when possible. Behind the scenes, a function might still get called, but with the built-in types, that’s often a pure C function instead of a more expensive one on the Python stack.

  • existing_list += other_iterable is faster than existing_list.extend(other_iterable)

  • new_sequence = existing_sequence[:] is faster than new_sequence = existing_sequence.copy() for lists and bytearrays.

  • del my_dict[key] is faster than my_dict.pop(key) if you know an item exists and don’t need its value.

  • existing_dict |= other_dict is faster than existing_dict.update(other_dict)

  • existing_set |= other_set is faster than existing_set.update(other_set)

    • sets and frozensets actually have tons of operators. A look at the docs will help.

Note some operators are stricter than their function equivalents. You can my_set.update(my_tuple) but you can’t my_set |= my_tuple.

Use literals instead of functions where possible

  • f-strings are faster and more readable than str.format().

  • {1, 2, 3} is faster than set((1, 2, 3))

  • {'item1': 'this', 'item2': 'that'} is faster than dict(item1='this', item2='that')

Special cases with even more optimization:

  • Tuple literals that contain only literals themselves are evaluated at first execution, and referred to as constants in in the final bytecode, making them very fast. List literals don’t enjoy this same advantage.

  • While there is no frozenset literal, if you use a set literal for an “in query”, the set is compiled as a frozenset in bytecode, and will get the speedy minimalism.

In many cases, a function is getting called behind the scenes to fulfill the operation, but when you use an operator, it’s often going to be a function in C behind the scenes and not using the more expensive Python function stack.

Use a faster JSON builder/parser

If you run your code on pypy, skip this section. pypy’s standard json.dumps()/json.loads() are very fast and are probably what you should use. For everyone else, using a 3rd-party JSON library can have significant performance benefits.

For this dict:

dict_to_dump = {
    'boolValue': True,
    'intNumberValue': 234,
    'floatNumberValue': 234.641114,
    'stringValue': 'Hello world!',
    'stringWithNonAsciiChars': 'こんにちは世界!',
    'nullValue': None
}

orjson for JSON bytes

Up until recently, the JSON specification was defined in terms of characters and didn’t specify how to transport it as bytes. Nowadays the JSON specification specifically recommends using UTF-8 for transporting JSON as bytes.

The built-in json module, ujson, and rapidjson all work with representing JSON in characters and leave it up to you to serialize/deserialize it to/from bytes how you see fit. But if you already know you’re going to be using UTF-8 bytes for your JSONs, the orjson module might be the best fit for you in speed. It skips character representation and directly gives you UTF-8 bytes in its dumps() method, and takes UTF-8 bytes in its loads() method.

For the dict above, orjson.dumps(dict_to_dump) is over 7x faster than json.dumps(dict_to_dump).encode('utf-8').

Use sets or frozensets for frequent “in” queries

This one’s easy. If all of the following are true then you should be using a set or frozenset instead of a list or tuple: * You are querying for existence often, e.g. if needle in haystack. * You aren’t constructing the collection or inserting into it often.

Avoid sets or frozensets for frequent inserts

When you’re doing more insertion than querying, sets are an expensive collection type to use. Inserting to an empty set is 4x slower than inserting into an empty list, and that gets worse the larger the set is.

Avoid frequent object attribute lookup

Every time the Python interpreter encounters a dot in a reference, a lookup for that attribute is done on the parent object, whether the attribute is a simple value or an instance method. So the statement

my_var = my_object.my_attribute.my_sub_attribute

results in two object attribute lookups. This isn’t bad until you are wastefully looking up the same attributes over and over again, like so:

my_object.attr1.method1()
my_object.attr1.method2()
my_object.attr1.subattr = [30]
my_object.attr1.subattr.append(40)
my_object.attr1.subattr.clear()
my_var = my_object.attr1.subattr2

Which can instead be written as:

attr1 = my_object.attr1
attr1.method1()
attr1.method2()

sub_attr = attr1.sub_attr = [30]
sub_attr.append(40)
sub_attr.clear()

my_var = my_object.attr1.sub_attr2

In that toy example, it probably won’t make the code much faster, but when you see the same attribute lookup performed for every iteration of a for or while loop, you can imagine those lookups start to add up. Take this example:

even_numbers = []
for i in range(100):
    even_numbers.append(i * 2)

The method append of the list even_numbers has to be looked up by the Python interpreter every iteration of the loop. If it didn’t, it might not realize the append method was replaced by a different method in the previous iteration or by a different thread. If we save the append method to a variable we reference every time, it’s faster than referencing the list variable and looking up its append attribute every time:

even_numbers = []
append_even_number = even_numbers.append
for i in range(100):
    append_even_number(i * 2)

This runs about 1.25x faster than the example with .append() every iteration.

Don’t use ExitStack or AsyncExitStack for visual organization

The context manager is a nice syntax for ensuring completion or cleanup of tasks and resources. As more Python components adopt the pattern, chances are you’ll need to stack multiple context managers on top of each other, leading to an indentation nightmare:

# What you're trying to avoid
with open('original.dat', 'rb') as original_file:
    with zip.ZipFile('archive.zip', 'wb') as archive_file:
        with socket.create_connection('127.0.0.1', 1337) as connection:
            for line in original_file:
                archive_file.write(line)
                connection.sendall(line)

You might be inclined to use contextlib.ExitStack or contextlib.AsyncExitStack as a way to avoid multiple indents:

with ExitStack() as stack:
    original_file = stack.enter_context(open('original.dat', 'rb'))
    archive_file = stack.enter_context(zip.ZipFile('archive.zip', 'wb'))
    connection = stack.enter_context(socket.create_connection('127.0.0.1', 1337))
    for line in original_file:
        archive_file.write(line)
        connection.sendall(line)

But there’s a faster way with pure syntax, commas and parentheses:

with (
    open('original.dat', 'rb')
) as original_file, (
    zip.ZipFile('archive.zip', 'wb')
) as archive_file, (
    socket.create_connection('127.0.0.1', 1337)
) as connection:
    for line in original_file:
        archive_file.write(line)
        connection.sendall(line)

In toy examples with minimalist context managers, the commas and parentheses approach is over 3x faster than the ExitStack equivalent, or over 2x faster than the AsyncExitStack equivalent.

There are still some scenarios where you’d want to use an ExitStack or AsyncExitStack:

  • When you need a child function or task to add a context manager to the stack that outlives that child function or task.

  • When you need to mix async and non-async context managers.

  • When multiple context entrance/constructor functions are costly and you want to enter multiple contexts concurrently to make it faster (e.g. with ThreadPoolExecutor.map() or asyncio.gather())

Pre-compile frequently used structs and regular expressions

If you find yourself searching, matching, or splitting with the same regular expression repeatedly, you ought to pre-compile it, and assign the method you’ll use to a module-level variable.

import re
search_for_errors = re.compile('(?:ERROR|Exception)').search

def parse_log_line(line):
    # Called constantly in the application's lifespan
    match = search_for_errors(line)

If you find yourself packing or unpacking data into and from bytes using the struct module repeatedly, you ought to pre-compile it, and assign the method you’ll use to a module-level variable.

from struct import Struct
long_to_bytes = Struct('>l').pack

for i in range(10_000):
    my_file.write(long_to_bytes(i))

These changes are a little less impactful in recent versions of the cpython and pypy stdlib, as both re and struct now cache recent pattern compilations. However, the stored compilation combined with not needing to perform module attribute lookup still gives these patterns an advantage on newer Python interpreters.

Avoid runtime imports

If you have run into a situation where you somehow had two modules depending on eachother’s assets and you were on a deadline, you might have felt inclined to have one of the modules imported at runtime (e.g. inside of a function):

module_a.py
def function_1():
    # This part will be slow, even if module_b is
    # already loaded in memory:
    from module_b import function_3
    print(function_3())

def function_2():
    return 'function_2_value'
module_b.py
from module_a import function_2

def function_3():
    return module_a.function_2()

Instead, on one side, import the other module without importing the specific assets needed up front:

module_a.py
import module_b

def function_1():
    # This part does an attribute lookup
    # but is faster:
    print(module_b.function_3())

def function_2():
    return 'function_2_value'
module_b.py
 from module_a import function_2

 def function_3():
     return module_a.function_2()

When you do this, the interpreter will fill out its interpretation of the other module after the name has been imported.

If you have the RAM, avoid generator comprehensions that get exhausted

Generator comprehensions seem like they should be very fast, but for many use-cases they are very slow! Docs suggesting they’re efficient are usually talking about memory efficiency, not compute efficiency.

''.join([chr.upper() for chr in my_string]) is faster than ''.join(chr.upper() for chr in my_string).

There is a scenario in which generator comprehensions are faster though, and that’s for scenarios where you stop iteration after you’ve found what you’ve been looking for instead of processing until the comprehension iterator is exhausted.

So, depending on your data, any(chr.isupper() for chr in my_string) can be faster than any([chr.isupper() for chr in my_string]) because any will return on the first truthy value.

Function of loop instead instead of loop of function (cpython)

Faster set and frozenset constructors in older interpreters

This is an obscure one that is specific to cpython 3.7.x and lower including cpython 2.7.x.

If you need to create a large number of sets or frozensets, you can make a faster constructor than the built-in one.

For a constructor that creates sets from iterables, you start with the state of an empty set, and reuse its union function.

# At the module level, create our new constructor:
make_fs = frozenset().union

Then make_fs((1, 2, 3, 4, 5)) is about 1.1x faster than frozenset((1, 2, 3, 4, 5)) every time we do it.

For a constructor that creates a fresh empty set, start with the state of an empty set, and reuse its copy function.

# At the module level, create our new constructor:
make_empty_set = set().copy

Then make_empty_set() is about 1.5x faster than :py:meth`set()` every time we do it.

There is no good use-case for constructing brand new empty frozensets. Because they’re immutable, one empty frozenset is as good as any other. Should you constantly need to reference an empty frozenset, just create one once and reuse it.

# At the module-level:
EMPTY_FROZENSET = frozenset()

def query_the_things(query_statement, filter_terms=EMPTY_FROZENSET):

As of cpython 3.8+, these tricks are moot; the built-in constructors are very fast.

The faster no-op and constant return function

This is really ugly and is cpython-specific, but if you need a function that always does nothing (i.e. returns None) or always returns the same value, and you want this to be as fast as possible, use a built-in C-based function instead of making one yourself with def or lambda.

Calling do_nothing() defined like this:

do_nothing = repeat(None).__next__

is around 1.5x faster than if defined like this:

def do_nothing():
    pass

and 1.6x faster than if defined like this:

do_nothing = lambda: None

The same trick can be used to construct a function that always returns the same value really fast, just pass that value instead of None to repeat.

The function produced this way has limited use-cases, unfortunately, due to not being able to accept any arguments.

Don’t use queue.Queue, asyncio.Queue, or trio.MemoryChannel if you never wait

An anti-pattern I see is using your concurrency library’s queue mechanics when your code doesn’t need it. The benefit of using your concurrency library’s queue or channel mechanisms is that if you need to wait for items to be enqueued, you can wait without blocking the rest of your code (e.g. by doing this waiting in another thread, or by yielding control back to the event loop while you wait).

A good test of whether or not you’re wasting stack space and compute time on higher-level queueing is to look at how you’re retrieving items. If you are only retrieving items with asyncio.Queue.get_nowait(), queue.Queue.get_nowait() or MemoryReceiveChannel.receive_nowait() and only adding items with put_nowait(), you probably don’t need the higher-level Queue or Channel model.

If your code is simply loading items into a data structure and needing to process or retrieve them in insertion order later, use the lower-level collections.deque. It also supports limiting the size of enqueued items for various caching or tailing use-cases. To use a collection.deque like a FIFO (first-in, first-out) queue, just use deque.append(item) or deque += items for insertion and deque.popleft() for retrieval.