.. role:: python(code)
    :language: python

.. default-role:: python


Micro-optimization
==================
I get tunnel-vision and spend more time micro-optimizing my code than I should.
It can drive my colleagues crazy! Don't be like me; forget about
micro-optimizing except in some special cases. You've got more important things
to be working on.

When to micro-optimize:

* Your code runs repeatedly and constantly (e.g. is already taking up 100% of
  the CPU you give it)
* You pay by the execution millisecond. AWS Lambda is an example of this case.

The micro-optimizations listed below range from being helpful things you ought
to do anyway to things that make your code less readable or portable.

Use Class slots if you're constantly constructing objects
---------------------------------------------------------
Modern Python classes construct powerful and flexible objects. You can create
an object from a modern class, then attach attributes and methods the object
didn't have in its original class definition. To support this flexibility,
Python manages a mutable mapping of attributes to values and functions behind
the scenes. Constructing and managing this mapping is slow.

If you find that your program constructs millions of objects of the same class
and you never take advantage of the flexible attributes, you can disable the
mutable attribute mapping and greatly speed up object construction. To disable
it, tell Python all of your class' data attributes up front using a
:py:data:`~object.__slots__` class variable.

Use operators instead of functions where possible
-------------------------------------------------
When two objects interact, there's often both an operator and a function
version of that interaction. Given how expensive the Python function stack is,
use the operator version when possible. Behind the scenes, a function might
still get called, but with the built-in types, that's often a pure C function
instead of a more expensive one on the Python stack.


* `existing_list += other_iterable` is
  faster than `existing_list.extend(other_iterable)`
* `new_sequence = existing_sequence[:]` is faster than
  `new_sequence = existing_sequence.copy()` for lists and bytearrays.
* `del my_dict[key]` is faster than `my_dict.pop(key)` if you know an item
  exists and don't need its value.
* `existing_dict |= other_dict` is faster than
  `existing_dict.update(other_dict)`
* `existing_set |= other_set` is faster than
  `existing_set.update(other_set)`

  * sets and frozensets actually have *tons* of operators. A look at
    :ref:`the docs <python:types-set>` will help.

Note some operators are stricter than their function equivalents.
You can `my_set.update(my_tuple)` but you can't `my_set |= my_tuple`.

Use literals instead of functions where possible
------------------------------------------------
* :ref:`f-strings <python:tut-f-strings>` are faster and more readable than
  :py:meth:`str.format`.
* `{1, 2, 3}` is faster than `set((1, 2, 3))`
* :python:`{'item1': 'this', 'item2': 'that'}` is faster than
  `dict(item1='this', item2='that')`

Special cases with even more optimization:

* Tuple literals that contain only literals themselves are evaluated at first
  execution, and referred to as constants in in the final bytecode, making them
  very fast. List literals don't enjoy this same advantage.
* While there is no frozenset literal, if you use a set literal for an "in
  query", the set is compiled as a frozenset in bytecode, and will get the
  speedy minimalism.

In many cases, a function is getting called behind the scenes to fulfill the
operation, but when you use an operator, it's often going to be a function in
C behind the scenes and not using the more expensive Python function stack.

Use a faster JSON builder/parser
--------------------------------
If you run your code on pypy, skip this section. pypy's standard
:py:func:`json.dumps`/:py:func:`json.loads` are very fast and are probably what
you should use. For everyone else, using a 3rd-party JSON library can have
significant performance benefits.

For this dict:

.. code-block:: python

    dict_to_dump = {
        'boolValue': True,
        'intNumberValue': 234,
        'floatNumberValue': 234.641114,
        'stringValue': 'Hello world!',
        'stringWithNonAsciiChars': 'こんにちは世界！',
        'nullValue': None
    }

* :py:func:`rapidjson.dumps` is around 2x faster than the standard
  :py:func:`json.dumps`
* :py:func:`ujson.dumps` is around 3.4x faster than :py:func:`json.dumps`

orjson for JSON bytes
^^^^^^^^^^^^^^^^^^^^^
Up until recently, the JSON specification was defined in terms of characters
and didn't specify how to transport it as bytes. Nowadays the JSON
specification specifically recommends using UTF-8 for transporting JSON as
bytes.

The built-in json module, ujson, and rapidjson all work with representing JSON
in characters and leave it up to you to serialize/deserialize it to/from bytes
how you see fit. But if you already know you're going to be using UTF-8 bytes
for your JSONs, the orjson module might be the best fit for you in speed. It
skips character representation and directly gives you UTF-8 bytes in its
:py:func:`~orjson.dumps` method, and takes UTF-8 bytes in its
:py:func:`~orjson.loads` method.

For the dict above, `orjson.dumps(dict_to_dump)` *is over 7x faster than*
`json.dumps(dict_to_dump).encode('utf-8')`.

Use sets or frozensets for frequent "in" queries
------------------------------------------------
This one's easy. If all of the following are true then you should be using a
set or frozenset instead of a list or tuple:
* You are querying for existence often, e.g. `if needle in haystack`.
* You aren't constructing the collection or inserting into it often.

Avoid sets or frozensets for frequent inserts
---------------------------------------------
When you're doing more insertion than querying, sets are an expensive
collection type to use. Inserting to an empty set is 4x slower than
inserting into an empty list, and that gets worse the larger the set is.

Avoid frequent object attribute lookup
--------------------------------------
Every time the Python interpreter encounters a dot in a reference, a lookup
for that attribute is done on the parent object, whether the attribute
is a simple value or an instance method. So the statement

.. code-block:: python

   my_var = my_object.my_attribute.my_sub_attribute

results in two object attribute lookups. This isn't bad until you are
wastefully looking up the same attributes over and over again, like so:

.. code-block:: python

   my_object.attr1.method1()
   my_object.attr1.method2()
   my_object.attr1.subattr = [30]
   my_object.attr1.subattr.append(40)
   my_object.attr1.subattr.clear()
   my_var = my_object.attr1.subattr2

Which can instead be written as:

.. code-block:: python

   attr1 = my_object.attr1
   attr1.method1()
   attr1.method2()
   
   sub_attr = attr1.sub_attr = [30]
   sub_attr.append(40)
   sub_attr.clear()

   my_var = my_object.attr1.sub_attr2

In that toy example, it probably won't make the code much faster, but when you
see the same attribute lookup performed for every iteration of a for or while
loop, you can imagine those lookups start to add up. Take this example:

.. code-block:: python

   even_numbers = []
   for i in range(100):
       even_numbers.append(i * 2)

The method `append` of the list `even_numbers` has to be looked up
by the Python interpreter every iteration of the loop. If it didn't, it
might not realize the append method was replaced by a different method
in the previous iteration or by a different thread. If we save the
append method to a variable we reference every time, it's faster than
referencing the list variable and looking up its `append` attribute every
time:

.. code-block:: python

   even_numbers = []
   append_even_number = even_numbers.append
   for i in range(100):
       append_even_number(i * 2)

This runs about 1.25x faster than the example with `.append()` every
iteration.

Don't use ExitStack or AsyncExitStack for visual organization
-------------------------------------------------------------
The context manager is a nice syntax for ensuring completion or cleanup of
tasks and resources. As more Python components adopt the pattern, chances are
you'll need to stack multiple context managers on top of each other, leading
to an indentation nightmare:

.. code-block:: python

   # What you're trying to avoid
   with open('original.dat', 'rb') as original_file:
       with zip.ZipFile('archive.zip', 'wb') as archive_file:
           with socket.create_connection('127.0.0.1', 1337) as connection:
               for line in original_file:
                   archive_file.write(line)
                   connection.sendall(line)


You might be inclined to use :py:class:`contextlib.ExitStack` or
:py:class:`contextlib.AsyncExitStack` as a way to avoid multiple indents:

.. code-block:: python

   with ExitStack() as stack:
       original_file = stack.enter_context(open('original.dat', 'rb'))
       archive_file = stack.enter_context(zip.ZipFile('archive.zip', 'wb'))
       connection = stack.enter_context(socket.create_connection('127.0.0.1', 1337))
       for line in original_file:
           archive_file.write(line)
           connection.sendall(line)

But there's a faster way with pure syntax, commas and parentheses:

.. code-block:: python

   with (
       open('original.dat', 'rb')
   ) as original_file, (
       zip.ZipFile('archive.zip', 'wb')
   ) as archive_file, (
       socket.create_connection('127.0.0.1', 1337)
   ) as connection:
       for line in original_file:
           archive_file.write(line)
           connection.sendall(line)

In toy examples with minimalist context managers, the commas and parentheses
approach is over 3x faster than the ExitStack equivalent, or over 2x faster
than the AsyncExitStack equivalent.

There are still some scenarios where you'd want to use an :py:class:`~contextlib.ExitStack`
or :py:class:`~contextlib.AsyncExitStack`:

* When you need a child function or task to add a context manager to the
  stack that outlives that child function or task.
* When you need to mix async and non-async context managers.
* When multiple context entrance/constructor functions are costly and you want
  to enter multiple contexts concurrently to make it faster (e.g. with
  :py:func:`ThreadPoolExecutor.map` or :py:func:`asyncio.gather`)

Pre-compile frequently used structs and regular expressions
-----------------------------------------------------------
If you find yourself searching, matching, or splitting with the same regular expression repeatedly, you ought to
pre-compile it, and assign the method you'll use to a module-level variable.

.. code-block:: python

   import re
   search_for_errors = re.compile('(?:ERROR|Exception)').search
   
   def parse_log_line(line):
       # Called constantly in the application's lifespan
       match = search_for_errors(line)

If you find yourself packing or unpacking data into and from bytes using the
struct module repeatedly, you ought to pre-compile it, and assign the method
you'll use to a module-level variable.

.. code-block:: python

   from struct import Struct
   long_to_bytes = Struct('>l').pack
   
   for i in range(10_000):
       my_file.write(long_to_bytes(i))


These changes are a little less impactful in recent versions of the cpython
and pypy stdlib, as both re and struct now cache recent pattern compilations.
However, the stored compilation combined with not needing to perform module
attribute lookup still gives these patterns an advantage on newer Python
interpreters.

Avoid runtime imports
---------------------
If you have run into a situation where you somehow had two modules depending
on eachother's assets and you were on a deadline, you might have felt inclined
to have one of the modules imported at runtime (e.g. inside of a function):

.. code-block:: python
   :caption: module_a.py

   def function_1():
       # This part will be slow, even if module_b is
       # already loaded in memory:
       from module_b import function_3
       print(function_3())
    
   def function_2():
       return 'function_2_value'

.. code-block:: python
   :caption: module_b.py

   from module_a import function_2
   
   def function_3():
       return module_a.function_2()

Instead, on one side, import the other module without importing the specific assets needed up front:

.. code-block:: python
   :caption: module_a.py

   import module_b
   
   def function_1():
       # This part does an attribute lookup
       # but is faster:
       print(module_b.function_3())
   
   def function_2():
       return 'function_2_value'


.. code-block:: python
   :caption: module_b.py

    from module_a import function_2
    
    def function_3():
        return module_a.function_2()


When you do this, the interpreter will fill out its interpretation of the other
module after the name has been imported.

If you have the RAM, avoid generator comprehensions that get exhausted
----------------------------------------------------------------------
Generator comprehensions seem like they should be very fast, but for many
use-cases they are very slow! Docs suggesting they're efficient are usually
talking about memory efficiency, not compute efficiency.

`''.join([chr.upper() for chr in my_string])` is faster than
`''.join(chr.upper() for chr in my_string)`.

There is a scenario in which generator comprehensions are faster though, and
that's for scenarios where you stop iteration after you've found what you've
been looking for instead of processing until the comprehension iterator is
exhausted.

So, depending on your data, `any(chr.isupper() for chr in my_string)` can be
faster than `any([chr.isupper() for chr in my_string])` because any will
return on the first truthy value.

Function of loop instead instead of loop of function (cpython)
--------------------------------------------------------------

Faster set and frozenset constructors in older interpreters
-----------------------------------------------------------
This is an obscure one that is specific to cpython 3.7.x and lower including
cpython 2.7.x.

If you need to create a large number of :py:class:`set`\ s or
:py:class:`frozenset`\ s, you can make a faster constructor than the built-in
one.

For a constructor that creates sets from iterables, you start with the state
of an empty set, and reuse its union function.

.. code-block:: python

   # At the module level, create our new constructor:
   make_fs = frozenset().union


Then `make_fs((1, 2, 3, 4, 5))` is about 1.1x faster than
`frozenset((1, 2, 3, 4, 5))` every time we do it.

For a constructor that creates a fresh empty set, start with the state of an
empty set, and reuse its copy function.

.. code-block:: python

   # At the module level, create our new constructor:
   make_empty_set = set().copy

Then `make_empty_set()` is about 1.5x faster than
:py:meth`set()` every time we do it.

There is no good use-case for constructing brand new empty frozensets. Because
they're immutable, one empty frozenset is as good as any other. Should you
constantly need to reference an empty frozenset, just create one once and reuse
it.

.. code-block:: python

   # At the module-level:
   EMPTY_FROZENSET = frozenset()

   def query_the_things(query_statement, filter_terms=EMPTY_FROZENSET):
       

As of cpython 3.8+, these tricks are moot; the built-in constructors are very
fast.

The faster no-op and constant return function
---------------------------------------------
This is really ugly and is cpython-specific, but if you need a function that
always does nothing (i.e. returns ``None``) or always returns the same value,
and you want this to be as fast as possible, use a built-in C-based function
instead of making one yourself with def or lambda.

Calling `do_nothing()` defined like this:

.. code-block:: python

   do_nothing = repeat(None).__next__

is around 1.5x faster than if defined like this:

.. code-block:: python

   def do_nothing():
       pass

and 1.6x faster than if defined like this:

.. code-block:: python

   do_nothing = lambda: None

The same trick can be used to construct a function that always returns the same
value really fast, just pass that value instead of :py:obj:`None` to repeat.

The function produced this way has limited use-cases, unfortunately, due to not
being able to accept any arguments.

Don't use queue.Queue, asyncio.Queue, or trio.MemoryChannel if you never wait
-----------------------------------------------------------------------------

An anti-pattern I see is using your concurrency library's queue mechanics when
your code doesn't need it. The benefit of using your concurrency library's
queue or channel mechanisms is that if you need to wait for items to be
enqueued, you can wait without blocking the rest of your code (e.g. by doing
this waiting in another thread, or by yielding control back to the event loop
while you wait).

A good test of whether or not you're wasting stack space and compute time on
higher-level queueing is to look at how you're retrieving items. If you are
only retrieving items with :py:meth:`asyncio.Queue.get_nowait()`,
:py:meth:`queue.Queue.get_nowait()` or
:py:meth:`MemoryReceiveChannel.receive_nowait()
<trio.MemoryReceiveChannel.receive_nowait()>` and only adding items with
`put_nowait()`, you probably don't need the higher-level Queue or Channel
model.

If your code is simply loading items into a data structure and needing to
process or retrieve them in insertion order later, use the lower-level
:py:class:`collections.deque`. It also supports limiting the size of enqueued
items for various caching or tailing use-cases. To use a collection.deque like
a FIFO (first-in, first-out) queue, just use :py:meth:`deque.append(item)
<collections.deque.append>` or `deque += items` for insertion and
:py:meth:`deque.popleft() <collections.deque.popleft>` for
retrieval.