08 Oct 2021

When the Python Interpreter Is the Bottleneck

Consider the logistic map, a one-line recurrence, iterated 10 million times. The computation does no I/O and spends almost all its time executing a handful of floating-point operations in a Python loop. Rewriting that loop in C++ can produce a large speedup. Most Python performance problems are not this clean.

Performance-sensitive Python already delegates much of its work to compiled code. NumPy, image codecs, and compression libraries do not execute their inner loops as Python bytecode. Interpreter overhead matters when the hot computation remains in Python and no existing library can absorb it. In that case, cppimport can compile a small C++ extension as part of the import process. First, though, the profiler has to show that the loop is the problem.

Finding the hot spot

A profiler such as cProfile, py-spy, or line_profiler, plus small time.perf_counter() checks around suspected kernels, should show where the program spends its time. If it is waiting on I/O or running an already compiled library call, rewriting nearby Python code in C++ will not help.

The first remedy depends on the shape of the bottleneck:

Symptom	First tool to try
Algorithmic blowup	A better algorithm
Repeated subproblems	Memoization or dynamic programming
Array arithmetic in Python loops	NumPy or another vectorized library
Scalar Python hot loop	A just-in-time (JIT) compiler or a compiled extension

The algorithm, not the language

A familiar example is the naive recursive Fibonacci, slow because of its algorithm rather than its language:

def fib(n):
    return n if n < 2 else fib(n - 1) + fib(n - 2)

This version makes $\Theta(\varphi^{\,n})$ calls because it repeatedly solves the same subproblems. C++ would make each call cheaper without removing the exponential growth. Reusing subproblem results collapses the evaluation to linear work, although the cost of each addition still grows with the size of the integer. A memoized recursive version (@lru_cache) expresses this directly but still recurses n deep, so it hits Python’s recursion limit for large n; a bottom-up loop avoids that:

def fib(n):
    a, b = 0, 1
    for _ in range(n):
        a, b = b, a + b
    return a

For sufficiently large n, changing the algorithm matters more than changing the language: the linear Python version can beat an exponential C++ version, as long as both implementations compute the same integer sequence rather than letting fixed-width overflow change the problem. Array arithmetic has a similar escape hatch: a Python loop pays interpreter overhead for every element, while a NumPy operation runs a compiled loop over typed array buffers, which are often contiguous.¹

Work that survives checks for a better algorithm, repeated work, and vectorization is a candidate for a native extension. Branch-heavy parsers, custom data-structure traversals, and small numeric recurrences can all have this shape, with no useful high-level shortcut. In such a loop, each Python arithmetic operation dispatches bytecode, checks types, and creates Python objects for its results; a C++ compiler working with double values does not pay those costs.

Compiling on import

Once measurement points to the Python loop, a compiled kernel is one option. cppimport removes the separate build-and-install command. The extension still needs binding code and a small build configuration, but importing the module triggers compilation and loads the resulting shared library.

This example fits in one C++ file, although cppimport can also track headers and additional source files. The first-line marker // cppimport prevents the import hook from compiling unrelated C++ files, pybind11 converts the scalar arguments and return value, and the trailing Mako block configures the build. Keeping that block at the end preserves useful line numbers in compiler errors.

The full module goes in logistic_cpp.cpp, since the filename must match the module name. It iterates the logistic map,

x_{n+1} = r\,x_n(1 - x_n),

which exhibits chaotic behavior for some values of $r$ :

// cppimport
#include <pybind11/pybind11.h>

namespace py = pybind11;

// Same recurrence as logistic() in bench.py.
double logistic(long long steps, double r, double x) {
    for (long long i = 0; i < steps; ++i)
        x = r * x * (1.0 - x);
    return x;
}

PYBIND11_MODULE(logistic_cpp, m) {
    m.def("logistic", &logistic, py::call_guard<py::gil_scoped_release>());
}

/*
<%
# This flag assumes GCC or Clang.
cfg['extra_compile_args'] = ['-O3']
setup_pybind11(cfg)
%>
*/

The call_guard releases the Global Interpreter Lock while the function runs. This is safe because the loop does not touch Python objects. It is not responsible for the single-threaded speedup, but it allows separate calls to run on different cores from different Python threads when the C++ code is thread-safe.

cppimport is an ordinary package on PyPI, and installing it pulls in pybind11 and Mako. Beyond that, the machine needs a C++ compiler and the Python development headers. The -O3 flag in this example assumes GCC or Clang; another compiler may require a different optimization flag.

On the first import in a fresh Python process, cppimport builds the extension and caches the shared library. The build takes seconds rather than milliseconds, and a build failure aborts the import with the compiler’s diagnostic. On later runs cppimport compares checksums and rebuilds when the source or a declared dependency has changed.

Within one process, a second import returns the module already stored in sys.modules, so restart a long-running interpreter or notebook kernel after editing the C++ file. The explicit cppimport.imp("logistic_cpp") API performs the same build and load operation without requiring the first-line marker.

A workload without shortcuts

A single logistic-map trajectory has a loop-carried dependency: each step consumes the previous result. That prevents ordinary NumPy-style vectorization across iterations, although an array of independent trajectories could be vectorized.

Memoization offers no useful shortcut either. At $r = 3.9$ , the recurrence is chaotic in real arithmetic. A floating-point implementation has finitely many possible states and must eventually repeat, but a cache would have to store and hash floating-point states while waiting for a repeat. That adds work and memory to avoid a handful of arithmetic operations.

The benchmark

A short script times the same recurrence in Python and C++, with identical inputs and iteration count. The import happens before timing, so compilation and I/O stay out of the timed region. Save it as bench.py:

import math
import time

import cppimport.import_hook
import logistic_cpp  # compiles logistic_cpp.cpp on first import


def logistic(steps, r, x):  # same algorithm as logistic_cpp.logistic
    for _ in range(steps):
        x = r * x * (1.0 - x)
    return x


def logistic_cpp_one_step(steps, r, x):  # same kernel, but crosses the boundary every step
    for _ in range(steps):
        x = logistic_cpp.logistic(1, r, x)
    return x


STEPS, R, X0 = 10_000_000, 3.9, 0.5


if not math.isclose(
    logistic(10, R, X0),
    logistic_cpp.logistic(10, R, X0),
    rel_tol=1e-12,
    abs_tol=1e-12,
):
    raise SystemExit("python and C++ results diverge after 10 steps")


def measure(func, repeats=3):
    timings = []
    func(10_000, R, X0)  # untimed warm-up to discard first-call effects
    for _ in range(repeats):
        start = time.perf_counter()
        func(STEPS, R, X0)
        timings.append(time.perf_counter() - start)
    return min(timings)


py_time = measure(logistic)
cpp_time = measure(logistic_cpp.logistic)
step_time = measure(logistic_cpp_one_step)

print(f"python:       {py_time:.4f}s")
print(f"c++ batched:  {cpp_time:.4f}s")
print(f"c++ per step: {step_time:.4f}s")
print(f"batched c++ is {py_time / cpp_time:.0f}x faster")
print(f"per-step c++ is {step_time / py_time:.1f}x slower")

Place it next to logistic_cpp.cpp so the import hook can find the extension:

.
|-- bench.py
`-- logistic_cpp.cpp

Run it with python bench.py. The short closeness check catches basic implementation errors before the chaotic recurrence magnifies small floating-point differences. The script then reports the best of three runs, which reduces the effect of scheduler interruptions.

On one machine, with C++ compiled using -O3, a run produced:

python:       0.6404s
c++ batched:  0.0267s
c++ per step: 2.6830s
batched c++ is 24x faster
per-step c++ is 4.2x slower

Treat the result as a workload microbenchmark. It compares Python’s boxed, dynamically dispatched operations with optimized native floating-point operations on one system. The ratio can vary substantially across machines and Python builds.

The cost of each crossing

The ratio also depends on where the call boundary sits. Each crossing has a fixed cost: pybind11 resolves the call, converts the scalar arguments and the return value, and the call_guard releases and reacquires the GIL. The logistic_cpp_one_step() variant makes that cost visible by moving the loop back into Python and calling the C++ function one step at a time.

In the same run, that arrangement was 4.2 times slower than the pure Python loop. The fixed cost, a few hundred nanoseconds per call here, exceeds the single recurrence step each call performs. The same compiled kernel is tens of times faster than Python when the batched call runs the whole recurrence and slower than Python when each call runs one step.

Where it fits

Compiling at import time requires a compiler, Python headers, and pybind11 headers on the machine that runs the code. That works well for research code and internal tools in controlled environments. It can also be convenient in a notebook for a one-time build, although loading an edited extension reliably requires a kernel restart.

It is a poor default for a library distributed to arbitrary machines. Maintained packages should usually build wheels ahead of time with pybind11 and a conventional backend such as setuptools or scikit-build. Users then install a compatible binary instead of a compiler toolchain.

Two other costs remain. Debugging crosses a language boundary, and a fault in C++ can terminate the interpreter instead of raising a Python exception. Argument conversion also grows with the data. This example passes scalars, but pybind11’s automatic STL conversions copy containers such as lists and dictionaries into C++ equivalents, in both directions. Large numeric arguments are better passed as NumPy arrays, which can cross the boundary without copying.

cppimport is one point in a larger design space:

Tool	Best fit
Numba	Numeric loops that fit its supported Python and NumPy subset.²
Cython	Python-like code that benefits from added type information and compilation.
pybind11 with a wheel backend	Maintained packages that should install without a compiler on the user’s machine.
Rust with PyO3	Extension modules written in Rust rather than C++.
cppimport	C++ kernels that benefit from a short edit-compile-import loop.

cppimport is strongest when the kernel belongs in C++ anyway, because the project already has C++ code, needs a library outside a JIT compiler’s reach, or requires direct control over native data structures.

Keep the boundary small

Most Python performance problems are solved before any C++ is written: by the profiler pointing somewhere unexpected, by a better algorithm, or by a vectorized library call. For the loop that survives lower-friction remedies, cppimport reduces a C++ kernel to one source file and an import statement. What keeps the rewrite worthwhile is a small native surface, with enough work behind each call to pay for crossing the boundary.

References

Charles R. Harris, K. Jarrod Millman, Stéfan J. van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, and others. 2020. Array programming with NumPy. Nature 585, 7825 (2020), 357-362. ↩︎
Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. 2015. Numba: A LLVM-based Python JIT compiler. In Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC, 2015. 1-6. ↩︎

Boyang Yue