Chasing Performance-Deconstructing Python Bytecode, Emitted Machine Code, and F# Efficiency

Posted Apr 27, 2025 Updated May 2, 2025

By Andrew Abok

7 min read

This post was originally published on the Innova official blog.

Introduction.

While reading Anthony Shaw’s CPython Internals to initially to understand concurrency and parallelism from the perspective of the Python compiler I got hooked. What started as a quick detour into threading and the GIL turned into a deep dive from Chapter 1.

Every time we write code, we’re essentially describing abstract instructions that a compiler (or interpreter) must translate into something a CPU can actually execute. But how exactly does that translation work? And how much does it really impact performance?

I love how Robert Nystrom visualizes this transformation from high-level code to low-level operations.

It made me wonder: what does my Python function look like under the hood? How many machine-level instructions does it emit? Can I optimize it to emit fewer? And how does it compare to a functional language like F#, which compiles to .NET Intermediate Language (IL)?

This article is my exploration of those questions.

Measuring machine code output.

The CPython compilation process, while abstracted from most users, goes through several important stages before your code is ever executed. While there’s a lot happening under the hood especially during the parsing phase.I’ll skip the deep details and focus on the high-level flow:

Parsing – The source code is parsed into an Abstract Syntax Tree (AST).
Compilation – The AST is transformed into a Control Flow Graph or other intermediate representation, and then into bytecode.
Assembly – The bytecode is packaged for interpretation by the CPython virtual machine.
Execution – The Python interpreter walks through the bytecode and executes it line by line.

Each of these steps has corresponding modules and APIs in CPython you can actually interact with—like ast, compile(), and dis.

Illustration

We will use a random polynomial function of a 4th degree to test the instructions generated.

\[f(x) = x^4 + 3x^3 - 2x^2 + 7x - 5\]

If I were just starting out in programming, I’d likely write the function as a direct translation of the math expression:

  
def polynomial(x):
    return x**4 + 3*x**3 - 2*x**2 + 7*x - 5

Disassembling this with dis.dis(polynomial) gives the following bytecode:

2           2 LOAD_FAST                0 (x)
            4 LOAD_CONST               1 (4)
            6 BINARY_OP                8 (**)
           10 LOAD_CONST               2 (3)
           12 LOAD_FAST                0 (x)
           14 LOAD_CONST               2 (3)
           16 BINARY_OP                8 (**)
           20 BINARY_OP                5 (*)
           24 BINARY_OP                0 (+)
           28 LOAD_CONST               3 (2)
           30 LOAD_FAST                0 (x)
           32 LOAD_CONST               3 (2)
           34 BINARY_OP                8 (**)
           38 BINARY_OP                5 (*)
           42 BINARY_OP               10 (-)
           46 LOAD_CONST               4 (7)
           48 LOAD_FAST                0 (x)
           50 BINARY_OP                5 (*)
           54 BINARY_OP                0 (+)
           58 LOAD_CONST               5 (5)
           60 BINARY_OP               10 (-)
           64 RETURN_VALUE

This emits 21 instructions, each representing an operation that the interpreter must evaluate. For optimization, we can re-write this function using Horner’s Method:

Horner’s Rule rewrites a polynomial in nested form: \(a_0 + a_1x + a_2x^2 + a_3x^3 +..+a_nx^n \\ = a_0 + x(a_1 + x(a_2 + x(a_3 + ... + x(a_(n-1))...)))\) This reduces the number of arithmetic operations, especially costly exponentiations, resulting in improved performance.

Here’s the rewritten version:

  
def polynomial_horner(x: float) -> float:
    return (((x + 3) * x - 2) * x + 7) * x - 5

The disassembled bytecode shows only 15 instructions:

2           2 LOAD_FAST                0 (x)
            4 LOAD_CONST               1 (3)
            6 BINARY_OP                0 (+)
           10 LOAD_FAST                0 (x)
           12 BINARY_OP                5 (*)
           16 LOAD_CONST               2 (2)
           18 BINARY_OP               10 (-)
           22 LOAD_FAST                0 (x)
           24 BINARY_OP                5 (*)
           28 LOAD_CONST               3 (7)
           30 BINARY_OP                0 (+)
           34 LOAD_FAST                0 (x)
           36 BINARY_OP                5 (*)
           40 LOAD_CONST               4 (5)
           42 BINARY_OP               10 (-)
           46 RETURN_VALUE

Here is a visual representation of the AST graph

Observations

doing a quick check on the performance timings The native polynomial function takes an average 0.88 μs per call while the Horners’ takes 0.37 μs per call showing the difference in performancee.

The BINARY_OP 8 (**) instruction (power operation) is one of the most computationally expensive operations in Python when we do away with it in the Horners’ method we see the significant cut in time and machine code generated.

Each operation undergoes type checking, stack manipulation, and memory access, which introduces overhead due to Python’s dynamic typing.

Even simple numbers like 5 and 3 are heap-allocated Python objects (boxed integers), not raw CPU-native integers.

These nuances explain why writing cleaner, more optimized code—especially in numerical applications—can translate into fewer bytecode instructions, and ultimately, faster execution.

NOTE: There will always be a trade off between performance and memory allocation.I am expecting there to be more memory allocated to the honer approach given it will generate more steps but that’s an exploration we will explore when we look at the memory allocation.

Comparison with F#

For numerical computing i use a lot of F# again my curiosity just wanted to test that performance and see how it plays out on the F# side.

In F# I’ll be using both the ildasm and the BenchmarkDotNet library which does magic the best profiler i have used thus far.Our programme.fs will look like so

  
open System
open BenchmarkDotNet.Attributes
open BenchmarkDotNet.Running
open BenchmarkDotNet.Diagnosers


[<MemoryDiagnoser; DisassemblyDiagnoser(exportCombinedDisassemblyReport=true)>]
type PolynomialBenchmarks() =

    [<Params(2.0)>]
    member val x = 0.0 with get, set

    [<Benchmark>]
    member this.NaivePolynomial() =
        let x = this.x
        x ** 4.0 + 3.0 * x ** 3.0 - 2.0 * x ** 2.0 + 7.0 * x - 5.0

    [<Benchmark>]
    member this.HornerPolynomial() =
        let x = this.x
        (((x + 3.0) * x - 2.0) * x + 7.0) * x - 5.0

[<EntryPoint>]
let main argv =
    BenchmarkRunner.Run<PolynomialBenchmarks>() |> ignore
    0

BenchmarkDotNet generates the low level assembly code which you can review. I’ll use the iladsm code generated which is the CIL code again still low level code compared to the python code.

ldarg.0               // x
ldc.r8 4.0
call PowFloat         // x**4
ldc.r8 3.0
ldarg.0               // x
ldc.r8 3.0
call PowFloat         // x**3
mul                   // 3*x**3
add                   // x⁴ + 3x³
ldc.r8 2.0
ldarg.0               // x
ldc.r8 2.0
call PowFloat         // x²
mul                   // 2*x²
sub                   // (x⁴+3x³)-2x²
ldc.r8 7.0
ldarg.0               // x
mul                   // 7*x
add                   // (prev)+7x
ldc.r8 5.0
sub                   // (prev)-5
ret

This generates 21 CIL instructions while the Horne’s method generates 18 CIL Instructions but this are low level instructions to the CPU unlike.If you look at the assembly code generated by the Horner’s method its even way more optimized with only 8 cpu instructions with no calls at all with a size of only 50 bytes (something I’ll explore later on in F#)

; Program+PolynomialBenchmarks.HornerPolynomial()
       vmovsd    xmm0,qword ptr [rcx+8]
       vaddsd    xmm1,xmm0,qword ptr [7FFB220F85A8]
       vmulsd    xmm1,xmm1,xmm0
       vsubsd    xmm1,xmm1,qword ptr [7FFB220F85B0]
       vmulsd    xmm1,xmm1,xmm0
       vaddsd    xmm1,xmm1,qword ptr [7FFB220F85B8]
       vmulsd    xmm0,xmm1,xmm0
       vsubsd    xmm0,xmm0,qword ptr [7FFB220F85C0]
       ret
; Total bytes of code 50

The performance is also quite decent but see the cut on the horners’ method this is reduced significantlyy.

Method	x	Mean	Error	StdDev	Code Size	Allocated
NaivePolynomial	2	88.8478 ns	1.8153 ns	1.6093 ns	153 B	-
HornerPolynomial	2	0.2780 ns	0.0275 ns	0.0244 ns	50 B	-

Observations.

The arithmetic ops in F# are direct CPU instructions and no allocations happens.

The operations seen in the BenchmarkDotNet assembly code generated are all the explicit visible instruction.

Instructions like BINARY_OP in python triggers PyNumber_Multiply() in the C Api which then does a Dynamic type checking, method look up and finally allocates memory for the numbers.

Conclusions.

I initially started this as a set of notes while reading CPython Internals, but curiosity got the better of me and I ended up diving deeper into the instructions generated by different languages and runtimes.

We can confidently conclude that Horner’s method should always be preferred when implementing polynomials—regardless of the language—as it drastically reduces both the number of arithmetic operations and the size of the generated instructions.

For numerical computing, where performance is critical, F# offers a significant performance advantage. It compiles down to highly optimized machine code with minimal allocations, giving it a considerable speed-up compared to dynamically typed languages like Python.

Perfomance Optimization

Python F#

This post is licensed under CC BY 4.0 by the author.