Python GIL vs No-GIL: Real FastAPI benchmarks with free-threaded Python 3.13

4 min readApr 18, 2026

FastAPI 0.136.0 officially supports free-threaded Python. I ran benchmarks to measure its real impact on API performance.

Benchmarked on Python 3.12 (GIL) and Python 3.13.0t (No-GIL), load tested with wrk.

What is the GIL, and why does it matter?

The Global Interpreter Lock (GIL) is a mutex in CPython that allows only one thread to execute Python bytecode at a time. For I/O-bound work, waiting on databases, HTTP calls, file reads - the GIL is released and threads cooperate well. For CPU-bound work, the GIL is a hard ceiling: no matter how many threads you spin up, only one runs at a time.

Free-threaded Python (3.13t) removes this lock entirely, enabling true parallel execution across CPU cores.

Setting up the experiment

The goal: run the exact same code on two Python runtimes and measure what changes. No code modifications - just a different interpreter.

Test Environment

All benchmarks were executed on:

MacBook M2 (8-core CPU, 8-core GPU) 16GB RAM

Step 1: Project setup

Create these two files in your project root:

.python-version

3.13.0t

pyproject.toml

[project]
name = "fastapi-gil-benchmark"
version = "0.1.0"
description = "Benchmarking Python GIL vs free-threaded (No-GIL) performance with FastAPI"
authors = []
readme = "README.md"
requires-python = ">=3.13,<3.14"
dependencies = ["fastapi>=0.136.0", "uvicorn>=0.44.0"]

[dependency-groups]
dev = []

The .python-version file tells uv to use the free-threaded build automatically - no manual pyenv switching needed.

Step 2: Install dependencies

uv sync

uv reads .python-version, pulls 3.13.0t if not already installed, creates the virtual environment, and installs dependencies in one step.

Step 3: Verify GIL is off

uv run python -c "import sys; print(sys._is_gil_enabled())"
# False

Step 4: Create `main.py` file

Same file runs on both Python 3.12 (GIL) and Python 3.13t (No-GIL). Four endpoints: two CPU-bound, two I/O-bound covering both threaded and sequential variants.

"""FastAPI GIL vs No-GIL benchmark"""

from fastapi import FastAPI
import time
import threading

app = FastAPI()


def cpu_heavy_task(n: int):
    total = 0
    for i in range(n):
        total += i * i
    return total


def io_task():
    time.sleep(2)  # simulate blocking I/O


@app.get("/")
def root():
    return {"message": "GIL vs No-GIL Demo API"}


@app.get("/cpu-thread")
def cpu_thread():
    start = time.time()
    threads = []
    for _ in range(2):
        t = threading.Thread(target=cpu_heavy_task, args=(2_000_000,))
        threads.append(t)
        t.start()
    for t in threads:
        t.join()
    return {
        "type": "CPU-bound (threads)",
        "time_taken": round(time.time() - start, 2),
    }


@app.get("/cpu-seq")
def cpu_seq():
    start = time.time()
    cpu_heavy_task(2_000_000)
    cpu_heavy_task(2_000_000)
    return {
        "type": "CPU sequential",
        "time_taken": round(time.time() - start, 2),
    }


@app.get("/io-thread")
def io_thread():
    start = time.time()
    threads = []
    for _ in range(2):
        t = threading.Thread(target=io_task)
        threads.append(t)
        t.start()
    for t in threads:
        t.join()
    return {
        "type": "IO-bound (threads)",
        "time_taken": round(time.time() - start, 2),
    }


@app.get("/io-seq")
def io_seq():
    start = time.time()
    io_task()
    io_task()
    return {
        "type": "IO sequential",
        "time_taken": round(time.time() - start, 2),
    }

Step 5: Run the server

Use a single worker, multiple workers spawn separate processes which bypass the GIL entirely, making the comparison invalid.

uv run uvicorn main:app --workers 1 --host 0.0.0.0 --port 8000

Step 6: Load test

wrk -t4 -c20 -d30s http://localhost:8000/cpu-thread
wrk -t4 -c20 -d30s http://localhost:8000/cpu-seq
wrk -t4 -c20 -d30s http://localhost:8000/io-thread
wrk -t4 -c20 -d30s http://localhost:8000/io-seq

4 threads, 20 concurrent connections, 30 second duration. The --workers 1 flag on the server side is critical, without it you're testing multiprocessing, not the GIL.

Results

What the numbers actually mean

1. CPU performance: ~8x improvement

This is the headline result. CPU-bound endpoints jumped from ~4 req/s to ~32 req/s, roughly an 8x increase. No code changes. Just a different Python build. This is free-threading doing exactly what it promises: multiple requests now execute in parallel across CPU cores instead of queuing behind each other.

2. The surprising part: threading inside a request still doesn’t help

CPU-bound benchmark (No-GIL):

/cpu-thread → 31.99 RPS (≈ same as sequential)
/cpu-seq → 32.42 RPS (baseline)

Even without the GIL, manually spawning threads inside a single request didn’t improve performance. This trips up a lot of people. The reason, at 20 concurrent connections, your CPU is already saturated by request-level parallelism. Adding threads inside one request just creates more scheduling overhead on an already loaded system.

No-GIL shifts parallelism to the request level. Multiple requests can now run truly in parallel. But threading inside a single endpoint under high load adds overhead, not throughput.

3. I/O-bound: unchanged

I/O results are nearly identical across both runtimes. The GIL was never the bottleneck here, it gets released during blocking I/O operations anyway. If your FastAPI app is mostly database queries and HTTP calls, don’t expect a difference from free-threading.

I/O-bound benchmark:

/io-thread → 9.30 RPS (GIL) vs 9.31 RPS (No-GIL) no change
/io-seq → 4.65 RPS (GIL) vs 4.65 RPS (No-GIL) no change

Verdict

When free-threaded Python actually helps

CPU-heavy endpoints - image processing, ML inference, data transformation
High-concurrency APIs where requests compete for CPU time
Workloads that previously needed multiprocessing to bypass the GIL

Where It Doesn’t Make a Difference

I/O-bound APIs - no measurable difference
Async-first apps - asyncio already handles concurrency well

Things to Keep in Mind

Free-threading is still maturing - some C extensions may not be thread-safe yet
Thread safety is now your responsibility - shared mutable state needs explicit protection

The most compelling part of this experiment: the same code, the same endpoints, the same benchmark command, just a different Python binary: produced 8x better CPU throughput. FastAPI 0.136.0 making this officially supported means it’s no longer an experiment. It’s a real option for CPU-bound workloads.