Python GIL vs No-GIL: Real FastAPI benchmarks with free-threaded Python 3.13
FastAPI 0.136.0 officially supports free-threaded Python. I ran benchmarks to measure its real impact on API performance.
Benchmarked on Python 3.12 (GIL) and Python 3.13.0t (No-GIL), load tested with wrk.
What is the GIL, and why does it matter?
The Global Interpreter Lock (GIL) is a mutex in CPython that allows only one thread to execute Python bytecode at a time. For I/O-bound work, waiting on databases, HTTP calls, file reads - the GIL is released and threads cooperate well. For CPU-bound work, the GIL is a hard ceiling: no matter how many threads you spin up, only one runs at a time.
Free-threaded Python (3.13t) removes this lock entirely, enabling true parallel execution across CPU cores.
Setting up the experiment
The goal: run the exact same code on two Python runtimes and measure what changes. No code modifications - just a different interpreter.
Test Environment
All benchmarks were executed on:
- MacBook M2 (8-core CPU, 8-core GPU) 16GB RAM
Step 1: Project setup
Create these two files in your project root:
.python-version
3.13.0tpyproject.toml
[project]
name = "fastapi-gil-benchmark"
version = "0.1.0"
description = "Benchmarking Python GIL vs free-threaded (No-GIL) performance with FastAPI"
authors = []
readme = "README.md"
requires-python = ">=3.13,<3.14"
dependencies = ["fastapi>=0.136.0", "uvicorn>=0.44.0"]
[dependency-groups]
dev = []The .python-version file tells uv to use the free-threaded build automatically - no manual pyenv switching needed.
Step 2: Install dependencies
uv syncuv reads .python-version, pulls 3.13.0t if not already installed, creates the virtual environment, and installs dependencies in one step.
Step 3: Verify GIL is off
uv run python -c "import sys; print(sys._is_gil_enabled())"
# FalseStep 4: Create main.py file
Same file runs on both Python 3.12 (GIL) and Python 3.13t (No-GIL). Four endpoints: two CPU-bound, two I/O-bound covering both threaded and sequential variants.
"""FastAPI GIL vs No-GIL benchmark"""
from fastapi import FastAPI
import time
import threading
app = FastAPI()
def cpu_heavy_task(n: int):
total = 0
for i in range(n):
total += i * i
return total
def io_task():
time.sleep(2) # simulate blocking I/O
@app.get("/")
def root():
return {"message": "GIL vs No-GIL Demo API"}
@app.get("/cpu-thread")
def cpu_thread():
start = time.time()
threads = []
for _ in range(2):
t = threading.Thread(target=cpu_heavy_task, args=(2_000_000,))
threads.append(t)
t.start()
for t in threads:
t.join()
return {
"type": "CPU-bound (threads)",
"time_taken": round(time.time() - start, 2),
}
@app.get("/cpu-seq")
def cpu_seq():
start = time.time()
cpu_heavy_task(2_000_000)
cpu_heavy_task(2_000_000)
return {
"type": "CPU sequential",
"time_taken": round(time.time() - start, 2),
}
@app.get("/io-thread")
def io_thread():
start = time.time()
threads = []
for _ in range(2):
t = threading.Thread(target=io_task)
threads.append(t)
t.start()
for t in threads:
t.join()
return {
"type": "IO-bound (threads)",
"time_taken": round(time.time() - start, 2),
}
@app.get("/io-seq")
def io_seq():
start = time.time()
io_task()
io_task()
return {
"type": "IO sequential",
"time_taken": round(time.time() - start, 2),
}Step 5: Run the server
Use a single worker, multiple workers spawn separate processes which bypass the GIL entirely, making the comparison invalid.
uv run uvicorn main:app --workers 1 --host 0.0.0.0 --port 8000Step 6: Load test
wrk -t4 -c20 -d30s http://localhost:8000/cpu-thread
wrk -t4 -c20 -d30s http://localhost:8000/cpu-seq
wrk -t4 -c20 -d30s http://localhost:8000/io-thread
wrk -t4 -c20 -d30s http://localhost:8000/io-seq4 threads, 20 concurrent connections, 30 second duration. The --workers 1 flag on the server side is critical, without it you're testing multiprocessing, not the GIL.
Results
What the numbers actually mean
1. CPU performance: ~8x improvement
This is the headline result. CPU-bound endpoints jumped from ~4 req/s to ~32 req/s, roughly an 8x increase. No code changes. Just a different Python build. This is free-threading doing exactly what it promises: multiple requests now execute in parallel across CPU cores instead of queuing behind each other.
2. The surprising part: threading inside a request still doesn’t help
CPU-bound benchmark (No-GIL):
/cpu-thread→ 31.99 RPS (≈ same as sequential)/cpu-seq→ 32.42 RPS (baseline)
Even without the GIL, manually spawning threads inside a single request didn’t improve performance. This trips up a lot of people. The reason, at 20 concurrent connections, your CPU is already saturated by request-level parallelism. Adding threads inside one request just creates more scheduling overhead on an already loaded system.
No-GIL shifts parallelism to the request level. Multiple requests can now run truly in parallel. But threading inside a single endpoint under high load adds overhead, not throughput.
3. I/O-bound: unchanged
I/O results are nearly identical across both runtimes. The GIL was never the bottleneck here, it gets released during blocking I/O operations anyway. If your FastAPI app is mostly database queries and HTTP calls, don’t expect a difference from free-threading.
I/O-bound benchmark:
/io-thread→ 9.30 RPS (GIL) vs 9.31 RPS (No-GIL) no change/io-seq→ 4.65 RPS (GIL) vs 4.65 RPS (No-GIL) no change
Verdict
When free-threaded Python actually helps
- CPU-heavy endpoints - image processing, ML inference, data transformation
- High-concurrency APIs where requests compete for CPU time
- Workloads that previously needed multiprocessing to bypass the GIL
Where It Doesn’t Make a Difference
- I/O-bound APIs - no measurable difference
- Async-first apps - asyncio already handles concurrency well
Things to Keep in Mind
- Free-threading is still maturing - some C extensions may not be thread-safe yet
- Thread safety is now your responsibility - shared mutable state needs explicit protection
The most compelling part of this experiment: the same code, the same endpoints, the same benchmark command, just a different Python binary: produced 8x better CPU throughput. FastAPI 0.136.0 making this officially supported means it’s no longer an experiment. It’s a real option for CPU-bound workloads.
