Sitemap

Python GIL vs No-GIL: Real FastAPI benchmarks with free-threaded Python 3.13

4 min readApr 18, 2026

FastAPI 0.136.0 officially supports free-threaded Python. I ran benchmarks to measure its real impact on API performance.

Benchmarked on Python 3.12 (GIL) and Python 3.13.0t (No-GIL), load tested with wrk.

Press enter or click to view image in full size

What is the GIL, and why does it matter?

The Global Interpreter Lock (GIL) is a mutex in CPython that allows only one thread to execute Python bytecode at a time. For I/O-bound work, waiting on databases, HTTP calls, file reads - the GIL is released and threads cooperate well. For CPU-bound work, the GIL is a hard ceiling: no matter how many threads you spin up, only one runs at a time.

Free-threaded Python (3.13t) removes this lock entirely, enabling true parallel execution across CPU cores.

Setting up the experiment

The goal: run the exact same code on two Python runtimes and measure what changes. No code modifications - just a different interpreter.

Test Environment

All benchmarks were executed on:

  • MacBook M2 (8-core CPU, 8-core GPU) 16GB RAM

Step 1: Project setup

Create these two files in your project root:

.python-version

3.13.0t

pyproject.toml

[project]
name = "fastapi-gil-benchmark"
version = "0.1.0"
description = "Benchmarking Python GIL vs free-threaded (No-GIL) performance with FastAPI"
authors = []
readme = "README.md"
requires-python = ">=3.13,<3.14"
dependencies = ["fastapi>=0.136.0", "uvicorn>=0.44.0"]

[dependency-groups]
dev = []

The .python-version file tells uv to use the free-threaded build automatically - no manual pyenv switching needed.

Step 2: Install dependencies

uv sync

uv reads .python-version, pulls 3.13.0t if not already installed, creates the virtual environment, and installs dependencies in one step.

Step 3: Verify GIL is off

uv run python -c "import sys; print(sys._is_gil_enabled())"
# False

Step 4: Create main.py file

Same file runs on both Python 3.12 (GIL) and Python 3.13t (No-GIL). Four endpoints: two CPU-bound, two I/O-bound covering both threaded and sequential variants.

"""FastAPI GIL vs No-GIL benchmark"""

from fastapi import FastAPI
import time
import threading

app = FastAPI()


def cpu_heavy_task(n: int):
total = 0
for i in range(n):
total += i * i
return total


def io_task():
time.sleep(2) # simulate blocking I/O


@app.get("/")
def root():
return {"message": "GIL vs No-GIL Demo API"}


@app.get("/cpu-thread")
def cpu_thread():
start = time.time()
threads = []
for _ in range(2):
t = threading.Thread(target=cpu_heavy_task, args=(2_000_000,))
threads.append(t)
t.start()
for t in threads:
t.join()
return {
"type": "CPU-bound (threads)",
"time_taken": round(time.time() - start, 2),
}


@app.get("/cpu-seq")
def cpu_seq():
start = time.time()
cpu_heavy_task(2_000_000)
cpu_heavy_task(2_000_000)
return {
"type": "CPU sequential",
"time_taken": round(time.time() - start, 2),
}


@app.get("/io-thread")
def io_thread():
start = time.time()
threads = []
for _ in range(2):
t = threading.Thread(target=io_task)
threads.append(t)
t.start()
for t in threads:
t.join()
return {
"type": "IO-bound (threads)",
"time_taken": round(time.time() - start, 2),
}


@app.get("/io-seq")
def io_seq():
start = time.time()
io_task()
io_task()
return {
"type": "IO sequential",
"time_taken": round(time.time() - start, 2),
}

Step 5: Run the server

Use a single worker, multiple workers spawn separate processes which bypass the GIL entirely, making the comparison invalid.

uv run uvicorn main:app --workers 1 --host 0.0.0.0 --port 8000

Step 6: Load test

wrk -t4 -c20 -d30s http://localhost:8000/cpu-thread
wrk -t4 -c20 -d30s http://localhost:8000/cpu-seq
wrk -t4 -c20 -d30s http://localhost:8000/io-thread
wrk -t4 -c20 -d30s http://localhost:8000/io-seq

4 threads, 20 concurrent connections, 30 second duration. The --workers 1 flag on the server side is critical, without it you're testing multiprocessing, not the GIL.

Results

Press enter or click to view image in full size

What the numbers actually mean

1. CPU performance: ~8x improvement

This is the headline result. CPU-bound endpoints jumped from ~4 req/s to ~32 req/s, roughly an 8x increase. No code changes. Just a different Python build. This is free-threading doing exactly what it promises: multiple requests now execute in parallel across CPU cores instead of queuing behind each other.

2. The surprising part: threading inside a request still doesn’t help

CPU-bound benchmark (No-GIL):

  • /cpu-thread31.99 RPS (≈ same as sequential)
  • /cpu-seq32.42 RPS (baseline)

Even without the GIL, manually spawning threads inside a single request didn’t improve performance. This trips up a lot of people. The reason, at 20 concurrent connections, your CPU is already saturated by request-level parallelism. Adding threads inside one request just creates more scheduling overhead on an already loaded system.

No-GIL shifts parallelism to the request level. Multiple requests can now run truly in parallel. But threading inside a single endpoint under high load adds overhead, not throughput.

3. I/O-bound: unchanged

I/O results are nearly identical across both runtimes. The GIL was never the bottleneck here, it gets released during blocking I/O operations anyway. If your FastAPI app is mostly database queries and HTTP calls, don’t expect a difference from free-threading.

I/O-bound benchmark:

  • /io-thread9.30 RPS (GIL) vs 9.31 RPS (No-GIL) no change
  • /io-seq4.65 RPS (GIL) vs 4.65 RPS (No-GIL) no change

Verdict

When free-threaded Python actually helps

  • CPU-heavy endpoints - image processing, ML inference, data transformation
  • High-concurrency APIs where requests compete for CPU time
  • Workloads that previously needed multiprocessing to bypass the GIL

Where It Doesn’t Make a Difference

  • I/O-bound APIs - no measurable difference
  • Async-first apps - asyncio already handles concurrency well

Things to Keep in Mind

  • Free-threading is still maturing - some C extensions may not be thread-safe yet
  • Thread safety is now your responsibility - shared mutable state needs explicit protection

The most compelling part of this experiment: the same code, the same endpoints, the same benchmark command, just a different Python binary: produced 8x better CPU throughput. FastAPI 0.136.0 making this officially supported means it’s no longer an experiment. It’s a real option for CPU-bound workloads.

--

--

Keval Dekivadiya
Keval Dekivadiya

Written by Keval Dekivadiya

🚀 AI/ML Engineer | BE in Information & Technology 🎓 Crafting innovative AI solutions and pushing tech boundaries.