Python 异步编程实战指南：从零构建高并发 Web 爬虫与 API 服务

最新推荐文章于 2026-06-19 17:00:49 发布

原创

最新推荐文章于 2026-06-19 17:00:49 发布 · 2.4k 阅读

标签

#python #前端 #爬虫

🧭 一、为什么你需要 `asyncio`？

当你的程序频繁做以下事情时：

✅ 发起 HTTP 请求（爬虫/API 调用）
✅ 读写数据库（如 asyncpg, aiomysql）
✅ 处理文件 I/O（日志、上传）
✅ WebSocket 实时通信

同步代码的瓶颈在于：

for url in urls:
    response = requests.get(url)  # 每次阻塞 200~2000ms

→ 100 个请求 ≈ 20~200 秒等待。

而异步方案可将耗时压缩至单次网络延迟量级（如 2 秒），实现 10x~100x 性能跃升。

⚙️ 二、`asyncio` 核心三要素：再巩固一次

概念	类比	关键 API
Coroutine（协程）	“可暂停的函数”	`async def`, `await`
Task（任务）	“被调度的协程”	`create_task()`, `TaskGroup`（3.11+）
Event Loop（事件循环）	“CPU 时间分配器”	`asyncio.run()`, `get_running_loop()`

📌 重要原则：

await 是协程的“让出点”——CPU 在此处切换到其他就绪任务，而非空等。

🛠️ 三、实战项目 1：高并发网页爬虫（带速率限制）

✅ 目标

并发抓取 50 个网页
控制最大并发数 = 10（避免被封 IP）
自动重试失败请求
输出响应统计

🔧 代码实现

import asyncio
import aiohttp
import time
from typing import List, Tuple

# 全局限速：最多 10 个并发
SEMAPHORE = asyncio.Semaphore(10)
TIMEOUT = aiohttp.ClientTimeout(total=10)

async def fetch_url(
    session: aiohttp.ClientSession,
    url: str,
    max_retries: int = 2
) -> Tuple[str, str, float]:
    """抓取单个 URL，返回 (url, content, latency)"""
    for attempt in range(max_retries + 1):
        try:
            async with SEMAPHORE:  # ⚠️ 限流关键！
                start = time.perf_counter()
                async with session.get(url, timeout=TIMEOUT) as resp:
                    content = await resp.text()
                    latency = time.perf_counter() - start
                    if resp.status == 200:
                        return url, content[:100] + "...", round(latency, 3)
                    else:
                        raise aiohttp.ClientResponseError(
                            resp