Python爬虫实战⑤｜抓取动态网页，AJAX与API接口解析

最新推荐文章于 2026-06-24 20:29:46 发布

原创最新推荐文章于 2026-06-24 20:29:46 发布 · 812 阅读

8 ·

本内容遵循CC 4.0 BY-SA版权协议

GEO检测

标签

#python #爬虫 #ajax

Python爬虫实战专栏收录该内容

30 篇文章

订阅专栏

author: 专注Python实战，分享爬虫与数据分析干货
title: Python爬虫实战⑤｜抓取动态网页，AJAX与API接口解析
update: 2026-04-26
tags: Python,爬虫,动态网页,AJAX,API,JSON,异步加载

作者：专注Python实战，分享爬虫与数据分析干货
更新时间：2026年4月
适合人群：已掌握基础爬虫、想突破动态网页抓取的开发者

前言：页面数据看不到？因为它是动态加载的

你一定遇到过这种情况——

浏览器能看到数据，但requests抓下来是空的
网页源码里没有你要的内容，但页面上明明显示着
点击"加载更多"后URL不变，但内容增加了

这不是bug，是AJAX动态加载。 数据不在HTML里，而是通过JavaScript额外请求API获取。

学会抓动态网页，你的爬虫能力直接翻倍。

一、什么是AJAX动态加载？

1.1 传统网页 vs 动态网页

传统网页（静态）：

浏览器请求 → 服务器返回完整HTML → 浏览器渲染显示

数据就在HTML源码里，requests直接能拿到。

动态网页（AJAX）：

浏览器请求HTML → HTML是空的/半空的
            ↓
浏览器执行JS → JS发起AJAX请求API → 服务器返回JSON数据 → JS填充到页面

数据在API接口的JSON响应里，requests抓HTML拿不到数据。

1.2 怎么判断是不是动态网页？

import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
}

url = "https://example.com/dynamic-page"
response = requests.get(url, headers=headers, timeout=10)
soup = BeautifulSoup(response.text, "html.parser")

# 如果页面上的数据在HTML里找不到，就是动态加载
content = soup.find("div", class_="data-list")
if content:
    print("静态网页，数据在HTML里")
else:
    print("动态网页！需要找API接口")

二、用浏览器开发者工具找API接口

2.1 开发者工具打开方式

Chrome/Edge：按F12，或右键→检查
切换到 Network（网络）标签
勾选 “Preserve log”（保留日志）
选择 XHR 或 Fetch 过滤器

2.2 找到API请求的步骤

打开目标网页
F12 → Network → XHR
在网页上触发数据加载（刷新/翻页/滚动/点击）
观察Network里新出现的请求
点击请求 → Preview/Response 查看返回数据
点击 Headers 查看请求URL、方法、参数

2.3 用代码模拟API请求

找到API接口后，直接用requests请求JSON数据：

import requests
import json

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
    "Referer": "https://example.com/",
    "Accept": "application/json, text/plain, */*",
}

# API接口URL（从开发者工具中获取）
api_url = "https://example.com/api/data/list"

# 请求参数
params = {
    "page": 1,
    "size": 20,
    "keyword": "Python",
}

response = requests.get(api_url, headers=headers, params=params, timeout=10)
data = response.json()

print(f"状态: {data.get('status', 'unknown')}")
print(f"数据条数: {len(data.get('data', {}).get('list', []))}")

# 提取具体字段
for item in data.get("data", {}).get("list", []):
    print(f"  标题: {item.get('title', '')}")
    print(f"  价格: {item.get('price', '')}")
    print(f"  链接: {item.get('url', '')}")
    print()

三、常见AJAX接口格式与解析

3.1 GET请求 + URL参数

import requests

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
    "Referer": "https://example.com/",
}

# 翻页接口
for page in range(1, 6):
    api_url = "https://example.com/api/articles"
    params = {
        "page": page,
        "per_page": 20,
        "category": "tech",
    }
    response = requests.get(api_url, headers=headers, params=params, timeout=10)
    data = response.json()

    articles = data.get("data", {}).get("articles", [])
    print(f"第{page}页: {len(articles)} 篇文章")
    for article in articles:
        print(f"  - {article['title']}")

3.2 POST请求 + JSON参数

import requests
import json

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
    "Content-Type": "application/json;charset=UTF-8",
    "Referer": "https://example.com/",
}

api_url = "https://example.com/api/search"
payload = {
    "keyword": "Python爬虫",
    "page": 1,
    "pageSize": 20,
    "filters": {
        "category": "book",
        "priceRange": [0, 100]
    }
}

response = requests.post(api_url, headers=headers, json=payload, timeout=10)
data = response.json()

results = data.get("result", {}).get("items", [])
print(f"搜索到 {len(results)} 条结果")
for item in results:
    print(f"  {item['name']} - ¥{item['price']}")

3.3 需要Token认证的API

import requests

# 第1步：登录获取token
login_url = "https://example.com/api/login"
login_data = {"username": "user", "password": "pass"}
response = requests.post(login_url, json=login_data, timeout=10)
token = response.json().get("token", "")
print(f"获取Token: {token[:20]}...")

# 第2步：带token请求API
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
    "Authorization": f"Bearer {token}",
    "Referer": "https://example.com/",
}

api_url = "https://example.com/api/user/data"
response = requests.get(api_url, headers=headers, timeout=10)
data = response.json()
print(f"用户数据: {data}")

四、动态网页实战案例

4.1 抓取新浪微博搜索结果

import requests
import json
import time
import random

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
    "Referer": "https://s.weibo.com/",
    "Accept": "application/json, text/plain, */*",
    "X-Requested-With": "XMLHttpRequest",
}

keyword = "Python"
all_posts = []

for page in range(1, 6):
    api_url = "https://s.weibo.com/ajax/jsonp/search"
    params = {
        "keyword": keyword,
        "page": page,
    }

    print(f"搜索 '{keyword}' 第{page}页...", end=" ")

    try:
        response = requests.get(api_url, headers=headers, params=params, timeout=10)
        # 有些API返回JSONP格式，需要去掉回调函数包装
        text = response.text
        if text.startswith("jQuery"):
            text = text[text.index("(") + 1 : text.rindex(")")]

        data = json.loads(text)
        cards = data.get("data", {}).get("cards", [])

        for card in cards:
            mblog = card.get("mblog", {})
            if mblog:
                all_posts.append({
                    "用户": mblog.get("user", {}).get("screen_name", ""),
                    "内容": mblog.get("text", "")[:50],
                    "转发": mblog.get("reposts_count", 0),
                    "评论": mblog.get("comments_count", 0),
                    "点赞": mblog.get("attitudes_count", 0),
                })

        print(f"获取 {len(cards)} 条")
        time.sleep(random.uniform(2, 5))

    except Exception as e:
        print(f"失败: {e}")

print(f"\n共获取 {len(all_posts)} 条微博")

4.2 抓取知乎热榜

import requests

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
    "Referer": "https://www.zhihu.com/hot",
}

api_url = "https://www.zhihu.com/api/v3/feed/topstory/hot-lists/total"
params = {"limit": 50, "desktop": "true"}

try:
    response = requests.get(api_url, headers=headers, params=params, timeout=10)
    data = response.json()

    items = data.get("data", [])
    print(f"知乎热榜共 {len(items)} 条")
    print("=" * 60)

    for i, item in enumerate(items, 1):
        target = item.get("target", {})
        title = target.get("title", "")
        hot = item.get("detail_text", "")
        print(f"{i:>2}. {title}")
        print(f"    热度: {hot}")

except Exception as e:
    print(f"抓取失败: {e}")
    print("提示：知乎API可能需要登录Cookie才能访问")

五、处理JSONP响应

有些API返回的不是标准JSON，而是JSONP（带回调函数名）：

// 标准JSON
{"name": "张三", "age": 25}

// JSONP
callback({"name": "张三", "age": 25})
jQuery123456({"name": "张三", "age": 25})

import json
import re

def parse_jsonp(jsonp_text):
    """解析JSONP响应，提取JSON数据"""
    # 方法1：用正则提取括号内内容
    match = re.search(r"\((.+)\);?$", jsonp_text, re.DOTALL)
    if match:
        json_str = match.group(1)
        return json.loads(json_str)

    # 方法2：如果是标准JSON，直接解析
    try:
        return json.loads(jsonp_text)
    except json.JSONDecodeError:
        pass

    # 方法3：暴力去掉常见回调前缀
    for prefix in ["callback", "jQuery"]:
        if jsonp_text.startswith(prefix):
            json_str = jsonp_text[jsonp_text.index("(") + 1 : jsonp_text.rindex(")")]
            return json.loads(json_str)

    raise ValueError(f"无法解析JSONP: {jsonp_text[:100]}...")

# 使用示例
jsonp_response = 'jQuery123456({"status": "ok", "data": [1, 2, 3]})'
data = parse_jsonp(jsonp_response)
print(data)  # {'status': 'ok', 'data': [1, 2, 3]}

六、无限滚动加载

很多网站（如微博、淘宝、Pinterest）用"无限滚动"代替分页：

6.1 找到滚动加载的API

初始页面：https://example.com/feed
滚动加载：https://example.com/api/feed?after=abc123&count=20
再次滚动：https://example.com/api/feed?after=def456&count=20

"after"参数通常是上一批最后一条数据的ID或时间戳。

6.2 模拟无限滚动

import requests
import time
import random

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
    "Referer": "https://example.com/feed",
}

api_url = "https://example.com/api/feed"
all_items = []
after_cursor = None  # 游标，初始为None
max_rounds = 10      # 最多加载10轮

for round_num in range(1, max_rounds + 1):
    params = {"count": 20}
    if after_cursor:
        params["after"] = after_cursor

    print(f"加载第 {round_num} 轮...", end=" ")

    try:
        response = requests.get(api_url, headers=headers, params=params, timeout=10)
        data = response.json()

        items = data.get("items", [])
        if not items:
            print("没有更多数据了")
            break

        all_items.extend(items)

        # 获取下一轮的游标
        after_cursor = data.get("next_cursor") or data.get("pagination", {}).get("after")
        if not after_cursor:
            print("已到最后一页")
            break

        print(f"获取 {len(items)} 条，累计 {len(all_items)} 条")
        time.sleep(random.uniform(1, 3))

    except Exception as e:
        print(f"失败: {e}")
        break

print(f"\n共加载 {len(all_items)} 条数据")

七、知识卡

概念	说明
AJAX	异步JavaScript请求，页面不刷新即可获取数据
API接口	后端提供的数据接口，通常返回JSON
XHR	XMLHttpRequest，AJAX请求的类型
JSONP	带回调函数的JSON，跨域请求用
Network面板	浏览器开发者工具的网络请求监控
游标(cursor)	无限滚动的分页标记，指向下一批数据
Token	API访问凭证，放在请求头里
X-Requested-With	AJAX请求标识，值通常为XMLHttpRequest
Referer	告诉API你从哪个页面来的
Content-Type	请求体格式，JSON为application/json