今晚跟cursor斗智斗勇一个小时,终于成功实现抓取《数据库内核月报》的需求,比起csdn过程坎坷不少,知识点也更多,也更充分体现了提示词、需求拆解和分析的重要性。

        过程太长,先简单记录聊天过程,后续有时间再细化整理

一、 初始需求

你是一个Python爬虫专家,你需要生成一段Python代码,获取 http://mysql.taobao.org/monthly/ 中的所有文章链接,并将文档内容以markdown格式存放在本地taobao目录中。
其他注意事项:
1. 分辨页面渲染方式,避免链接抓取错误
2. 为提升获取效率,你应该考虑利用多线程及其他技术提效
3. 避免被反爬技术拦截

记一次波折的 cursor 生成代码抓取《数据库内核月报》_Python

推荐代码如下:

import os
import re
import time
import random
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor, as_completed

BASE_URL = "http://mysql.taobao.org/monthly/"
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
}
SAVE_DIR = "taobao"
MAX_RETRIES = 3
THREADS = 8

os.makedirs(SAVE_DIR, exist_ok=True)

def get_with_retry(url, retries=MAX_RETRIES, **kwargs):
    for i in range(retries):
        try:
            resp = requests.get(url, headers=HEADERS, timeout=10, **kwargs)
            if resp.status_code == 200:
                return resp
        except Exception:
            if i == retries - 1:
                raise
            time.sleep(random.uniform(1, 3))
    return None

def get_all_links():
    resp = get_with_retry(BASE_URL)
    soup = BeautifulSoup(resp.text, "lxml")
    links = []
    for a in soup.select("a"):
        href = a.get("href", "")
        # 文章链接通常形如 202x-xx.html
        if re.match(r"\d{4}-\d{2}\.html", href):
            links.append(BASE_URL + href)
    return links

def save_article(url):
    try:
        resp = get_with_retry(url)
        soup = BeautifulSoup(resp.text, "lxml")
        # 获取文章标题
        title = soup.title.text.strip() if soup.title else url.split("/")[-1]
        # 获取正文内容
        content_div = soup.find("div", class_="content")
        if not content_div:
            content_div = soup.find("div", id="content")
        content = content_div.get_text("\n", strip=True) if content_div else soup.get_text("\n", strip=True)
        # 生成markdown内容
        md_content = f"# {title}\n\n原文链接: [{url}]({url})\n\n---\n\n{content}"
        # 文件名
        filename = os.path.join(SAVE_DIR, url.split("/")[-1].replace(".html", ".md"))
        with open(filename, "w", encoding="utf-8") as f:
            f.write(md_content)
        print(f"已保存: {filename}")
        # 随机延时,防止被封
        time.sleep(random.uniform(0.5, 1.5))
    except Exception as e:
        print(f"下载失败: {url},原因: {e}")

def main():
    links = get_all_links()
    print(f"共发现{len(links)}篇文章,开始下载...")
    with ThreadPoolExecutor(max_workers=THREADS) as executor:
        futures = [executor.submit(save_article, url) for url in links]
        for future in as_completed(futures):
            pass
    print("全部下载完成。")

if __name__ == "__main__":
    main()
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.
  • 59.
  • 60.
  • 61.
  • 62.
  • 63.
  • 64.
  • 65.
  • 66.
  • 67.
  • 68.
  • 69.
  • 70.
  • 71.
  • 72.
  • 73.
  • 74.
  • 75.

二、 共发现0篇文章

文章获取失败,提示“共发现0篇文章,开始下载”,分析原因并优化代码实现

问题分析:

  • 你的爬虫代码之所以“共发现0篇文章”,很可能是因为页面源码中的文章链接格式与你的正则表达式不符,或者页面内容被重定向、反爬等原因影响,导致没有正确抓取到链接。

优化建议:

  • 调试页面源码:直接打印页面源码的前几百字符,确认文章链接的真实格式。
  • 放宽正则匹配:有些链接可能不是 202x-xx.html,而是 202x-xx/ 或其他格式。
  • 检查编码和反爬:有些网站会根据 User-Agent 返回不同内容,或返回乱码。

调试代码片段(建议先运行,观察输出):

import requests

url = "http://mysql.taobao.org/monthly/"
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
}
resp = requests.get(url, headers=headers)
print(resp.status_code)
print(resp.apparent_encoding)
print(resp.text[:1000])  # 打印前1000字符
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.

请运行上面代码,贴出输出内容,这样我可以帮你精准调整正则和解析逻辑,确保能抓到所有文章链接。

如果你希望我直接给出更鲁棒的爬虫代码,也可以告知你希望的策略(比如:更宽松的正则、直接抓取所有以 .html 结尾的链接等)。

结果如下
D:\book_download\venv\Scripts\python.exe D:/book_download/book_download/book/ai_get_url.py
200
utf-8
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<!-- <title>数据库内核月报</title> -->
<meta name="description" content="数据库内核月报, 来着阿里云 PolarDB 数据库内核团队。">
<link rel="stylesheet" href="/monthly/css/typo.css">
<link rel="stylesheet" href="/monthly/css/animate.css">
<link rel="stylesheet" href="/monthly/css/main.css">
<link rel="canonical" href="http://mysql.taobao.org//monthly/">
<link rel="alternate" type="application/rss+xml" title="数据库内核月报" href="http://mysql.taobao.org//monthly/feed.xml" />
<link rel="stylesheet" href="//cdn.staticfile.org/highlight.js/8.3/styles/tomorrow.min.css">
<script src="/monthly/js/highlight.min.js"></script>
<!-- <link rel="stylesheet" href="/monthly/themes/tomorrow.css">
<script src="/monthly/highlight/highlight.pack.js"> -->
<script>hl
Process finished with exit code 0

感谢你提供的输出!

页面返回状态码 200,编码为 utf-8,且源码内容正常,但你贴出的只是前几行,还未包含文章链接部分。

不过,这已经说明页面没有被重定向或反爬拦截。

调试代码如下:

import requests
from bs4 import BeautifulSoup

url = "http://mysql.taobao.org/monthly/"
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
}
resp = requests.get(url, headers=headers)
soup = BeautifulSoup(resp.text, "lxml")

# 打印所有a标签的href
for a in soup.find_all("a"):
    print(a.get("href"))
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.

请运行上面代码,把输出贴出来,这样我可以帮你精准写出能抓到所有文章链接的正则和爬虫代码。

调试代码返回结果如下
/monthly/
/monthly/2025/04
/monthly/2025/03
/monthly/2025/02
/monthly/2025/01
/monthly/2024/12
/monthly/2024/11
/monthly/2024/10

...

记一次波折的 cursor 生成代码抓取《数据库内核月报》_cursor_02

import os
import re
import time
import random
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor, as_completed

BASE_URL = "http://mysql.taobao.org"
MONTHLY_URL = BASE_URL + "/monthly/"
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
}
SAVE_DIR = "taobao"
MAX_RETRIES = 3
THREADS = 8

os.makedirs(SAVE_DIR, exist_ok=True)

def get_with_retry(url, retries=MAX_RETRIES, **kwargs):
    for i in range(retries):
        try:
            resp = requests.get(url, headers=HEADERS, timeout=10, **kwargs)
            if resp.status_code == 200:
                return resp
        except Exception:
            if i == retries - 1:
                raise
            time.sleep(random.uniform(1, 3))
    return None

def get_all_article_links():
    resp = get_with_retry(MONTHLY_URL)
    soup = BeautifulSoup(resp.text, "lxml")
    links = set()
    for a in soup.find_all("a", href=True):
        href = a["href"]
        # 匹配 /monthly/YYYY/MM 目录
        m = re.match(r"/monthly/(\d{4})/(\d{2})$", href)
        if m:
            links.add(BASE_URL + href)
    return sorted(links)

def save_article(url):
    try:
        resp = get_with_retry(url)
        soup = BeautifulSoup(resp.text, "lxml")
        # 获取文章标题
        title = soup.title.text.strip() if soup.title else url.split("/")[-2] + "-" + url.split("/")[-1]
        # 获取正文内容
        content_div = soup.find("div", class_="content")
        if not content_div:
            content_div = soup.find("div", id="content")
        content = content_div.get_text("\n", strip=True) if content_div else soup.get_text("\n", strip=True)
        # 生成markdown内容
        md_content = f"# {title}\n\n原文链接: [{url}]({url})\n\n---\n\n{content}"
        # 文件名
        filename = os.path.join(SAVE_DIR, url.rstrip("/").split("/")[-2] + "-" + url.rstrip("/").split("/")[-1] + ".md")
        with open(filename, "w", encoding="utf-8") as f:
            f.write(md_content)
        print(f"已保存: {filename}")
        time.sleep(random.uniform(0.5, 1.5))
    except Exception as e:
        print(f"下载失败: {url},原因: {e}")

def main():
    links = get_all_article_links()
    print(f"共发现{len(links)}篇文章,开始下载...")
    with ThreadPoolExecutor(max_workers=THREADS) as executor:
        futures = [executor.submit(save_article, url) for url in links]
        for future in as_completed(futures):
            pass
    print("全部下载完成。")

if __name__ == "__main__":
    main()
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.
  • 59.
  • 60.
  • 61.
  • 62.
  • 63.
  • 64.
  • 65.
  • 66.
  • 67.
  • 68.
  • 69.
  • 70.
  • 71.
  • 72.
  • 73.
  • 74.
  • 75.
  • 76.

三、抓取目录链接而非文章内容

有两个问题
1. 你抓取的文章中文显示有乱码
05
MySQL · 5.7优化 · Metadata Lock子系统的优化
06
MySQL · 5.7特性 · 在线Truncate undo log 表空间
2. 你抓取的是每个目录中的文章链接,而非目录中实际的文章内容

你的反馈非常关键,下面针对两个问题详细分析并给出解决方案:


1. 中文乱码问题

原因分析:

  • 乱码通常是因为 requests 没有正确识别网页编码,或者网页实际编码与 resp.text 的解码方式不一致。
  • 该站点页面 <meta charset="utf-8">,但有时 requests 识别不准,需强制设定编码。

解决方法:

  • 在获取响应后,强制设置 resp.encoding = 'utf-8',再用 resp.text。

2. 抓取的是目录中的文章链接,而非实际文章内容

原因分析:

  • 你现在抓到的是每个月的目录页(如 /monthly/2025/04),但实际内容在该目录页的正文部分(通常是 <div class="content">),而不是该页下的其他链接。
  • 你只需抓取每个目录页的正文内容即可,无需再进入子链接。
  • 只抓取每个 /monthly/YYYY/MM 目录页的正文内容。
  • 强制设置编码,解决中文乱码。
import os
import re
import time
import random
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor, as_completed

BASE_URL = "http://mysql.taobao.org"
MONTHLY_URL = BASE_URL + "/monthly/"
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
}
SAVE_DIR = "taobao"
MAX_RETRIES = 3
THREADS = 8

os.makedirs(SAVE_DIR, exist_ok=True)

def get_with_retry(url, retries=MAX_RETRIES, **kwargs):
    for i in range(retries):
        try:
            resp = requests.get(url, headers=HEADERS, timeout=10, **kwargs)
            resp.encoding = 'utf-8'  # 强制设定编码,防止中文乱码
            if resp.status_code == 200:
                return resp
        except Exception:
            if i == retries - 1:
                raise
            time.sleep(random.uniform(1, 3))
    return None

def get_all_article_links():
    resp = get_with_retry(MONTHLY_URL)
    soup = BeautifulSoup(resp.text, "lxml")
    links = set()
    for a in soup.find_all("a", href=True):
        href = a["href"]
        # 匹配 /monthly/YYYY/MM 目录
        m = re.match(r"/monthly/(\d{4})/(\d{2})$", href)
        if m:
            links.add(BASE_URL + href)
    return sorted(links)

def save_article(url):
    try:
        resp = get_with_retry(url)
        soup = BeautifulSoup(resp.text, "lxml")
        # 获取文章标题
        title = soup.title.text.strip() if soup.title else url.split("/")[-2] + "-" + url.split("/")[-1]
        # 获取正文内容
        content_div = soup.find("div", class_="content")
        if not content_div:
            content_div = soup.find("div", id="content")
        if content_div:
            content = content_div.get_text("\n", strip=True)
        else:
            content = soup.get_text("\n", strip=True)
        # 生成markdown内容
        md_content = f"# {title}\n\n原文链接: [{url}]({url})\n\n---\n\n{content}"
        # 文件名
        filename = os.path.join(SAVE_DIR, url.rstrip("/").split("/")[-2] + "-" + url.rstrip("/").split("/")[-1] + ".md")
        with open(filename, "w", encoding="utf-8") as f:
            f.write(md_content)
        print(f"已保存: {filename}")
        time.sleep(random.uniform(0.5, 1.5))
    except Exception as e:
        print(f"下载失败: {url},原因: {e}")

def main():
    links = get_all_article_links()
    print(f"共发现{len(links)}篇文章,开始下载...")
    with ThreadPoolExecutor(max_workers=THREADS) as executor:
        futures = [executor.submit(save_article, url) for url in links]
        for future in as_completed(futures):
            pass
    print("全部下载完成。")

if __name__ == "__main__":
    main()
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.
  • 59.
  • 60.
  • 61.
  • 62.
  • 63.
  • 64.
  • 65.
  • 66.
  • 67.
  • 68.
  • 69.
  • 70.
  • 71.
  • 72.
  • 73.
  • 74.
  • 75.
  • 76.
  • 77.
  • 78.
  • 79.
  • 80.
3. 需求再拆解

中文显示问题解决了,但抓取的仍然是目录下的文章标题而非文章内容。这样,你先尝试帮我获取 @http://mysql.taobao.org/monthly/2024/12/ 目录下的文章内容,还是以markdown格式保存 在taobao目录下

记一次波折的 cursor 生成代码抓取《数据库内核月报》_Python_03

import os
import requests
from bs4 import BeautifulSoup

url = "http://mysql.taobao.org/monthly/2024/12/"
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
}
SAVE_DIR = "taobao"
os.makedirs(SAVE_DIR, exist_ok=True)

resp = requests.get(url, headers=headers, timeout=10)
resp.encoding = 'utf-8'
soup = BeautifulSoup(resp.text, "lxml")

# 获取标题
title = soup.find("h1")
md_content = f"# {title.text.strip() if title else '数据库内核月报'}\n\n原文链接: [{url}]({url})\n\n---\n\n"

# 获取所有条目(通常是h3或h2标题+后续内容)
for tag in soup.find_all(['h3', 'h2']):
    # 只抓取以“# XX”开头的条目
    if tag.text.strip().startswith("#"):
        md_content += f"## {tag.text.strip()}\n\n"
        # 收集该标题下的所有段落,直到下一个h3/h2
        for sibling in tag.find_next_siblings():
            if sibling.name in ['h3', 'h2']:
                break
            # 只抓正文段落和列表
            if sibling.name in ['p', 'ul', 'ol', 'pre', 'blockquote']:
                md_content += sibling.get_text("\n", strip=True) + "\n\n"

filename = os.path.join(SAVE_DIR, "2024-12.md")
with open(filename, "w", encoding="utf-8") as f:
    f.write(md_content)

print(f"已保存: {filename}")
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.

还是有问题,你并没有打开目录中的子链接,例如第一篇文章打开后是 @http://mysql.taobao.org/monthly/2024/12/01/ ,你应该逐个打开链接并获取文章内容

记一次波折的 cursor 生成代码抓取《数据库内核月报》_cursor_04

import os
import re
import requests
from bs4 import BeautifulSoup

BASE_URL = "http://mysql.taobao.org"
MONTH_URL = "/monthly/2024/12/"
FULL_MONTH_URL = BASE_URL + MONTH_URL
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
}
SAVE_DIR = "taobao"
os.makedirs(SAVE_DIR, exist_ok=True)

def get_soup(url):
    resp = requests.get(url, headers=HEADERS, timeout=10)
    resp.encoding = 'utf-8'
    return BeautifulSoup(resp.text, "lxml")

def get_article_links(month_url):
    soup = get_soup(month_url)
    links = []
    for a in soup.find_all("a", href=True):
        href = a["href"]
        # 匹配 /monthly/2024/12/01/ 这种子文章链接
        if re.match(r"/monthly/2024/12/\d{2}/$", href):
            # 绝对化
            if not href.startswith("http"):
                href = BASE_URL + href
            links.append(href)
    return sorted(set(links))

def get_article_content(article_url):
    soup = get_soup(article_url)
    # 标题
    h2 = soup.find("h2")
    title = h2.text.strip() if h2 else article_url.split("/")[-2]
    # 正文
    content_div = soup.find("div", class_="content")
    if not content_div:
        content_div = soup.find("div", id="content")
    content = content_div.get_text("\n", strip=True) if content_div else soup.get_text("\n", strip=True)
    return f"## {title}\n\n原文链接: [{article_url}]({article_url})\n\n{content}\n\n---\n"

def main():
    article_links = get_article_links(FULL_MONTH_URL)
    print(f"共发现{len(article_links)}篇文章")
    md_content = f"# 数据库内核月报 2024/12\n\n原文目录: [{FULL_MONTH_URL}]({FULL_MONTH_URL})\n\n---\n\n"
    for link in article_links:
        print(f"抓取: {link}")
        md_content += get_article_content(link)
    filename = os.path.join(SAVE_DIR, "2024-12.md")
    with open(filename, "w", encoding="utf-8") as f:
        f.write(md_content)
    print(f"已保存: {filename}")

if __name__ == "__main__":
    main()
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.

就是这样,干得漂亮~ 但你需要将每篇文章单独存放在 .md文件里,并以文章标题作为文件名,加油

记一次波折的 cursor 生成代码抓取《数据库内核月报》_博客_05

import os
import re
import requests
from bs4 import BeautifulSoup

BASE_URL = "http://mysql.taobao.org"
MONTH_URL = "/monthly/2024/12/"
FULL_MONTH_URL = BASE_URL + MONTH_URL
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
}
SAVE_DIR = "taobao"
os.makedirs(SAVE_DIR, exist_ok=True)

def safe_filename(title):
    # 只保留中英文、数字、下划线和横线
    return re.sub(r'[\\/:*?"<>|]', '', title).strip()

def get_soup(url):
    resp = requests.get(url, headers=HEADERS, timeout=10)
    resp.encoding = 'utf-8'
    return BeautifulSoup(resp.text, "lxml")

def get_article_links(month_url):
    soup = get_soup(month_url)
    links = []
    for a in soup.find_all("a", href=True):
        href = a["href"]
        # 匹配 /monthly/2024/12/01/ 这种子文章链接
        if re.match(r"/monthly/2024/12/\d{2}/$", href):
            # 绝对化
            if not href.startswith("http"):
                href = BASE_URL + href
            links.append(href)
    return sorted(set(links))

def save_article(article_url):
    soup = get_soup(article_url)
    # 标题
    h2 = soup.find("h2")
    title = h2.text.strip() if h2 else article_url.split("/")[-2]
    filename = safe_filename(title) + ".md"
    # 正文
    content_div = soup.find("div", class_="content")
    if not content_div:
        content_div = soup.find("div", id="content")
    content = content_div.get_text("\n", strip=True) if content_div else soup.get_text("\n", strip=True)
    md_content = f"# {title}\n\n原文链接: [{article_url}]({article_url})\n\n{content}\n"
    filepath = os.path.join(SAVE_DIR, filename)
    with open(filepath, "w", encoding="utf-8") as f:
        f.write(md_content)
    print(f"已保存: {filepath}")

def main():
    article_links = get_article_links(FULL_MONTH_URL)
    print(f"共发现{len(article_links)}篇文章")
    for link in article_links:
        save_article(link)
    print("全部下载完成。")

if __name__ == "__main__":
    main()
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.
  • 59.
  • 60.
  • 61.
  • 62.

记一次波折的 cursor 生成代码抓取《数据库内核月报》_cursor_06

四、 格式化内容

内容ok,帮我再格式化一下里面的文本和代码,使其比较美观

import os
import re
import requests
from bs4 import BeautifulSoup

BASE_URL = "http://mysql.taobao.org"
MONTH_URL = "/monthly/2024/12/"
FULL_MONTH_URL = BASE_URL + MONTH_URL
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
}
SAVE_DIR = "taobao"
os.makedirs(SAVE_DIR, exist_ok=True)

def safe_filename(title):
    # 只保留中英文、数字、下划线和横线
    return re.sub(r'[\\/:*?"<>|]', '', title).strip()

def get_soup(url):
    resp = requests.get(url, headers=HEADERS, timeout=10)
    resp.encoding = 'utf-8'
    return BeautifulSoup(resp.text, "lxml")

def get_article_links(month_url):
    soup = get_soup(month_url)
    links = []
    for a in soup.find_all("a", href=True):
        href = a["href"]
        # 匹配 /monthly/2024/12/01/ 这种子文章链接
        if re.match(r"/monthly/2024/12/\d{2}/$", href):
            # 绝对化
            if not href.startswith("http"):
                href = BASE_URL + href
            links.append(href)
    return sorted(set(links))

def format_content(content_div):
    md_lines = []
    for elem in content_div.children:
        if elem.name == 'pre':
            # 代码块
            code = elem.get_text("\n", strip=True)
            md_lines.append(f'\n```\n{code}\n```\n')
        elif elem.name in ['ul', 'ol']:
            # 列表
            for li in elem.find_all('li'):
                md_lines.append(f'- {li.get_text(strip=True)}')
            md_lines.append('')
        elif elem.name == 'blockquote':
            # 引用
            quote = elem.get_text("\n", strip=True)
            md_lines.append(f'> {quote}\n')
        elif elem.name in ['h1', 'h2', 'h3', 'h4', 'h5', 'h6']:
            # 标题
            level = int(elem.name[1])
            md_lines.append(f'\n{"#"*level} {elem.get_text(strip=True)}\n')
        elif elem.name == 'p' or elem.name is None:
            # 普通段落
            text = elem.get_text("\n", strip=True)
            if text:
                md_lines.append(f'{text}\n')
    return '\n'.join(md_lines)

def save_article(article_url):
    soup = get_soup(article_url)
    # 标题
    h2 = soup.find("h2")
    title = h2.text.strip() if h2 else article_url.split("/")[-2]
    filename = safe_filename(title) + ".md"
    # 正文
    content_div = soup.find("div", class_="content")
    if not content_div:
        content_div = soup.find("div", id="content")
    if content_div:
        content = format_content(content_div)
    else:
        content = soup.get_text("\n", strip=True)
    md_content = f"# {title}\n\n原文链接: [{article_url}]({article_url})\n\n{content}\n"
    filepath = os.path.join(SAVE_DIR, filename)
    with open(filepath, "w", encoding="utf-8") as f:
        f.write(md_content)
    print(f"已保存: {filepath}")

def main():
    article_links = get_article_links(FULL_MONTH_URL)
    print(f"共发现{len(article_links)}篇文章")
    for link in article_links:
        save_article(link)
    print("全部下载完成。")

if __name__ == "__main__":
    main()
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.
  • 59.
  • 60.
  • 61.
  • 62.
  • 63.
  • 64.
  • 65.
  • 66.
  • 67.
  • 68.
  • 69.
  • 70.
  • 71.
  • 72.
  • 73.
  • 74.
  • 75.
  • 76.
  • 77.
  • 78.
  • 79.
  • 80.
  • 81.
  • 82.
  • 83.
  • 84.
  • 85.
  • 86.
  • 87.
  • 88.
  • 89.
  • 90.
  • 91.
  • 92.

记一次波折的 cursor 生成代码抓取《数据库内核月报》_博客_07

五、 合并代码,实现初始需求

嗯,就是这样,结合以上思路,回到我们最开始的需求,帮我抓取获取 http://mysql.taobao.org/monthly/ 中的所有文章链接,并将文档内容以markdown格式存放在本地taobao目录中,其余要求不变

import os
import re
import time
import random
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor, as_completed

BASE_URL = "http://mysql.taobao.org"
MONTHLY_URL = BASE_URL + "/monthly/"
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
}
SAVE_DIR = "taobao"
THREADS = 8
os.makedirs(SAVE_DIR, exist_ok=True)

def safe_filename(title):
    return re.sub(r'[\\/:*?"<>|]', '', title).strip()

def get_soup(url):
    for _ in range(3):
        try:
            resp = requests.get(url, headers=HEADERS, timeout=10)
            resp.encoding = 'utf-8'
            if resp.status_code == 200:
                return BeautifulSoup(resp.text, "lxml")
        except Exception:
            time.sleep(random.uniform(1, 2))
    return None

def get_all_month_links():
    soup = get_soup(MONTHLY_URL)
    links = set()
    for a in soup.find_all("a", href=True):
        href = a["href"]
        # 匹配 /monthly/YYYY/MM 目录
        m = re.match(r"/monthly/(\d{4})/(\d{2})$", href)
        if m:
            links.add(BASE_URL + href + "/")
    return sorted(links)

def get_article_links(month_url):
    soup = get_soup(month_url)
    links = []
    if not soup:
        return links
    for a in soup.find_all("a", href=True):
        href = a["href"]
        # 匹配 /monthly/YYYY/MM/XX/ 这种子文章链接
        m = re.match(r"/monthly/\d{4}/\d{2}/\d{2}/$", href)
        if m:
            if not href.startswith("http"):
                href = BASE_URL + href
            links.append(href)
    return sorted(set(links))

def format_content(content_div):
    md_lines = []
    for elem in content_div.children:
        if elem.name == 'pre':
            code = elem.get_text("\n", strip=True)
            md_lines.append(f'\n```\n{code}\n```\n')
        elif elem.name in ['ul', 'ol']:
            for li in elem.find_all('li'):
                md_lines.append(f'- {li.get_text(strip=True)}')
            md_lines.append('')
        elif elem.name == 'blockquote':
            quote = elem.get_text("\n", strip=True)
            md_lines.append(f'> {quote}\n')
        elif elem.name in ['h1', 'h2', 'h3', 'h4', 'h5', 'h6']:
            level = int(elem.name[1])
            md_lines.append(f'\n{"#"*level} {elem.get_text(strip=True)}\n')
        elif elem.name == 'p' or elem.name is None:
            text = elem.get_text("\n", strip=True)
            if text:
                md_lines.append(f'{text}\n')
    return '\n'.join(md_lines)

def save_article(article_url):
    soup = get_soup(article_url)
    if not soup:
        print(f"获取失败: {article_url}")
        return
    h2 = soup.find("h2")
    title = h2.text.strip() if h2 else article_url.split("/")[-2]
    filename = safe_filename(title) + ".md"
    content_div = soup.find("div", class_="content")
    if not content_div:
        content_div = soup.find("div", id="content")
    if content_div:
        content = format_content(content_div)
    else:
        content = soup.get_text("\n", strip=True)
    md_content = f"# {title}\n\n原文链接: [{article_url}]({article_url})\n\n{content}\n"
    filepath = os.path.join(SAVE_DIR, filename)
    with open(filepath, "w", encoding="utf-8") as f:
        f.write(md_content)
    print(f"已保存: {filepath}")
    time.sleep(random.uniform(0.5, 1.2))

def main():
    month_links = get_all_month_links()
    print(f"共发现{len(month_links)}个月份")
    all_article_links = []
    for month_url in month_links:
        article_links = get_article_links(month_url)
        print(f"{month_url} 发现{len(article_links)}篇文章")
        all_article_links.extend(article_links)
    print(f"总共发现{len(all_article_links)}篇文章,开始下载...")
    with ThreadPoolExecutor(max_workers=THREADS) as executor:
        futures = [executor.submit(save_article, url) for url in all_article_links]
        for future in as_completed(futures):
            pass
    print("全部下载完成。")

if __name__ == "__main__":
    main()
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.
  • 59.
  • 60.
  • 61.
  • 62.
  • 63.
  • 64.
  • 65.
  • 66.
  • 67.
  • 68.
  • 69.
  • 70.
  • 71.
  • 72.
  • 73.
  • 74.
  • 75.
  • 76.
  • 77.
  • 78.
  • 79.
  • 80.
  • 81.
  • 82.
  • 83.
  • 84.
  • 85.
  • 86.
  • 87.
  • 88.
  • 89.
  • 90.
  • 91.
  • 92.
  • 93.
  • 94.
  • 95.
  • 96.
  • 97.
  • 98.
  • 99.
  • 100.
  • 101.
  • 102.
  • 103.
  • 104.
  • 105.
  • 106.
  • 107.
  • 108.
  • 109.
  • 110.
  • 111.
  • 112.
  • 113.
  • 114.
  • 115.
  • 116.
  • 117.
  • 118.

六、 目录细化,抓取图片

干得漂亮~ 需求已基本实现,有一些优化项:
1. 帮我按照链接中的年份及月份创建子目录,并将对应月份的文章放入其中
2. 只需要抓取2024及2025年的文章
3. 如果文章中有表格、图片等,也需要抓取下来

import os
import re
import time
import random
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor, as_completed
from urllib.parse import urljoin, urlparse

BASE_URL = "http://mysql.taobao.org"
MONTHLY_URL = BASE_URL + "/monthly/"
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
}
SAVE_DIR = "taobao"
THREADS = 8
os.makedirs(SAVE_DIR, exist_ok=True)

def safe_filename(title):
    return re.sub(r'[\\/:*?"<>|]', '', title).strip()

def get_soup(url):
    for _ in range(3):
        try:
            resp = requests.get(url, headers=HEADERS, timeout=10)
            resp.encoding = 'utf-8'
            if resp.status_code == 200:
                return BeautifulSoup(resp.text, "lxml")
        except Exception:
            time.sleep(random.uniform(1, 2))
    return None

def get_all_month_links():
    soup = get_soup(MONTHLY_URL)
    links = set()
    for a in soup.find_all("a", href=True):
        href = a["href"]
        # 只抓2024和2025年
        m = re.match(r"/monthly/(2024|2025)/(\d{2})$", href)
        if m:
            links.add(BASE_URL + href + "/")
    return sorted(links)

def get_article_links(month_url):
    soup = get_soup(month_url)
    links = []
    if not soup:
        return links
    for a in soup.find_all("a", href=True):
        href = a["href"]
        # 匹配 /monthly/2024/12/01/ 这种子文章链接
        m = re.match(r"/monthly/(2024|2025)/\d{2}/\d{2}/$", href)
        if m:
            if not href.startswith("http"):
                href = BASE_URL + href
            links.append(href)
    return sorted(set(links))

def download_image(img_url, save_path):
    try:
        resp = requests.get(img_url, headers=HEADERS, timeout=10)
        if resp.status_code == 200:
            with open(save_path, 'wb') as f:
                f.write(resp.content)
            return True
    except Exception:
        pass
    return False

def format_table(table):
    rows = table.find_all('tr')
    md = []
    for i, row in enumerate(rows):
        cols = row.find_all(['td', 'th'])
        line = '| ' + ' | '.join(col.get_text(strip=True) for col in cols) + ' |'
        md.append(line)
        if i == 0:
            md.append('|' + '---|' * len(cols))
    return '\n'.join(md) + '\n'

def format_content(content_div, article_url, save_dir):
    md_lines = []
    for elem in content_div.children:
        if elem.name == 'pre':
            code = elem.get_text("\n", strip=True)
            md_lines.append(f'\n```\n{code}\n```\n')
        elif elem.name in ['ul', 'ol']:
            for li in elem.find_all('li'):
                md_lines.append(f'- {li.get_text(strip=True)}')
            md_lines.append('')
        elif elem.name == 'blockquote':
            quote = elem.get_text("\n", strip=True)
            md_lines.append(f'> {quote}\n')
        elif elem.name in ['h1', 'h2', 'h3', 'h4', 'h5', 'h6']:
            level = int(elem.name[1])
            md_lines.append(f'\n{"#"*level} {elem.get_text(strip=True)}\n')
        elif elem.name == 'p' or elem.name is None:
            text = elem.get_text("\n", strip=True)
            if text:
                md_lines.append(f'{text}\n')
        elif elem.name == 'table':
            md_lines.append(format_table(elem))
        elif elem.name == 'img':
            img_src = elem.get('src')
            if img_src:
                img_url = urljoin(article_url, img_src)
                img_name = os.path.basename(urlparse(img_url).path)
                img_save_path = os.path.join(save_dir, img_name)
                if download_image(img_url, img_save_path):
                    md_lines.append(f'![{img_name}]({img_name})\n')
    # 处理段落内图片
    for img in content_div.find_all('img'):
        img_src = img.get('src')
        if img_src:
            img_url = urljoin(article_url, img_src)
            img_name = os.path.basename(urlparse(img_url).path)
            img_save_path = os.path.join(save_dir, img_name)
            if download_image(img_url, img_save_path):
                # 替换原img为markdown图片语法
                img.replace_with(f'![{img_name}]({img_name})')
    return '\n'.join(md_lines)

def save_article(article_url, year, month):
    soup = get_soup(article_url)
    if not soup:
        print(f"获取失败: {article_url}")
        return
    h2 = soup.find("h2")
    title = h2.text.strip() if h2 else article_url.split("/")[-2]
    filename = safe_filename(title) + ".md"
    # 按年份和月份创建子目录
    subdir = os.path.join(SAVE_DIR, year, month)
    os.makedirs(subdir, exist_ok=True)
    content_div = soup.find("div", class_="content")
    if not content_div:
        content_div = soup.find("div", id="content")
    if content_div:
        content = format_content(content_div, article_url, subdir)
    else:
        content = soup.get_text("\n", strip=True)
    md_content = f"# {title}\n\n原文链接: [{article_url}]({article_url})\n\n{content}\n"
    filepath = os.path.join(subdir, filename)
    with open(filepath, "w", encoding="utf-8") as f:
        f.write(md_content)
    print(f"已保存: {filepath}")
    time.sleep(random.uniform(0.5, 1.2))

def main():
    month_links = get_all_month_links()
    print(f"共发现{len(month_links)}个月份")
    all_tasks = []
    for month_url in month_links:
        m = re.match(r".*/monthly/(2024|2025)/(\d{2})/", month_url)
        if not m:
            continue
        year, month = m.group(1), m.group(2)
        article_links = get_article_links(month_url)
        print(f"{month_url} 发现{len(article_links)}篇文章")
        for link in article_links:
            all_tasks.append((link, year, month))
    print(f"总共发现{len(all_tasks)}篇文章,开始下载...")
    with ThreadPoolExecutor(max_workers=THREADS) as executor:
        futures = [executor.submit(save_article, url, year, month) for url, year, month in all_tasks]
        for future in as_completed(futures):
            pass
    print("全部下载完成。")

if __name__ == "__main__":
    main()
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.
  • 59.
  • 60.
  • 61.
  • 62.
  • 63.
  • 64.
  • 65.
  • 66.
  • 67.
  • 68.
  • 69.
  • 70.
  • 71.
  • 72.
  • 73.
  • 74.
  • 75.
  • 76.
  • 77.
  • 78.
  • 79.
  • 80.
  • 81.
  • 82.
  • 83.
  • 84.
  • 85.
  • 86.
  • 87.
  • 88.
  • 89.
  • 90.
  • 91.
  • 92.
  • 93.
  • 94.
  • 95.
  • 96.
  • 97.
  • 98.
  • 99.
  • 100.
  • 101.
  • 102.
  • 103.
  • 104.
  • 105.
  • 106.
  • 107.
  • 108.
  • 109.
  • 110.
  • 111.
  • 112.
  • 113.
  • 114.
  • 115.
  • 116.
  • 117.
  • 118.
  • 119.
  • 120.
  • 121.
  • 122.
  • 123.
  • 124.
  • 125.
  • 126.
  • 127.
  • 128.
  • 129.
  • 130.
  • 131.
  • 132.
  • 133.
  • 134.
  • 135.
  • 136.
  • 137.
  • 138.
  • 139.
  • 140.
  • 141.
  • 142.
  • 143.
  • 144.
  • 145.
  • 146.
  • 147.
  • 148.
  • 149.
  • 150.
  • 151.
  • 152.
  • 153.
  • 154.
  • 155.
  • 156.
  • 157.
  • 158.
  • 159.
  • 160.
  • 161.
  • 162.
  • 163.
  • 164.
  • 165.
  • 166.
  • 167.
  • 168.
  • 169.

嗯,图片不要单独保存,需要嵌入回.md文档里,这样方便用户查询

import os
import re
import time
import random
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor, as_completed
from urllib.parse import urljoin

BASE_URL = "http://mysql.taobao.org"
MONTHLY_URL = BASE_URL + "/monthly/"
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
}
SAVE_DIR = "taobao"
THREADS = 8
os.makedirs(SAVE_DIR, exist_ok=True)

def safe_filename(title):
    return re.sub(r'[\\/:*?"<>|]', '', title).strip()

def get_soup(url):
    for _ in range(3):
        try:
            resp = requests.get(url, headers=HEADERS, timeout=10)
            resp.encoding = 'utf-8'
            if resp.status_code == 200:
                return BeautifulSoup(resp.text, "lxml")
        except Exception:
            time.sleep(random.uniform(1, 2))
    return None

def get_all_month_links():
    soup = get_soup(MONTHLY_URL)
    links = set()
    for a in soup.find_all("a", href=True):
        href = a["href"]
        m = re.match(r"/monthly/(2024|2025)/(\d{2})$", href)
        if m:
            links.add(BASE_URL + href + "/")
    return sorted(links)

def get_article_links(month_url):
    soup = get_soup(month_url)
    links = []
    if not soup:
        return links
    for a in soup.find_all("a", href=True):
        href = a["href"]
        m = re.match(r"/monthly/(2024|2025)/\d{2}/\d{2}/$", href)
        if m:
            if not href.startswith("http"):
                href = BASE_URL + href
            links.append(href)
    return sorted(set(links))

def format_table(table):
    rows = table.find_all('tr')
    md = []
    for i, row in enumerate(rows):
        cols = row.find_all(['td', 'th'])
        line = '| ' + ' | '.join(col.get_text(strip=True) for col in cols) + ' |'
        md.append(line)
        if i == 0:
            md.append('|' + '---|' * len(cols))
    return '\n'.join(md) + '\n'

def format_content(content_div, article_url):
    md_lines = []
    for elem in content_div.children:
        if elem.name == 'pre':
            code = elem.get_text("\n", strip=True)
            md_lines.append(f'\n```\n{code}\n```\n')
        elif elem.name in ['ul', 'ol']:
            for li in elem.find_all('li'):
                md_lines.append(f'- {li.get_text(strip=True)}')
            md_lines.append('')
        elif elem.name == 'blockquote':
            quote = elem.get_text("\n", strip=True)
            md_lines.append(f'> {quote}\n')
        elif elem.name in ['h1', 'h2', 'h3', 'h4', 'h5', 'h6']:
            level = int(elem.name[1])
            md_lines.append(f'\n{"#"*level} {elem.get_text(strip=True)}\n')
        elif elem.name == 'p' or elem.name is None:
            text = elem.get_text("\n", strip=True)
            if text:
                md_lines.append(f'{text}\n')
        elif elem.name == 'table':
            md_lines.append(format_table(elem))
        elif elem.name == 'img':
            img_src = elem.get('src')
            if img_src:
                img_url = urljoin(article_url, img_src)
                md_lines.append(f'![image]({img_url})\n')
    # 处理段落内图片
    for img in content_div.find_all('img'):
        img_src = img.get('src')
        if img_src:
            img_url = urljoin(article_url, img_src)
            img.replace_with(f'![image]({img_url})')
    return '\n'.join(md_lines)

def save_article(article_url, year, month):
    soup = get_soup(article_url)
    if not soup:
        print(f"获取失败: {article_url}")
        return
    h2 = soup.find("h2")
    title = h2.text.strip() if h2 else article_url.split("/")[-2]
    filename = safe_filename(title) + ".md"
    subdir = os.path.join(SAVE_DIR, year, month)
    os.makedirs(subdir, exist_ok=True)
    content_div = soup.find("div", class_="content")
    if not content_div:
        content_div = soup.find("div", id="content")
    if content_div:
        content = format_content(content_div, article_url)
    else:
        content = soup.get_text("\n", strip=True)
    md_content = f"# {title}\n\n原文链接: [{article_url}]({article_url})\n\n{content}\n"
    filepath = os.path.join(subdir, filename)
    with open(filepath, "w", encoding="utf-8") as f:
        f.write(md_content)
    print(f"已保存: {filepath}")
    time.sleep(random.uniform(0.5, 1.2))

def main():
    month_links = get_all_month_links()
    print(f"共发现{len(month_links)}个月份")
    all_tasks = []
    for month_url in month_links:
        m = re.match(r".*/monthly/(2024|2025)/(\d{2})/", month_url)
        if not m:
            continue
        year, month = m.group(1), m.group(2)
        article_links = get_article_links(month_url)
        print(f"{month_url} 发现{len(article_links)}篇文章")
        for link in article_links:
            all_tasks.append((link, year, month))
    print(f"总共发现{len(all_tasks)}篇文章,开始下载...")
    with ThreadPoolExecutor(max_workers=THREADS) as executor:
        futures = [executor.submit(save_article, url, year, month) for url, year, month in all_tasks]
        for future in as_completed(futures):
            pass
    print("全部下载完成。")

if __name__ == "__main__":
    main()
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.
  • 59.
  • 60.
  • 61.
  • 62.
  • 63.
  • 64.
  • 65.
  • 66.
  • 67.
  • 68.
  • 69.
  • 70.
  • 71.
  • 72.
  • 73.
  • 74.
  • 75.
  • 76.
  • 77.
  • 78.
  • 79.
  • 80.
  • 81.
  • 82.
  • 83.
  • 84.
  • 85.
  • 86.
  • 87.
  • 88.
  • 89.
  • 90.
  • 91.
  • 92.
  • 93.
  • 94.
  • 95.
  • 96.
  • 97.
  • 98.
  • 99.
  • 100.
  • 101.
  • 102.
  • 103.
  • 104.
  • 105.
  • 106.
  • 107.
  • 108.
  • 109.
  • 110.
  • 111.
  • 112.
  • 113.
  • 114.
  • 115.
  • 116.
  • 117.
  • 118.
  • 119.
  • 120.
  • 121.
  • 122.
  • 123.
  • 124.
  • 125.
  • 126.
  • 127.
  • 128.
  • 129.
  • 130.
  • 131.
  • 132.
  • 133.
  • 134.
  • 135.
  • 136.
  • 137.
  • 138.
  • 139.
  • 140.
  • 141.
  • 142.
  • 143.
  • 144.
  • 145.
  • 146.
  • 147.
  • 148.

图片没有显示成功呢

记一次波折的 cursor 生成代码抓取《数据库内核月报》_爬虫_08

记一次波折的 cursor 生成代码抓取《数据库内核月报》_博客_09

结合你这个优化思路,再优化下之前那段代码,使抓取的图片能在.md文件中直接展示

import os
import re
import time
import random
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor, as_completed
from urllib.parse import urljoin

BASE_URL = "http://mysql.taobao.org"
MONTHLY_URL = BASE_URL + "/monthly/"
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
}
SAVE_DIR = "taobao"
THREADS = 8
os.makedirs(SAVE_DIR, exist_ok=True)

def safe_filename(title):
    # 只保留中英文、数字、下划线和横线
    return re.sub(r'[\\/:*?"<>|]', '', title).strip()

def get_soup(url):
    for _ in range(3):
        try:
            resp = requests.get(url, headers=HEADERS, timeout=10)
            resp.encoding = 'utf-8'
            if resp.status_code == 200:
                return BeautifulSoup(resp.text, "lxml")
        except Exception:
            time.sleep(random.uniform(1, 2))
    return None

def get_all_month_links():
    soup = get_soup(MONTHLY_URL)
    links = set()
    for a in soup.find_all("a", href=True):
        href = a["href"]
        # 只抓2024和2025年
        m = re.match(r"/monthly/(2024|2025)/(\d{2})$", href)
        if m:
            links.add(BASE_URL + href + "/")
    return sorted(links)

def get_article_links(month_url):
    soup = get_soup(month_url)
    links = []
    if not soup:
        return links
    for a in soup.find_all("a", href=True):
        href = a["href"]
        # 匹配 /monthly/2024/12/01/ 这种子文章链接
        m = re.match(r"/monthly/(2024|2025)/\d{2}/\d{2}/$", href)
        if m:
            if not href.startswith("http"):
                href = BASE_URL + href
            links.append(href)
    return sorted(set(links))

def format_table(table):
    rows = table.find_all('tr')
    md = []
    for i, row in enumerate(rows):
        cols = row.find_all(['td', 'th'])
        line = '| ' + ' | '.join(col.get_text(strip=True) for col in cols) + ' |'
        md.append(line)
        if i == 0:
            md.append('|' + '---|' * len(cols))
    return '\n'.join(md) + '\n'

def format_content(content_div, article_url):
    md_lines = []
    for elem in content_div.children:
        if elem.name == 'pre':
            code = elem.get_text("\n", strip=True)
            md_lines.append(f'\n```\n{code}\n```\n')
        elif elem.name in ['ul', 'ol']:
            for li in elem.find_all('li', recursive=False):
                # 处理列表项中的图片
                imgs = li.find_all('img')
                for img in imgs:
                    img_src = img.get('src')
                    if img_src:
                        img_url = urljoin(article_url, img_src)
                        md_lines.append(f'![image]({img_url})')
                text = li.get_text(" ", strip=True)
                if text:
                    md_lines.append(f'- {text}')
            md_lines.append('')
        elif elem.name == 'blockquote':
            quote = elem.get_text("\n", strip=True)
            md_lines.append(f'> {quote}\n')
        elif elem.name in ['h1', 'h2', 'h3', 'h4', 'h5', 'h6']:
            level = int(elem.name[1])
            md_lines.append(f'\n{"#"*level} {elem.get_text(strip=True)}\n')
        elif elem.name == 'p' or elem.name is None:
            # 处理段落中的图片
            if hasattr(elem, 'find_all'):
                imgs = elem.find_all('img')
                for img in imgs:
                    img_src = img.get('src')
                    if img_src:
                        img_url = urljoin(article_url, img_src)
                        md_lines.append(f'![image]({img_url})')
            text = elem.get_text("\n", strip=True)
            if text:
                md_lines.append(f'{text}\n')
        elif elem.name == 'table':
            md_lines.append(format_table(elem))
    return '\n'.join(md_lines)

def save_article(article_url, year, month):
    soup = get_soup(article_url)
    if not soup:
        print(f"获取失败: {article_url}")
        return
    h2 = soup.find("h2")
    title = h2.text.strip() if h2 else article_url.split("/")[-2]
    filename = safe_filename(title) + ".md"
    # 按年份和月份创建子目录
    subdir = os.path.join(SAVE_DIR, year, month)
    os.makedirs(subdir, exist_ok=True)
    content_div = soup.find("div", class_="content")
    if not content_div:
        content_div = soup.find("div", id="content")
    if content_div:
        content = format_content(content_div, article_url)
    else:
        content = soup.get_text("\n", strip=True)
    md_content = f"# {title}\n\n原文链接: [{article_url}]({article_url})\n\n{content}\n"
    filepath = os.path.join(subdir, filename)
    with open(filepath, "w", encoding="utf-8") as f:
        f.write(md_content)
    print(f"已保存: {filepath}")
    time.sleep(random.uniform(0.5, 1.2))

def main():
    month_links = get_all_month_links()
    print(f"共发现{len(month_links)}个月份")
    all_tasks = []
    for month_url in month_links:
        m = re.match(r".*/monthly/(2024|2025)/(\d{2})/", month_url)
        if not m:
            continue
        year, month = m.group(1), m.group(2)
        article_links = get_article_links(month_url)
        print(f"{month_url} 发现{len(article_links)}篇文章")
        for link in article_links:
            all_tasks.append((link, year, month))
    print(f"总共发现{len(all_tasks)}篇文章,开始下载...")
    with ThreadPoolExecutor(max_workers=THREADS) as executor:
        futures = [executor.submit(save_article, url, year, month) for url, year, month in all_tasks]
        for future in as_completed(futures):
            pass
    print("全部下载完成。")

if __name__ == "__main__":
    main()
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.
  • 59.
  • 60.
  • 61.
  • 62.
  • 63.
  • 64.
  • 65.
  • 66.
  • 67.
  • 68.
  • 69.
  • 70.
  • 71.
  • 72.
  • 73.
  • 74.
  • 75.
  • 76.
  • 77.
  • 78.
  • 79.
  • 80.
  • 81.
  • 82.
  • 83.
  • 84.
  • 85.
  • 86.
  • 87.
  • 88.
  • 89.
  • 90.
  • 91.
  • 92.
  • 93.
  • 94.
  • 95.
  • 96.
  • 97.
  • 98.
  • 99.
  • 100.
  • 101.
  • 102.
  • 103.
  • 104.
  • 105.
  • 106.
  • 107.
  • 108.
  • 109.
  • 110.
  • 111.
  • 112.
  • 113.
  • 114.
  • 115.
  • 116.
  • 117.
  • 118.
  • 119.
  • 120.
  • 121.
  • 122.
  • 123.
  • 124.
  • 125.
  • 126.
  • 127.
  • 128.
  • 129.
  • 130.
  • 131.
  • 132.
  • 133.
  • 134.
  • 135.
  • 136.
  • 137.
  • 138.
  • 139.
  • 140.
  • 141.
  • 142.
  • 143.
  • 144.
  • 145.
  • 146.
  • 147.
  • 148.
  • 149.
  • 150.
  • 151.
  • 152.
  • 153.
  • 154.
  • 155.
  • 156.
  • 157.
  • 158.

终于终于,需求实现~

七、 执行效果

抓取成功~

记一次波折的 cursor 生成代码抓取《数据库内核月报》_博客_10

文章效果

记一次波折的 cursor 生成代码抓取《数据库内核月报》_cursor_11

八、 最佳提示词总结

总结我们的对话,我应该给你什么样的提示词,能让你在第一次就生成这段目标代码呢

最佳提示词(Prompt)示例

你是一个Python爬虫专家。请帮我用Python编写一个高效的爬虫,要求如下:

  1. 目标网站:抓取 http://mysql.taobao.org/monthly/ 下2024和2025年所有月报文章。
  1. 目录结构:自动识别所有2024和2025年下的月份和文章链接,并按“年份/月份”创建本地子目录(如 taobao/2024/12/)。
  1. 内容保存:每篇文章内容以 markdown 格式单独保存为 .md 文件,文件名为文章标题,放在对应的子目录下。
  1. 内容格式:正文内容需美观排版,支持代码块、表格(转为markdown表格)、列表、引用等常见格式。
  1. 图片处理:文章中的图片在 markdown 文件中以原始网络URL嵌入(即 ![alt](url)),无需下载到本地,确保在md预览时可直接显示。
  1. 效率与反爬:支持多线程下载,设置User-Agent、随机延时、自动重试等反爬措施。
  1. 健壮性:遇到网络异常、页面结构变化等情况能自动跳过并继续。
  1. 依赖说明:请注明所需的第三方库。

请直接给出完整、可运行的Python代码。

这样写的好处
  • 明确了目标范围(2024/2025年、所有月报、所有文章)
  • 明确了目录结构和文件命名
  • 明确了内容格式和图片处理方式
  • 明确了效率和反爬要求
  • 明确了健壮性和依赖

你可以根据实际需求增减细节,但只要覆盖上述要点,基本可以一次性获得你想要的高质量爬虫代码!

九、 知识点学习

为了实现这个需求,你的最终代码生成思路是什么样的呢,帮忙一步步列出。这段代码中的重点和难点分别是什么,涉及到哪些知识点?帮忙结合代码分析,我不太清楚你的步骤对应代码中的哪些内容

1. 步骤分解与代码对应

记一次波折的 cursor 生成代码抓取《数据库内核月报》_博客_12

记一次波折的 cursor 生成代码抓取《数据库内核月报》_cursor_13

记一次波折的 cursor 生成代码抓取《数据库内核月报》_AI编码_14

记一次波折的 cursor 生成代码抓取《数据库内核月报》_Python_15

记一次波折的 cursor 生成代码抓取《数据库内核月报》_Python_16

记一次波折的 cursor 生成代码抓取《数据库内核月报》_博客_17

记一次波折的 cursor 生成代码抓取《数据库内核月报》_cursor_18

2. 重点与难点代码分析
重点1:高质量内容解析与格式化
  • format_content 函数负责将HTML正文转为美观的Markdown,兼容各种标签(段落、标题、代码块、表格、图片等)。
  • format_table 函数将HTML表格转为Markdown表格。
难点1:图片URL处理
  • 用 urljoin(article_url, img_src) 保证图片URL完整,插入 ![image](url),确保md中能直接显示图片。
难点2:多线程与异常处理
  • ThreadPoolExecutor 并发抓取,get_soup 内部自动重试,保证大批量抓取时的健壮性。
难点3:文件名合法性
  • safe_filename 函数去除标题中的非法字符,避免文件保存出错。

3. 涉及知识点与代码位置
  • requests/BeautifulSoup:所有网页请求与HTML解析相关函数
  • 正则表达式:链接筛选、文件名处理
  • os/path:目录和文件操作
  • concurrent.futures:多线程并发
  • Markdown语法:format_content、format_table等
  • 异常处理:try/except、重试机制

4. 总结
  • 你可以对照每个函数和代码块,理解它在整个流程中的作用。
  • 重点和难点都在内容解析、图片处理、多线程和健壮性上,代码中都有明确实现。