python抓取国外网站时出现403错误

用python来抓取国外网站时出现403错误

代码如下

import requests
from bs4 import BeautifulSoup

url = 'https://7net.omni7.jp/search/?keyword=%E3%83%8F%E3%82%A4%E3%82%AD%E3%83%A5%E3%83%BC%EF%BC%81%EF%BC%81%E3%82%B9%E3%82%AF%E3%82%A8%E3%82%A2%E7%BC%B6%E3%83%90%E3%83%83%E3%82%B8'
headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept-Language': 'ja,en-US;q=0.9,en;q=0.8',
    'Cache-Control': 'max-age=0',
    'Sec-Ch-Ua-Mobile': '?0',
    'Sec-Ch-Ua-Platform': '"Windows"',
    'Sec-Fetch-Dest': 'document',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-Site': 'none',
    'Sec-Fetch-User': '?1',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2045.31',
}

response = requests.get(url, headers=headers)

if response.status_code == 200:
    print('页面请求成功!')
    print(response.status_code)
    print(response.url)
    soup = BeautifulSoup(response.content, 'html.parser')
    print(soup)
elif response.status_code == 302:
    redirect_url = response.headers.get('Location')
    print('页面已重定向到新的URL地址:', redirect_url)
else:
    print('页面请求失败,状态码为:', response.status_code)

如果在header中加入cookie时,可以正常抓取到数据,但是不能够每次抓数据的时候我都用浏览器获取到cookie,再写入到headers里面来抓取吧,这样就失去了爬虫的意义。

我的想法是通过访问某个页面来获取到session,再带session来访问我要抓取的页面来抓取数据。

我也尝试过下面的代码,但是还是出现403错误,

import time

import requests
from bs4 import BeautifulSoup

# 创建一个会话
session = requests.Session()

url_top = 'https://7net.omni7.jp/'
url = 'https://7net.omni7.jp/search/?keyword=%E3%83%8F%E3%82%A4%E3%82%AD%E3%83%A5%E3%83%BC%EF%BC%81%EF%BC%81%E3%82%B9%E3%82%AF%E3%82%A8%E3%82%A2%E7%BC%B6%E3%83%90%E3%83%83%E3%82%B8'
headers = {   
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept-Language': 'ja,en-US;q=0.9,en;q=0.8',
    'Cache-Control': 'max-age=0',
    'Sec-Ch-Ua': '"Google Chrome";v="117", "Not;A=Brand";v="8", "Chromium";v="117"',
    'Sec-Ch-Ua-Mobile': '?0',
    'Sec-Ch-Ua-Platform': '"Windows"',
    'Sec-Fetch-Dest': 'document',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-Site': 'none',
    'Sec-Fetch-User': '?1',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2045.31',
}
response = session.get(url_top, headers=headers)
if response.status_code == 200:
    print('页面请求成功!')
    print(response.status_code)
    print(response.url)
    time.sleep(3)

    response = requests.get(url, headers=headers)

    if response.status_code == 200:
        print('页面请求成功!')
        print(response.status_code)
        print(response.url)
        soup = BeautifulSoup(response.content, 'html.parser')
        print(soup)
    elif response.status_code == 302:
        redirect_url = response.headers.get('Location')
        print('页面已重定向到新的URL地址:', redirect_url)
    else:
        print('页面请求失败,状态码为:', response.status_code)

elif response.status_code == 302:
    redirect_url = response.headers.get('Location')
    print('页面已重定向到新的URL地址:', redirect_url)
else:
    print('页面请求失败,状态码为:', response.status_code)


运行结果为

页面请求成功!
200
https://7net.omni7.jp/top
页面请求失败,状态码为: 403

Process finished with exit code 0

希望得到大家的帮助,谢谢

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值