爬虫程序访问网站,速度很快,很容易突破网站设置的访问次数,此情况下就会被停止访问,或者IP被封。如果此时能有一些代理IP,切换不同的代理IP去访问网站,使网站以为是从不同的机器上访问的,那么代理IP背后的自己的IP就不受影响了。就算用了代理IP也不要频繁访问网站,因为要为网站考虑一下它的压力。
1.从http://www.xicidaili.com/nn/1里获取免费代理IP。打开网页,查看源代码,分析代码结构,找到你需要的数据,用正则把 用它找出来。正则表达式是 r'<td>(([1-9]\.|[1-9][0-9]\.|1[0-9]{2}\.|2[0-4][0-9]\.|25[0-5]\.){3}([1-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5]))</td>\s+<td>(\d{2,5})</td>'。
2.把代理IP保存文件,留着用。但代理IP变化很快,有可能一会功夫就不能用了。所以在需要的时候抓取一下就行了。可以保存在文件里,也可以保存在数据库里。
3.检查代理IP有效性。这个操作可以放在每次抓取页面前,如果不能用就切换其他代理IP,同时把这个不能用的代理IP移除。
代码如下:分两个文件,一个获取代理IP,一个检查有效性(另外有多进程检查)。
# -*- coding: utf-8 -*-
'''
从www.xicidaili.com获取代理IP,并保存文件
'''
import urllib.request as req
import time
import re
import random
text_html = r'd:/tmp/xici_html.txt'
text_ips = r'd:/tmp/xici_ips.txt'
class Getxi():
def __init__(self,page):
self.page = page
self.url = r'http://www.xicidaili.com/nn/{}'
def request_method(self,p):
curr_time = time.time()
sec = int(curr_time)
micsec = int(round(curr_time*1000))
print(sec,' == ',micsec)
headers = {
'Cache-Control':'max-age=0',
'Connection': 'Keep-Alive',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'zh-CN,zh;q=0.8',
'Accept-Enconding':'gzip, deflate, sdch',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
'Host':'www.xicidili.com',
'Referer':'http://www.xicidili.com/',
'Pragma':'no-cache',
'Upgrade-Insecure-Requests':1,
}
url_com = self.url.format(p)
reqs = req.Request(url_com,headers=headers)
return reqs
def get_html(self,p):
reqss = self.request_method(p)
conn = req.urlopen(reqss)
html = conn.read().decode('utf-8')
return html
def save_html(self,ip_html):
with open(text_html,'a') as f:
f.write(ip_html)
f.close()
def save_ips(self,ips):
with open(text_ips,'a') as f:
f.write(ips)
f.close()
def parse_html(self,ip_html):
pattern = re.compile(r'<td>(([1-9]\.|[1-9][0-9]\.|1[0-9]{2}\.|2[0-4][0-9]\.|25[0-5]\.){3}([1-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5]))</td>\s+<td>(\d{2,5})</td>',re.S)
tds = pattern.findall(ip_html)
str1 = ''
for td in tds:
str1 += '{}:{}\n'.format(td[0].strip(),td[3].strip())
#print(str1)
self.save_ips(str1)
def crawler(self):
for i in range(self.page):
html = self.get_html(i+1)
self.save_html(html)
self.parse_html(html)
time.sleep(random.randint(5,15))
def xixi():
page = 2
xi = Getxi(page)
xi.crawler()
if __name__ == '__main__':
xixi()
检查有效性:访问的网页是http://2018.ip138.com/ic.asp
# -*- coding: utf-8 -*-
'''
验证代理IP的有效性
'''
from urllib import request
import urllib
import time
import random
import socket
import http
ips_ok_file = r'd:/tmp/xici_1_ok.txt' # 验证后,存入有效的IP
ips_file = r'd:/tmp/xici_ips.txt' # IP列表
url = 'http://2018.ip138.com/ic.asp' # 检测访问ip
User_Agent = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
ok_ips = ''
class CheckProxyIp():
def __init__(self):
pass
def read_ips_file(self):
with open(ips_file,'r',encoding='utf-8') as f:
ips = f.readlines()
f.close()
for ip in ips:
i = ip.strip()
self.check_ips(i)
time.sleep(random.randint(1,5))
def check_ips(self,ip):
global ok_ips
proxy = {'http':ip,'https':ip}
print(proxy)
proxy_handler = request.ProxyHandler(proxy)
opener = request.build_opener(proxy_handler)
opener.addheaders = [('User-Agent',User_Agent)]
request.install_opener(opener)
try:
response = request.urlopen(url,timeout=3) # 使用安装好的opener
if(response.getcode() == 200):
html = response.read().decode('gbk')
print(len(html))
ok_ips += ip+'\n'
else:
print('no')
except UnicodeDecodeError as e:
print(e)
except urllib.error.HTTPError as e:
print(e)
except urllib.error.URLError as e:
print(e)
except socket.timeout as e:
print(e)
except http.client.RemoteDisconnected as e:
print(e)
except ConnectionResetError as e:
print(e)
def save_ok_ip(self):
global ok_ips
print('save ....')
print(ok_ips)
with open(ips_ok_file,'w') as f:
f.write(ok_ips)
f.close()
def check():
chcip = CheckProxyIp()
chcip.read_ips_file()
chcip.save_ok_ip()
if __name__ == '__main__':
check()
本文介绍了如何利用Python从指定网站抓取免费代理IP,并通过正则表达式提取所需数据。同时,讨论了代理IP的保存方式和有效性检查,确保在爬虫过程中避免IP被封的风险。提供了一个检查代理IP有效性的代码示例,涉及多进程处理。

7万+

被折叠的 条评论
为什么被折叠?



