爬虫常用库

最新推荐文章于 2024-11-28 14:01:44 发布

原创最新推荐文章于 2024-11-28 14:01:44 发布 · 445 阅读

1 ·

本内容遵循CC 4.0 BY-SA版权协议

python 专栏收录该内容

16 篇文章

订阅专栏

爬虫常用库

urllib库
- request模块
requests库
Selenium库
- Selenium查找方法
- Selenium驱动Edge
BeautifulSoup

urllib库

request模块

最基本的HTTP 请求模块，可以用来模拟发送请求。就像在浏览器里输入网址，然后回车一样，只需要给库方法传入URL 以及额外的参数，就可以模拟实现这个过程了。

	函数/方法	说明
导入	import urllib.repuest
请求	urlopen(url,data,timeout)	请求网页，得到HTTPResposne 类型的对象
	Request(url,data,headers,method)	请求网页，得到Request类型的对象
HTTPResposne方法	read()
	readinto()
	getheader(name)	响应的头信息中的某参数
	getheaders()	响应的头信息
Request方法	add_header()	为Request对象添加headers信息

requests库

函数	说明
requests.get(url,params,headers)	获取网页内容,Response格式数据
requests.post(url,data)	post方式提交表单(form)

Response格式数据属性

函数	说明
.status_code	状态码
.raise_for_status	异常状态码直接报错,不再往下运行
.encoding	编码格式
.text	源码文本
.content	bytes格式源码
.url	url链接
.json()	返回json格式内容

Selenium库

	函数	描述
导入	from selenium import webdriver	导入webdriver
	from selenium.webdriver.common.keys import Keys	导入Keys（模拟提交）
	from selenium.webdriver.support.ui import Select	导入Select（填写表格）
	from selenium.webdriver import ActionChains	导入ActionChains（元素拖放）
启动	driver = webdriver.Chrome(executable_path=’*\chromedriver.exe’)	Chrome实例化driver
	driver = webdriver.Edge()	Edge实例化driver
常用操作	.get(“url”)	访问网址
	.click()	点击
	.submit()	提交
	.page_source	获取源代码
	.current_url	获取当前页面url
提交	.send_keys(‘something’,Keys.RETURN)	输入内容并提交
	.clear()	清空提交
填写表格	Select()
	.select_by_index()	根据索引来选择
	.select_by_visible_text	根据文本来选择
	.select_by_value	根据值来选择
	.deselect_all()	取消已经选择的元素
	.options	获取所有可选选项
元素拖放	ActionChains()
	.drag_and_drop(source,target).perform	拖拽元素
页面切换	.switch_to_window()	切换窗口
	for handle in driver.window_handles: driver.switch_to_window(handle)	迭代所有已经打开的窗口
	switch_to_frame()	切换 frame
处理弹窗	driver.switch_to_alert()	访问对话框
访问历史记录	driver.forward()	前进
	driver.back()	后退
操作Cookies	driver.add_cookie(cookie)	cookie字典
	driver.get_cookies()	获取页面 Cookies

Selenium查找方法

将驱动实例化之后，利用以下方法找到对应元素

函数	描述
.find_element_by_name()	通过元素name定位
.find_element_by_id()	通过元素id定位
.find_element_by_xpath()	通过xpath表达式定位
.find_element_by_link_text()	通过完整超链接定位
.find_element_by_partial_link_text ()	通过部分链接定位
.find_element_by_tag_name()	通过标签定位
.find_element_by_class_name()	通过类名进行定位
.find_element_by_css_selector()	通过css选择器进行定位

如果要定位的元素有多个，那么可以把element改为elements，这样就可以匹配多个元素了

还可以采用By类来确定哪种选择方式，然后再匹配，By 类的一些属性如下：

ID = “id”
XPATH = “xpath”
LINK_TEXT = “link text”
PARTIAL_LINK_TEXT = “partial link text”
NAME = “name”
TAG_NAME = “tag name”
CLASS_NAME = “class name”
CSS_SELECTOR = “css selector”

Selenium驱动Edge

下载驱动，https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/
下载完成后解压，把msedgedriver.exe拷贝一份重命名为MicrosoftWebDriver.exe，并把它们两个拷贝到Python安装目录下
使用Selenium启动Edge

browser = webdriver.Edge()

BeautifulSoup

函数	说明
from bs4 import BeautifulSoup as bs	导入
BeautifulSoup(html,‘html.parser’)	将html内容转换返回BeautifulSoup格式数据

BeautifulSoup格式数据属性

函数	说明
.title	网页标题
.p	网页段落
…	根据html标签取相应部分
.get_text()	取文档内所有文本内容
.prettify()	工整格式显示代码
.find(‘标签’,class_=‘title’)	找到html内符合属性筛选的（第一个）指定标签
.find_all(‘标签’)	找到html内所有指定标签
.find_all(class_=‘title’)	找到html内所有指定属性的节点
.节点.text	只取节点文本
.节点.name	只取节点的标签名称
.节点.parent	取节点的父节点
.节点.children	取节点的子节点
.节点[‘属性’]	取节点的属性