xpath和pyquery

最新推荐文章于 2025-01-03 17:48:16 发布

原创最新推荐文章于 2025-01-03 17:48:16 发布 · 684 阅读

2 ·

本内容遵循CC 4.0 BY-SA版权协议

标签

#html #xpath

spider 专栏收录该内容

5 篇文章

订阅专栏

本文详细介绍了PyQuery和XPath在解析HTML时的应用，包括PyQuery的创建、获取标签内容和属性，以及XPath的路径解析、谓词筛选、通配符使用等，帮助读者深入理解这两种强大的HTML解析工具。

今日总结

一、pyquery

pyquery是通过css选择器来获取网页中的标签

1. 获取数据(只能是html界面)

from pyquery import PyQuery
with open('files/data.html', encoding='utf-8') as f:
    content = f.read()

2. 创建PyQuery对象

html = PyQuery(content)

3. 获取标签

直接在整个页面中按照css选择器获取指定标签

PyQuery对象(css选择器) - 获取指定标签

p = html('div>p')
print(p)

lis = html('li')
print(lis)

f1 = html('#f1')
print(f1)

ps = html('p')
print(ps)

# 在指定标签中按照css选择器获取指定标签
div1 = html('#div1')
p = div1('p')
print(p, type(p))

divs = html('.c1')
print(divs)
ps = divs('p')
print(ps)

4. 获取标签的内容和属性

PyQuery对象.text() - 获取双标签的文本内容
PyQuery对象.val() - 获取标签的value属性
PyQuery对象.attr(属性名) - 获取标签指定的属性

result = html('#p1').text()
print(result)    # 我是段落2

print('=================')
# 直接获取所有的p标签的文本内容
result = html('p').text()
print(result, type(result))

# 单独获取所有p标签的文本内容
ps = html('p')
for x in ps:
    print('x:', PyQuery(x).text())

result = html('input').val()
print(result)

all_a = html('a')
for a in all_a:
    print(PyQuery(a).attr('href'))

二、xPath 解析

xPath主要针对html文件和xml文件，解析原理：通过告诉解析器需要标签在页面中的路径来获取对应的标签

xml也是一种通用的数据格式

"""
<supermarket>
    <name>永辉超市</name>
    <goodsList>
        <goods price="100">衣服</goods>
        <goods></goods>
        <goods></goods>
        <goods></goods>
    </goodsList>
</supermarket>
"""

0. 准备网页数据

with open('files/data.html', encoding='utf-8') as f:
    content = f.read()

1. 创建解析器对象

根节点 = etree.HTML(html文本数据)

html = etree.HTML(content)     # <Element html at 0x1045b5e80>

2. 获取节点(获取标签)

节点对象.xpath(路径)

1）标签名 - 在当前节点下找对应的子节点(相对路径)

获取html节点中名字叫body的子节点

result1 = html.xpath('body')
print(result1)   #[<Element body at 0x105501340>]

result2 = html.xpath('body/div')
print(result2)   # [<Element div at 0x10c8a23c0>, <Element div at 0x10c8a2440>, <Element div at 0x10c8a2480>]

result3 = html.xpath('body/div/img')
print(result3)   # [<Element img at 0x10e3115c0>]

div = result2[0]
print(div.xpath('a'))

2）. - 写相对路径

result4 = html.xpath('./body/div/font')
print(result4)

result5 = div.xpath('./img')
print(result5)

3）/ - 绝对路径

从根节点开始写路径，而且和谁去.xpath无关

result6 = html.xpath('/html/body/div/img')
print(result6)   # [<Element img at 0x10c78f1c0>]

result7 = div.xpath('/html/body/div/img')
print(result7)   # [<Element img at 0x10c78f1c0>]

// - 从任意位置开始

//img - 获取整个页面中所有的img节点

//div/img - 获取整个页面中是div子节点的img节点

result8 = html.xpath('//p')
print(result8)

result9 = html.xpath('//div/div/p')
print(result9)

result10 = html.xpath('//div/p')
print(result10)

… - 当前节点的上层节点

result11 = div.xpath('../ol')
print(result11)

@ - 获取属性值

//img/@src - 获取整个页面中所有图片标签的src属性

img = html.xpath('//img/@src')
print(img)

text() - 获取标签的文本内容

lis = html.xpath('//li/text()')
print(lis)

三、xPath的谓词

xPath的谓词可以理解成筛选条件，写的时候必须放在 [ ] 里面

0. 准备网页数据

with open('files/data.html', encoding='utf-8') as f:
    content = f.read()

html = etree.HTML(content)

1. 位置

[N] - 获取第N个(同层的第N个)

result = html.xpath('//div[1]')
print(result)

2. 属性

[@属性] - 筛选出包含指定属性的标签

//p[@id] - 获取设置了id属性的p标签

result = html.xpath('//p[@id]/text()')
print(result)   #['我是段落2', '我是段落3']

[@属性=值] - 筛选出指定属性是指定值的标签

//p[@id=“p1”] - 获取id是p1的p标签

result = html.xpath('//p[@id="p1"]/text()')
print(result)   # ['我是段落2']

子标签内容 - 通过子标签的内容来对父标签进行筛选

//div[p=“我是段落5”] - 获取包含内容是"我是段落5"的p标签的div

result = html.xpath('//div[p="我是段落5"]')
print(result)

result = html.xpath('//div[span>90]/p/text()')
print(result)

3. 通配符

用*表示所有

result = html.xpath('//div[@id="div1"]/*')
print(result)

result = html.xpath('//*[@*]')
print(len(result))

4. 同时选取多个路径

路径1|路径2|路径3|…

result = html.xpath('//div[@id="div1"]/p/text()|//div[@id="div1"]/a/text()')
print(result)

result = soup.find_all(attrs={'tag': 'hot'})
print(result)

获取tag属性是’hot’并且height属性是’100’的所有标签

result = soup.find_all(attrs={'tag': 'hot', 'height': '100'})
print(result)

5.获取内容和属性

标签.string - 获取双标签的文字内容（注意：被获取的标签中不能有子标签，否则结果是None）
标签.contents - 获取双标签的内容（包括文字内容和子标签）

b1 = soup.select('#f1')[0]
print(b1.string)    # 我是font1

b2 = soup.select('#f2')[0]
print('===:', b2.string)   # ===: None

print(b1.contents)   # ['我是font1']
print(b2.contents, b2.contents[-1].string)   # ['我是font2 ', <a href="#">abc</a>]  abc

print(b1.get_text())    # 我是font1
print(b2.get_text())    # 我是font2 abc

6. 获取标签属性

标签对象.attrs[属性名]

b3 = soup.body.div.a
print(b3.attrs['href'])    # https://www.baidu.com

b4 = soup.find_all(attrs={'title': 't百度'})[0]
print(b4.attrs['src'], b4.attrs['title'])    # https://www.baidu.com/img/PCtm_d9c8750bed0b3c7d089fa7d55720d6cf.png t百度

四、xPath的使用

选取节点

XPath 使用路径表达式在 XML 文档中选取节点。节点是通过沿着路径或者 step 来选取的。

表达式	描述
节点名称	选取此节点的所有子节点。
/	从根节点选取。(绝对路径)
//	提取任意子节点（从匹配选择的当前节点选择文档中的节点，而不考虑它们的位置。）
.	选取当前节点。
…	选取当前节点的父节点。
@	选取属性。

实例：

路径表达式	结果
bookstore	选取 bookstore 元素的所有子节点。
/bookstore	选取根元素 bookstore。注释：假如路径起始于正斜杠( / )，则此路径始终代表到某元素的绝对路径！
bookstore/book	选取属于 bookstore 的子元素的所有 book 元素。
//book	选取所有 book 子元素，而不管它们在文档中的位置。
bookstore//book	选择属于 bookstore 元素的后代的所有 book 元素，而不管它们位于 bookstore 之下的什么位置。
//@lang	选取名为 lang 的所有属性

谓语

谓语用来查找某个特定的节点或者包含某个指定的值的节点。

谓语被嵌在方括号中。

路径表达式	结果
/bookstore/book[1]	选取属于 bookstore 子元素的第一个 book 元素。
/bookstore/book[last()]	选取属于 bookstore 子元素的最后一个 book 元素。
/bookstore/book[last()-1]	选取属于 bookstore 子元素的倒数第二个 book 元素。
/bookstore/book[position()❤️]	选取最前面的两个属于 bookstore 元素的子元素的 book 元素。
//title[@lang]	选取所有拥有名为 lang 的属性的 title 元素。
//title[@lang=‘eng’]	选取所有 title 元素，且这些元素拥有值为 eng 的 lang 属性。
/bookstore/book[price>35.00]	选取 bookstore 元素的所有 book 元素，且其中的 price 元素的值须大于 35.00。
/bookstore/book[price>35.00]/title	选取 bookstore 元素中的 book 元素的所有 title 元素，且其中的 price 元素的值须大于 35.00。

选取未知节点

XPath 通配符可用来选取未知的 XML 元素。

通配符	描述
*	匹配任何元素节点
@*	匹配任何属性节点。
node()	匹配任何类型的节点

实例：

路径表达式	结果
/bookstore/*	选取 bookstore 元素的所有子元素。
//*	选取文档中的所有元素。
//title[@*]	选取所有带有属性的 title 元素。

选取若干路径

通过在路径表达式中使用“|”运算符，您可以选取若干个路径。

路径表达式	结果
//book/title\|//book/price	选取 book 元素的所有 title 和 price 元素。
//title\|//price	选取文档中的所有 title 和 price 元素。
/bookstore/book/title\|//pric	选取属于 bookstore 元素的 book 元素的所有 title 元素，以及文档中所有的 price 元素。