1. Python标准库urllib.request模块_1(python3)

最新推荐文章于 2026-06-22 23:38:57 发布

转载最新推荐文章于 2026-06-22 23:38:57 发布 · 130 阅读

0 ·

本内容遵循CC 4.0 BY-SA版权协议

原文链接：https://my.oschina.net/dataRunner/blog/410206

标签

#python

本文详细介绍了使用Python的urllib库进行网络爬虫的基本操作，包括GET和POST请求、网页解码、获取网页头部信息、状态码及URL等，同时提供了实际案例，如从指定网站抓取页面并保存为HTML或TXT格式。

参考学习地址：http://www.iplaypython.com

Header 网页头部信息：

Server： 服务器类型

Content-Type: 网页内容类型： text 编码： GBK/UTF-8

Last-Modified： 网站最后修改时间

# coding:utf-8

# 学习1
# import urllib
# # 查看方法内容
# print(dir(urllib))
# # 查看帮助文档
# help(urllib)
# # PACKAGE CONTENTS （包里面的内容）
# #     error
# #     parse
# #     request
# #     response
# #     robotparser

# 学习2
# # urllib 包下的模块 request
# import urllib.request
# print(dir(urllib.request))
# help(urllib.request)

# 学习3
import urllib.request
# post/get 2中请求方式
help(urllib.request.urlopen)

# decode表示网页的解码方式, encode 表示展现是的编码

# 案例1： utf-8
# # 网页编码是 utf-8
# url="http://www.iplaypython.com"
# html=urllib.request.urlopen(url)
# # 获取网页header信息，有网站编码格式
# print(html.info())
# html_content=html.read().decode("utf-8")
# print(html_content)

# 案例2： gbk (python中 gb2312统一写成gbk)
# # 网页编码是 gb2312
# url="http://www.163.com"
# html=urllib.request.urlopen(url)
# # 获取网页header信息，有网站编码格式
# print(html.info())
# html_content=html.read().decode("gbk")
# print(html_content)

# 学习4
import urllib.request
# print(dir(html))
# 获取网页所在的header信息
url="http://www.iplaypython.com"
html=urllib.request.urlopen(url)
# # 获取网页header信息，有网站编码格式
print(html.info())
# 获取网站返回的状态码
print("返回的状态码: %s" % html.getcode())
"""
网页状态码

200正常访问
301重定向 302临时重定向
403禁止访问 404网页不存在
500服务器忙/无响应

http权威指南，专门介绍http协议，推荐大家买纸质档
Web开发，这本书是必备的

"""

# 获取用户传入的url
print(html.geturl())
# 网页打开后，记得关闭，便于内存回收
html.close()

# 学习5
import urllib.request
# 网页爬取，下载网页
# urllib.request.urlretrieve(url,"e:/_python/other/abc.html") #将网页保存为html格式
urllib.request.urlretrieve(url,"e:/_python/other/abc.txt") #将网页保存为txt格式

# 网页打开后，记得关闭，便于内存回收
html.close()

转载于:https://my.oschina.net/dataRunner/blog/410206