前言
自己找的一个B站的Python多进程多线程多协程的视频(见参考第二条),时间不长,但是感觉讲的还不错,这篇文章也就是个简单的笔记。
怎样选择多进程、多线程、多协程
CPU密集型、IO密集型计算

多进程、多线程、多协程的对比
- 多进程Process(multiprocessing)
- 多线程Thread(threading)
- 多协程Coroutin(asyncio)


Python全局解释器锁GIL
Python速度慢的两大原因:
- 动态类型语言(解释型语言)
- GIL的存在导致Python无法使用多核CPU并发执行



用多线程加速爬虫程序
target传入的是函数的名字,后面传的参数是元组

import requests
urls = [
f"https://www.cnblogs.com/#p{page}"
for page in range(1, 50+1)
]
# print(urls)
def craw(url):
r = requests.get(url)
print(url, len(r.text))
craw(urls[0])
# multi_thread.py
import blog_spider
import threading
import time
def single_thread():
print("single_thread begin")
for url in blog_spider.urls:
blog_spider.craw(url)
print("single_thread end")
def multi_thread():
print("multi_thread begin")
threads = []
for url in blog_spider.urls:
threads.append(
threading.Thread(target=blog_spider.craw, args=(url, ))
)
for thread in threads:
thread.start()
for thread in threads:
thread.join()
print("multi_thread begin")
if __name__ == "__main__":
start = time.time()
single_thread()
end = time.time()
print("single thread cost:", end-start, "seconds")
start = time.time()
multi_thread()
end = time.time()
print("multi thread cost:", end - start, "seconds")
Python实现生产者消费者爬虫



定位标题所在的class

完整代码
# _*_ coding=utf-8 _*_
import queue
import blog_spider
import time
import random
import threading
def do_craw(url_queue:queue.Queue, html_queue:queue.Queue):
while True:
url = url_queue.get()
html = blog_spider.craw(url)
html_queue.put(html)
print(threading.current_thread().name, f"craw {url}",
"url_queue.size=", url_queue.qsize())
time.sleep(random.randint(1, 2))
def do_parse(html_queue:queue.Queue, fout):
while True:
html = html_queue.get()
results = blog_spider.parse(html)
for result in results:
fout.write(str(result) + "\n")
print(threading.current_thread().name, f"result.size", len(results),
"html_queue.size=", html_queue.qsize())
time.sleep(random.randint(1, 2))
if __name__ == "__main__":
url_queue = queue.Queue()
html_queue = queue.Queue()
for url in blog_spider.urls:
url_queue.put(url)
for idx in range(3):
t = threading.Thread(target=do_craw, args=(url_queue, html_queue), name=f"craw{idx}")
t.start()
fout = open("producer_consumer_data.txt", 'w')
for idx in range(2):
t = threading.Thread(target=do_parse, args=(html_queue, fout), name=f"parse{idx}")
t.start()
线程安全问题及解决方案
线程安全问题,例子:银行取钱
可以使用Lock加锁解决线程安全问题

未加锁前的代码
# _*_ coding=utf-8 _*_
import threading
import time
class Account:
def __init__(self, balance):
self.balance = balance
def craw(account, amount):
if account.balance >= amount:
time.sleep(0.1) # 能更清晰地看到线程问题
print(threading.current_thread().name, "取钱成功")
account.balance -= amount
print(threading.current_thread().name, "余额是", account.balance)
else:
print(threading.current_thread().name, "余额不足,取钱失败")
if __name__ == "__main__":
account = Account(1000)
ta = threading.Thread(name='ta', target=craw, args=(account, 800))
tb = threading.Thread(name='tb', target=craw, args=(account, 800))
ta.start()
tb.start()
这输出也太乱了。。。

加了锁以后(加了两行代码)

输出也正常了呢

好用的线程池ThreadPoolExecutor
- 新建:完全不动的状态
- 就绪:需要系统调度



as_completed()方法是一个生成器,在没有任务完成的时候,会阻塞,在有某个任务完成的时候,会yield这个任务,就能执行for循环下面的语句,然后继续阻塞住,循环到所有的任务结束。从结果也可以看出,先完成的任务会先通知主线程。
本节代码:
# _*_ coding=utf-8 _*_
import concurrent.futures
import blog_spider
# craw
with concurrent.futures.ThreadPoolExecutor() as pool:
htmls = pool.map(blog_spider.craw, blog_spider.urls)
htmls = list(zip(blog_spider.urls, htmls))
for url, html in htmls:
print(url, len(html))
print("craw over")
# parse
with concurrent.futures.ThreadPoolExecutor() as pool:
futures = {}
for url, html in htmls:
future = pool.submit(blog_spider.parse, html)
futures[future] = url
# 第一种
# for future, url in futures.items():
# print(url, future.result())
# 第二种 这个不按顺序,按线程完成的顺序进行返回
for future in concurrent.futures.as_completed(futures):
url = futures[future]
print(url, future.result())
用线程池加速Web服务


用time模拟web服务器的几个操作(读文件、读数据库、读api)
# _*_ coding=utf-8 _*_
import json
import flask
import time
app = flask.Flask(__name__)
def read_file():
time.sleep(0.1)
return "file result"
def read_db():
time.sleep(0.2)
return "db result"
def read_api():
time.sleep(0.3)
return "api result"
@app.route('/')
def index():
result_file = read_file()
result_db = read_db()
result_api = read_api()
return json.dumps({
"result_file": result_file,
"result_db": result_db,
"result_api": result_api
})
if __name__ == "__main__":
app.run()
使用线程池加速。将对应服务用线程池来运行,返回结果时也是使用线程池的result方法返回。此时由于是并发执行的,因此运行时间与三个服务中最长的有关(即花费300多ms)
# _*_ coding=utf-8 _*_
import json
import flask
import time
from concurrent.futures import ThreadPoolExecutor
app = flask.Flask(__name__)
pool = ThreadPoolExecutor()
def read_file():
time.sleep(0.1)
return "file result"
def read_db():
time.sleep(0.2)
return "db result"
def read_api():
time.sleep(0.3)
return "api result"
@app.route('/')
def index():
result_file = pool.submit(read_file)
result_db = pool.submit(read_db)
result_api = pool.submit(read_api)
return json.dumps({
"result_file": result_file.result(),
"result_db": result_db.result(),
"result_api": result_api.result()
})
if __name__ == "__main__":
app.run()
使用多进程multiprocessing加速程序的运行

语法几乎一样


# _*_ coding=utf-8 _*_
import math
import time
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
from Crypto.Util import number
PRIMES = []
for i in range(100):
PRIMES.append(number.getPrime(40))
# print(PRIMES)
def is_prime(n):
if n < 2:
return False
if n == 2:
return True
if n % 2 == 0:
return False
sqrt_n = int(math.floor(math.sqrt(n)))
for i in range(3, sqrt_n + 1, 2):
if n % i == 0:
return False
return True
def single_thread():
for number in PRIMES:
is_prime(number)
def multi_thread():
with ThreadPoolExecutor() as pool:
pool.map(is_prime, PRIMES)
def multi_process():
with ProcessPoolExecutor() as pool:
pool.map(is_prime, PRIMES)
if __name__ == "__main__":
start = time.time()
single_thread()
end = time.time()
print('single_thread, cost:', end - start, 'seconds')
start = time.time()
multi_thread()
end = time.time()
print('multi_thread, cost:', end - start, 'seconds')
start = time.time()
multi_process()
end = time.time()
print('multi_process, cost:', end - start, 'seconds')
多进程会快一些,有时多线程会比单线程还要慢

在Flask服务中使用进程池加速
也就是在Flask环境下,面对CPU密集型的运算,使用多进程的方法加速。
代码实际上与使用线程池加速是类似的,但视频中讲到由于进程池是不共享内存空间的,因此需要将进程池的声明放到所有函数声明之后,并且是main函数中,才能保证进程池的正常使用。
PS:不过在实验过后,貌似像线程池那样调用也是可以的,可能是这个库更新了什么地方?(未深究)
# _*_ coding=utf-8 _*_
import math
import flask
import json
from concurrent.futures import ProcessPoolExecutor
app = flask.Flask(__name__)
process_pool = ProcessPoolExecutor() # 视频中说放在这会报错,但是实验貌似没啥问题啊,是不是哪里更改了
def is_prime(n):
if n < 2:
return False
if n == 2:
return True
if n % 2 == 0:
return False
sqrt_n = int(math.floor(math.sqrt(n)))
for i in range(3, sqrt_n + 1, 2):
if n % i == 0:
return False
return True
@app.route('/is_prime/<numbers>')
def api_is_prime(numbers):
number_list = [int(x) for x in numbers.split(',')]
results = process_pool.map(is_prime, number_list)
return json.dumps(dict(zip(number_list, results)))
if __name__ == "__main__":
# 需要放在主函数中,并且是所有其他函数都声明完毕之后,访问 http://127.0.0.1:5000/is_prime/1,2,3
# process_pool = ProcessPoolExecutor()
app.run()
异步IO实现并发爬虫


# _*_ coding=utf-8 _*_
import time
import aiohttp
import asyncio
import blog_spider
async def async_craw(url):
print('craw url:', url)
async with aiohttp.ClientSession() as session:
async with session.get(url) as resp:
result = await resp.text()
print(f'craw url: {url}, {len(result)}')
loop = asyncio.get_event_loop()
tasks = [
loop.create_task(async_craw(url))
for url in blog_spider.urls
]
start = time.time()
loop.run_until_complete(asyncio.wait(tasks))
end = time.time()
print('use time seconds:', end-start)
参考:
- https://www.jianshu.com/p/b9b3d66aa0be
- https://www.bilibili.com/video/BV1bK411A7tV?from=search&seid=10357238605445227627&spm_id_from=333.337.0.0
本文介绍Python中的多进程、多线程及多协程原理与应用案例,包括如何选择合适的并发方式、解决线程安全问题、使用线程池与进程池提高程序效率,以及通过异步IO实现高效并发爬虫。

1066

被折叠的 条评论
为什么被折叠?



