本文以制作小学课堂音频数据集为例子

1. 搜索关键字获取音视频链接
if __name__ == "__main__":
with sync_playwright() as playwright:
searcher = BLVideoSearch(playwright, headless=True)
url = searcher.make_url(keyword=["小学公开课"])
searcher.run(url, outfile="videos_url.txt")
得到链接列表
2. 批量下载和实时视频转音频
you-get: 根据链接下载视频文件
ffmpeg: 将视频实时转音频
subprocess: 通过子进程执行上述命令
2.1 多线程批量下载 (you-get)
you-get 子进程:
command = [YOUGET, "-o", self.video_dir, "-O", utt, task]
subprocess.run(command, check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
2.2 实时视频转音频
ffmpeg 子进程:
command = [FFMPEG, "-i", video_file, '-ac', '1', '-ar', '16000', audio_file]
subprocess.run(command, check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
下载视频文件信息如下:
最终保存为音频文件
3. 使用whisper或funasr进行多路转写
funsound支持多路离线转写,后端可以选用whisper or fuansr
from funsound.funasr.onnx.offline.asr import ASR
from funsound.common.executor import Worker, launch, get_worker_status, submit_task, get_task_progress
from funsound.utils import *
def init_engine(id):
engine = ASR(cfg_file='conf/funasr_onnx.yaml',
log_file=f'log/funasr-{id}.log')
engine.init_state()
return engine
def processor(self,params):
audio_file = params[0]
result = self.engine.inference(audio_file,
make_sentence_split="punc")
return result
Worker.processor = processor
if __name__ == "__main__":
nj = 3 # 开启3路
workers = []
for id in range(nj):
engine = init_engine(id)
worker = Worker(wid=id,log_file=f'log/worker-{id}.log')
worker.load_engine(engine=engine)
workers.append(worker)
launch(workers)
print(get_worker_status(workers))
audio_file = "/opt/wangwei/funsound_onnx/funsound/examples/test1.wav"
task_id = submit_task(workers,params=[audio_file])
while 1:
prgs = get_task_progress(task_id)
print(prgs)
if prgs['status'] in ["SUCCESS","FAIL"]:
if prgs['status'] == "SUCCESS":
for line in prgs['result']:
print(line)
break
time.sleep(1)
识别后的结果如下,包含时间戳:

4. 人工纠正UI
同时打开音/视频文件和asr识别的json文件

支持以下:
- 点击句子跳转播放
- 修改时间戳,识别结果,角色等其他字段
- 不满意的预识别句子可以丢弃
- 修改完毕后导出json标注文件




1147

被折叠的 条评论
为什么被折叠?



