docxtpl/python-docx/openpyxl

原创

已于 2025-06-18 11:14:42 修改 · 1.1w 阅读

标签

#python #开发语言

于 2019-11-26 15:08:01 首次发布

本文介绍了基于docxtpl的Python自动化报告生成方法，包括模板设置、图片生成和数据/图片插入模板生成word文件。同时，文章还涵盖了如何解决相关错误，如ImportError: cannot import name ‘soft_unicode’，并分享了将html、pdf转化为word、pdf和md文件的多种技术，涉及到libreoffice、comtypes、pdfkit等工具的使用。

基于docxtpl的自动化报告生成(基于word模板)
基于docx将html页面内容转化为word文档
基于comtypes将pdf/word转化为word文档-仅window系统可以使用
基于libreoffice将pdf/word转化为word文档-linux系统可以使用
- 直接调用libreoffice
基于OpenOffice将pdf/word转化为word文档-linux系统可以使用
基于pdfkit将html页面内容转化为pdf文档
基于pdf2docx将pdf文档内容转化为docx文档
- 错误修复-ImportError: DLL load failed while importing cv2
- 错误修复-TypeError: object of type 'NoneType' has no len()
基于pdfminer，markdown将pdf转化为md文件
- 安装pdfminer.six过程中出现Failed to build cryptography
- pandoc安装
pyecharts 离线使用
pd.MultiIndex
重新整合
- MSGraph
- MSWord
- MSExcel

基于docxtpl的自动化报告生成(基于word模板)

在阅读这篇文章之前，可以去看我以前写过相似的自动化报告生成文章
ps:以前在新浪博客写，然后各种被删贴后换过来这边了

基于python-docx文本生成word文档 (http://blog.sina.com.cn/s/blog_1473990d60102x4fr.html)
基于tushare文本生成word文档(http://blog.sina.com.cn/s/blog_1473990d60102x4fs.html)
自动化接口文档生成(http://blog.sina.com.cn/s/blog_1473990d60102yhr3.html)

特别感谢

https://blog.csdn.net/u012117917/article/details/41604711 CSS 颜色代码大全 CSS颜色对照表
https://blog.csdn.net/sunchengquan/article/details/80369304 python批量操作word文档实战
https://blog.csdn.net/zhang__ao/article/details/80745873 echarts标题（title）配置
以上三篇文章。此文是基于以上三篇文章的实施文章。

此文分3部分

模板设置
图片生成
数据/图片插入模板中生成word文件

模板设置

在这里插入图片描述
这里,文档中的{ {qe}},{ {mie}}等都是作为指代参数参数使用，需要跟代码中的指代参数保持一致。
表示方法a为具体的某个指代
表示方法b为表格数据指代
表格的数据格式为

[{'city': '广东省', 'mie': 13885, 'toimie': '5476.71', 'qie': 18158, 'toiqie': '15736.47'},
 {'city': '浙江省', 'mie': 2350, 'toimie': '2861.79', 'qie': 1059, 'toiqie': '859.79'}]

图片生成

此处我是先用pyecharts生成html，再对html图片进行截图。
采用方法：http://pyecharts.org/#/zh-cn/render_images
中的snapshot_selenium，使用 Chrome 浏览器。具体的配置方法可以参考我以前写的

高新技术企业认定工作网(页面截图) (http://blog.sina.com.cn/s/blog_1473990d60102xaf0.html)
chrome浏览器。下载解压好的chromediver.exe文件放进python安装路径下的scripts文件夹里(或者你用的是anaconda,放进anaconda安装路径下的scripts文件夹里)
这里不再多说
相关代码如下

# ************************************画图*****************************************
    from pyecharts.charts import Bar,Grid,Pie
    from pyecharts import options as opts

    # 方形图
    subtitle = ''
    title = '图1.1：***********投资前10省分布'
    path_html = './picture/1.小册子-总体情况.html'
    path_png = "./picture/1.小册子-总体情况.png"
    draw_data = table_1[:10].sort_values(['toimie'], ascending=True)
    x_lable = draw_data['city'].to_list()
    y_data_1 = [format('%.2f'%i) for i in draw_data['toimie'].to_list()]
    y_data_2 = [format('%.2f'%i) for i in draw_data['toiqie'].to_list()]
    init_opts = opts.InitOpts(width="480px", height="360px")
    plt = (
        Bar(init_opts=init_opts)
            .set_global_opts(title_opts=opts.TitleOpts(title=title, # 标题
                                                       # subtitle = subtitle, # 副标题
                                                       pos_left='center',pos_top='bottom', # 标题位置
                                                       title_textstyle_opts = {
   
   
                                                           'fontSize':10.5
                                                       }
                                                       ),
                             yaxis_opts=opts.AxisOpts( # Y轴设置

                             ),
                             xaxis_opts=opts.AxisOpts(   # X轴设置
                                # type_="category"     # 行坐标类型
                             ),
                             legend_opts=opts.LegendOpts(type_='scroll',    # 图例
                                                         orient='vertical',  # 图例列表的布局朝向
                                                         pos_left="center", pos_top='center'   # 图例位置
                                                         ),
                             tooltip_opts=opts.TooltipOpts(trigger='axis'),
                             toolbox_opts=opts.ToolboxOpts(),  # 工具栏
                             # datazoom_opts=opts.DataZoomOpts(),  # 缩放功能

                             )
            .set_series_opts(label_opts=opts.LabelOpts(# is_show=False,       # 是否显示数值
                                                       position="right" ,     # 设置字体对齐
                                                       ))
        #     .extend_axis(           # 双轴
        #     yaxis=opts.AxisOpts()
        # )
            .add_xaxis(x_lable
                       )
            .add_yaxis('****投资总额(亿元)', y_data_1,label_opts=opts.LabelOpts(position='right', # 标签文字位置
                                                                        font_weight='bolder',    # 标签字体
                                                                        # color='#FFC8B4'
                                                                        ),
                                                    # color='#FFC8B4'

                       )
            .add_yaxis('对****投资总额(亿元)', y_data_2,label_opts=opts.LabelOpts(position='right',font_weight='bolder'))
            .reversal_axis()  # 转轴
    )
    grid = Grid(init_opts=init_opts)
    grid.add(plt, grid_opts=opts.GridOpts(pos_top='5'))  # 仅使用pos_top修改相对顶部的位置
    grid.render(path_html)


    # 玫瑰图
    subtitle = '图1.2：**********互投行业明细'
    title = '外圈:***投资总额  内圈:对****投资总额'
    path_html_1 = './picture/1.小册子-总体情况_1.html'
    path_png_1 = "./picture/1.小册子-总体情况_1.png"
    code_num = len(table_2['code'].to_list())
    # code_num = 10
    draw_data = \
        [[table_2['code'].to_list()[i],format('%.2f'%table_2['toimie'].to_list()[i])]
         for i in range(code_num) ]
    draw_data_1 = \
        [[table_2['code'].to_list()[i],format('%.2f'%table_2['toiqie'].to_list()[i])]
         for i in range(code_num)]
    init_opts_pie = opts.InitOpts(width="640px", height="480px")
    plt = (
        Pie(init_opts=init_opts_pie)
            .set_global_opts(title_opts=opts.TitleOpts(title=title,  # 标题
                                                       subtitle = subtitle, # 副标题
                                                       pos_left='center', pos_bottom='0',  # 标题位置
                                                       title_textstyle_opts={
   
                 # 主标题
                                                           'fontSize': 16.5,               # 字体大小
                                                           "fontWeight": "bolder",         # 字体:加粗
                                                           "color": "#444444"              # 字体颜色
                                                       },
                                                       subtitle_textstyle_opts={
   
              # 负标题
                                                           'fontSize': 16.5,
                                                            "fontWeight": "bolder",
                                                            "color": "#000000"
                                                       }
                                                       ),
                             yaxis_opts=opts.AxisOpts(  # Y轴设置

                             ),
                             xaxis_opts=opts.AxisOpts(  # X轴设置
                                 # type_="category"     # 行坐标类型
                             ),
                             legend_opts=opts.LegendOpts(type_='scroll',  # 图例
                                                         orient='vertical',  # 图例列表的布局朝向
                                                         pos_left="left", pos_top='center'  # 图例位置
                                                         ),
                             tooltip_opts=opts.TooltipOpts(trigger='axis'),
                             toolbox_opts=opts.ToolboxOpts(),  # 工具栏
                             # datazoom_opts=opts.DataZoomOpts(),  # 缩放功能

                             )
            .set_series_opts(label_opts=opts.LabelOpts(# is_show=False,       # 是否显示数值
            position="right",  # 设置字体对齐
        ))
            .add(
            "对****投资总额",
            draw_data_1,
            radius=["15%", "30%"],
            # center=["25%", "50%"],    # 中心点位置
            # rosetype="radius",
            label_opts=opts.LabelOpts(is_show=True,formatter="{b}: {c}",font_weight='bolder',),
        )
            .add(
            "****投资总额",
            draw_data,
            radius=["65%", "80%"],
            # center=["75%", "50%"],
            # rosetype="area",
            label_opts=opts.LabelOpts(is_show=True,formatter="{b}: {c}",font_weight='bolder',),
        )
    )
    grid_1 = Grid(init_opts=init_opts_pie)
    grid_1.add(plt, grid_opts=opts.GridOpts(pos_top='5'))  # 仅使用pos_top修改相对顶部的位置
    grid_1.render(path_html_1)


    # html转图片
    from pyecharts.render import make_snapshot
    from snapshot_selenium import snapshot
    make_snapshot(snapshot, grid.render(path_html), path_png,delay=2,pixel_ratio=2)
    make_snapshot(snapshot, grid_1.render(path_html_1), path_png_1,delay=2,pixel_ratio=2)

数据/图片插入模板中生成word文件

这里是将生成好的数据与图片插入到word中,使用的是jinja2,docxtpl,docx这3个包
代码如下

    import jinja2
    from jinja2.utils import Markup
    from docxtpl import DocxTemplate
    from docxtpl import InlineImage
    from docx.shared import Mm, Inches, Pt

    tpl=DocxTemplate(r'./source/from/1.小册子-总体情况.docx')


        # 20191129针对缺失值修改为 '-' 显示
    table_1 = \
         [{
   
   'city': row.city,
           'mie':'-' if format('%.0f' %row.mie) == 'nan' else format('%.0f' %row.mie),
           'toimie': '-' if format('%.2f' % row.toimie) == 'nan' else format('%.2f' % row.toimie),
           'qie': '-' if format('%.0f' % row.qie) == 'nan' else format('%.0f' % row.qie),
           'toiqie': '-' if format('%.2f' % row.toiqie) == 'nan' else format('%.2f' % row.toiqie)}
          for index, row in table_1.iterrows()]
     table_2 = \
         [{
   
   'code': row.code,
           'mie': '-' if format('%.0f' % row.mie) == 'nan' else format('%.0f' % row.mie),
           'toimie': '-' if format('%.2f' % row.toimie) == 'nan' else format('%.2f' % row.toimie),
           'qie': '-' if format('%.0f' % row.qie) == 'nan' else format('%.0f' % row.qie),
           'toiqie': '-' if format('%.2f' % row.toiqie) == 'nan' else format('%.2f' % row.toiqie), }
          for index, row in table_2.iterrows()]

    context={
   
   
        'year':year,
        'quarter':quarter,
        'qe':qe,
        'mie':mie,
        'toimie':toimie,
        'me':me,
        'qie':qie,
        'toiqie':toiqie,
        'pic_1': InlineImage(tpl, path_png,width=Mm(125)),
        'pic_2': InlineImage(tpl, path_png_1, width=Mm(100)),
        'table_1':table_1,
        'table_2': table_2
              }
    jinja_env = jinja2.Environment(autoescape=True)
    tpl.render(context,jinja_env)
    tpl.save(r'./result/1.小册子-*****总体情况.docx')

效果图
在这里插入图片描述

20220719补充关于ImportError: cannot import name ‘soft_unicode’ from 'markupsafe’错误解决方案。

这个错误是由于markupsafe的包更新导致，所以暂时的解决方案是用pip install --upgrade markupsafe==2.0.1把markupsafe降级为2.0.1版本。但是降级后会引发另外一个错误，就是在执行

# python-docx
from docxtpl import DocxTemplate
tpl = DocxTemplate("文件名称.docx")
tpl.paragraphs

会反馈AttributeError: 'NoneType' object has no attribute 'paragraphs'这个错误，解决方法是添加tpl.get_docx()来触发读取文件操作。
触发前：
在这里插入图片描述
触发后:

这时候就可以读取到文件了。

20230413补充docxtpl读取段落表格等内容

import pandas as pd
from docxtpl import DocxTemplate
tpl = DocxTemplate("112.docx")
tpl.get_docx()
# 读取段落文字
for i in tpl.paragraphs:
    print(i.text)
# 读取表格
def getDocxTableToDF(table):
    '''
    将docx的table转化为df
    :param table: docx.table.Table object
    :return: df
    '''
    total = []
    for row in table.rows:
        r_list = []
        for col in row.cells:
            r_list.append(col.text)
            # print(col.text)
        total.append(r_list)
    total = pd.DataFrame().from_records(total)
    return total
    
for table in tpl.tables:
    getDocxTableToDF(table)

基于docx将html页面内容转化为word文档

参考文章：https://stackoverflow.com/questions/55041766/how-to-add-waltchunk-and-its-relationship-with-python-docx

from docx.opc.constants import RELATIONSHIP_TYPE as RT
from docx.opc.part import Part
from docx.oxml import OxmlElement
from docx.oxml.ns import qn
from docx import Document

from lxml.etree import tostring
from lxml import html
import requests

def add_alt_chunk(doc: Document, html: str):
    package = doc.part.package
    partname = package.next_partname('/word/altChunk%d.html')
    alt_part = Part(partname, 'text/html', html.encode(), package)
    r_id = doc.part.relate_to(alt_part, RT.A_F_CHUNK)
    alt_chunk = OxmlElement('w:altChunk')
    alt_chunk.set(qn('r:id'), r_id)
    doc.element.body.sectPr.addprevious(alt_chunk)

if __name__ == "__main__":
	# 爬取指定的html页面内容下来，使用xpath来截取目标页面内容
    headers = {
   
   
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36',
    }
    response = requests.get(link, headers=headers)
    tree = html.fromstring(response.text)
    article= tree.xpath('''/html/body/div[@class="wrap"]//article[@class="articleCon"]''')[0]  # 正文内容
	
	# 创建一个doc来存放该待处理的内容
	doc = Document()
	article_str = tostring(article, encoding="utf8")  # 转回字符串格式
	add_alt_chunk(doc, article_str.decode("utf8"))  
	doc.save("111.docx")

页面原型
在这里插入图片描述
转化后

但是，上述的docx直接使用docxtpl读取是无法读取的。读取不出来段落等信息。所以需要下面的方法处理一下，才能读取出段落信息。

基于comtypes将pdf/word转化为word文档-仅window系统可以使用

参考文章： https://pythonhosted.org/comtypes/
参考文章： https://github.com/enthought/comtypes
参考文章：https://stackoverflow.com/questions/6011115/doc-to-pdf-using-python
使用pip install comtypes安装相关包。

import os
import comtypes.client
word = comtypes.client.CreateObject('Word.Application')  # 打开文件使用的模式
wdFormatPDF = 17  # PDF格式匹配
PDFFormatwd = 16  # word格式匹配
source_docx = "111.docx"
source_abspath = os.path.abspath(os.path.join(file_path, source_docx))
doc = word.Documents.Open(source_abspath)

target_pdf = '112.pdf'
target_docx = '112.docx'
doc.SaveAs(os.path.join(file_path, target_pdf), FileFormat=wdFormatPDF)
doc.SaveAs(os.path.join(file_path, target_docx), FileFormat=PDFFormatwd)

doc.Close()
word.Quit()

ps：doc.SaveAs默认是不替换文件的。然而我也没有找到该函数的替换开关，所以在保存前最好先用os.remove删除掉目标，防止已经有文件导致无法生成。
ps1：这里的运行服务器上需要提前安装好对应的office软件，否则就会报出【[WinError -2147221005] 无效的类字符串】这个错误。

基于libreoffice将pdf/word转化为word文档-linux系统可以使用

参考文章：https://stackoverflow.com/questions/34618767/clean-up-xml-of-a-docx-document-with-python-linux-binary
参考文章：https://stackoverflow.com/questions/50982064/converting-docx-to-pdf-with-pure-python-on-linux-without-libreoffice
参考文章：https://ourcodeworld.com/articles/read/867/how-to-convert-a-word-file-to-pdf-docx-to-pdf-in-libreoffice-with-the-cli-in-ubuntu-2004
参考文章：https://stackoverflow.com/questions/30349542/command-libreoffice-headless-convert-to-pdf-test-docx-outdir-pdf-is-not
参考文章：https://ask.libreoffice.org/t/cannot-print-anything-exporting-to-pdf-produces-blank-pdf-lubuntu-20-04/52285/10
参考文章：https://github.com/mila/pyoo
系统：Ubuntu 20.04.4 LTS (GNU/Linux 5.4.0-146-generic x86_64)
先更新apt-get到最新版本，然后安装libreoffice

sudo apt-get update
sudo apt-get install libreoffice

然后检查是否安装java

java -version

如果没有，则通过下面指令安装

sudo apt-get install default-jre

安装好查询版本信息来确定是否安装成功

libreoffice --version
LibreOffice 6.4.7.2 40(Build:2)

使用代码如下

    def convertFile_linux(self, source_abspath, file_path, file_type):
        # --headless：此参数以无界面模式启动 libreoffice，没有 UI。
        # --convert-to <format>：此参数指定该工具将用于将输入文件转换为新格式，该格式作为此参数的位置参数给出，在我们的例子中为 PDF。
        # --outdir: 转换后的文件应该存储的目录。在我们的例子中，./ 表示文件应该保存在当前目录中。
        # 迄今为止唯一存在的问题是您无法更改输出文件的名称。
        cmd = f'libreoffice --headless --convert-to {
     
     file_type}'.split() + [source_abspath] + ['--outdir'] + [file_path]
        p = subprocess.Popen(cmd, stderr=subprocess.PIPE, stdout=subprocess.PIPE)
        p.wait(timeout=10)
        stdout, stderr = p.communicate()
        if stderr:
            raise subprocess.SubprocessError(stderr)

ps：该方法无法实现docx的格式标准化处理，即无法通过docx到docx转化来将非标准的的xml格式文字转化为标准化的格式文字使docxtpl读取到段落/表格信息。

直接调用libreoffice

参考文章：https://stackoverflow.com/questions/61457120/how-to-use-libreoffice-api-uno-with-python-windows
参考文章：https://help.libreoffice.org/6.3/en-US/text/sbasic/python/main0000.html
首先，先启动启动LibreOffice：

soffice "-accept=socket,host=localhost,port=2002;urp;" --norestore --nologo --nodefault --headless

你也需要确保你的防火墙没有阻止这个端口的连接。

# 导入uno模块
import uno

# 连接到LibreOffice
localContext = uno.getComponentContext()
resolver = localContext.ServiceManager.createInstanceWithContext("com.sun.star.bridge.UnoUrlResolver", localContext)
ctx = resolver.resolve("uno:socket,host=localhost,port=2002;urp;StarOffice.ComponentContext")
smgr = ctx.ServiceManager
desktop = smgr.createInstanceWithContext("com.sun.star.frame.Desktop", ctx)

# 打开一个word文件
url = uno.systemPathToFileUrl("C:/test.doc") # 你需要修改这个路径
doc = desktop.loadComponentFromURL(url, "_blank", 0, ())

# 复制里面的内容
text = doc.Text
cursor = text.createTextCursor()
cursor.gotoStart(False)
cursor.gotoEnd(True)
text.copy(cursor)

# 创建一个新的word文件
new_doc = desktop.loadComponentFromURL("private:factory/swriter", "_blank", 0, ())

# 粘贴内容到新的word文件
new_text = new_doc.Text
new_cursor = new_text.createTextCursor()
new_text.paste(new_cursor)

# 保存新的word文件
new_url = uno.systemPathToFileUrl("C:/new_test.doc") # 你需要修改这个路径
new_doc.storeAsURL(new_url, ()

基于OpenOffice将pdf/word转化为word文档-linux系统可以使用

参考文章：https://github.com/mila/pyoo
参考文章：https://developer.aliyun.com/article/791858
系统：Ubuntu 20.04.4 LTS (GNU/Linux 5.4.0-146-generic x86_64)
首先，到http://www.openoffice.org/download/中下载对应的包到服务器上。
在这里插入图片描述
下载安装包

sudo wget https://cytranet.dl.sourceforge.net/project/openofficeorg.mirror/4.1.14/binaries/zh-CN/Apache_OpenOffice_4.1.14_Linux_x86-64_install-deb_zh-CN.tar.gz

基于pdfkit将html页面内容转化为pdf文档

参考文章：https://www.geeksforgeeks.org/python-convert-html-pdf/
参考文章：https://github.com/JazzCore/python-pdfkit
参考文章：https://www.zhihu.com/tardis/zm/art/94608155?source_id=1003
参考文章：https://www.cnblogs.com/mianbaoshu/p/13366074.html
首先，使用pip install pdfkit安装py的包，然后按照官网所说，如果是用Ubuntu/Debian，使用sudo apt-get install wkhtmltopdf安装额外程序，如果是使用window，则到WKHTMLTOPDF下载安装程序。然后在环境变量中添加环境变量(或者在代码中使用os添加也行)。
ps：可以选择直接到官网下载程序包，然后解压放到项目文件下，接着再指向该文件即可。

最低0.47元/天解锁文章