word转markdown - 自定义解析工具WordPI

在处理 Word 文件（.doc 或 .docx）时，我们常常需要提取图片、生成目录，或者将内容转换为 Markdown 格式以便于分享或编辑。为了解决这些需求，我定义了一个名为 WordPI 的 Python 类，它集成了多种实用功能，能够高效解析和转换 Word 文件。本文将详细介绍 WordPI 的功能、使用方法以及一些实际应用场景。

什么是 WordPI？

WordPI 是我设计的一个基于 Python 的工具类，专门用于处理 Word 文件。它依托 python-docx 库解析 .docx 文件，并结合 LibreOffice 的命令行工具实现文件格式转换。以下是它的核心功能：

1. 将 .doc 文件转换为 .docx 文件**：借助 LibreOffice，将老旧的 .doc 文件转换为现代 .docx 格式。

2. 提取 .docx 文件中的图片：支持提取所有嵌入的图片并保存为指定格式（如 .png）。

3. 将图像文件转换为 PNG 格式：针对特殊格式（如 .wmf），可通过 LibreOffice 转换为通用的 .png。

4. 生成目录结构：解析 Word 文件中的标题，生成层级目录（支持 Markdown 格式）。

5. 将 Word 文件转换为 Markdown：将文档内容（包括文字、表格和图片）转换为 Markdown 格式，便于后续使用。

安装与依赖

要使用我定义的 WordPI 工具，需要安装以下依赖：

python-docx：用于解析 .docx 文件。

pip install python-docx

LibreOffice：用于文件格式转换，需安装并确保命令行工具 soffice 可用。
- macOS：可通过 Homebrew 安装：brew install libreoffice
- Windows/Linux：从 LibreOffice 官网下载安装。
Python 环境：建议使用 Python 3.8 或更高版本。

此外，WordPI 使用了标准库 os、logging 和 subprocess，无需额外安装。

核心功能详解

下面我将逐一介绍 WordPI 的主要方法，并附上代码示例。

1. 初始化与文件检查

word_pi = WordPI(word_file_path="example.doc")

初始化时，WordPI 会检查文件是否存在，并提取文件名（不含扩展名）。如果路径无效或扩展名不符合 .doc 或 .docx，会抛出异常。

2. 将 .doc 文件转换为 .docx

word_pi.convert_doc_to_docx(word_file_path="example.doc", output_folder="output")

此方法利用 LibreOffice 将 .doc 文件转换为 .docx。如果未指定输出目录，默认使用输入文件所在目录。转换成功后，会输出：

转换成功：example.doc -> output/example.docx

3. 提取 .docx 文件中的图片

word_pi.extract_images_from_docx(word_file_path="example.docx", output_folder="images")

这个方法可以提取 .docx 文件中的所有图片，并根据 MIME 类型保存。如果遇到 .wmf 格式，会自动转换为 .png。示例输出：

图片 1 的 MIME 类型: image/png 图片已保存到 images/image_1.png
图片 2 的 MIME 类型: image/x-wmf
图片类型wmf，需转化png格式
文件已成功转换为PNG格式：images/image_2.png

4. 生成目录结构

catalogue = word_pi.get_catalogue(word_file_path="example.docx", mark=True) print(catalogue)

此方法解析 Word 文件中的标题（Heading 1、Heading 2 等），生成目录。mark=True 时返回 Markdown 格式，例如：

['# 1基础功能设计说明', '## 1.1 身份认证']

mark=False 时返回带层级的元组列表，例如：

[(1, '1基础功能设计说明'), (2, '1.1 身份认证')]

5. 将 Word 文件转换为 Markdown

markdown_content = word_pi.convert_word_to_markdown( word_file_path="example.docx", output_dir="output", mark=True ) print(markdown_content)

此方法将 Word 文件内容转换为 Markdown 格式。如果指定 output_dir，会保存为 .md 文件；否则返回字符串。支持转换标题、表格和图片，例如嵌入图片为 ![](image_path)。

使用场景

1. 文档整理：将 .doc 文件转换为 .docx，提取图片或生成 Markdown，用于存档或发布。

2. 内容迁移：从 Word 文档生成 Markdown，直接用于博客或 GitHub。

3. 快速预览：提取目录结构，快速了解文档的组织和重点。

示例代码

以下是一个完整的示例，展示如何使用我定义的 WordPI 处理 Word 文件：

from word_pi import WordPI

# 初始化
word_pi = WordPI("example.doc")

# 转换为 .docx
word_pi.convert_doc_to_docx(output_folder="output")

# 提取图片
word_pi.extract_images_from_docx(word_file_path="output/example.docx", output_folder="images")

# 获取目录
catalogue = word_pi.get_catalogue(word_file_path="output/example.docx", mark=True)
print("目录结构：", catalogue)

# 转换为 Markdown
markdown = word_pi.convert_word_to_markdown(word_file_path="output/example.docx", output_dir="output")
print("Markdown 内容：", markdown)

注意事项

1. LibreOffice 配置：确保 soffice 命令可用，代码中默认路径为 macOS，Windows 用户需调整。

2. 图片格式支持：目前仅针对 .wmf 进行转换，其他格式（如 .emf）可根据需求扩展。

3. 异常处理：内置基本的错误日志，建议在实际应用中完善日志配置。

总结

WordPI 是我定义的一个实用工具，能够简化 Word 文件的解析与转换工作。无论是格式转换、图片提取，还是 Markdown 生成，它都能帮你节省时间。如果你有类似的文档处理需求，可以试试这个工具，并根据实际需要调整或扩展功能。欢迎留言分享你的使用体验或改进建议！

源代码

import os
import logging
import subprocess

from docx import Document  # 导入 python-docx 库用于处理 DOCX 文件
from typing import Union
        
from docx.enum.shape import WD_INLINE_SHAPE_TYPE


logger = logging.getLogger()




class WordPI:
    """
        WordPI

        功能：
        1. 使用 LibreOffice 将 .doc 文件转换为 .docx 文件。
        2. 将 .docx 文件中的所有图片提取并转换为 .png 文件。
        3. 使用LibreOffice将图像文件（如.wmf）转换为PNG格式。
    
    """


    def __init__(self,word_file_path: str = None):
        self.word_file_path = word_file_path
        try:
            if not os.path.exists(word_file_path):
                raise FileNotFoundError(f"文件未找到：{word_file_path}")
            
            self.name = os.path.basename(word_file_path).split(".")[0]
            if ".doc" not in self.name or ".docx" not in self.name:
                raise "文件名扩展类型不正确"
        except Exception as e:
            logger.info(f"Error word: {e}")

    def get_word_doc(self,word_file_path: str = None) -> Union[Document, None]:
        "获取word文件document对象"
        if word_file_path is None:
            word_file_path = self.word_file_path
        try:    
            doc = Document(word_file_path)
            return doc
        except Exception as e:
            logger.info(f"Error word: {e}")
            return None


    def convert_doc_to_docx(self,word_file_path: str, output_folder: str=None) -> None:
        """
        使用 LibreOffice 将 .doc 文件转换为 .docx 文件。
        :param doc_file: 输入的 .doc 文件路径
        :param output_folder: 输出文件夹路径（可选） 如果未指定输出文件夹，则使用输入文件的目录
        """
        if not os.path.exists(word_file_path):
            raise FileNotFoundError(f"文件未找到：{word_file_path}")
        
        if word_file_path is None:
            word_file_path = self.word_file_path

        # 如果未指定输出文件夹，则使用输入文件的目录
        if not output_folder:
            output_folder = os.path.dirname(word_file_path)

        # 构建输出文件路径
        docx_file = os.path.join(output_folder, os.path.splitext(os.path.basename(word_file_path))[0] + ".docx")

        # 构建 LibreOffice 命令
        command = [
            "soffice",  # LibreOffice 的命令行工具
            "--headless",  # 无界面模式
            "--convert-to", "docx",  # 转换为 docx 格式
            "--outdir", output_folder,  # 输出目录
            word_file_path  # 输入文件
        ]
        # 执行命令
        try:
            subprocess.run(command, check=True)
            print(f"转换成功：{word_file_path} -> {docx_file}")
        except subprocess.CalledProcessError as e:
            print(f"转换失败：{e}")



    def convert_image_to_png(self,image_file_path: str, output_path: str) -> None:
        """
        使用LibreOffice将图像文件（如.wmf）转换为PNG格式。
        
        参数:
            input_path (str): 输入文件的路径。
            output_path (str): 输出文件的路径。
        """
        if word_file_path is None:
            word_file_path = self.word_file_path

        if not os.path.exists(image_file_path):
            raise FileNotFoundError(f"输入文件不存在：{image_file_path}")
        
        # 确保输出目录存在
        output_dir = os.path.dirname(output_path)
        if not os.path.exists(output_dir):
            os.makedirs(output_dir)
        
        # 构造LibreOffice命令
        command = [
            '/Applications/LibreOffice.app/Contents/MacOS/soffice',  # LibreOffice的路径
            '--headless',  # 无头模式
            '--convert-to', 'png',  # 转换为PNG格式
            '--outdir', output_dir,  # 输出目录
            image_file_path  # 输入文件路径
        ]
        
        try:
            # 执行命令
            subprocess.run(command, check=True)
            print(f"文件已成功转换为PNG格式：{output_path}")
        except subprocess.CalledProcessError as e:
            print(f"转换失败：{e}")
        except Exception as e:
            print(f"发生错误：{e}")


    # def extract_images_from_docx(docx_path: str, output_folder: str) -> None:
    #     # 确保输出文件夹存在
    #     if not os.path.exists(output_folder):
    #         os.makedirs(output_folder)

    #     # 打开并解压 .docx 文件
    #     with zipfile.ZipFile(docx_path, 'r') as zip_ref:
    #         # 遍历压缩包中的所有文件
    #         for file_info in zip_ref.infolist():
    #             # 如果文件位于 word/media/ 目录下，则提取图片
    #             if file_info.filename.startswith('word/media/'):
    #                 # 提取文件到指定的输出文件夹
    #                 image_name = os.path.basename(file_info.filename)
    #                 output_path = os.path.join(output_folder, image_name)
    #                 with open(output_path, 'wb') as image_file:
    #                     image_file.write(zip_ref.read(file_info))
    #                 print(f"Extracted: {output_path}")
    

    def get_image_extension(self, content_type: str) -> str:
        """
        根据图片的 MIME 类型返回对应的文件扩展名。

        参数:
            content_type (str): 图片的 MIME 类型，例如 "image/jpeg" 或 "image/png"。

        返回:
            str: 对应的文件扩展名，如 "jpg" 或 "png"。如果 MIME 类型未知，则返回 "unknown"。
        """
        extensions = {
            "image/jpeg": "jpg",       # JPEG 图片
            "image/png": "png",        # PNG 图片
            "image/gif": "gif",        # GIF 图片
            "image/bmp": "bmp",        # BMP 图片
            "image/tiff": "tiff",      # TIFF 图片
            "image/svg+xml": "svg",    # SVG 图片
            "image/x-emf": "emf",      # 增强型元文件
            "image/x-wmf": "wmf",      # Windows 元文件
            "image/webp": "webp"       # WebP 图片
        }
        return extensions.get(content_type, "unknown")  # 根据 MIME 类型获取扩展名，未知类型返回 "unknown"


    def extract_images_from_docx(self, word_file_path: str, output_folder: str) -> None:
        """
        从 DOCX 文件中提取所有图片，并保存到指定的输出文件夹中。

        参数:
            docx_path (str): DOCX 文件的路径。
            output_folder (str): 保存提取图片的目标文件夹路径。

        功能:
            1. 打开 DOCX 文件并解析其内容。
            2. 遍历文档中的所有图片关系（rels），提取图片数据。
            3. 根据图片的 MIME 类型确定文件扩展名。
            4. 将提取的图片保存到指定文件夹中。
        """
                
        if word_file_path is None:
            word_file_path = self.word_file_path

        if not os.path.exists(word_file_path):
            raise FileNotFoundError(f"输入文件不存在：{word_file_path}")
        
        # 打开 DOCX 文件
        doc = Document(word_file_path)

        # 检查输出文件夹是否存在，如果不存在则创建
        if not os.path.exists(output_folder):
            os.makedirs(output_folder)

        # 遍历文档中的所有关系（rels），提取图片
        for i, rel in enumerate(doc.part.rels.values()):
            if "image" in rel.target_ref:  # 检查关系是否指向图片
                img_part = rel.target_part  # 获取图片部分
                img_bytes = img_part.blob  # 获取图片的二进制数据
                img_ext = self.get_image_extension(img_part.content_type)  # 根据 MIME 类型获取扩展名

                # 打印图片的 MIME 类型
                print(f"图片 {i + 1} 的 MIME 类型: {img_part.content_type}")

                # 构造图片文件名并保存到指定文件夹
                img_filename = os.path.join(output_folder, f'image_{i + 1}.{img_ext}')
                with open(img_filename, 'wb') as img_file:  # 以二进制写入模式打开文件
                    img_file.write(img_bytes)  # 写入图片数据
                print(f'图片已保存到 {img_filename}')  # 提示用户图片已保存

                # TODO 暂时只处理wmf格式的图片，后面看实际业务中会有什么数据
                if img_ext == "wmf":
                    print(f"图片类型{img_ext}，需转化png格式")
                    img_filename_png = os.path.join(output_folder, f'image_{i + 1}.png')
                    self.convert_image_to_png(input_path=img_filename, output_path=img_filename_png)
                    os.remove(img_filename)


    def get_catalogue(self,word_file_path: str = None, mark: bool = False) -> list[tuple[int,str]] | list[str]:
        """
            获取word文件目录结构
            mark=Ture 返回mark格式 ['# 1基础功能设计说明','## 1.1 身份认证']
            mark=False 返回级别格式 [(1, '1基础功能设计说明'),(2, '1.1 身份认证')]
        """
        
        if word_file_path is None:
            word_file_path = self.word_file_path

        if not os.path.exists(word_file_path):
            raise FileNotFoundError(f"输入文件不存在：{word_file_path}")
        
        doc = Document(word_file_path)
        toc = []
        # 遍历文档中的段落
        for para in doc.paragraphs:
            # 获取段落的样式
            style = para.style.name
            # 检查是否是标题样式（Heading 1, Heading 2, 等）
            if style.startswith('Heading'):
                # 获取标题级别（从样式名中提取数字，例如 "Heading 1" -> 1）
                level = int(style.split()[1])
                # 获取标题文本
                text = para.text.strip()
                if text:  # 确保文本非空
                    if mark:
                        indent = "#" * (level)
                        toc.append(f"{indent} {text}")
                    else:
                        toc.append((level, text))

        return toc
    
    def convert_word_to_markdown(self,word_file_path: str = None,output_dir: str = None, mark: bool = False) -> None | str:
        """
            将word文件转换为markdown文件
            output_dir = None时，直接返回markdown格式字符串
            output_dir = str时，将markdown格式字符串保存到指定目录下
        """
        if word_file_path is None:
            word_file_path = self.word_file_path

        if not os.path.exists(word_file_path):
            raise FileNotFoundError(f"输入文件不存在：{word_file_path}")

        doc = Document(word_file_path)

        if output_dir is None:
            raise FileNotFoundError(f"输出路径不存在：{word_file_path}")
        else:
            # 创建输出目录
            os.makedirs(output_dir, exist_ok=True)

        # HTML 模板（用于表格）
        html_header = """
        <html lang="zh-CN">
        <head>
            <meta charset="UTF-8">
            <title>接口表格</title>
            <style>
                table { width: 100%; border-collapse: collapse; margin: 20px 0; }
                th, td { border: 1px solid #ddd; padding: 8px; text-align: left; }
                th { background-color: #f2f2f2; }
                .section-title th { background-color: #e0e0e0; font-weight: bold; text-align: center; }
                pre { margin: 0; white-space: pre-wrap; }
            </style>
        </head>
        <body>
        """
        html_footer = """
        </body>
        </html>
        """
        # 结果汇总（Markdown 格式）
        result = []
        image_counter = 1

        # 处理表格的函数
        def process_table(table, table_idx):
            html_content = '<table>\n    <tbody>\n'
            has_content = False
            
            for row_idx, row in enumerate(table.rows):
                cells = [cell.text.strip() for cell in row.cells]
                if not any(cells):  # 空行
                    html_content += '        <tr><td colspan="6">&nbsp;</td></tr>\n'
                    continue
                
                has_content = True
                if row_idx == 0 and len(cells) > 0 and cells[0]:
                    html_content += f'        <tr><th colspan="6" style="text-align: center; font-weight: bold;">{cells[0]}</th></tr>\n'
                elif row_idx == 1:  # 表头
                    html_content += '        <tr>\n'
                    for header in cells:
                        html_content += f'            <th>{header}</th>\n'
                    html_content += '        </tr>\n'
                else:  # 数据行
                    html_content += '        <tr>\n'
                    for cell in cells:
                        if any(c in cell for c in ['{', '}', '[', ']']):
                            html_content += f'            <td><pre>{cell}</pre></td>\n'
                        else:
                            html_content += f'            <td>{cell}</td>\n'
                    html_content += '        </tr>\n'
            
            html_content += '    </tbody>\n</table>\n'
            return html_content if has_content else None
        try:
            # 遍历文档结构
            for block in doc.element.body:
                
                # 处理段落
                if block.tag.endswith('p'):
                    para = next((p for p in doc.paragraphs if p._element == block), None)
                    if para and para.text.strip():
                        style = para.style.name
                        if style.startswith('Heading'):
                            level = int(style.split()[1]) if len(style.split()) > 1 else 1
                            result.append(f"{'#' * level} {para.text.strip()}")
                        else:
                            result.append(para.text.strip())
                # 处理表格
                elif block.tag.endswith('tbl'):
                    table = next((t for t in doc.tables if t._element == block), None)
                    if table:
                        table_html = process_table(table, None)
                        if table_html:
                            result.append(table_html)

                # 检查 inline_shapes 中的图片（需要在段落中查找）
                # 注意：图片通常嵌在段落中，此处仅标记段落可能包含图片
                for para in doc.paragraphs:
                    if para._element == block:
                        for shape in para._element.xpath('.//wp:inline'):  # 检查段落中的内嵌图片
                            for shape_obj in doc.inline_shapes:
                                if shape_obj._inline == shape:
                                    if shape_obj.type == WD_INLINE_SHAPE_TYPE.PICTURE:  # 图片类型
                                        print(f" 图片 (内嵌于段落), 宽度: {shape_obj.width.cm:.2f} cm, 高度: {shape_obj.height.cm:.2f} cm")
                                        # 保存图片
                                        image_rid = shape_obj._inline.graphic.graphicData.pic.blipFill.blip.embed
                                        image_part = doc.part.related_parts[image_rid]
                                        image_filename = os.path.join(output_dir, f"image_{image_counter}.png")
                                        with open(image_filename, "wb") as f:
                                            f.write(image_part.blob)
                                        print(f"  已保存图片到: {image_filename}")
                                        result.append(f"![]({image_filename})")
                                        image_counter += 1
            if mark is True:
                # 保存结果到 Markdown 文件
                with open(f"{self.name}.md", "w", encoding="utf-8") as f:
                    f.write("\n\n".join(result))
                print("处理完成，结果已保存到 output.md")
                
            result_content = "\n\n".join(result)
            
            return result_content
            
        except Exception as e:
            print(f"处理文档时出错: {e}")
            return None

您可能感兴趣的与本文相关的镜像