CASIA手写体数据集HWDB gnt和dgrl格式解析

最新推荐文章于 2026-03-31 02:03:36 发布

原创

最新推荐文章于 2026-03-31 02:03:36 发布 · 1.9k 阅读

标签

#算法 #python

文章介绍了如何使用Python解析CASIA手写体数据集中的Gnt和Dgrl格式文件，用于手写识别项目。Gnt解析涉及读取像素数据并转存为PNG，Dgrl解析涉及提取文字标签和图像信息。文章采用了多进程处理以提高批量处理图片的效率。

文章目录

引言
Gnt格式解析
Dgrl格式解析
参考资料

引言

最近在做手写识别项目，网上找到的是用CASIA数据集来做模型测试，CASIA数据集网址
需要对其数据中的gnt和dgrl格式进行解析，网上也找了很多现成的代码拿来用
最终找到了这篇CASIA手写体数据集HWDB1.0 gnt和dgrl格式解析
在处理gnt格式的代码上做了修改，增加了多进程处理的功能，可提高批量处理图片的效率

Gnt格式解析

import struct
from pathlib import Path
from PIL import Image
from multiprocessing import Pool



def process_gnt_file(gnt_paths):

    label_list = []
    for gnt_path in gnt_paths:
        count = 0
        print(f'gnt路径--->{gnt_path}')

        with open(str(gnt_path), 'rb') as f:
            while f.read(1) != "":
                f.seek(-1, 1)
                count += 1
                try:
                    # 按类型提取gnt格式文件中的数据
                    length_bytes = struct.unpack('<I', f.read(4))[0]

                    tag_code = f.read(2)

                    width = struct.unpack('<H', f.read(2))[0]

                    height = struct.unpack('<H', f.read(2))[0]

                    im = Image.new('RGB', (width, height))
                    img_array = im.load()  # 返回像素值
                    for x in range(height):
                        for y in range(width):
                            # 读取像素值
                            pixel = struct.unpack('<B', f.read(1))[0]
                            # 赋值
                            img_array[y, x] = (pixel, pixel, pixel)

                    filename = str(count) + '.png'

                    # 转换为中文的格式
                    tag_code = tag_code.decode('gbk').strip('\x00')
                    save_path = f'{save_dir}/zf_images_train/{gnt_path.stem}'
                    if not Path(save_path).exists(

最低0.47元/天解锁文章