Tesseract-OCR 图片数字识别的样本训练

最新推荐文章于 2026-06-17 14:47:55 发布

原创最新推荐文章于 2026-06-17 14:47:55 发布 · 2w 阅读

73 ·

本内容遵循CC 4.0 BY-SA版权协议

标签

#python #pytesseract #Tesseract-OCR #jTessBoxEditor

python 专栏收录该内容

5 篇文章

订阅专栏

本文介绍了如何训练Tesseract-OCR以提高对游戏中数字的识别精度。通过准备样本图像，使用jTessBoxEditor工具进行训练，包括合并图像、生成Box文件、文字校正、定义字体特征文件和生成语言文件，最终实现对数字的准确识别。

该文章已生成可运行项目，

最近想利用python写一段识别穿越火线交易所各种道具价格的代码。命令行执行：

tesseract.exe grab.jpg result -l eng

使用默认的Tesseract语言库总会识别成字母或者乱码，如下图：

于是参考https://blog.csdn.net/yasi_xi/article/details/8763385这篇帖子，训练了一个对游戏中数字识别度较高的样本库。

训练样本：

待识别的图像如下图中出售价格及我的CF点

python代码：

import win32con
import win32gui
import pytesseract
from PIL import ImageGrab
import time



def get_bin_table(threshold=105):
    # 获取灰度转二值的映射table
    table = []
    for i in range(256):
        if i < threshold:
            table.append(0)
        else:
            table.append(1)
    return table


def grab():
    hwnd = win32gui.FindWindow(0, "Crossfire20170910_0000.bmp - 画图")  # 获取句柄
    print(hwnd)
    left, top, right, bottom = win32gui.GetWindowRect(hwnd)
    print(left, top, right, bottom)
    win32gui.ShowWindow(hwnd, win32con.SW_SHOWNORMAL)
    win32gui.SetForegroundWindow(hwnd)
    time.sleep(0.2)
    img = ImageGrab.grab((870,478,913,495))  # 截图，获取需要识别的区域
    img.show()
    imggray = img.convert('L')  # 转化为灰度图
    table = get_bin_table()
    out = imggray.point(table, '1')
    #out.show()
    text = pytesseract.image_to_string(out)  # 使用简体中文解析图片则改为chi_sim
    text = text.upper()
    print(text)

    # img.save('C:/Users/Ysx/PycharmProjects/ocr/out/%s.jpg' % text)


if __name__ == '__main__':
    grab()

使用默认的语言库识别成功率不高，会识别为乱码或者字母。所以计划自己训练一个只能识别数字的准确语言库。

1.训练环境：首先安装jdk-10.0.1_windows-x64_bin.exe，它是java的运行环境。然后下载工具jTessBoxEditor，它是训练样本的工具。

2.样本图像：从CF截图如下图像（越多越好，不过总共也就10个数字）。

3.合并图像：运行jTessBoxEditor，菜单栏中Tools--Merge TIFF。在弹出的对话框中选择样本图像（按Shift选择多张），合并名为num.font.exp0.tif文件。

4.生成Box file文件：

命令行执行：

 cd C:\Users\Ysx\PycharmProjects\ocr\train

tesseract.exe num.font.exp0.tif num.font.exp0 batch.nochop makebox

* 生成num.font.exp0.box，BOX文件为Tessercat识别出的文字和其坐标。

5.文字校正：jTessBoxEditor工具打开num.font.exp0.tif，增加未识别的、修改识别错误的数字。

6.定义字体特征文件：（当前目录下新建记事本，输入font 0 0 0 0 0，存为font_properties，注意删除.txt后缀）

格式应为<fontname> <italic> <bold> <fixed> <serif> <fraktur>，取值1/0代表是否拥有对应属性。

7.生成语言文件：（当前目录下新建记事本，输入下列代码，存为1.bat）

rem 执行改批处理前先要目录下创建font_properties文件

echo Run Tesseract for Training..  
tesseract.exe num.font.exp0.tif num.font.exp0 nobatch box.train  
  
echo Compute the Character Set..  
unicharset_extractor.exe num.font.exp0.box  
mftraining -F font_properties -U unicharset -O num.unicharset num.font.exp0.tr  
  
echo Clustering..  
cntraining.exe num.font.exp0.tr  
  
echo Rename Files..  
rename normproto num.normproto  
rename inttemp num.inttemp  
rename pffmtable num.pffmtable  
rename shapetable num.shapetable   
  
echo Create Tessdata..  
combine_tessdata.exe num.

双击1.bat执行，生成的num.traineddata存到对应的Tesseract-OCR\tessdata目录。

8.使用训练得到的num语言库：

命令行执行：