TrOCR — Transformer-based Optical Recognition Model

Introduction

5 min readMar 5, 2023

The electronic translation of images of typed, handwritten, or printed text into machine-encoded text is known as optical character recognition (OCR). The source could be a page that has been scanned, a photo of the page, or text that has been overlaid on an image. OCR is used to convert the text from these sources into machine-readable form.

Before discussing Transformer Based OCR in more detail, let’s first understand how an OCR pipeline functions.

Most OCR pipelines consist of two modules.
1. Text detection module
2. Text recognition module

Text Detection Module

As its name suggests, the text Detection module looks for any instances of text in the source. Every text block included in the text picture is aimed at being localized, either at the word or text line level (individual words).In this work, the object of interest, which is analogous to an object detection problem, is the text blocks.

There are many popular object detection algorithms available, including YOLOv4/5/6, Detectron, Mask-RCNN, and others.

Text Recognition Module

The text recognition module’s objective is to decipher the content of the detected text block and convert the visual cues into tokens of natural language.

A text recognition module frequently has two sub-modules.

Image Understanding
Word Piece Generation

Image Understanding → CNN based modules used to understand the image that we got from text detection module

Word Piece Generation → After understanding the image , we autoregressively outputs the text in the image.

TrOCR Working

One of the earliest studies that use both pre-trained image and text transformers simultaneously is transformer-based OCR. It is an end-to-end OCR paradigm built on transformers that can recognise text.

The graphic below depicts modified OCR. The Roberta (Text Transformer) Decoder is on the right side of the diagram, while the Vision Transformer Encoder is on the left.

ViTransformer as Encoder

Each of the NxN patches that make up an image is examined as a token in a sentence. After being flattened (2D — -> 1D), the image patches are linearly projected with positional embeddings. The layers of the transformer encoder convey the linear projection plus positional embeddings.

The image is broken up into multiple small text boxes when OCR is used. To ensure uniformity in localised text boxes, the images/image component of the text boxes is modified to a HxW. The image is then broken up into patches, each of which is HW/ (PxP). P is the patch size.

A fully connected feed-forward network and a multi-head self-attention module are present in each Transformer layer. Layer normalisation and residual connection come after these first two sections.

Note: During backpropagation, residual connections guarantee gradient flow.

Roberta as Decoder

The decoder module receives the extracted output embeddings from a specific depth of the ViTransformers as input.

The decoder module is a transformer that has a stack of identical layers with structures that are similar to those of the layers in the encoder, with the exception that it inserts “encoder-decoder attention” between the multi-head self-attention and feedforward network to distribute different attention on the encoder’s output. The keys and values and the queries in the encoder-decoder attention module are derived from the encoder output and the decoder input, respectively.

The decoder’s embeddings are projected from the model dimension (768) to the vocabulary size dimension V. (50265).

The softmax function calculates the probabilities over the vocabulary and we use beam search to get the final output.

Advantages and Disadvantages of TrOCR

Pros

The first effort to jointly use pre-trained image and text Transformers for the text recognition job in OCR is TrOCR, an end-to-end Transformer-based OCR model for text recognition with pre-trained CV and NLP models.

TrOCR uses a conventional transformer-based encoder-decoder paradigm that is convolution-free and does not require any complicated pre- or post-processing to achieve state-of-the-art accuracy.

Cons

This model needs images which has text in one line, let’s see the example for more explanation

Implementation

First focus on Hand Written Text

!pip install transformers
from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image
from IPython.display import display
processor = TrOCRProcessor.from_pretrained("microsoft/trocr-large-handwritten") 
model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-large-handwritten")

Visualize Images

def show_image(pathStr):
  img = Image.open(pathStr).convert("RGB")
  display(img)
  return img

def ocr_image(src_img):
  pixel_values = processor(images=src_img, return_tensors="pt").pixel_values
  generated_ids = model.generate(pixel_values)
  return processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

hw_image = show_image('/HWLetter.png')

ocr_image(hw_image)

Output from the previous example is “0.0” because as the cons I have told we need one text image, so we have to crop this image to one text line

hw_image1 = hw_image.crop((0, 10, hw_image.size[0], 40))
display(hw_image1)

Now, see the output

cr_image(hw_image1)

Output → Dean Sister, but her you, feel bad for years,

2. Printed Images and Scans

printed_processor = TrOCRProcessor.from_pretrained('microsoft/trocr-base-printed')
printed_model = VisionEncoderDecoderModel.from_pretrained('microsoft/trocr-base-printed')

def ocr_printed_image(src_img):
  pixel_values = printed_processor(images=src_img, return_tensors="pt").pixel_values
  generated_ids = printed_model.generate(pixel_values)
  return printed_processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

invoice_image = show_image('/Invoice1.png')

Cropping

invoice_image1 = invoice_image.crop((0, 200, invoice_image.size[0], 225))
display(invoice_image1)

Press enter or click to view image in full size

Output → JOH SHIM NO. JOH SHIMH INVOICE DATE 102909

Reference :

TrOCR — Paper: TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models
TrOCR — Models Library: https://huggingface.co/models?filter=trocr
TrOCR — Most Downloaded Model: https://huggingface.co/microsoft/trocr-base-printed
TrOCR — Hugging Face Home Page: https://huggingface.co/docs/transformers/model_doc/trocr

TrOCR — Transformer-based Optical Recognition Model

Introduction

Written by Tejpal Kumawat