Sitemap
Press enter or click to view image in full size

How Gemini 2.0 Flash is Revolutionizing Table Extraction from PDFs: A Deep Dive with Real Benchmarks

4 min readMay 23, 2025

--

Extracting structured data from unstructured documents — especially tables from PDFs — has long been a pain point for data engineers and AI practitioners. With the advent of large language models (LLMs) like OpenAI’s GPT-4, Anthropic’s Claude, and Google’s Gemini 2.0 Flash, the landscape is rapidly changing. In this article, I’ll walk you through a real-world benchmarking project that puts Gemini 2.0 Flash to the test, extracting tables from nearly a thousand PDFs and comparing its performance to other leading models.

Git repo: https://github.com/sahil0094/PDF-Extraction-Gemini-2-Flash-001

Why Table Extraction from PDFs is Hard

PDFs are designed for human readability, not machine parsing. Tables may be embedded as images, use inconsistent formatting, or span multiple pages. Traditional rule-based or OCR-based approaches often fail on complex layouts or scanned documents.

LLMs, with their ability to “see” and “understand” both text and images, offer a new hope. But how well do they really perform? And how do you measure that performance at scale?

Project Overview: Benchmarking LLMs for Table Extraction

This project is an open-source benchmarking suite, inspired by the rd-tablebench repository and the research article “Ingesting Millions of PDFs and why Gemini 2.0 Changes Everything”. The goal: systematically evaluate how well different LLMs can extract tables from PDFs.

What Does the Project Do?

Processes hundreds of PDFs in parallel

  • Sends each PDF to an LLM (Gemini, GPT-4, Claude) for table extraction
  • Grades the extracted tables against ground truth
  • Tracks token usage, latency, and cost
  • Aggregates and reports results

Step 1: Preparing the Data

The dataset consists of 945 PDFs — a mix of digital and scanned documents, each containing at least one table. For each PDF, there’s a corresponding ground truth table (in HTML format) for evaluation.

Directory structure:

input/
pdfs/ # The raw PDF files
groundtruth/ # Ground truth tables for each PDF
results/
outputs/ # Extracted tables (HTML) from each model
scores/ # Grading and evaluation results

Step 2: Converting PDFs to Images

Since many tables are embedded as images (not selectable text), the first page of each PDF is converted to a PNG image using the pdf2image library. This image is then base64-encoded for API submission.

from pdf2image import convert_from_path
from io import BytesIO
import base64

def convert_pdf_to_base64_image(pdf_path):
images = convert_from_path(pdf_path, first_page=1, last_page=1)
img_buffer = BytesIO()
images[0].save(img_buffer, format="PNG")
return base64.b64encode(img_buffer.getvalue()).decode("utf-8")

Step 3: Prompting Gemini 2.0 Flash (and Other LLMs)

The core of the extraction is a carefully crafted prompt:

“Convert the image to an HTML table. The output should begin with <table> and end with </table>. Specify rowspan and colspan attributes when they are greater than 1. Do not specify any other attributes. Only use table related HTML tags, no additional formatting is required.”

This prompt, along with the image, is sent to the LLM. For Gemini 2.0 Flash, the Google Generative AI Python SDK is used:

import google.generativeai as genai

genai.configure(api_key=GEMINI_API_KEY)
gemini_model = genai.GenerativeModel("gemini-2.0-flash-001")

image_part = {"mime_type": "image/png", "data": base64_image}
response = gemini_model.generate_content([prompt, image_part])

The model returns an HTML snippet containing the extracted table.

Step 4: Postprocessing and Saving Results

The returned HTML is parsed to extract only the <table>…</table> portion. This is saved for later grading.

def parse_gemini_response(content: str):
start = content.find("<table>")
end = content.find("</table>") + 8
if start != -1 and end != -1:
return content[start:end]
return None

Step 5: Grading the Extracted Tables

How do you know if the extraction was successful? The project uses a robust grading system:

  • HTML-to-Array Conversion: Both the ground truth and extracted HTML tables are converted to NumPy arrays for cell-wise comparison.
  • Cell Similarity: Uses Levenshtein distance to measure how similar each cell is.
  • Row/Column Alignment: Employs the Needleman-Wunsch algorithm (famous in bioinformatics) to align rows and columns, handling insertions, deletions, and misalignments.
  • Final Score: Produces a similarity score between 0 (completely different) and 1 (perfect match).

Example grading function:

def table_similarity(ground_truth, prediction):
# Normalize, align, and score the tables
...
return similarity_score

Step 6: Aggregating Metrics

For each model, the system tracks:

  • Total PDFs processed
  • input/output tokens
  • Average API latency
  • Average accuracy and standard deviation
  • Total cost

Example results (from the project):

  • Total PDFs processed: 945
  • Total input tokens: 294,840
  • Total output tokens: 670,605
  • Average API latency: 12.73 seconds
  • Average Accuracy: 0.84 (std 0.15)
  • Total cost: $0.30

What Makes Gemini 2.0 Flash Stand Out?

  • Speed: Gemini 2.0 Flash is optimized for low latency, making it suitable for large-scale batch processing.
  • Accuracy: Achieves high accuracy on complex, real-world tables, including those embedded as images.
  • Cost Efficiency: Processes nearly a thousand PDFs for under $0.30.

Conclusion

This project demonstrates that Gemini 2.0 Flash is a game-changer for extracting structured data from unstructured documents. By combining powerful LLMs, smart prompting, and robust evaluation, it’s now possible to automate what was once a manual, error-prone process — at scale and at low cost.

If you’re working with document data, consider benchmarking Gemini 2.0 Flash on your own datasets. The code is open-source, extensible, and ready for real-world use.

References:

--

--