Skip to main content

Building an LLM Document Extraction Benchmark Framework

Β· 5 min read
Shreya Soni
Intern

Large Language Models (LLMs) are increasingly being used for structured information extraction from documents such as resumes, invoices, and reports. However, different LLMs behave differently in terms of extraction accuracy, execution time, consistency, and output quality. Choosing the right model for document extraction tasks therefore becomes an important challenge.

To address this, we built an LLM Document Extraction Benchmark System that compares multiple LLMs on structured document extraction tasks. The framework evaluates models using common prompts and documents, then measures their performance using metrics such as execution time, accuracy, precision, recall, and F1 score.

The project supports both local and cloud-based LLMs and provides a benchmarking pipeline for comparing extraction quality, execution time, and structured output consistency across different models.

🎯 Project Goal

The goal of this project is to build a generic LLM Benchmarking Framework for evaluating how effectively different Large Language Models perform structured information extraction from real-world documents.

The framework helps compare models based on:

  • Extraction accuracy
  • Execution time
  • Output consistency
  • Structured response quality

The system benchmarks multiple local and cloud-based LLMs using the same documents and prompts to ensure fair and reliable comparison for document extraction workflows.

🧠 Models Evaluated​

The project benchmarks both local and cloud-based LLMs.

ModelPlatform
Llama3 (8B)Ollama
Mistral (7B)Ollama
Qwen2.5 (7B)Ollama
GPT-4.1Azure OpenAI
Azure Llama 3.1Azure AI

Each model processes identical prompts and documents, allowing direct comparison of extraction quality and execution performance.

πŸ—οΈ System Architecture

Benchamark Architecture

The benchmarking framework was designed so that every model receives the same document, extraction prompt, and evaluation workflow to ensure fair comparison across all models.

The project was implemented using:

  • FastAPI for backend API workflows
  • Ollama for running local LLMs
  • Azure OpenAI APIs for cloud-based model testing
  • PyPDF for document text extraction
  • Pandas & OpenPyXL for automated benchmark report generation

The pipeline extracts text from documents, sends standardized prompts to multiple LLMs, parses the generated responses into structured JSON outputs, validates the outputs, and compares model performance using metrics such as accuracy, precision, recall, F1 score, execution time, and output consistency.

Each model was executed independently, and execution time was measured separately to compare model latency and extraction performance fairly.

One of the major challenges was handling inconsistent LLM outputs such as invalid JSON, missing fields, and hallucinated values. To solve this, additional parsing, validation, and formatting logic was implemented before evaluation.

⚑ Backend API using FastAPI​

The project also includes a FastAPI backend that allows extraction and benchmarking through APIs.

Example Endpoint​

from fastapi import FastAPI, UploadFile

app = FastAPI()

@app.post("/extract")
async def extract(file: UploadFile, prompt: str):
return {
"message": "Extraction Started",
"prompt": prompt
}

FastAPI provides:

  • High-performance APIs
  • Automatic Swagger documentation
  • Easy integration with AI workflows
  • Scalable backend support

🌐 Swagger UI Support​

FastAPI automatically generates interactive API documentation.

http://127.0.0.1:8000/docs

Swagger UI allows direct testing of extraction endpoints from the browser.

Benchmarking

πŸ“‚ Supported Document Formats​

The framework supports multiple document formats:

  • PDF
  • TXT
  • DOCX
  • XLSX

This flexibility makes the system adaptable for different enterprise document workflows.

✨ Prompt-Based Extraction​

One of the most important features of the system is prompt-driven extraction. The system automatically identifies requested fields from Resumes and generates structured JSON outputs.

Instead of hardcoding extraction rules, users dynamically define extraction fields using prompts.

Example Prompt​

Extract name, email, phone, skills, education, and experience from the document.

The system automatically identifies requested fields and generates structured JSON outputs.

πŸ“Š Benchmarking Metrics​

The framework evaluates models using:

  • Execution Time
  • Accuracy
  • Precision
  • Recall
  • F1 Score

This helps identify the best-performing model for real-world extraction systems.

πŸ“Š Benchmark Performance Comparison​

MetricLlamaMistralQwenGPT-4.1Azure-Llama
Execution Time (sec)58.2489.7162.284.121.99
Accuracy10071.4385.7110085.71
Precision10071.4385.7110085.71
Recall100100100100100
F1 Score10083.3392.3110092.31

These comparisons help evaluate model performance based on extraction accuracy, execution speed, response consistency, and overall benchmarking efficiency.

πŸ“ Generated Reports​

The framework automatically generates:

Benchmark Output Report​

benchmark_output.xlsx

Contains:

  • Extracted fields
  • Ground truth values
  • Predictions from each model
  • Execution times

Accuracy Report​

benchmark_accuracy.xlsx

Contains:

  • Accuracy
  • Precision
  • Recall
  • F1 Score

🧩 Challenges Faced​

Building reliable LLM extraction systems comes with several challenges.

1️⃣ Inconsistent Outputs​

LLMs may generate:

  • Invalid JSON
  • Missing fields
  • Additional explanations
  • Incomplete responses

βœ… Solution​

  • Structured prompts
  • Validation layers
  • Parsing logic

2️⃣ Hallucination​

Sometimes models generate information not present in the document.

βœ… Solution​

  • Better prompt design
  • Validation checks
  • Controlled output formatting

3️⃣ Parsing Failures​

Improperly formatted outputs can break downstream workflows.

βœ… Solution​

  • Exception handling
  • Fallback parsing
  • JSON validation

πŸ”₯ Key Features​

  • Multi-model LLM benchmarking
  • Prompt-driven extraction
  • Structured JSON output
  • FastAPI backend integration
  • Swagger UI support
  • Execution time benchmarking
  • Automated Excel reports
  • Support for multiple document formats

🎯 Conclusion

LLM benchmarking plays an important role in identifying which models perform better for structured document extraction tasks based on factors such as accuracy, execution time, and output consistency.

This project demonstrates how multiple local and cloud-based LLMs can be evaluated using a common benchmarking pipeline with standardized prompts, structured validation, and automated reporting. By integrating FastAPI, prompt-based extraction, validation workflows, and benchmarking metrics, the framework provides a practical and scalable approach for comparing document extraction performance across different models.

The project also highlights that selecting the right LLM depends not only on intelligence, but also on reliability, speed, and consistency for real-world AI workflows.

πŸ”— GitHub Repository

Check out the complete project here:
https://github.com/admin-suketa/LLM_Document_Extraction_Benchmark

πŸ“š References

  1. FastAPI Documentation (2026). Available at: https://fastapi.tiangolo.com/

  2. Ollama Documentation (2026). Available at: https://ollama.com/

  3. OpenAI API Documentation (2026). Available at: https://platform.openai.com/docs/

  4. PyPDF Documentation (2026). Available at: https://pypdf.readthedocs.io/

  5. Microsoft Azure OpenAI Documentation (2026). Available at: https://learn.microsoft.com/en-us/azure/ai-services/openai/