AI Document Intelligence — OCR + LLM

Built a two-stage pipeline for automated document extraction using OCR and locally-hosted LLM.

Shory Insurance Brokers

4+Document Types Automated

LocalLLM via Ollama

JSONStructured Output to Backend

The Problem

Every insurance application required verified identity and residency documents — Emirates IDs, labour contracts, visas, and residence permits. Reading these manually to extract fields like name, ID number, expiry date, and employer was a slow, error-prone process that created a bottleneck at every customer onboarding step. A misread field meant application delays or rejected submissions. I built a two-stage automated pipeline to eliminate this: first, an OCR layer to extract raw text from document scans regardless of layout; then a locally-hosted LLM (Ollama) to semantically understand the extracted text and return only the specific structured fields the backend needed — no hallucinations, no ambiguity, deterministic JSON output.

Pipeline Architecture

Built a two-stage pipeline in Python. Stage 1 used OCR to extract raw text from scanned documents regardless of layout — handling a variety of document formats, orientations, and print qualities across Emirates IDs, labour contracts, visas, and residence permits. Stage 2 passed that raw text to a locally-hosted LLM running via Ollama, prompted to identify and extract only the specific fields required by the backend (name, ID number, expiry, nationality, employer, etc.) and return them as a deterministic structured JSON object. No external API calls — the LLM ran locally, keeping all document data on-premise.

Sample Structured Output

OUT Ollama LLM → Backend Structured JSON Response

{
  "document_type": "emirates_id",
  "full_name": "[REDACTED]",
  "id_number": "784-[REDACTED]",
  "nationality": "[REDACTED]",
  "date_of_birth": "[REDACTED]",
  "expiry_date": "[REDACTED]",
  "confidence": "high"
}

// LLM prompted with field schema — returns only required fields
// Ollama running locally — zero external data exposure

Business Impact

4 document types fully automated — Emirates IDs, labour contracts, visas, and residence permits
Eliminated manual document reading from the onboarding workflow — zero human transcription needed for standard documents
Two-stage design (OCR → LLM) handled layout variation and print quality issues that single-step extraction could not
Local Ollama inference kept all customer document data on-premise — no third-party API exposure for sensitive PII

Stack: PythonOllamaOCRREST APIs SQL ServerAzure DevOps