AI Document Intelligence — OCR + LLM
Built a two-stage pipeline for automated document extraction using OCR and locally-hosted LLM.
Shory Insurance Brokers
The Problem
Every insurance application required verified identity and residency documents — Emirates IDs, labour contracts, visas, and residence permits. Reading these manually to extract fields like name, ID number, expiry date, and employer was a slow, error-prone process that created a bottleneck at every customer onboarding step. A misread field meant application delays or rejected submissions. I built a two-stage automated pipeline to eliminate this: first, an OCR layer to extract raw text from document scans regardless of layout; then a locally-hosted LLM (Ollama) to semantically understand the extracted text and return only the specific structured fields the backend needed — no hallucinations, no ambiguity, deterministic JSON output.
Pipeline Architecture
Built a two-stage pipeline in Python. Stage 1 used OCR to extract raw text from scanned documents regardless of layout — handling a variety of document formats, orientations, and print qualities across Emirates IDs, labour contracts, visas, and residence permits. Stage 2 passed that raw text to a locally-hosted LLM running via Ollama, prompted to identify and extract only the specific fields required by the backend (name, ID number, expiry, nationality, employer, etc.) and return them as a deterministic structured JSON object. No external API calls — the LLM ran locally, keeping all document data on-premise.
Sample Structured Output
{ "document_type": "emirates_id", "full_name": "[REDACTED]", "id_number": "784-[REDACTED]", "nationality": "[REDACTED]", "date_of_birth": "[REDACTED]", "expiry_date": "[REDACTED]", "confidence": "high" } // LLM prompted with field schema — returns only required fields // Ollama running locally — zero external data exposure
Business Impact
- 4 document types fully automated — Emirates IDs, labour contracts, visas, and residence permits
- Eliminated manual document reading from the onboarding workflow — zero human transcription needed for standard documents
- Two-stage design (OCR → LLM) handled layout variation and print quality issues that single-step extraction could not
- Local Ollama inference kept all customer document data on-premise — no third-party API exposure for sensitive PII