Some customers want our help with document processing. For example, emails or invoices. We need to extract the information and add structured information.
I’d treat this as an “ingestion → extraction → validation → structuring → publishing” pipeline, with humans-in-the-loop where it matters (accuracy, exceptions, and policy).
1. Define the target “structured record”
Start with a schema per document type (invoice, receipt, purchase order, email request, etc.). Keep it simple and extensible.
Invoice example (core fields)
Vendor: name, VAT ID, address
Buyer: name, VAT ID
Invoice: number, issue date, due date, currency
Totals: subtotal, tax breakdown, total
Line items: description, qty, unit price, tax rate
Payment: IBAN, payment terms
Provenance: source file/email id, received timestamp, page count
Evidence: “this value came from this snippet / bbox / page”
That last part (evidence) is crucial for trust and audits.
2. Ingest + normalize
Inputs: PDFs, scans, email bodies, email attachments, EDI-ish PDFs, images.
Steps:
Collect from sources (email inbox, upload folder, API).
Convert to a canonical “document bundle”:
text (best-effort)
layout (pages, blocks)
images (per page)
metadata (sender, dates, thread id)
De-duplicate (hashing) and classify.
Empty Mermaid block
3. Classify document type + route
Use a lightweight classifier:
Heuristics (sender, keywords like “Invoice”, “Factura”, amounts, IBAN)
ML/LLM classification as fallback
Route to an extractor specialized for:
Invoices
Receipts
Contracts
Emails (requests, approvals, complaints, support)
4) Extract with “hybrid” methods (best results in practice)
Don’t bet everything on one technique.
For digital PDFs (text-based):
Parse text + layout (tables, key-value zones)
Use deterministic patterns for high-signal fields (VAT/IVA IDs, dates, invoice number formats, IBAN)
For scanned PDFs/images:
OCR
Then the same as above, but with lower confidence
LLM step (structured):
Ask the model to output strict JSON that matches your schema
Provide the model with:
extracted text
layout hints (tables, page headings)
instructions like “return null if missing, don’t guess”
Have the model also return citations/evidence (snippet + page, or bbox id) for each field.
5) Validate and score confidence
Run validators after extraction:
Invoice number present?
Totals match: sum(line_items) ≈ subtotal, subtotal + taxes ≈ total
Dates are sensible (due date ≥ issue date)
VAT/IVA format valid per country
IBAN checksum valid
Currency matches symbols
Compute an overall confidence score and decide automation level:
High confidence → auto-ingest
Medium → “review required”
Low → “manual entry”
6) Human-in-the-loop review UI (where you win deals)
For medium confidence cases:
Show the document side-by-side with extracted fields
Highlight evidence snippets
One-click fix + “why” (so you can learn)
Every correction becomes training data:
vendor-specific templates
recurring line-item patterns
preferred mappings (e.g., account codes, cost centers)
7) Map to the customer’s systems
Structured output typically needs to flow into:
ERP/accounting (NetSuite, SAP, Odoo, QuickBooks, Xero)
CRM/ticketing (HubSpot, Zendesk, Jira)
Document repository / knowledge base
Use a canonical internal model → export adapters:
JSON (API)
CSV (legacy)
UBL / Factur-X / PEPPOL-like formats if needed
8) Store as “structured + source + provenance”
Keep:
Original document (immutable)
Extracted structured record (versioned)
Evidence map (field → snippet/page/bbox)
Processing log (model version, OCR version, rules triggered)
This makes audits, dispute resolution, and debugging straightforward.
9) Practical deployment approach
Phase 1 (2–4 weeks): one document type (invoices) + 2–3 pilot customers
Build schema, ingestion, extraction, validation, review UI, export to one target system.
Phase 2: vendor learning + email intake
Recognize repeat vendors and handle their quirks
Parse “invoice via email” workflows (threading, attachments, approvals)
Phase 3: multi-doc workflows
Purchase order ↔ invoice matching
Payment status updates
Exceptions and disputes
10) What I’d implement first (minimal but powerful)
Ingestion from email + uploads
Invoice classifier
Extraction pipeline:
PDF text extraction → LLM JSON
OCR fallback
Validation engine (math + formats)
Review UI with evidence highlights
Export to customer’s accounting system
Continuous learning from corrections
If you tell me:
which formats (Spanish “Factura” PDFs? scans?),
where the structured data needs to end up (ERP/CRM?), and
typical volume (10/day vs 10k/day),
I can sketch an architecture and a concrete “first pilot” plan with the exact fields and validators.
How to archive emails
Perfect — I’ll format this as a Seed-ready archive document, following the conventions you described:
One Email Thread document
One Email Message document per message
Attachments linked
Structured metadata
Full provenance-ready structure
Clean block structure for htmlToBlocks() or markdown import
Below is a Seed archive representation in Markdown form, ready for seed document create.
Invoice
Re: [seed-hypermedia/seed] Bug: Notifications sent you to an older document version. (Issue #256)