Hosted onloislane.hyper.mediavia theHypermedia Protocol

    Some customers want our help with document processing. For example, emails or invoices. We need to extract the information and add structured information.

    I’d treat this as an “ingestion → extraction → validation → structuring → publishing” pipeline, with humans-in-the-loop where it matters (accuracy, exceptions, and policy).

    1. Define the target “structured record”

      Start with a schema per document type (invoice, receipt, purchase order, email request, etc.). Keep it simple and extensible.

      Invoice example (core fields)

        Vendor: name, VAT ID, address

        Buyer: name, VAT ID

        Invoice: number, issue date, due date, currency

        Totals: subtotal, tax breakdown, total

        Line items: description, qty, unit price, tax rate

        Payment: IBAN, payment terms

        Provenance: source file/email id, received timestamp, page count

        Evidence: “this value came from this snippet / bbox / page”

    That last part (evidence) is crucial for trust and audits.

    2. Ingest + normalize

      Inputs: PDFs, scans, email bodies, email attachments, EDI-ish PDFs, images.

      Steps:

        Collect from sources (email inbox, upload folder, API).

        Convert to a canonical “document bundle”:

          text (best-effort)

          layout (pages, blocks)

          images (per page)

          metadata (sender, dates, thread id)

        De-duplicate (hashing) and classify.

    Empty Mermaid block

    3. Classify document type + route

      Use a lightweight classifier:

        Heuristics (sender, keywords like “Invoice”, “Factura”, amounts, IBAN)

        ML/LLM classification as fallback

      Route to an extractor specialized for:

        Invoices

        Receipts

        Contracts

        Emails (requests, approvals, complaints, support)

    4) Extract with “hybrid” methods (best results in practice)

      Don’t bet everything on one technique.

      For digital PDFs (text-based):

        Parse text + layout (tables, key-value zones)

        Use deterministic patterns for high-signal fields (VAT/IVA IDs, dates, invoice number formats, IBAN)

      For scanned PDFs/images:

        OCR

        Then the same as above, but with lower confidence

      LLM step (structured):

        Ask the model to output strict JSON that matches your schema

        Provide the model with:

          extracted text

          layout hints (tables, page headings)

          instructions like “return null if missing, don’t guess”

        Have the model also return citations/evidence (snippet + page, or bbox id) for each field.

    5) Validate and score confidence

      Run validators after extraction:

        Invoice number present?

        Totals match: sum(line_items) ≈ subtotal, subtotal + taxes ≈ total

        Dates are sensible (due date ≥ issue date)

        VAT/IVA format valid per country

        IBAN checksum valid

        Currency matches symbols

      Compute an overall confidence score and decide automation level:

        High confidence → auto-ingest

        Medium → “review required”

        Low → “manual entry”

    6) Human-in-the-loop review UI (where you win deals)

      For medium confidence cases:

        Show the document side-by-side with extracted fields

        Highlight evidence snippets

        One-click fix + “why” (so you can learn)

      Every correction becomes training data:

        vendor-specific templates

        recurring line-item patterns

        preferred mappings (e.g., account codes, cost centers)

    7) Map to the customer’s systems

      Structured output typically needs to flow into:

        ERP/accounting (NetSuite, SAP, Odoo, QuickBooks, Xero)

        CRM/ticketing (HubSpot, Zendesk, Jira)

        Document repository / knowledge base

      Use a canonical internal model → export adapters:

        JSON (API)

        CSV (legacy)

        UBL / Factur-X / PEPPOL-like formats if needed

    8) Store as “structured + source + provenance”

      Keep:

        Original document (immutable)

        Extracted structured record (versioned)

        Evidence map (field → snippet/page/bbox)

        Processing log (model version, OCR version, rules triggered)

      This makes audits, dispute resolution, and debugging straightforward.

    9) Practical deployment approach

      Phase 1 (2–4 weeks): one document type (invoices) + 2–3 pilot customers

        Build schema, ingestion, extraction, validation, review UI, export to one target system.

      Phase 2: vendor learning + email intake

        Recognize repeat vendors and handle their quirks

        Parse “invoice via email” workflows (threading, attachments, approvals)

      Phase 3: multi-doc workflows

        Purchase order ↔ invoice matching

        Payment status updates

        Exceptions and disputes

    10) What I’d implement first (minimal but powerful)

      Ingestion from email + uploads

      Invoice classifier

      Extraction pipeline:

        PDF text extraction → LLM JSON

        OCR fallback

      Validation engine (math + formats)

      Review UI with evidence highlights

      Export to customer’s accounting system

      Continuous learning from corrections

      If you tell me:

        which formats (Spanish “Factura” PDFs? scans?),

        where the structured data needs to end up (ERP/CRM?), and

        typical volume (10/day vs 10k/day),
        I can sketch an architecture and a concrete “first pilot” plan with the exact fields and validators.

    How to archive emails

      Perfect — I’ll format this as a Seed-ready archive document, following the conventions you described:

        One Email Thread document

        One Email Message document per message

        Attachments linked

        Structured metadata

        Full provenance-ready structure

        Clean block structure for htmlToBlocks() or markdown import

      Below is a Seed archive representation in Markdown form, ready for seed document create.

    Invoice

    Re: [seed-hypermedia/seed] Bug: Notifications sent you to an older document version. (Issue #256)