How to Extract and Store Invoice Data in a Knowledge Collection

Process incoming invoices with AI and make them searchable in your knowledge base.

What This Integration Does

Invoice processing usually means one of two things: tedious manual data entry, or a brittle template-based extraction tool that breaks when a vendor changes their letterhead. In Spojit you build a workflow that uses a Connector node in Agent mode to pull structured fields from any invoice format, writes those fields to a store for accounting and reporting, and embeds the original document into a Knowledge collection so accounts-payable staff can search past invoices by free-text query later.

A Mailhook trigger starts a run the moment an invoice email arrives. The PDF attachment bytes are fetched, a Connector node in Agent mode extracts a structured object (vendor, total, line items, due date) under a Response Schema, that object is written to your database, and the original PDF is embedded into the Knowledge collection. Each run processes one email; re-sends of the same invoice are deduplicated by the Mailhook trigger, and re-embedding the same file name overwrites the prior copy in the collection.

This tutorial uses a persistent Knowledge collection because the goal is a searchable invoice archive that lasts. If you only need to extract fields from one document and do not want it in any long-lived collection, use a Transient Knowledge collection instead: see How to Create NetSuite Sales Orders from Emailed PO PDFs for the transient pattern.

Prerequisites

A Mailhook trigger so invoice emails push a run directly to a Spojit address (no mailbox connection needed). Point your vendors or a forwarding rule at the generated @mailhook.spojit.com address.
A mysql or mongodb connection for the structured invoice store.
A persistent Knowledge collection (e.g. invoice-archive) created in the Knowledge section of the sidebar, for full-text indexing.
An AI model available to your workspace, used by the Connector node in Agent mode for structured extraction.

Step 1: Mailhook Trigger and Attachment Bytes

Set the Trigger node Trigger Type to Mailhook, set an optional address prefix (for example invoices), then Generate email address and point your vendors or a forwarding rule at it. Add an optional Subject regex or From allowlist to filter unwanted mail. The trigger output is available as {{ input }} and includes subject, from, receivedAt, and an attachments list of references.

Add an Attachment node (it requires the Mailhook trigger) to fetch the actual PDF bytes. Set Mode to Single, set the Content type filter to application/pdf or a Filename pattern of *.pdf, and turn on Fail if no attachment matches. Its output is { filename, contentType, size, content }, where content is the base64 PDF you reuse downstream.

Step 2: Connector - Extract Invoice Text

Add a Connector node in Direct mode pointing at the pdf connector. Call get-info first to confirm the PDF is valid and grab its page count, then call extract-text for the body, passing the base64 {{ attachment.content }} from Step 1. Follow it with a Condition node that aborts cleanly on encrypted or zero-page documents: those should be sent to a manual-review queue, not the indexed collection.

Step 3: Connector in Agent Mode - Structured Extraction

Add a Connector node in Agent mode and define a Response Schema to force structured JSON output. The prompt should pin the model to your schema and forbid invented fields:

Extract the following fields from this invoice text. Return JSON
matching the schema exactly. If a field isn't present, use null.

Schema:
{
  "vendor": string,
  "invoiceNumber": string,
  "issueDate": string (YYYY-MM-DD),
  "dueDate": string (YYYY-MM-DD),
  "currency": string (ISO 4217),
  "subtotal": number,
  "tax": number,
  "total": number,
  "lineItems": [{ "description": string, "qty": number, "unitPrice": number, "amount": number }]
}

Invoice text:
{{ pdfText.text }}

Because the Response Schema already constrains the output shape, you can additionally validate the result with the json connector validate tool against the same schema. Route any validation failure to the Human approval step in Step 6 rather than letting it corrupt downstream tables.

Step 4: Store the Structured Data

For a relational store, add a Connector node in Direct mode on the mysql connector with the insert-rows tool, writing the top-level invoice fields to an invoices table and the line items to an invoice_lines table keyed on invoiceId. For a document store, use the mongodb connector with the insert-documents tool and write the full structured object as one document. In either case, build an invoiceId (for example from vendor plus invoice number) so the structured row and the indexed document share the same reference. Reuse that invoiceId as the Knowledge File Name in the next step.

Step 5: Knowledge Node - Embed the Original

Add a Knowledge node in Embed mode. Set Collection to your persistent invoice-archive collection, set Document Type to PDF, and set Document Input to the base64 attachment from Step 1:

Collection:      invoice-archive
Document Type:   PDF
File Name:       {{ invoiceId }}
Document Input:  {{ attachment.content }}

Setting File Name to the same invoiceId ties the embedded document to the structured row. Because Embed mode overwrites when a file name already exists, re-running the same invoice after a correction replaces the prior copy cleanly. The Output Variable reports the chunk count once embedding finishes.

Step 6: Human Approval for Low-Confidence Extractions

Add a Condition node that routes invoices to a Human node when key fields are null, totals do not reconcile (subtotal + tax != total), or the vendor is not on your allow-list. Configure the Human node with a clear Message that includes the extracted values via {{ }} variables and an Approval slot for your accounts-payable reviewer. If a reviewer approves, the run continues to Steps 4 and 5; if they reject, the run halts before anything is written, so the bad extraction never reaches your tables or the collection. The Human node approves or rejects only: it does not edit fields and there is no on-reject branch, so for corrections, fix the source data and re-send the invoice to the Mailhook address.

Tips

Validate math - add a reconciliation check (line-item sum == subtotal, subtotal + tax == total) before writing. Most extraction errors fail this check first.
Normalize vendor names - run extracted vendor names through a small fuzzy match against your existing vendor list so the same supplier doesn't appear three different ways.
Currency hygiene - some vendors omit the currency symbol. Default to a workspace-level default (e.g. AUD) and flag any mismatch for human review.

Common Pitfalls

OCR-only PDFs - scanned invoices return blank text. Always check for empty extraction and route to an OCR path or human entry instead of indexing nothing.
Date formats - vendors use DD/MM/YYYY, MM/DD/YYYY, and named months interchangeably. The structured-output prompt should enforce YYYY-MM-DD and reject ambiguous values.
Duplicate invoices - same vendor, same invoice number resends are common. Make (vendor, invoiceNumber) a unique key on the database table so duplicates fail loudly.

Testing

Run five real invoices from different vendors through the workflow. For each, verify the row in the database matches the original PDF field-for-field, and confirm a Knowledge query like "what did Acme Supplies bill us for in March?" returns the right document. Then deliberately feed in a malformed invoice and confirm it lands in the Human review queue without poisoning the index.