How to Extract and Store Invoice Data in a Knowledge Collection

Process incoming invoices with AI and make them searchable in your knowledge base.

What This Integration Does

Invoice processing usually means one of two things: tedious manual data entry, or a brittle template-based extraction tool that breaks when a vendor changes their letterhead. This workflow uses an AI Agent to pull structured fields from any invoice format, writes those fields to a relational store for accounting and reporting, and embeds the original document into a Knowledge collection so AP staff can search past invoices by free-text query later.

Each invoice flows through three coordinated stages. First, the PDF is parsed to raw text. Second, the AI Agent extracts a structured object (vendor, total, line items, due date). Third, that object goes into a database while the full invoice text goes into the Knowledge collection. The two outputs stay linked via a shared invoiceId so a search hit can drill back into the structured row, and vice versa.

This tutorial uses a persistent Knowledge collection because the goal is a searchable invoice archive that lasts. If you only need to extract fields from one document and don't want it in any long-lived collection, use a Transient Knowledge collection instead - see How to Create NetSuite Sales Orders from Emailed PO PDFs for the transient pattern.

Prerequisites

  • A trigger source: an Email Trigger watching invoices@yourcompany.com, a Webhook for upload tools, or a scheduled ftp poll for vendor drops.
  • The pdf utility connector for text extraction.
  • A mysql or mongodb connection for the structured invoice store.
  • A Knowledge collection (e.g. invoice-archive) for full-text indexing.
  • Access to the AI Agent for structured-output extraction.

Step 1: Trigger and Capture the PDF

Add a Trigger node and configure it for your preferred source. For email, filter on attachments with a .pdf extension and on senders matching your vendor allow-list. The trigger needs to expose filename, bytes, and a receivedAt timestamp so downstream steps can tag the resulting records.

Step 2: Connector - Extract Invoice Text

Add a Connector node pointing at the pdf connector. Call get-info first to confirm the PDF is valid and grab its page count, then call extract-text for the body. Wrap the pair in a Condition that aborts cleanly on encrypted or zero-page documents - those should land in a manual-review queue, not the indexed collection.

Step 3: AI Agent - Structured Extraction

Add a Connector node that invokes the AI Agent with a structured-output schema. The prompt should pin the model to your schema and forbid hallucinated fields:

Extract the following fields from this invoice text. Return JSON
matching the schema exactly. If a field isn't present, use null.

Schema:
{
  "vendor": string,
  "invoiceNumber": string,
  "issueDate": string (YYYY-MM-DD),
  "dueDate": string (YYYY-MM-DD),
  "currency": string (ISO 4217),
  "subtotal": number,
  "tax": number,
  "total": number,
  "lineItems": [{ "description": string, "qty": number, "unitPrice": number, "amount": number }]
}

Invoice text:
{{ invoiceText }}

Set the temperature to 0.1 for stable extraction. Validate the result with json validate against the same schema; any failure should route to a Human review queue rather than corrupt downstream tables.

Step 4: Store the Structured Data

For a relational store, add a Connector node on the mysql connector with the insert-rows tool, writing the top-level invoice fields to an invoices table and the line items to an invoice_lines table keyed on invoiceId. For a document store, use mongodb insert-documents with the full structured object as one document. In either case, generate invoiceId up front (the same ID is used in the next step) so the structured row and the indexed document can be cross-referenced.

Step 5: Knowledge Node - Embed the Original

Add a Knowledge node in embed mode targeting the invoice-archive collection. Compose a small metadata header so retrieval has hooks to match against:

Vendor: {{ extracted.vendor }}
Invoice #: {{ extracted.invoiceNumber }}
Total: {{ extracted.total }} {{ extracted.currency }}
Issue Date: {{ extracted.issueDate }}

{{ invoiceText }}

Set sourceId to the invoiceId from Step 4. Re-indexing the same invoice (after a correction in Human review) replaces the prior copy cleanly.

Step 6: Human Review for Low-Confidence Extractions

Add a Condition node that routes invoices to a Human node when key fields are null, totals don't reconcile (subtotal + tax != total), or the vendor isn't in the allow-list. The Human node should show the extracted JSON next to the original PDF and let the reviewer correct fields before the workflow continues to Step 4 and 5. Approved corrections write back through the same steps, replacing the prior record.

Tips

  • Validate math - add a reconciliation check (line-item sum == subtotal, subtotal + tax == total) before writing. Most extraction errors fail this check first.
  • Normalize vendor names - run extracted vendor names through a small fuzzy match against your existing vendor list so the same supplier doesn't appear three different ways.
  • Currency hygiene - some vendors omit the currency symbol. Default to a workspace-level default (e.g. AUD) and flag any mismatch for human review.

Common Pitfalls

  • OCR-only PDFs - scanned invoices return blank text. Always check for empty extraction and route to an OCR path or human entry instead of indexing nothing.
  • Date formats - vendors use DD/MM/YYYY, MM/DD/YYYY, and named months interchangeably. The structured-output prompt should enforce YYYY-MM-DD and reject ambiguous values.
  • Duplicate invoices - same vendor, same invoice number resends are common. Make (vendor, invoiceNumber) a unique key on the database table so duplicates fail loudly.

Testing

Run five real invoices from different vendors through the workflow. For each, verify the row in the database matches the original PDF field-for-field, and confirm a Knowledge query like "what did Acme Supplies bill us for in March?" returns the right document. Then deliberately feed in a malformed invoice and confirm it lands in the Human review queue without poisoning the index.

Learn More

Did this answer your question? Thanks for the feedback There was a problem submitting your feedback. Please try again later.