How to Extract Invoice Data with PDF Tools and AI

Extract structured data from PDF invoices using PDF Tools and AI.

What This Integration Does

Vendor invoices are wildly inconsistent: every supplier has their own layout, font, and field order. Traditional PDF parsing breaks the moment a vendor redesigns their template. This workflow combines the pdf connector for text extraction with an AI Agent that understands the document conceptually - so a fresh template still produces a clean record with vendor, totals, dates, and line items in the same shape as everything else.

The workflow takes one PDF (from email, FTP, or a webhook upload), extracts the text, runs it through the AI Agent with a strict schema, validates the result, and writes it to your AP system or a database. Failures are caught and routed to a human review queue, so accounts payable always knows when to look.

This approach uses the pdf connector for raw text extraction. For an alternative that reasons over the PDF directly via the Knowledge Base in Transient mode (no PDF text extraction step needed), see How to Create NetSuite Sales Orders from Emailed PO PDFs.

Prerequisites

  • The pdf connector available in your workspace.
  • An AI Agent enabled in Spojit.
  • A destination for structured invoices: netsuite, mongodb, mysql, or a webhook into your AP system.
  • Optionally, an ftp or Email trigger source.

Step 1: Receive the Invoice

Drop a Trigger node. Use the Email sub-type for vendor inboxes, a Webhook for a portal upload, or a Schedule trigger that polls ftp list-directory on a shared drop folder followed by download-file for new PDFs.

Step 2: Extract Text from the PDF

Add a Connector node pointing at the pdf connector. Call get-info first to see how many pages the document has, then extract-text to pull the textual content. For multi-page invoices where only the first page has totals, use extract-pages to grab pages 1-2 and skip pages of fine print.

Step 3: Clean the Raw Text

PDF text often arrives with weird whitespace, mid-word line breaks, and column noise. Add a Connector node calling the text connector with trim, then replace with a regex to collapse multiple spaces. A cleaner input dramatically improves the AI extraction quality.

Step 4: Extract Structured Fields with the AI Agent

Add a Connector node in Agent Mode with Structured Output. Use a strong-enough model (Sonnet works well) so line item arithmetic is reliable:

Extract the invoice fields below from the document text. Use ISO 8601 for
dates and ISO 4217 for currency codes. Do not invent fields that are not
in the source. Line items must sum to (total - tax) within 1 cent.

Document:
{{ cleanText }}

Schema:

{
  "vendorName":    { "type": "string" },
  "vendorAddress": { "type": "string" },
  "invoiceNumber": { "type": "string" },
  "poNumber":      { "type": "string" },
  "invoiceDate":   { "type": "string" },
  "dueDate":       { "type": "string" },
  "currency":      { "type": "string" },
  "subtotal":      { "type": "number" },
  "tax":           { "type": "number" },
  "total":         { "type": "number" },
  "lineItems": {
    "type": "array",
    "items": {
      "type": "object",
      "properties": {
        "description": { "type": "string" },
        "qty":         { "type": "number" },
        "unitPrice":   { "type": "number" },
        "amount":      { "type": "number" }
      },
      "required": ["description", "amount"]
    }
  }
}

Step 5: Validate

Add three checks:

  • validation connector with iso-date on invoiceDate and dueDate.
  • math connector with sum across lineItems[].amount, then a Condition asserting the sum is within 1 cent of total - tax.
  • A duplicate check - hash the file with the encoding connector's hash-sha256 and look up mongodb count-documents on that hash to skip invoices you've already processed.

Step 6: Write the Record and Archive the Source

Fan out with a Parallel node:

  • Write the structured invoice via netsuite create-record (vendor bill), mongodb insert-documents, or your AP system's REST endpoint via http http-post.
  • Archive the source PDF with ftp upload-file for audit retention.
  • Post a one-line summary to slack send-message in the AP channel.

Tips

  • Stamp each record with the source filename and SHA-256 hash - it makes audits painless.
  • For vendors that consistently use the same layout, build a vendor-specific prompt variant via a Condition node keyed on vendor name. Accuracy goes up noticeably.
  • When the AI returns a value with low confidence, route to Human review rather than auto-posting to your books.

Common Pitfalls

  • Scanned PDFs - extract-text returns empty on image-only PDFs. Detect this and route to an OCR step; don't let an empty extraction silently produce an empty invoice.
  • Multi-currency - invoices in EUR or GBP arrive in the same workflow. Always capture currency and never assume USD downstream.
  • Discounts and credits - some invoices have negative line items. Make sure your validation tolerates them and your destination system accepts them.

Testing

Collect 10 invoices spanning your top vendors and 2-3 unusual ones. Run the workflow with destination writes disabled and inspect the structured output for each. Patch the prompt or pre-clean step for any vendor that fails consistently, then enable writes and forward live traffic.

Learn More

Did this answer your question? Thanks for the feedback There was a problem submitting your feedback. Please try again later.