How to Extract Structured Data from PDF Documents with AI

Use PDF Tools and AI to pull structured data from invoices, contracts, and reports.

What This Integration Does

PDFs are everywhere but the data inside them is locked behind layout. This Spojit workflow takes a PDF (received by email, fetched from an FTP server, or posted via webhook), extracts the raw text, then sends that text to a Connector node in Agent mode with a Response Schema describing exactly the fields you want. The result is a clean JSON record - vendor, amount, due date, line items - that downstream nodes can write into NetSuite, MongoDB, or your accounting system.

It handles invoices, contracts, purchase orders, expense receipts, and reports. Each document type gets its own schema and prompt; the rest of the workflow is the same.

For an alternative that uses the Knowledge node in Transient mode (one-run embed then query, automatic cleanup) instead of the PDF Tools text-extract path, see How to Extract Invoice Data with PDF Tools and AI.

Prerequisites

Access to Agent mode on the Connector node so the extraction step can return structured output. Pick a model with a generous context window for multi-page documents.
A way to receive PDFs into the workflow: an ftp connection, the Email Trigger (Gmail or Outlook, with attachments), or a Webhook Trigger that carries a file.
A target system to write the structured data to (netsuite, mongodb, mysql, etc.).
A clear schema per document type - know what fields you need before you start.

Step 1: Get the PDF

Drop a Trigger node and pick the source:

Email Trigger polling a connected Gmail or Outlook mailbox - the attachment is exposed to the workflow as a reference, with bytes fetched on demand.
Webhook Trigger if a portal or partner posts the PDF directly.
Schedule Trigger paired with a Connector node on ftp / list-directory and download-file for batch pickup from a drop folder.

Step 2: Sniff the Document

Before extracting full text, add a Connector node on pdf / get-info to read page count and metadata. Skip multi-hundred-page documents (probably contracts, not invoices) or branch to a different prompt for them. This is also where you'd detect password-protected files and route them to a human.

Step 3: Extract Text

Add a Connector node on pdf / extract-text. For documents over 5-10 pages, pair with pdf / extract-pages first to grab the relevant range (e.g. just the invoice header page rather than every appendix). For scanned PDFs (images, not text), the connector falls back to OCR but accuracy drops - flag these for closer review.

Step 4: Classify the Document (Optional)

If your inbound stream mixes document types, add a short Connector node in Agent mode with a tiny Response Schema:

{ "documentType": { "type": "string", "enum": ["invoice", "purchase-order", "receipt", "contract", "other"] } }

Then a Condition node routes each document to the right extraction prompt.

Step 5: AI Extract with a Response Schema

Add the main extraction Connector node in Agent mode. Provide the extracted text in the prompt and a strict Response Schema that forces JSON output. Example for invoices:

{
  "type": "object",
  "properties": {
    "vendor":      { "type": "string" },
    "vendorTaxId": { "type": ["string", "null"] },
    "invoiceNumber": { "type": "string" },
    "issueDate":   { "type": "string", "format": "date" },
    "dueDate":     { "type": ["string", "null"], "format": "date" },
    "currency":    { "type": "string" },
    "subtotal":    { "type": "number" },
    "tax":         { "type": "number" },
    "total":       { "type": "number" },
    "lineItems": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "description": { "type": "string" },
          "quantity":    { "type": "number" },
          "unitPrice":   { "type": "number" },
          "lineTotal":   { "type": "number" }
        }
      }
    }
  },
  "required": ["vendor", "invoiceNumber", "issueDate", "currency", "total"]
}

Prompt the model to return null for missing fields rather than guessing, and to copy numbers verbatim (no rounding).

Step 6: Validate, Persist, and Notify

Add a Transform node that checks arithmetic: sum(lineItems.lineTotal) + tax == total (within a small tolerance). Mismatches go to a review queue via mongodb / insert-documents. Clean records flow to your target - e.g. netsuite / upsert-record for vendor bills, or mongodb / update-documents for an internal invoices collection. Post a summary to slack / send-message for visibility.

Tips

Keep per-vendor notes in a Knowledge collection. Store vendor quirks (where line items live, currency conventions) as documents in a persistent Knowledge collection, then add a Knowledge node in Query mode to pull the matching notes into the extraction prompt.
Two-pass on hard documents. Run a cheap small model first to classify and pick which pages matter; run an expensive model only on those pages.
Keep the raw text. Store both the extracted text and the structured fields. When extraction is wrong, the text is what you re-process - you don't want to re-OCR.

Common Pitfalls

OCR on scanned invoices. Accuracy is much lower than on text-native PDFs. Set lower confidence thresholds and route these to a human.
Number formatting. European invoices use "1.234,56", US uses "1,234.56". Tell the model the locale (or detect it from the vendor country) before parsing totals.
Multi-page line items. Long invoices split line items across pages. Always extract the full PDF, not just page 1.
Hallucinated totals. Models sometimes "fix" arithmetic by inventing a total that matches subtotal + tax. The arithmetic check in Step 6 catches this.

Testing

Collect 10-20 real PDFs covering each vendor format you expect to see. Run the workflow, then compare the extracted JSON against a manually-entered ground truth. Iterate on the prompt and schema until accuracy on the sample is acceptable (95%+ on critical fields like total). Run for a week with all results going to the review queue before letting the workflow write to production systems.