How to Extract Structured Data from PDF Documents with AI
Use PDF Tools and AI to pull structured data from invoices, contracts, and reports.
What This Integration Does
PDFs are everywhere but the data inside them is locked behind layout. This workflow takes a PDF (received by email, dropped on an SFTP, or posted via webhook), extracts the raw text, then sends that text to an AI step with a Structured Output schema describing exactly the fields you want. The result is a clean JSON record - vendor, amount, due date, line items - that downstream nodes can write into NetSuite, Mongo, or your accounting system.
It handles invoices, contracts, purchase orders, expense receipts, and reports. Each document type gets its own schema and prompt; the rest of the workflow is the same.
For an alternative that uses the Knowledge node in Transient mode (one-run embed + query, automatic cleanup) instead of the PDF Tools text-extract path, see How to Create NetSuite Sales Orders from Emailed PO PDFs.
Prerequisites
- A workspace LLM provider configured. Pick a model with a generous context window for multi-page documents.
- A way to receive PDFs: an ftp connection, the Email Trigger with attachment support, or a webhook with a file upload.
- A target system to write the structured data to (netsuite, mongodb, mysql, etc.).
- A clear schema per document type - know what fields you need before you start.
Step 1: Get the PDF
Drop a Trigger node and pick the source:
- Email Trigger reading an inbox like
invoices@company.com- the attachment is exposed to the workflow. - Webhook Trigger if a portal or partner posts the PDF directly.
- Schedule Trigger paired with a Connector node on ftp /
list-directory+download-filefor batch pickup from an SFTP drop folder.
Step 2: Sniff the Document
Before extracting full text, add a Connector node on pdf / get-info to read page count and metadata. Skip multi-hundred-page documents (probably contracts, not invoices) or branch to a different prompt for them. This is also where you'd detect password-protected files and route them to a human.
Step 3: Extract Text
Add a Connector node on pdf / extract-text. For documents over 5-10 pages, pair with pdf / extract-pages first to grab the relevant range (e.g. just the invoice header page rather than every appendix). For scanned PDFs (images, not text), the connector falls back to OCR but accuracy drops - flag these for closer review.
Step 4: Classify the Document (Optional)
If your inbound stream mixes document types, add a short Connector node configured as an LLM call with a tiny schema:
{ "documentType": { "type": "string", "enum": ["invoice", "purchase-order", "receipt", "contract", "other"] } }
Then a Condition node routes each document to the right extraction prompt.
Step 5: AI Extract with Structured Output
Add the main extraction Connector node configured as a tool-augmented LLM call. Provide the extracted text and a strict Structured Output schema. Example for invoices:
{
"type": "object",
"properties": {
"vendor": { "type": "string" },
"vendorTaxId": { "type": ["string", "null"] },
"invoiceNumber": { "type": "string" },
"issueDate": { "type": "string", "format": "date" },
"dueDate": { "type": ["string", "null"], "format": "date" },
"currency": { "type": "string" },
"subtotal": { "type": "number" },
"tax": { "type": "number" },
"total": { "type": "number" },
"lineItems": {
"type": "array",
"items": {
"type": "object",
"properties": {
"description": { "type": "string" },
"quantity": { "type": "number" },
"unitPrice": { "type": "number" },
"lineTotal": { "type": "number" }
}
}
}
},
"required": ["vendor", "invoiceNumber", "issueDate", "currency", "total"]
}
Prompt the model to return null for missing fields rather than guessing, and to copy numbers verbatim (no rounding).
Step 6: Validate, Persist, and Notify
Add a Transform node that checks arithmetic: sum(lineItems.lineTotal) + tax == total (within a small tolerance). Mismatches go to a review queue via mongodb / insert-documents. Clean records flow to your target - e.g. netsuite / upsert-record for vendor bills, or mongodb / update-documents for an internal invoices collection. Post a summary to slack / send-message for visibility.
Tips
- Store extraction templates in the Knowledge base. Per-vendor quirks (where line items live, currency conventions) live nicely in Knowledge and feed the prompt automatically when the vendor is recognised.
- Two-pass on hard documents. Run a cheap small model first to classify and pick which pages matter; run an expensive model only on those pages.
- Keep the raw text. Store both the extracted text and the structured fields. When extraction is wrong, the text is what you re-process - you don't want to re-OCR.
Common Pitfalls
- OCR on scanned invoices. Accuracy is much lower than on text-native PDFs. Set lower confidence thresholds and route these to a human.
- Number formatting. European invoices use "1.234,56", US uses "1,234.56". Tell the model the locale (or detect it from the vendor country) before parsing totals.
- Multi-page line items. Long invoices split line items across pages. Always extract the full PDF, not just page 1.
- Hallucinated totals. Models sometimes "fix" arithmetic by inventing a total that matches subtotal + tax. The arithmetic check in Step 6 catches this.
Testing
Collect 10-20 real PDFs covering each vendor format you expect to see. Run the workflow, then compare the extracted JSON against a manually-entered ground truth. Iterate on the prompt and schema until accuracy on the sample is acceptable (95%+ on critical fields like total). Run for a week with all results going to the review queue before letting the workflow write to production systems.