PDF Tools
Extract text and manipulate PDF documents.
Overview
The PDF Tools connector is a built-in utility for working with PDF files inside a workflow. It pulls plain text out of documents (the foundation of any AI-driven invoice or contract pipeline), and provides page-level operations for merging, splitting, extracting, rotating, and inspecting PDFs.
It's most useful in document-processing pipelines: pair extract-text with an AI Transform node to turn invoices into structured JSON, use split to break a multi-document batch into one workflow run per file, or use merge to assemble a shipping label with a packing slip into a single document for printing.
What You Can Do
The PDF connector exposes these tools:
extract-text- Pull plain text from a PDF, page by page.get-info- Read document metadata (page count, title, author, encryption status).merge- Combine multiple PDFs into one document.split- Split a PDF into separate documents at given page boundaries.extract-pages- Pull out a range of pages as a new PDF.remove-pages- Delete a range of pages from a PDF.rotate- Rotate one or more pages by 90, 180, or 270 degrees.
Authentication and Setup
No connection or authentication is required. These tools are built into the platform and available in every workflow by default - just drop a Connector node onto the canvas and pick the tool you need.
Using in a Workflow
Add a Connector node, select PDF Tools, and pick a mode:
- Direct Mode - Recommended for document pipelines. Call
extract-textagainst a known file, then feed the result into a Transform or AI node. - Agent Mode - Useful when you want an AI agent to decide whether to extract text, split, or merge based on a prose instruction.
For batch processing, place the PDF node inside a Loop so each incoming file (from FTP, Gmail Trigger, or a knowledge collection) gets its own extraction step.
Tips
- Always pair
extract-textwith an AI step for structured extraction. The text is rarely useful on its own, but it's exactly the input AI invoice and contract parsers need. - Use
get-infoas a guard before processing - skip encrypted PDFs or files over a sensible page limit. - Split before extracting on very large PDFs. Per-page extraction keeps prompts inside an AI model's context window.
- Merge late, not early - keep intermediate documents separate while you process them, then combine at the very end if a single artifact is needed.
Common Pitfalls
- Scanned PDFs have no text -
extract-textreturns image-only pages as empty. OCR isn't included; use an AI vision model or an OCR connector for scans. - Layout-sensitive extraction - Multi-column documents and tables lose their structure in plain text. Use an AI step (Structured Output mode) rather than regex to recover fields.
- Encrypted PDFs - Password-protected files can't be read. Decrypt upstream (e.g. in a Code Runner step) or reject them in a Condition node.
- Page indexing - Pages are 1-based. Off-by-one errors when calling
extract-pagesorremove-pagesare common. - File size limits - Very large PDFs (hundreds of MB) may exceed the workflow payload limit; split them on the source side or stream from FTP.
Common Use Cases
- Extract Structured Data from PDF Documents with AI
- Extract Invoice Data with PDF Tools and AI
- Build an AI-Powered Invoice Processing Pipeline
- Set Up Email-Triggered Document Processing
- Build a Knowledge Base from PDFs
- Extract and Store Invoice Data in a Knowledge Collection
Related Articles
For technical API details and field specifications, see the PDF Tools documentation.