How to Build a Knowledge Base from PDFs
Upload PDF documents into a searchable knowledge base for your workflows.
What This Integration Does
PDFs are the universal format for policies, handbooks, product manuals, and supplier specifications - and the worst format for finding anything. This workflow turns a folder of PDFs into a Knowledge collection that any downstream workflow or AI Agent can query in plain English. Once it is running, dropping a new PDF anywhere your trigger watches is enough to make the content searchable across the whole platform.
The pipeline accepts PDFs from one of several sources (email attachments, FTP drops, or direct webhook uploads), extracts the text per page, and feeds it to the Knowledge node in embed mode. Metadata such as page count and source path is captured so later queries can return precise citations. Re-running the workflow against the same PDF replaces the existing document instead of creating duplicates.
This tutorial uses a persistent collection because the goal is a long-lived, multi-workflow-readable archive. For a one-off "extract structured fields from a single PDF and discard it" pattern, use a Transient Knowledge collection instead - see How to Create NetSuite Sales Orders from Emailed PO PDFs.
Prerequisites
- A Knowledge collection created in advance (for example
company-documents). - An input source: a Trigger set to Email or Webhook, or an FTP connection for scheduled directory polling.
- The pdf utility connector available in your workspace.
Step 1: Choose and Configure the Trigger
Drop a Trigger node onto the canvas. Pick one of:
- Email - watches a shared mailbox for messages with PDF attachments. Useful when staff forward documents to
kb@yourcompany.com. - Webhook - exposes a URL your internal upload tool or browser form can POST a PDF to.
- Schedule - paired with an FTP
list-directorycall for vendor drops (see the auto-index FTP tutorial for the polling pattern).
Step 2: Connector - Inspect the PDF
Add a Connector node pointing at the pdf connector and pick the get-info tool. Pass the file bytes from the trigger. The tool returns page count, document title, and whether the file is encrypted. Use this to short-circuit bad inputs early - a Condition node should drop encrypted or zero-page PDFs to a notification branch rather than letting them poison the index.
Step 3: Connector - Extract Text
Add another Connector step on the pdf connector with the extract-text tool:
{
"file": "{{ trigger.attachment }}",
"preserveLayout": false,
"pageRange": "1-{{ pdfInfo.pageCount }}"
}
For very large documents (more than 200 pages), pair this with pdf split in a Loop so you process the file in 50-page chunks instead of one giant request.
Step 4: Transform - Build the Document Header
Add a Transform node to compose the final text payload that goes into the index. Prepend a small header so future Knowledge queries can return useful citations:
Title: {{ pdfInfo.title || trigger.filename }}
Source: {{ trigger.sourcePath }}
Pages: {{ pdfInfo.pageCount }}
{{ extractedText }}
Strip page-number-only lines and repeated headers/footers if you can identify a pattern - they pollute embeddings and rarely help retrieval.
Step 5: Knowledge Node - Embed
Add a Knowledge node and set it to embed mode. Choose your collection, pass the composed text, and set sourceId to a stable identifier such as the filename plus the modified timestamp. The node chunks the text, generates embeddings, and writes them to the vector store. A re-run with the same sourceId replaces the prior version rather than creating a second copy.
Step 6: Query Later from Any Workflow
From this point on, any other workflow can add a Knowledge node in query mode against the same collection. Pass the user's question, set topK to 4-8, and feed the result chunks plus their sourceId into an AI Agent prompt for grounded answers with citations.
Tips
- Normalize whitespace - PDFs frequently produce extracted text with weird line breaks; a text
replacestep before embedding improves chunk boundaries. - Tag by source folder - if you have folders like
/policiesand/vendor-docs, store the folder name as a Knowledge tag so queries can filter to a specific corpus. - Plan for OCR-only PDFs - scanned PDFs return empty text. Detect this (extracted length under, say, 200 characters) and route to an OCR-capable step instead of indexing emptiness.
Common Pitfalls
- Duplicate documents - if
sourceIdis just the filename, two different vendors withspec.pdfcollide. Include the folder path or vendor ID in the key. - Encrypted PDFs -
extract-textfails silently or with a cryptic error. Always runget-infofirst and bail out cleanly. - Memory pressure on giant files - one 500 MB scanned PDF can stall a worker. Cap the trigger's accepted file size and use pdf
splitfor anything over 50 MB.
Testing
Index three known PDFs with very different content. Open a second workflow, drop a Knowledge query node against the collection, and ask a question whose answer you know lives in exactly one of the documents. Confirm the result chunk comes from that document and that the sourceId matches. Then re-run the indexing workflow with the same PDFs and confirm the collection size stays the same.