How to Build a Searchable Knowledge Base from PDF Documents
Upload your PDF documents into a searchable knowledge collection powered by AI.
What This Integration Does
PDFs are where institutional knowledge goes to die. Once a manual or specification is saved as a PDF and dropped in a shared drive, finding anything inside it relies on whoever can remember its filename. This workflow turns a PDF library into a Knowledge collection any workflow can query in natural language, so a question like "what is the warranty period for the Model X regulator?" returns a real answer with a citation.
The workflow accepts PDFs from one or more sources (FTP drops, email attachments, or direct webhook uploads), validates them, extracts the text per page, and embeds the result into the Knowledge collection. Each embedded document carries a stable sourceId so re-uploads replace rather than duplicate, and tags so downstream queries can scope to a particular product line or document type.
Prerequisites
- A Knowledge collection created in advance (for example
product-manuals). - The pdf utility connector.
- An input source: an ftp connection for scheduled drops, an Email Trigger for emailed manuals, or a Webhook Trigger for upload tools.
Step 1: Choose Your Trigger
Drop a Trigger node and pick the sub-type that matches how PDFs reach you:
- Schedule - paired with an ftp
list-directorycall when manuals arrive in a shared drop folder. - Email - watch a shared mailbox (e.g.
manuals@yourcompany.com) for PDF attachments. - Webhook - exposes a URL your internal upload form can POST a PDF to.
All three should produce the same downstream envelope: { filename, sourcePath, bytes }.
Step 2: Connector - Validate the PDF
Add a Connector node pointing at the pdf connector with the get-info tool. The response includes page count, title metadata, and whether the file is encrypted. Use a Condition node to skip encrypted or zero-page files and route them to a manual-review queue rather than letting them pollute the index.
Step 3: Connector - Extract Text
Add another Connector step on the pdf connector using the extract-text tool:
{
"file": "{{ envelope.bytes }}",
"preserveLayout": false
}
For very large manuals (over 200 pages), wrap the extraction in a Loop over pdf split output, processing 50 pages at a time so a single oversized document doesn't time out the workflow.
Step 4: Transform - Add a Metadata Header
Add a Transform node to compose the final indexable text. The header gives the Knowledge node hooks to match against for facets like product line or document type:
Title: {{ pdfInfo.title || envelope.filename }}
Source: {{ envelope.sourcePath }}
Pages: {{ pdfInfo.pageCount }}
Indexed: {{ now }}
{{ extractedText }}
If filenames follow a convention (e.g. manual-modelX-rev3.pdf), parse the model and revision with regex extract and add them as tags in the next step.
Step 5: Knowledge Node - Embed
Add a Knowledge node in embed mode targeting your product-manuals collection. Set sourceId to a stable identifier such as {{ envelope.filename }}::{{ pdfInfo.pageCount }} so re-uploads of the same manual replace cleanly. Apply tags like productLine, revision, and docType so future queries can scope to "warranty manuals for Model X only" without searching the whole corpus.
Step 6: Build the Auto-Ingestion Loop
For a self-service knowledge base, wrap Steps 1-5 in a scheduled ingestion workflow: an ftp list-directory call every 15 minutes, filtering against a small mongodb tracking collection so already-indexed files are skipped. New files are downloaded with ftp download-file and pushed through the same Steps 2-5. The tracking record stores filename, modifiedAt, and indexedAt so updated manuals re-index automatically while unchanged ones are ignored.
Tips
- Clean repetitive headers and footers - "Page X of Y" lines and corporate footers pollute embeddings. Strip them in the Transform step with a text
replaceor a regexreplace. - Index revision metadata - manuals get updated; revisions matter. Tag each document with its revision so queries can be scoped to "current revision only" by default.
- Plan for OCR-only PDFs - scanned manuals return empty extracted text. Detect (length under ~200 characters) and route to an OCR step or a human-entry queue.
Common Pitfalls
- Same filename, different manual -
manual.pdffrom two different product folders collides on filename. Include the folder path or product line insourceId. - Encrypted PDFs -
extract-textfails confusingly on encrypted files. Always runget-infofirst and short-circuit cleanly. - Document churn - a manual updated weekly creates 52 versions a year if you don't replace on
sourceId. Always re-use the same key for the same manual; only update the revision tag.
Testing
Index three PDFs that cover distinct topics. From a second workflow, drop a Knowledge query node and ask a question whose answer lives in exactly one of the three (e.g. a specific warranty period). Confirm the retrieved chunk comes from the right document with a reasonable score. Then re-upload one of the PDFs unchanged and confirm the collection size stays constant.