How to Build a Searchable Knowledge Base from PDF Documents

Upload your PDF documents into a searchable knowledge collection powered by AI.

What This Integration Does

PDFs are where institutional knowledge goes to die. Once a manual or specification is saved as a PDF and dropped in a shared drive, finding anything inside it relies on whoever can remember its filename. This workflow turns a PDF library into a Knowledge collection any workflow can query in natural language, so a question like "what is the warranty period for the Model X regulator?" returns a real answer with a citation.

The workflow accepts PDFs from one or more sources (FTP drops, email attachments, or direct webhook uploads), validates them, extracts the text per page, and embeds the result into the Knowledge collection. Each embedded document is keyed by its File Name, so re-uploading a manual under the same name overwrites the previous version rather than creating a duplicate. Spojit handles chunking and embedding for you.

Prerequisites

A Knowledge collection created in advance (for example product-manuals).
The pdf utility connector.
An input source: an ftp connection for scheduled drops, an Email Trigger (connected to a Gmail or Outlook mailbox) for emailed manuals, or a Webhook Trigger for upload tools.

Step 1: Choose Your Trigger

Drop a Trigger node and pick the sub-type that matches how PDFs reach you:

Schedule - paired with an ftp list-directory call when manuals arrive in a shared drop folder.
Email - poll a connected Gmail or Outlook mailbox (e.g. manuals@yourcompany.com) for PDF attachments. The trigger emits attachments[] references whose bytes are fetched on demand.
Webhook - exposes a URL your internal upload form can POST a PDF to.

All three should produce the same downstream envelope: { filename, sourcePath, bytes }.

Step 2: Connector - Validate the PDF

Add a Connector node pointing at the pdf connector with the get-info tool. The response includes page count, title metadata, and whether the file is encrypted. Use a Condition node to skip encrypted or zero-page files and route them to a manual-review queue rather than letting them pollute the index.

Step 3: Connector - Extract Text

Add another Connector step on the pdf connector using the extract-text tool. Map its document input to the PDF bytes you captured upstream (for example {{ envelope.bytes }}). The tool returns the extracted text of the document.

For very large manuals (over 200 pages), use the pdf split tool to break the file into smaller pieces, then wrap the extraction in a Loop over those pieces so a single oversized document doesn't time out the workflow.

Step 4: Transform - Add a Metadata Header

Add a Transform node to compose the final indexable text into an output variable such as headedText. Putting the title, source, and revision into the text itself means a natural-language query can surface and cite those details, since the Knowledge node indexes the words you embed:

Title: {{ pdfInfo.title || envelope.filename }}
Source: {{ envelope.sourcePath }}
Pages: {{ pdfInfo.pageCount }}
Indexed: {{ now }}

{{ extractedText }}

If filenames follow a convention (e.g. manual-modelX-rev3.pdf), parse the model and revision with regex extract and add them as tags in the next step.

Step 5: Knowledge Node - Embed

Add a Knowledge node in Embed mode. Set Collection to your persistent product-manuals collection. Set File Name to a stable identifier for the manual, for example {{ envelope.filename }}: embedding under a name that already exists overwrites the previous version, so re-uploads of the same manual replace cleanly instead of duplicating. Set Document Type to PDF if you are embedding the raw file, or Plain Text if you embed the metadata-headed text from Step 4. Point Document Input at the content you want indexed (for example {{ headedText }}), and capture the Output Variable to read back the chunk count after embedding.

Step 6: Build the Auto-Ingestion Loop

For a self-service knowledge base, wrap Steps 1-5 in a scheduled ingestion workflow: an ftp list-directory call every 15 minutes, filtering against a small mongodb tracking collection so already-indexed files are skipped. New files are downloaded with ftp download-file and pushed through the same Steps 2-5. The tracking record stores filename, modifiedAt, and indexedAt so updated manuals re-index automatically while unchanged ones are ignored.

Tips

Clean repetitive headers and footers - "Page X of Y" lines and corporate footers pollute embeddings. Strip them in the Transform step with a text replace or a regex replace.
Index revision metadata - manuals get updated; revisions matter. Tag each document with its revision so queries can be scoped to "current revision only" by default.
Plan for OCR-only PDFs - scanned manuals return empty extracted text. Detect (length under ~200 characters) and route to an OCR step or a human-entry queue.

Common Pitfalls

Same filename, different manual - manual.pdf from two different product folders collides on filename. Include the folder path or product line in sourceId.
Encrypted PDFs - extract-text fails confusingly on encrypted files. Always run get-info first and short-circuit cleanly.
Document churn - a manual updated weekly creates 52 versions a year if you don't replace on sourceId. Always re-use the same key for the same manual; only update the revision tag.

Testing

Index three PDFs that cover distinct topics. From a second workflow, drop a Knowledge query node and ask a question whose answer lives in exactly one of the three (e.g. a specific warranty period). Confirm the retrieved chunk comes from the right document with a reasonable score. Then re-upload one of the PDFs unchanged and confirm the collection size stays constant.