How to Build a Knowledge Base from PDFs

Upload PDF documents into a searchable knowledge base for your workflows.

What This Integration Does

PDFs are the universal format for policies, handbooks, product manuals, and supplier specifications, and the worst format for finding anything. This Spojit workflow turns a stream of PDFs into a persistent Knowledge collection that any workflow in your workspace can query in plain English. Once it is running, sending a new PDF to the source your trigger watches is enough to make the content searchable across your workspace.

The pipeline accepts PDFs from one of several sources (email attachments, FTP drops, or direct webhook uploads), pulls the text out of each file with the pdf connector, and feeds it to the Knowledge node in embed mode. You set a stable File Name so later queries can attribute results, and so a re-run with the same name overwrites the existing document instead of creating a duplicate.

This tutorial uses a persistent collection because the goal is a long-lived, multi-workflow-readable archive. For a one-off "extract structured fields from a single PDF and discard it" pattern, pick Transient in the collection dropdown instead, which auto-creates a collection per run and cleans it up on completion. See How to Create NetSuite Sales Orders from Emailed PO PDFs.

Prerequisites

A Knowledge collection created in advance from the Knowledge section of the sidebar (for example company-documents). Its embedding model is fixed at creation.
An input source: a Trigger set to Email or Webhook, or an FTP connection for scheduled directory polling.
The built-in pdf utility connector, which needs no authentication.

Step 1: Choose and Configure the Trigger

Drop a Trigger node onto the canvas. Pick one of:

Email: polls a connected Gmail or Outlook mailbox for messages with PDF attachments. Useful when staff forward documents to a mailbox like kb@yourcompany.com that you connect under Connections. Attachments arrive as references, so you fetch the bytes on demand.
Webhook: exposes a URL your internal upload tool or browser form can POST a PDF to. The output is the parsed JSON body.
Schedule: a 5-field cron plus timezone, paired with an FTP list-directory then download-file call for vendor drops. See How to Auto-Index Documents Arriving via FTP for the polling pattern.

Step 2: Connector - Inspect the PDF

Add a Connector node in Direct mode pointing at the pdf connector and pick the get-info tool. Pass the PDF bytes from your trigger source. The tool returns metadata such as page count, document title, and whether the file is encrypted. Use this to short-circuit bad inputs early: a Condition node should route encrypted or zero-page PDFs to a notification branch rather than letting them poison the index.

Step 3: Connector - Extract Text

Add another Connector node in Direct mode on the pdf connector and pick the extract-text tool. Map its document input to the PDF bytes from your trigger and bind the result to an output variable such as extractedText:

extractedText <- pdf.extract-text( {{ pdfBytes }} )

For very large documents (more than 200 pages), pair this with the pdf split tool inside a Loop so you process the file in smaller chunks instead of one giant request.

Step 4: Transform - Build the Document Header

Add a Transform node to compose the final text payload that goes into the collection. Prepend a small header so future Knowledge queries surface useful context with each result:

Title: {{ pdfInfo.title }}
Pages: {{ pdfInfo.pageCount }}

{{ extractedText }}

Strip page-number-only lines and repeated headers and footers if you can identify a pattern: they pollute retrieval and rarely help.

Step 5: Knowledge Node - Embed

Add a Knowledge node and set its mode to Embed. Set Collection to your persistent collection (for example company-documents). Set Document Type to Plain Text since you are passing extracted text. For Document Input, pass a base64 reference to the composed text: run it through the encoding connector base64-encode tool first, then reference that variable here. Set File Name to a stable identifier such as the original filename plus a folder or vendor prefix: a re-run with the same File Name overwrites the prior version rather than creating a second copy. Bind Output Variable to capture the chunk count and metadata.

If you would rather skip the text-extraction steps, you can set Document Type to PDF and pass the raw PDF as a base64 reference in Document Input instead, and let the node handle extraction for you.

Step 6: Query Later from Any Workflow

Because Knowledge collections are workspace-scoped, any other workflow can add a Knowledge node in Query mode against the same collection. Set Collection to company-documents, put the user's question in Prompt, set Result Count to between 4 and 8, and pick a Model for synthesis. Add an optional Response Schema if you want structured JSON back, and bind Output Variable to the answer. The node retrieves the most relevant chunks and synthesizes a grounded answer.

Tips

Normalize whitespace: PDFs frequently produce extracted text with odd line breaks. A text replace or regex replace step before embedding produces cleaner results.
Separate corpora with collections: if you want policies and vendor docs queried independently, create a separate persistent collection for each rather than mixing them, since each Knowledge query targets a single collection.
Plan for scanned PDFs: image-only PDFs return little or no text from extract-text. Detect this (extracted length under, say, 200 characters) and embed the file with Document Type set to Images via OCR instead of indexing emptiness.

Common Pitfalls

Duplicate documents - if sourceId is just the filename, two different vendors with spec.pdf collide. Include the folder path or vendor ID in the key.
Encrypted PDFs - extract-text fails silently or with a cryptic error. Always run get-info first and bail out cleanly.
Memory pressure on giant files - one 500 MB scanned PDF can stall a worker. Cap the trigger's accepted file size and use pdf split for anything over 50 MB.

Testing

Index three known PDFs with very different content. Open a second workflow, drop a Knowledge query node against the collection, and ask a question whose answer you know lives in exactly one of the documents. Confirm the result chunk comes from that document and that the sourceId matches. Then re-run the indexing workflow with the same PDFs and confirm the collection size stays the same.