How to Build a Document Search System from Multiple File Types

Create a unified search across PDFs, Word docs, spreadsheets, emails, and more.

What This Integration Does

Real businesses store knowledge across a dozen file formats. A unified search means staff can ask one question and get back relevant content whether the original lived in a PDF policy, a CSV pricing sheet, an email thread, or a JSON config dump. This Spojit workflow normalizes all of those formats into a single Knowledge collection that any workflow can query in Query mode.

The workflow accepts inputs from several triggers, branches per file type to the appropriate extractor, normalizes the result into plain text with a metadata header, and embeds the final payload into a persistent Knowledge collection. Each document is embedded under a descriptive File Name that names its source and format, so re-running against an updated source overwrites the prior version cleanly.

Prerequisites

A Knowledge collection that will hold the unified index.
An FTP connection for scheduled drops, a webhook-style Trigger for direct uploads, or a Front connection for emails - any combination of these.
The pdf, csv, and json utility connectors available in your workspace.

Step 1: Multi-Source Trigger Surface

Build three small entry workflows that all hand off to a shared Subworkflow:

A Webhook Trigger for direct uploads from internal tools.
A Schedule Trigger that polls an ftp directory with list-directory and download-file.
A Schedule Trigger that calls front list-conversations for shared inboxes.

Each entry normalizes its input into a common envelope: { filename, mimeType, content, sourceTag }, where content is the base64 document body the Knowledge node expects. That way the indexing subworkflow is the only thing that needs to know about format-specific extractors.

Step 2: Branch by File Type

Inside the shared subworkflow, add a Condition node that routes on mimeType. Each branch handles one format family:

PDF - pdf get-info for validation, then extract-text.
CSV / TSV - csv parse followed by to-json; build a text summary using a Transform step.
JSON - json validate, then prettify for an indexable representation.
Plain text / Markdown / HTML - pass straight through, optionally running an HTML-to-text cleanup via text replace.
Email body - already-text content delivered from the Front entry; concatenate subject, participants, and body.

Step 3: Normalize and Add a Metadata Header

Add a Transform step that prepends a consistent metadata header to every document:

Title: {{ envelope.filename }}
Format: {{ envelope.mimeType }}
Source: {{ envelope.sourceTag }}

{{ extractedText }}

This header is what powers useful citations in answers later: the Knowledge node quotes it during synthesis, and your UI can show "this came from a CSV uploaded via the contracts FTP folder".

Step 4: Knowledge Node - Embed Mode

Add a Knowledge node and set its mode to Embed. Choose your persistent collection in the Collection dropdown, set Document Type to Plain Text (your Transform step already produced clean text with the metadata header), and feed the base64 body into Document Input with {{ envelope.content }}. Set File Name to a descriptive value such as {{ envelope.sourceTag }}-{{ envelope.filename }}: because embedding overwrites any existing document with the same name, encoding the source and filename here is what gives you clean re-indexing. The Output Variable returns the chunk count and metadata.

Keep every format in one collection. Because the metadata header (format, source) is embedded inside each chunk, a later Query prompt can ask for "only spreadsheet sources" or "only emails from the support inbox" in plain language without needing separate collections.

Step 5: Build the Search Endpoint

In a second workflow, add a Webhook Trigger so external tools can post a search request. The parsed JSON body is available as {{ input }}, so a caller sends something like { "question": "...", "format": "spreadsheet" }. Add a Knowledge node and set its mode to Query:

Collection: the persistent collection you embedded into.
Prompt: a natural-language question that folds in any caller filter, for example Answer using only {{ input.format }} sources: {{ input.question }}.
Result Count: how many chunks to retrieve (default 5; raise to 8 for broader questions).
Model: the AI model that synthesizes the answer.
Response Schema: optional JSON schema to force a structured { answer, sources } shape.

The Knowledge node does its own AI synthesis against the retrieved chunks and writes the result to its Output Variable. Return that answer to the caller with a Response node.

Step 6: Logging and Re-Index Tooling

Write each indexing attempt (success or failure) to a small mongodb or mysql table so you can see ingestion volume and parse-failure rates at a glance. Add a second workflow with a Manual Trigger that takes a sourceTag and re-runs ingestion for everything tagged that way - handy when you change the header format or want to reflow chunks.

Tips

Pick one collection, not one per format - unified search relies on having every format in the same collection so one query reaches all of them.
Use the same embedding model throughout - the embedding model is fixed when you create the collection, so embed and query always run against it. Do not create a second collection just to mix formats.
Cap document size - reject anything over a sensible per-format limit (5 MB for PDFs, 10 MB for CSVs) in the entry workflows so the indexing path never sees pathological inputs.

Common Pitfalls

Inconsistent mime detection - relying solely on file extension misclassifies plenty of files. Use the trigger's reported content type and fall back on a simple sniff.
Embedding raw HTML - HTML tags add noise that hurts retrieval. Either strip them with text replace first, or set Document Type to HTML so the Knowledge node extracts readable text for you.
Spreadsheet semantics - a CSV embedded as raw rows is rarely useful. Generate a short natural-language summary in the Transform step (column names, row count, a couple of sample rows) so retrieval has something semantic to match against.

Testing

Embed one document of each format you intend to support, using deliberately different topics so you know which file each answer should come from. Use a Transient collection first if you want to rehearse the embed-then-query flow on a single run before committing to your persistent collection. Then call the search endpoint with five questions, one targeted at each format. Confirm each answer draws from the correct source, and that adding a {{ input.format }} constraint to the Prompt narrows the result as expected.