How to Build a Document Search System from Multiple File Types
Create a unified search across PDFs, Word docs, spreadsheets, emails, and more.
What This Integration Does
Real businesses store knowledge across a dozen file formats. A unified search means staff can ask one question and get back relevant content whether the original lived in a PDF policy, a CSV pricing sheet, an email thread, or a JSON config dump. This workflow normalizes all of those formats into a single Knowledge collection that any workflow or AI Agent can query.
The workflow accepts inputs from several triggers, branches per file type to the appropriate extractor, normalizes the result into plain text with a metadata header, and pushes the final payload into a Knowledge collection. Each document carries a sourceId and a format tag so queries can be scoped or filtered later. Re-running against an updated source replaces the prior version cleanly.
Prerequisites
- A Knowledge collection that will hold the unified index.
- An FTP connection for scheduled drops, a webhook-style Trigger for direct uploads, or a Front connection for emails - any combination of these.
- The pdf, csv, and json utility connectors available in your workspace.
Step 1: Multi-Source Trigger Surface
Build three small entry workflows that all hand off to a shared Subworkflow:
- A Webhook Trigger for direct uploads from internal tools.
- A Schedule Trigger that polls an ftp directory with
list-directoryanddownload-file. - A Schedule Trigger that calls front
list-conversationsfor shared inboxes.
Each entry normalizes its input into a common envelope: { filename, mimeType, bytes, sourceTag }. That way the indexing subworkflow is the only thing that needs to know about format-specific extractors.
Step 2: Branch by File Type
Inside the shared subworkflow, add a Condition node that routes on mimeType. Each branch handles one format family:
- PDF - pdf
get-infofor validation, thenextract-text. - CSV / TSV - csv
parsefollowed byto-json; build a text summary using a Transform step. - JSON - json
validate, thenprettifyfor an indexable representation. - Plain text / Markdown / HTML - pass straight through, optionally running an HTML-to-text cleanup via text
replace. - Email body - already-text content delivered from the Front entry; concatenate subject, participants, and body.
Step 3: Normalize and Add a Metadata Header
Add a Transform step that prepends a consistent metadata header to every document:
Title: {{ envelope.filename }}
Format: {{ envelope.mimeType }}
Source: {{ envelope.sourceTag }}
Captured: {{ now }}
{{ extractedText }}
This header is what powers useful citations in answers later: the AI Agent will quote it, and your UI can show "this came from a CSV uploaded via the contracts FTP folder".
Step 4: Knowledge Node - Embed
Add a Knowledge node in embed mode targeting the unified collection. Pass the normalized text, set sourceId to {{ envelope.sourceTag }}::{{ envelope.filename }}, and store the format and original source as tags. Tags are crucial - they're how a later query can scope to "only spreadsheets" or "only emails from the support inbox".
Step 5: Build the Search Endpoint
In a second workflow, add a Webhook Trigger that accepts a search query plus optional filters (format, sourceTag). Feed those to a Knowledge node in query mode:
{
"query": "{{ trigger.q }}",
"topK": 8,
"filters": {
"format": "{{ trigger.format }}",
"sourceTag": "{{ trigger.sourceTag }}"
}
}
Hand the result chunks to an AI Agent step to synthesize an answer, then return both the answer and the raw citations via a Response node.
Step 6: Logging and Re-Index Tooling
Write each indexing attempt (success or failure) to a small mongodb or mysql table so you can see ingestion volume and parse-failure rates at a glance. Add a second workflow with a Manual Trigger that takes a sourceTag and re-runs ingestion for everything tagged that way - handy when you change the header format or want to reflow chunks.
Tips
- Pick one collection, not one per format - unified search relies on having all formats in the same vector space.
- Use tags for scoping, not separate collections - tags compose; collections do not.
- Cap document size - reject anything over a sensible per-format limit (5 MB for PDFs, 10 MB for CSVs) in the entry workflows so the indexing path never sees pathological inputs.
Common Pitfalls
- Inconsistent mime detection - relying solely on file extension misclassifies plenty of files. Use the trigger's reported content type and fall back on a simple sniff.
- Embedding raw HTML - HTML tags pollute vectors. Strip them with text
replaceor an HTML parser before embedding. - Spreadsheet semantics - a CSV embedded as raw rows is rarely useful. Generate a short natural-language summary in the Transform step (column names, row count, a couple of sample rows) so retrieval has something semantic to match against.
Testing
Index one document of each format you intend to support, using deliberately different topics so you know which file each answer should come from. Hit the search endpoint with five questions, one targeted at each format. Confirm each answer cites the correct source and that the format filter works when you narrow the query.