How to Auto-Index Documents Arriving via FTP

Automatically process and index new documents uploaded to your FTP server.

What This Integration Does

Plenty of business documents - vendor catalogs, signed contracts, monthly statements, EDI drops - still arrive via FTP. Letting them sit in a folder where nobody can find them is a waste; surfacing them through a Knowledge collection turns the FTP drop zone into a searchable archive your AI workflows can reach.

This workflow polls an FTP directory on a schedule, identifies files that have not yet been processed, downloads them, extracts text per file type, and embeds each one into a Knowledge collection. A tracking table keeps the workflow idempotent so a file is never indexed twice, and a failure branch raises an alert if a particular file refuses to parse.

Prerequisites

An FTP connection (FTP, FTPS, or SFTP) with read access to the source directory.
A Knowledge collection that will hold the indexed documents.
A MongoDB or mysql connection for the small "already-processed" tracking table.
A Slack or smtp connection for parse-failure notifications (optional but recommended).

Step 1: Schedule Trigger

Add a Trigger node and set its Trigger Type to Schedule. Give it a 5-field cron expression and an IANA timezone, for example */15 * * * * with Australia/Sydney to poll every 15 minutes. Tighten it to */5 * * * * if files need to be searchable almost immediately. The Schedule trigger's output is {{ scheduledAt }}; no other input is needed.

Step 2: Connector - List the Drop Directory

Add a Connector node pointing at the ftp connector with the list-directory tool:

{
  "path": "/incoming/contracts/",
  "recursive": false
}

This returns one entry per file, with fields such as name, size, and a last-modified timestamp. If you have nested folders (for example one per vendor), enable recursion and downstream nodes can use the path prefix as a tag. Use a Transform node if you want to reshape the listing before looping.

Step 3: Filter Out Already-Processed Files

Add a Loop over the file list. Inside it, run a Connector step against your tracking store - for example mongodb find-documents with { "fileKey": "{{ file.name }}-{{ file.modifiedAt }}" }. Using both name and modified-at as the key means an updated version of a file will re-index automatically. Follow with a Condition that continues only when no record is found.

Step 4: Download the File

For each new file, add a Connector node on the ftp connector with the download-file tool to pull the bytes into the workflow. The result exposes the file content as base64, ready to hand to the Knowledge node. Optionally branch by extension with a Condition:

PDF - call pdf get-info first to confirm it is a valid document and capture the page count for logging.
CSV - call csv info to capture row and column counts before indexing.
JSON - call json validate so a malformed file is skipped instead of breaking the run.
Other - route unsupported extensions to the skip path with a logged warning rather than failing the run.

Step 5: Embed into the Knowledge Collection

Add a Knowledge node and set its mode to Embed. Pick your persistent Collection in the dropdown, set Document Type to match the file (for example PDF, CSV/TSV, or JSON), and point Document Input at the base64 content from Step 4, for example {{ download.content }}. Set File Name to the file key from Step 3 so a future re-index of the same file overwrites the existing document rather than duplicating it. The Knowledge node handles text extraction and chunking internally; its Output Variable returns the chunk count and metadata. Keep the same embedding model the collection was created with.

Step 6: Record Success and Handle Failure

On the success path, write the file key and a status of indexed to the tracking store. On the failure path (use a Condition against the result of a step that can fail), write a failed record with the error message and send a slack send-message to your data-ops channel so someone can investigate before the file gets stuck. If you would rather a person sign off before a problem file is parked, add a Human node so an approver in the Approvals inbox confirms the skip; note that a rejection halts the run, so put the Human node at the end of the failure branch.

Tips

Use the modified timestamp in the key - this gives you free re-indexing when a vendor uploads a corrected copy with the same filename.
Process files one at a time - a Loop walks the listing sequentially, which keeps large batches from competing for resources. If you need fan-out, a Parallel node can split the work across a few branches.
Archive after indexing - move processed files to an archive/ subfolder using ftp rename so the drop directory stays clean. Build the destination path with the date connector's format tool.

Common Pitfalls

FTP file locks - if a vendor is still writing a file when your poll fires, you can download a truncated version. Either filter on a minimum file age (e.g. only files older than 60 seconds) or check for a sentinel .done companion file.
Encoding - CSVs from older systems often arrive as Windows-1252 rather than UTF-8, which can index garbled non-ASCII characters. Run the file through the csv parse and csv to-json tools to normalise it before embedding if you see corruption.
Large PDFs - a very large catalog can be slow to embed in one pass. Use the pdf split tool to break it into parts, then Loop over each part into the Knowledge node.

Testing

Drop three sample files into the FTP directory: one PDF, one CSV, and one file with an unsupported extension. Run the workflow manually. Confirm the PDF and CSV land in the Knowledge collection with sensible chunking, the unsupported file is skipped (not failed), and the tracking store now has three entries. Re-run and confirm nothing new is indexed.