How to Auto-Index Documents Arriving via FTP

Automatically process and index new documents uploaded to your FTP server.

What This Integration Does

Plenty of business documents - vendor catalogs, signed contracts, monthly statements, EDI drops - still arrive via FTP. Letting them sit in a folder where nobody can find them is a waste; surfacing them through a Knowledge collection turns the FTP drop zone into a searchable archive your AI workflows can reach.

This workflow polls an FTP directory on a schedule, identifies files that have not yet been processed, downloads them, extracts text per file type, and embeds each one into a Knowledge collection. A tracking table keeps the workflow idempotent so a file is never indexed twice, and a failure branch raises an alert if a particular file refuses to parse.

Prerequisites

  • An FTP connection (FTP, FTPS, or SFTP) with read access to the source directory.
  • A Knowledge collection that will hold the indexed documents.
  • A MongoDB or mysql connection for the small "already-processed" tracking table.
  • A Slack or smtp connection for parse-failure notifications (optional but recommended).

Step 1: Schedule Trigger

Add a Trigger node and set the sub-type to Schedule. Every 15 minutes is a sensible default for vendor drops; tighten it to 5 minutes if files need to be searchable almost immediately. The schedule trigger needs no input, so just give it a clear name like ftp-index-poll.

Step 2: Connector - List the Drop Directory

Add a Connector node pointing at the ftp connector with the list-directory tool:

{
  "path": "/incoming/contracts/",
  "recursive": false
}

This returns one entry per file with name, size, and modifiedAt. If you have nested folders (e.g. one per vendor), set recursive: true and downstream nodes can use the path prefix as a tag.

Step 3: Filter Out Already-Processed Files

Add a Loop over the file list. Inside it, run a Connector step against your tracking store - for example mongodb find-documents with { "fileKey": "{{ file.name }}-{{ file.modifiedAt }}" }. Using both name and modified-at as the key means an updated version of a file will re-index automatically. Follow with a Condition that continues only when no record is found.

Step 4: Download and Extract

For each new file, call the ftp connector's download-file tool to pull the bytes into the workflow. Then branch by extension with a Condition:

  • PDF - call pdf get-info first to confirm it's a valid document and capture page count, then pdf extract-text for the body.
  • CSV - call csv parse to get rows, then csv to-json for a structured representation.
  • JSON - call json validate followed by json prettify so the embedded version is human-readable.
  • Other - skip with a logged warning rather than failing the run.

Step 5: Embed into the Knowledge Collection

Add a Knowledge node in embed mode. Compose a small header so the indexed chunk knows where it came from:

Source: ftp://{{ ftp.host }}{{ file.path }}
Filename: {{ file.name }}
Indexed: {{ now }}

{{ extractedText }}

Set the document's sourceId to the file key from Step 3 so a future re-index replaces rather than duplicates.

Step 6: Record Success and Handle Failure

On the success path, write the file key and a status of indexed to the tracking store. On the failure path (use a Condition against the result of any step that can throw), write a failed record with the error message and send a slack send-message to your data-ops channel so someone can investigate before the file gets stuck. Optionally use the Human node to require a human to mark the file as "skipped" before the workflow stops retrying it.

Tips

  • Use the modified timestamp in the key - this gives you free re-indexing when a vendor uploads a corrected copy with the same filename.
  • Throttle the loop - PDF extraction is CPU-heavy; set the loop concurrency to 3-5 rather than letting 100 files extract in parallel.
  • Archive after indexing - move processed files to /incoming/contracts/archive/{{ year }}/{{ month }}/ using ftp rename so the drop directory stays clean.

Common Pitfalls

  • FTP file locks - if a vendor is still writing a file when your poll fires, you can download a truncated version. Either filter on a minimum file age (e.g. only files older than 60 seconds) or check for a sentinel .done companion file.
  • Encoding - CSVs from older systems often arrive as Windows-1252 rather than UTF-8. The csv parse tool needs the encoding hint or you will get garbled non-ASCII characters indexed.
  • Large PDFs - a 1000-page catalog will time out single-shot extraction. Use pdf split first, then loop over each part.

Testing

Drop three sample files into the FTP directory: one PDF, one CSV, and one file with an unsupported extension. Run the workflow manually. Confirm the PDF and CSV land in the Knowledge collection with sensible chunking, the unsupported file is skipped (not failed), and the tracking store now has three entries. Re-run and confirm nothing new is indexed.

Learn More

Did this answer your question? Thanks for the feedback There was a problem submitting your feedback. Please try again later.