How to Auto-Index Documents Arriving via FTP
Automatically process and index new documents uploaded to your FTP server.
What This Integration Does
Plenty of business documents - vendor catalogs, signed contracts, monthly statements, EDI drops - still arrive via FTP. Letting them sit in a folder where nobody can find them is a waste; surfacing them through a Knowledge collection turns the FTP drop zone into a searchable archive your AI workflows can reach.
This workflow polls an FTP directory on a schedule, identifies files that have not yet been processed, downloads them, extracts text per file type, and embeds each one into a Knowledge collection. A tracking table keeps the workflow idempotent so a file is never indexed twice, and a failure branch raises an alert if a particular file refuses to parse.
Prerequisites
- An FTP connection (FTP, FTPS, or SFTP) with read access to the source directory.
- A Knowledge collection that will hold the indexed documents.
- A MongoDB or mysql connection for the small "already-processed" tracking table.
- A Slack or smtp connection for parse-failure notifications (optional but recommended).
Step 1: Schedule Trigger
Add a Trigger node and set the sub-type to Schedule. Every 15 minutes is a sensible default for vendor drops; tighten it to 5 minutes if files need to be searchable almost immediately. The schedule trigger needs no input, so just give it a clear name like ftp-index-poll.
Step 2: Connector - List the Drop Directory
Add a Connector node pointing at the ftp connector with the list-directory tool:
{
"path": "/incoming/contracts/",
"recursive": false
}
This returns one entry per file with name, size, and modifiedAt. If you have nested folders (e.g. one per vendor), set recursive: true and downstream nodes can use the path prefix as a tag.
Step 3: Filter Out Already-Processed Files
Add a Loop over the file list. Inside it, run a Connector step against your tracking store - for example mongodb find-documents with { "fileKey": "{{ file.name }}-{{ file.modifiedAt }}" }. Using both name and modified-at as the key means an updated version of a file will re-index automatically. Follow with a Condition that continues only when no record is found.
Step 4: Download and Extract
For each new file, call the ftp connector's download-file tool to pull the bytes into the workflow. Then branch by extension with a Condition:
- PDF - call pdf
get-infofirst to confirm it's a valid document and capture page count, then pdfextract-textfor the body. - CSV - call csv
parseto get rows, then csvto-jsonfor a structured representation. - JSON - call json
validatefollowed by jsonprettifyso the embedded version is human-readable. - Other - skip with a logged warning rather than failing the run.
Step 5: Embed into the Knowledge Collection
Add a Knowledge node in embed mode. Compose a small header so the indexed chunk knows where it came from:
Source: ftp://{{ ftp.host }}{{ file.path }}
Filename: {{ file.name }}
Indexed: {{ now }}
{{ extractedText }}
Set the document's sourceId to the file key from Step 3 so a future re-index replaces rather than duplicates.
Step 6: Record Success and Handle Failure
On the success path, write the file key and a status of indexed to the tracking store. On the failure path (use a Condition against the result of any step that can throw), write a failed record with the error message and send a slack send-message to your data-ops channel so someone can investigate before the file gets stuck. Optionally use the Human node to require a human to mark the file as "skipped" before the workflow stops retrying it.
Tips
- Use the modified timestamp in the key - this gives you free re-indexing when a vendor uploads a corrected copy with the same filename.
- Throttle the loop - PDF extraction is CPU-heavy; set the loop concurrency to 3-5 rather than letting 100 files extract in parallel.
- Archive after indexing - move processed files to
/incoming/contracts/archive/{{ year }}/{{ month }}/using ftprenameso the drop directory stays clean.
Common Pitfalls
- FTP file locks - if a vendor is still writing a file when your poll fires, you can download a truncated version. Either filter on a minimum file age (e.g. only files older than 60 seconds) or check for a sentinel
.donecompanion file. - Encoding - CSVs from older systems often arrive as Windows-1252 rather than UTF-8. The csv
parsetool needs the encoding hint or you will get garbled non-ASCII characters indexed. - Large PDFs - a 1000-page catalog will time out single-shot extraction. Use pdf
splitfirst, then loop over each part.
Testing
Drop three sample files into the FTP directory: one PDF, one CSV, and one file with an unsupported extension. Run the workflow manually. Confirm the PDF and CSV land in the Knowledge collection with sensible chunking, the unsupported file is skipped (not failed), and the tracking store now has three entries. Re-run and confirm nothing new is indexed.