How to Auto-Index Incoming Emails into Your Knowledge Base
Automatically add important emails and attachments to your searchable knowledge base.
What This Integration Does
Important context lives in email: vendor notices, support escalations, contract amendments, policy updates from HR. None of that is searchable once it slips below the fold of someone's inbox. This workflow watches your shared inbox, picks out the messages that actually matter, and pushes both the body and any attachments into your Knowledge base so the whole company can query them later.
The workflow runs on a short schedule, pulls recent conversations from a shared inbox via the front connector, filters them by tag or sender, and embeds each one as a Knowledge document. Attachments get extracted and indexed separately so a PDF contract attached to an email becomes a first-class searchable artifact in its own right. Each indexed message records its source ID so re-runs are idempotent.
Prerequisites
- A Front connection with read access to the shared inbox you want to mirror, plus permission to list tags and contacts.
- A Knowledge collection set up to hold inbox content (e.g.
support-inbox-archive). - A small persistence store - a MongoDB collection or mysql table - to track which conversation IDs have already been indexed.
Step 1: Schedule Trigger
Add a Trigger node and set the type to Schedule. Every 10-15 minutes is a good cadence for shared inboxes - frequent enough that nothing sits unindexed for long, infrequent enough that you stay well inside Front's rate limits. Expose a since variable that holds the previous run's timestamp.
Step 2: Connector - List Recent Conversations
Add a Connector node pointing at the front connector and pick the list-conversations tool. Filter for conversations updated since the last run and, optionally, only those carrying a specific tag such as archive-to-kb:
{
"q": "updated_at:>{{ since }} AND tag:archive-to-kb",
"limit": 100
}
This returns conversation metadata including the body, participants, and a list of attachment URLs.
Step 3: Loop and Skip Already-Indexed Messages
Wrap the next steps in a Loop over the conversation list. For each conversation, run a Connector step against your tracking store (for example a mongodb find-documents with { "conversationId": "{{ conv.id }}" }). Follow it with a Condition node - if a record exists, short-circuit the iteration. Otherwise continue.
Step 4: Embed the Email Body
Add a Knowledge node in embed mode. Build a small text document that combines the subject, participants, and body so the indexed chunk has the context it needs to be retrieved later:
Subject: {{ conv.subject }}
From: {{ conv.from }}
Date: {{ conv.created_at }}
{{ conv.body_text }}
Set the document's sourceId to the Front conversation ID so re-indexing replaces rather than duplicates.
Step 5: Process Attachments
Add a nested Loop over conv.attachments. For each attachment, branch on file type with a Condition:
- For PDFs, call the pdf connector's
extract-texttool, then pass the result to a Knowledge embed step. - For CSVs, call csv
parsefollowed by csvto-json, then embed a summary plus a sample of rows. - For JSON, use json
prettifybefore embedding.
Tag each attachment document with the parent conversation ID so a Knowledge query can show "this answer came from an attachment on conversation X".
Step 6: Record the Conversation as Indexed
After a successful embed, write the conversation ID and timestamp to the tracking store via a mongodb insert-documents or mysql insert-rows call. That's what makes the workflow idempotent - the next scheduled run will skip anything already in the tracking table.
Tips
- Use tags as a soft filter - rather than indexing everything, train the team to tag conversations
archive-to-kb. You end up with a tighter, more useful index. - Chunk long threads - if a thread has 50 replies, embed each comment as its own document rather than one giant blob; retrieval quality is much better.
- Watch attachment size - cap PDF text extraction at 5-10 MB to avoid feeding 500-page contracts through the whole pipeline.
Common Pitfalls
- Private content - shared inboxes contain customer PII. Either strip emails and phone numbers before embedding, or restrict the Knowledge collection so only authorized roles can query it.
- Pagination -
list-conversationscaps at 100 per page. Loop until the result page is empty or you'll miss high-volume periods. - Timezone drift on
since- Front timestamps are UTC. If you computesincein local time you will lose or duplicate records every DST change.
Testing
Hand-tag two or three conversations with archive-to-kb. Run the workflow manually. Open the Knowledge collection and confirm one document per conversation plus one per attachment, with sensible chunking and the conversation ID stored as sourceId. Re-run and confirm zero new documents are written. Only then enable the schedule.