How to Auto-Index Incoming Emails into Your Knowledge Base

Automatically add important emails and attachments to your searchable knowledge base.

What This Integration Does

Important context lives in email: vendor notices, support escalations, contract amendments, policy updates from HR. None of that is searchable once it slips below the fold of someone's inbox. This workflow watches your shared inbox, picks out the messages that actually matter, and pushes both the body and any attachments into your Knowledge base so the whole company can query them later.

The workflow runs on a short schedule, pulls recent conversations from a shared inbox via the front connector, filters them by tag or sender, and embeds each one as a Knowledge document. Attachments get extracted and indexed separately so a PDF contract attached to an email becomes a first-class searchable artifact in its own right. Each indexed message records its source ID so re-runs are idempotent.

Prerequisites

  • A Front connection with read access to the shared inbox you want to mirror, plus permission to list tags and contacts.
  • A Knowledge collection set up to hold inbox content (e.g. support-inbox-archive).
  • A small persistence store - a MongoDB collection or mysql table - to track which conversation IDs have already been indexed.

Step 1: Schedule Trigger

Add a Trigger node and set the type to Schedule. Every 10-15 minutes is a good cadence for shared inboxes - frequent enough that nothing sits unindexed for long, infrequent enough that you stay well inside Front's rate limits. Expose a since variable that holds the previous run's timestamp.

Step 2: Connector - List Recent Conversations

Add a Connector node pointing at the front connector and pick the list-conversations tool. Filter for conversations updated since the last run and, optionally, only those carrying a specific tag such as archive-to-kb:

{
  "q": "updated_at:>{{ since }} AND tag:archive-to-kb",
  "limit": 100
}

This returns conversation metadata including the body, participants, and a list of attachment URLs.

Step 3: Loop and Skip Already-Indexed Messages

Wrap the next steps in a Loop over the conversation list. For each conversation, run a Connector step against your tracking store (for example a mongodb find-documents with { "conversationId": "{{ conv.id }}" }). Follow it with a Condition node - if a record exists, short-circuit the iteration. Otherwise continue.

Step 4: Embed the Email Body

Add a Knowledge node in embed mode. Build a small text document that combines the subject, participants, and body so the indexed chunk has the context it needs to be retrieved later:

Subject: {{ conv.subject }}
From: {{ conv.from }}
Date: {{ conv.created_at }}

{{ conv.body_text }}

Set the document's sourceId to the Front conversation ID so re-indexing replaces rather than duplicates.

Step 5: Process Attachments

Add a nested Loop over conv.attachments. For each attachment, branch on file type with a Condition:

  • For PDFs, call the pdf connector's extract-text tool, then pass the result to a Knowledge embed step.
  • For CSVs, call csv parse followed by csv to-json, then embed a summary plus a sample of rows.
  • For JSON, use json prettify before embedding.

Tag each attachment document with the parent conversation ID so a Knowledge query can show "this answer came from an attachment on conversation X".

Step 6: Record the Conversation as Indexed

After a successful embed, write the conversation ID and timestamp to the tracking store via a mongodb insert-documents or mysql insert-rows call. That's what makes the workflow idempotent - the next scheduled run will skip anything already in the tracking table.

Tips

  • Use tags as a soft filter - rather than indexing everything, train the team to tag conversations archive-to-kb. You end up with a tighter, more useful index.
  • Chunk long threads - if a thread has 50 replies, embed each comment as its own document rather than one giant blob; retrieval quality is much better.
  • Watch attachment size - cap PDF text extraction at 5-10 MB to avoid feeding 500-page contracts through the whole pipeline.

Common Pitfalls

  • Private content - shared inboxes contain customer PII. Either strip emails and phone numbers before embedding, or restrict the Knowledge collection so only authorized roles can query it.
  • Pagination - list-conversations caps at 100 per page. Loop until the result page is empty or you'll miss high-volume periods.
  • Timezone drift on since - Front timestamps are UTC. If you compute since in local time you will lose or duplicate records every DST change.

Testing

Hand-tag two or three conversations with archive-to-kb. Run the workflow manually. Open the Knowledge collection and confirm one document per conversation plus one per attachment, with sensible chunking and the conversation ID stored as sourceId. Re-run and confirm zero new documents are written. Only then enable the schedule.

Learn More

Did this answer your question? Thanks for the feedback There was a problem submitting your feedback. Please try again later.