How to Set Up Email-Triggered Document Processing

Automatically process documents received via email.

What This Integration Does

Vendors, partners, and customers send important documents - invoices, purchase orders, contracts, statements - by email. This Spojit workflow watches a connected mailbox, reads each attachment, extracts the contents, runs it through a Connector node in Agent mode to pull structured fields, and lands clean records in your system of record. No more "did anyone process that PO?" on the team channel.

The workflow runs every time a new email arrives matching your filter. Attachments are processed individually, and each one produces both a structured record (for your DB) and an audit log (for compliance). Failed extractions route to a human review queue rather than silently dropping documents.

Prerequisites

A connected Gmail or Outlook mailbox for the Email trigger (add it under Connections).
The built-in pdf and csv connectors for attachment parsing (no auth needed).
A destination connection where structured records will be written (e.g. mongodb, netsuite, or mysql).

Step 1: Email Trigger

Drop a Trigger node and set its type to Email. Filter by sender domain, subject pattern, or label so you only run on real document emails - not newsletters or replies. The trigger exposes the message body, sender, subject, and an array of attachments.

Step 2: Loop Over Attachments

Add a Loop node iterating over {{ email.attachments }}. For each attachment, branch on file type with a Condition node:

.pdf -> PDF extraction path
.csv / .xlsx -> spreadsheet path
.xml -> XML path
anything else -> route to human review

Step 3: Extract the Content

For PDFs, add a Connector node pointing at the pdf connector with the extract-text tool. For very long PDFs, use extract-pages to grab just the pages that contain the data (often the first page of an invoice). For CSVs, use the csv connector's parse tool followed by to-json to get structured rows. For XML, use the xml connector's to-json tool.

Step 4: Extract Structured Fields in Agent Mode

Add a Connector node in Agent mode and set a Response Schema to force structured JSON output. The schema depends on what you're processing - here's one for invoices:

{
  "vendor":         { "type": "string" },
  "invoiceNumber":  { "type": "string" },
  "invoiceDate":    { "type": "string", "description": "ISO 8601 date" },
  "dueDate":        { "type": "string", "description": "ISO 8601 date" },
  "currency":       { "type": "string", "description": "ISO 4217 code" },
  "subtotal":       { "type": "number" },
  "tax":            { "type": "number" },
  "total":          { "type": "number" },
  "lineItems": {
    "type": "array",
    "items": {
      "type": "object",
      "properties": {
        "description": { "type": "string" },
        "qty":         { "type": "number" },
        "unitPrice":   { "type": "number" },
        "amount":      { "type": "number" }
      }
    }
  }
}

Step 5: Validate Before Writing

Add a Connector node calling the math connector's sum tool over the line item amounts, then a Condition node that compares it to total - tax. If they don't match, route to a Human review node. This catches OCR errors, missed line items, or hallucinated numbers before they reach your books.

Step 6: Persist and Notify

Run a Parallel node:

Store the structured record via mongodb insert-documents, netsuite create-record, or mysql insert-rows.
Upload the original attachment to ftp upload-file (or any storage destination) so the source document is preserved for audit.
Post a one-line summary to slack send-message so the team sees what was processed.

Tips

Always check the email subject and sender against an allowlist - email is a common attack vector, and you don't want to OCR a PDF from an unknown sender.
Hash the attachment contents (via the encoding connector's hash-sha256 tool) and store the hash to dedupe - vendors often re-send the same invoice.
For scanned PDFs that come back empty from extract-text, route to a dedicated OCR step rather than failing the workflow.

Common Pitfalls

Multi-page invoices - some vendors split a single invoice across two PDFs in one email. Process attachments per email rather than treating each file in isolation if you see this.
Encoded subjects - non-ASCII characters in From or Subject arrive as MIME-encoded strings. Decode before filtering or you'll silently drop matches.
Date format drift - vendors use every date format ever invented. Ask for ISO 8601 explicitly in your Agent mode prompt and Response Schema, then validate it via the validation connector's iso-date tool.

Testing

Forward 10 historical emails to a staging mailbox connected to a duplicate workflow that writes to a sandbox database. Compare extracted records to the originals. Once you see clean extraction across format variations, point the trigger at production.