How to Use AI Agents to Process Unstructured Data

Turn messy, unstructured text into clean, structured data using AI agents.

What This Integration Does

Most business data starts its life as prose: support emails, meeting notes, lead forms with a "tell us more" textarea, chat transcripts, scanned faxes. This workflow takes any of those text blobs and uses an AI Agent with Structured Output to pull out the fields you actually want - so they can be stored, indexed, and acted on like normal database rows.

The workflow is a thin shell with one decision point: which extraction schema applies. You define the schemas once (support ticket, invoice, lead form, etc.), and the workflow classifies the input, picks the right schema, runs extraction, validates, and writes the result to your downstream system.

Prerequisites

  • A workspace LLM provider configured.
  • A trigger source - Email Trigger for inbox-driven workflows, Webhook for form/API inputs, or a database connector for batch processing of legacy text.
  • A target system to write the extracted records (CRM, ERP, or mongodb/mysql).
  • A clear list of document types you'll handle, each with its target schema.

Step 1: Trigger and Normalize Input

Drop a Trigger node:

  • Email Trigger for inbox sources - exposes subject, body, and attachments.
  • Webhook Trigger for form submissions or API posts.
  • Manual for one-off processing of pasted text.

Follow with a Transform node that produces a uniform shape: { source, receivedAt, rawText, attachments }. This way downstream nodes don't have to care where the text came from.

Step 2: Classify the Document Type

Add a Connector node as an LLM call with a tiny Structured Output schema picking from your known types:

{
  "documentType": {
    "type": "string",
    "enum": ["support-ticket", "invoice", "lead-form", "meeting-notes", "other"]
  },
  "confidence": { "type": "number" }
}

Low-confidence classifications get routed to a human review queue instead of being processed blind.

Step 3: Route to the Right Extraction Schema

Add a Condition node on documentType. Each branch runs its own extraction Connector node configured as a tool-augmented LLM call with a specific Structured Output schema.

Example schemas:

Support ticket:
{ "customerEmail", "issueType", "priority", "summary", "productArea" }

Invoice:
{ "vendor", "invoiceNumber", "amount", "currency", "dueDate", "lineItems[]" }

Lead form:
{ "name", "company", "role", "interest", "budgetRange", "timelineWeeks" }

Meeting notes:
{ "attendees[]", "decisions[]", "actionItems[{ owner, task, dueDate }]" }

Keep each schema strict and small. Smaller schemas mean cheaper, more reliable extractions.

Step 4: Validate the Extracted Record

After extraction, add a Transform + validation pass for the easy checks (validation / email on email fields, validation / iso-date on dates). Add a Condition node that checks all required fields are present and non-null. Records that pass continue; records that fail go to a review queue with the missing fields flagged.

Step 5: Persist to the Target System

Use a second Condition on document type to fan out to the right downstream call:

  • Support ticket: monday / create-item on the support board.
  • Invoice: netsuite / upsert-record as a vendor bill.
  • Lead form: klaviyo / create-profile plus add-profiles-to-list for nurture.
  • Meeting notes: mongodb / insert-documents into a meetings collection, then loop over actionItems creating tasks.

Step 6: Acknowledge and Audit

End with two parallel branches:

  • slack / send-message to the relevant team channel summarising the extracted record and linking to the downstream target.
  • mongodb / insert-documents into an extractions audit collection storing { source, rawText, documentType, extractedFields, model, tokensUsed }. This is your evidence trail when something downstream looks wrong.

Tips

  • Schemas are the contract. Spending five minutes nailing each schema saves hours of prompt tweaking later.
  • One purpose per schema. Don't try to extract every possible field "just in case". Trim to what you'll actually use downstream.
  • Confidence-gate the classifier. Sending an ambiguous document to the wrong extractor produces convincing-looking garbage. Route it to a human instead.

Common Pitfalls

  • Mixed-format inputs. Forwarded emails contain the original plus reply threads plus signatures. Strip quoted text and signatures (a regex / replace step) before extraction or the model will pull data from the wrong message.
  • Encoding issues. Pasted text from Word docs contains smart quotes and non-breaking spaces. Normalize with text / trim + a regex pass first.
  • Schema sprawl. Adding a 12th document type often means the classifier starts misfiring on adjacent types. Keep types under 8 and re-test when you add one.
  • PII handling. Unstructured text is where the riskiest PII lives. Decide upfront what gets masked before the AI step and what gets stored in the audit collection.

Testing

Assemble 5-10 real examples per document type plus a handful of intentionally ambiguous ones. Run the full workflow and check three things per example: did the classifier pick the right type, did extraction fill in the required fields, and did the downstream write actually happen. Adjust the classifier prompt and per-type extraction prompts until accuracy is acceptable, then turn the trigger on.

Learn More

Did this answer your question? Thanks for the feedback There was a problem submitting your feedback. Please try again later.