How to Use AI Agents to Process Unstructured Data
Turn messy, unstructured text into clean, structured data using AI agents.
What This Integration Does
Most business data starts its life as prose: support emails, meeting notes, lead forms with a "tell us more" textarea, chat transcripts, scanned faxes. This workflow takes any of those text blobs and uses an AI Agent with Structured Output to pull out the fields you actually want - so they can be stored, indexed, and acted on like normal database rows.
The workflow is a thin shell with one decision point: which extraction schema applies. You define the schemas once (support ticket, invoice, lead form, etc.), and the workflow classifies the input, picks the right schema, runs extraction, validates, and writes the result to your downstream system.
Prerequisites
- A workspace LLM provider configured.
- A trigger source - Email Trigger for inbox-driven workflows, Webhook for form/API inputs, or a database connector for batch processing of legacy text.
- A target system to write the extracted records (CRM, ERP, or mongodb/mysql).
- A clear list of document types you'll handle, each with its target schema.
Step 1: Trigger and Normalize Input
Drop a Trigger node:
- Email Trigger for inbox sources - exposes
subject,body, andattachments. - Webhook Trigger for form submissions or API posts.
- Manual for one-off processing of pasted text.
Follow with a Transform node that produces a uniform shape: { source, receivedAt, rawText, attachments }. This way downstream nodes don't have to care where the text came from.
Step 2: Classify the Document Type
Add a Connector node as an LLM call with a tiny Structured Output schema picking from your known types:
{
"documentType": {
"type": "string",
"enum": ["support-ticket", "invoice", "lead-form", "meeting-notes", "other"]
},
"confidence": { "type": "number" }
}
Low-confidence classifications get routed to a human review queue instead of being processed blind.
Step 3: Route to the Right Extraction Schema
Add a Condition node on documentType. Each branch runs its own extraction Connector node configured as a tool-augmented LLM call with a specific Structured Output schema.
Example schemas:
Support ticket:
{ "customerEmail", "issueType", "priority", "summary", "productArea" }
Invoice:
{ "vendor", "invoiceNumber", "amount", "currency", "dueDate", "lineItems[]" }
Lead form:
{ "name", "company", "role", "interest", "budgetRange", "timelineWeeks" }
Meeting notes:
{ "attendees[]", "decisions[]", "actionItems[{ owner, task, dueDate }]" }
Keep each schema strict and small. Smaller schemas mean cheaper, more reliable extractions.
Step 4: Validate the Extracted Record
After extraction, add a Transform + validation pass for the easy checks (validation / email on email fields, validation / iso-date on dates). Add a Condition node that checks all required fields are present and non-null. Records that pass continue; records that fail go to a review queue with the missing fields flagged.
Step 5: Persist to the Target System
Use a second Condition on document type to fan out to the right downstream call:
- Support ticket: monday /
create-itemon the support board. - Invoice: netsuite /
upsert-recordas a vendor bill. - Lead form: klaviyo /
create-profileplusadd-profiles-to-listfor nurture. - Meeting notes: mongodb /
insert-documentsinto ameetingscollection, then loop overactionItemscreating tasks.
Step 6: Acknowledge and Audit
End with two parallel branches:
- slack /
send-messageto the relevant team channel summarising the extracted record and linking to the downstream target. - mongodb /
insert-documentsinto anextractionsaudit collection storing{ source, rawText, documentType, extractedFields, model, tokensUsed }. This is your evidence trail when something downstream looks wrong.
Tips
- Schemas are the contract. Spending five minutes nailing each schema saves hours of prompt tweaking later.
- One purpose per schema. Don't try to extract every possible field "just in case". Trim to what you'll actually use downstream.
- Confidence-gate the classifier. Sending an ambiguous document to the wrong extractor produces convincing-looking garbage. Route it to a human instead.
Common Pitfalls
- Mixed-format inputs. Forwarded emails contain the original plus reply threads plus signatures. Strip quoted text and signatures (a regex /
replacestep) before extraction or the model will pull data from the wrong message. - Encoding issues. Pasted text from Word docs contains smart quotes and non-breaking spaces. Normalize with text /
trim+ a regex pass first. - Schema sprawl. Adding a 12th document type often means the classifier starts misfiring on adjacent types. Keep types under 8 and re-test when you add one.
- PII handling. Unstructured text is where the riskiest PII lives. Decide upfront what gets masked before the AI step and what gets stored in the audit collection.
Testing
Assemble 5-10 real examples per document type plus a handful of intentionally ambiguous ones. Run the full workflow and check three things per example: did the classifier pick the right type, did extraction fill in the required fields, and did the downstream write actually happen. Adjust the classifier prompt and per-type extraction prompts until accuracy is acceptable, then turn the trigger on.