How to Use AI Agents to Process Unstructured Data

Turn messy, unstructured text into clean, structured data using AI agents.

What This Integration Does

Most business data starts its life as prose: support emails, meeting notes, lead forms with a "tell us more" textarea, chat transcripts. This Spojit workflow takes any of those text blobs and uses a Connector node in Agent mode with a Response Schema to pull out the fields you actually want, so they can be stored, indexed, and acted on like normal database rows.

The workflow has one decision point: which extraction schema applies. You define the schemas once (support ticket, invoice, lead form, and so on), and the workflow classifies the input, picks the right schema, runs extraction, validates, and writes the result to your downstream system. It runs whenever its trigger fires, and each run is independent: re-running the same input produces a fresh extraction without affecting earlier records.

Prerequisites

A trigger source: an Email trigger for inbox-driven workflows, a Webhook trigger for form or API inputs, or a database connector such as mongodb or mysql for batch processing of legacy text.
A target system to write the extracted records (your CRM or ERP connector, or mongodb/mysql).
A clear list of document types you will handle, each with its target schema.
An awareness that Agent mode calls consume AI credits, so keep schemas small.

Step 1: Trigger and Normalize Input

Drop a Trigger node:

Email trigger for inbox sources: exposes subject, textBody, htmlBody, and attachments.
Webhook trigger for form submissions or API posts.
Manual trigger for one-off processing of pasted text.

Follow with a Transform node that produces a uniform shape: { source, receivedAt, rawText, attachments }. This way downstream nodes don't have to care where the text came from.

Step 2: Classify the Document Type

Add a Connector node in Agent mode with a tiny Response Schema that forces a JSON output picking from your known types:

{
  "documentType": {
    "type": "string",
    "enum": ["support-ticket", "invoice", "lead-form", "meeting-notes", "other"]
  },
  "confidence": { "type": "number" }
}

Low-confidence classifications get routed to a Human node for approval instead of being processed blind.

Step 3: Route to the Right Extraction Schema

Add a Condition node on documentType. Each branch runs its own extraction Connector node in Agent mode with a specific Response Schema.

Example schemas:

Support ticket:
{ "customerEmail", "issueType", "priority", "summary", "productArea" }

Invoice:
{ "vendor", "invoiceNumber", "amount", "currency", "dueDate", "lineItems[]" }

Lead form:
{ "name", "company", "role", "interest", "budgetRange", "timelineWeeks" }

Meeting notes:
{ "attendees[]", "decisions[]", "actionItems[{ owner, task, dueDate }]" }

Keep each schema strict and small. Smaller schemas mean cheaper, more reliable extractions.

Step 4: Validate the Extracted Record

After extraction, add a Transform + validation pass for the easy checks (validation / email on email fields, validation / iso-date on dates). Add a Condition node that checks all required fields are present and non-null. Records that pass continue; records that fail go to a Human node for review with the missing fields flagged.

Step 5: Persist to the Target System

Use a second Condition on document type to fan out to the right downstream call:

Support ticket: monday / create-item on the support board.
Invoice: netsuite / upsert-record as a vendor bill.
Lead form: klaviyo / create-profile plus add-profiles-to-list for nurture.
Meeting notes: mongodb / insert-documents into a meetings collection, then a Loop node over actionItems creating tasks.

Step 6: Acknowledge and Audit

End with a Parallel node that fans out into two branches:

slack / send-message to the relevant team channel summarising the extracted record and linking to the downstream target.
mongodb / insert-documents into an extractions audit collection storing { source, rawText, documentType, extractedFields, model, tokensUsed }. This is your evidence trail when something downstream looks wrong.

Tips

Schemas are the contract. Spending five minutes nailing each schema saves hours of prompt tweaking later.
One purpose per schema. Don't try to extract every possible field "just in case". Trim to what you'll actually use downstream.
Confidence-gate the classifier. Sending an ambiguous document to the wrong extractor produces convincing-looking garbage. Route it to a human instead.

Common Pitfalls

Mixed-format inputs. Forwarded emails contain the original plus reply threads plus signatures. Strip quoted text and signatures (a regex / replace step) before extraction or the model will pull data from the wrong message.
Encoding issues. Pasted text from Word docs contains smart quotes and non-breaking spaces. Normalize with text / trim + a regex pass first.
Schema sprawl. Adding a 12th document type often means the classifier starts misfiring on adjacent types. Keep types under 8 and re-test when you add one.
PII handling. Unstructured text is where the riskiest PII lives. Decide upfront what gets masked before the AI step and what gets stored in the audit collection.

Testing

Assemble 5-10 real examples per document type plus a handful of intentionally ambiguous ones. Run the full workflow and check three things per example: did the classifier pick the right type, did extraction fill in the required fields, and did the downstream write actually happen. Adjust the classifier prompt and per-type extraction prompts until accuracy is acceptable, then turn the trigger on.