How to Use Structured Output for Reliable AI Data Extraction

Constrain AI responses to a specific JSON schema for predictable, machine-readable output.

What This Integration Does

Free-form AI output is great for chat, but terrible for automation. The moment a downstream Spojit step expects a number where the model returned the word "twelve", the workflow breaks. A Response Schema pins a Connector node running in Agent Mode to a JSON shape you define, so every response has the same fields, in the same shape, with the same types: ready to feed into a connector, a branching Condition node, or another step.

The workflow runs whenever upstream content arrives (a Webhook trigger, a Schedule trigger, or an Email/Mailhook trigger). The agent reads the raw input and returns a single JSON object that matches your schema. You then validate the result in a following step, so malformed responses fail loudly rather than poisoning the rest of the workflow.

Prerequisites

A Connector node you can switch to Agent Mode (Agent Mode runs an AI agent and supports a Response Schema).
A clear idea of the fields you need to extract and their types.
A handful of representative input samples to test against.

Step 1: Choose a Trigger

Drop a Trigger node onto the canvas. For ad-hoc extraction, use a Webhook trigger that accepts a JSON payload with the raw text. For batch jobs, use a Schedule trigger and pull pending rows from your source system at the top of each run.

Step 2: Stage the Raw Content

If your source is a PDF, add a Connector node pointing at the pdf connector and pick the extract-text tool. For HTML or scraped text, route through a Transform node first to strip markup and collapse whitespace. The cleaner the input, the more reliably the model fills the schema.

Step 3: Configure Agent Mode with a Response Schema

Add a Connector node and switch it to Agent Mode. Fill in the Response Schema field with the JSON schema you want returned. Every property should have a type and a description: the description is what guides the model.

{
  "type": "object",
  "properties": {
    "customerName": { "type": "string", "description": "Full legal name from the document header" },
    "orderTotal":   { "type": "number", "description": "Grand total in the document's currency" },
    "currency":     { "type": "string", "description": "ISO 4217 currency code, e.g. USD" },
    "items": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "name": { "type": "string" },
          "qty":  { "type": "integer", "description": "Quantity ordered" },
          "unitPrice": { "type": "number" }
        },
        "required": ["name", "qty"]
      }
    }
  },
  "required": ["customerName", "orderTotal", "items"]
}

Step 4: Write a Tight Prompt

Your prompt does not need to repeat the schema - the model already sees it. Focus on instructions the schema can't express:

Extract order data from the document below. If a field is missing, omit it
rather than guessing. Use integers for quantities and decimal numbers for
prices. Return all monetary values in the document's original currency.

Document:
{{ extractText.text }}

Step 5: Validate the Result

Add a Connector node pointing at the validation connector with the json tool, or use a Condition node to assert the fields you can't live without (e.g. {{ agent.orderTotal }} > 0). A failed assertion can route to a Human node for manual review instead of writing bad data downstream.

Step 6: Hand Off the Clean Object

The agent's output is now a typed JSON object you can pass straight into any connector - mongodb insert-documents, mysql insert-rows, netsuite create-record, etc. No string parsing required.

Tips

Field descriptions are the highest-leverage thing you can tune - vague descriptions produce vague extractions.
Mark only the truly mandatory fields as required; over-requiring forces the model to hallucinate when data is missing.
For lists with unknown length, use "items" arrays rather than numbered fields (item1, item2): schemas with arrays generalize far better.
Let Miraxa, the intelligent layer across your automation, scaffold the node for you. Ask it to add a Connector node in Agent Mode with a Response Schema, then fine-tune the fields in the properties panel.

Common Pitfalls

Number vs string - if you declare a field as number, the model still returns valid JSON, but values like "1,200.00" can get coerced to 1. Strip thousand separators upstream or accept a string and parse it.
Schema too large - very deep schemas blow up the prompt and reduce extraction quality. Split into multiple agent steps if you're extracting more than ~15 fields.
Optional fields silently dropped - the model omits fields it can't find. Don't write code that assumes every property is present; default to null in the next Transform step.

Testing

Run the workflow against three representative inputs - one clean, one messy, and one that's deliberately missing a field. Compare the returned JSON against ground truth. If even one of the three trips up, tighten the field descriptions or add a clarifying line to the prompt before turning the trigger on.