How to Use Structured Output for Reliable AI Data Extraction

Constrain AI responses to a specific JSON schema for predictable, machine-readable output.

What This Integration Does

Free-form LLM output is great for chat, but terrible for automation. The moment a downstream Spojit step expects a number where the model returned the word "twelve", the workflow breaks. Structured Output pins the AI Agent to a JSON schema you define so every response has the same fields, in the same shape, with the same types - ready to feed into a database, an API, or a branching Condition node.

The workflow runs whenever upstream content arrives (email, webhook, schedule, or a file landing in storage). The AI Agent reads the raw input and returns a single JSON object that matches your schema. Validation runs in the same step, so malformed responses fail loudly rather than poisoning the rest of the workflow.

Prerequisites

  • An AI Agent enabled in Spojit (any provider - Vertex, Anthropic, OpenAI).
  • A clear idea of the fields you need to extract and their types.
  • A handful of representative input samples to test against.

Step 1: Choose a Trigger

Drop a Trigger node onto the canvas. For ad-hoc extraction, use a Webhook trigger that accepts a JSON payload with the raw text. For batch jobs, use a Schedule trigger and pull pending rows from your source system at the top of each run.

Step 2: Stage the Raw Content

If your source is a PDF, add a Connector node pointing at the pdf connector and pick the extract-text tool. For HTML or scraped text, route through a Transform node first to strip markup and collapse whitespace. The cleaner the input, the more reliably the model fills the schema.

Step 3: Configure the AI Agent with a JSON Schema

Add a Connector node and switch it to Agent Mode. Enable Structured Output and paste in the schema you want returned. Every property should have a type and a description - the description is what guides the model.

{
  "type": "object",
  "properties": {
    "customerName": { "type": "string", "description": "Full legal name from the document header" },
    "orderTotal":   { "type": "number", "description": "Grand total in the document's currency" },
    "currency":     { "type": "string", "description": "ISO 4217 currency code, e.g. USD" },
    "items": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "name": { "type": "string" },
          "qty":  { "type": "integer", "description": "Quantity ordered" },
          "unitPrice": { "type": "number" }
        },
        "required": ["name", "qty"]
      }
    }
  },
  "required": ["customerName", "orderTotal", "items"]
}

Step 4: Write a Tight Prompt

Your prompt does not need to repeat the schema - the model already sees it. Focus on instructions the schema can't express:

Extract order data from the document below. If a field is missing, omit it
rather than guessing. Use integers for quantities and decimal numbers for
prices. Return all monetary values in the document's original currency.

Document:
{{ extractText.text }}

Step 5: Validate the Result

Add a Connector node pointing at the validation connector with the json tool, or use a Condition node to assert the fields you can't live without (e.g. {{ agent.orderTotal }} > 0). A failed assertion can route to a Human node for manual review instead of writing bad data downstream.

Step 6: Hand Off the Clean Object

The agent's output is now a typed JSON object you can pass straight into any connector - mongodb insert-documents, mysql insert-rows, netsuite create-record, etc. No string parsing required.

Tips

  • Field descriptions are the highest-leverage thing you can tune - vague descriptions produce vague extractions.
  • Mark only the truly mandatory fields as required; over-requiring forces the model to hallucinate when data is missing.
  • For lists with unknown length, use "items" arrays rather than numbered fields (item1, item2) - schemas with arrays generalize far better.

Common Pitfalls

  • Number vs string - if you declare a field as number, the model still returns valid JSON, but values like "1,200.00" can get coerced to 1. Strip thousand separators upstream or accept a string and parse it.
  • Schema too large - very deep schemas blow up the prompt and reduce extraction quality. Split into multiple agent steps if you're extracting more than ~15 fields.
  • Optional fields silently dropped - the model omits fields it can't find. Don't write code that assumes every property is present; default to null in the next Transform step.

Testing

Run the workflow against three representative inputs - one clean, one messy, and one that's deliberately missing a field. Compare the returned JSON against ground truth. If even one of the three trips up, tighten the field descriptions or add a clarifying line to the prompt before turning the trigger on.

Learn More

Did this answer your question? Thanks for the feedback There was a problem submitting your feedback. Please try again later.