How to Use AI to Clean and Normalize Messy Data

Fix inconsistent data formats, typos, and missing values using AI before syncing to other systems.

What This Integration Does

Data imported from CSV files, spreadsheets, or legacy systems is rarely clean: addresses split across columns inconsistently, phone numbers in five different formats, "USA" vs "United States" vs "us". Pure rule-based cleaning handles the easy cases; the messy ones need a model that can read context. In Spojit, this workflow runs each record through a Connector node in Agent mode with a Response Schema so the cleaned data lands in a predictable shape, then writes it back to the source (or to a "cleaned" mirror).

The workflow is typically batch (a CSV upload or a nightly database sync), but works the same way for ad-hoc cleanup before a one-off migration. Each record is processed independently so it parallelises trivially.

Prerequisites

A data source: a csv file, a mysql/mongodb connection, or an API your http connector can call.
A target system to write the cleaned data to (often the same source, updated in place).
Enough AI credits in your workspace to run the Agent-mode step across your dataset.
A clear definition of "clean" - the target schema, casing convention, and phone/date formats.

Step 1: Trigger and Fetch the Raw Data

Drop a Trigger node. For batch cleanup, use Manual or Schedule. For an emailed CSV, use a Mailhook trigger, add an Attachment node to fetch the file bytes, then a Connector node on csv / parse to turn the file into JSON rows. For a posted CSV, use a Webhook trigger ahead of csv / parse. For database cleanup, use mysql / execute-query or mongodb / find-documents scoped to records that look dirty (e.g. WHERE phone NOT REGEXP '^\\+').

Step 2: Quick Rule-Based Pre-Clean

Don't make the model do work that trim can do. Add a Transform node (or chain of utility Connector nodes) for the cheap fixes:

text / trim and text / case on string columns.
regex / replace to strip obvious junk (e.g. (N/A|null|--) -> empty).
validation / email to mark which emails are syntactically valid.

This pass is free (no AI credits) and dramatically reduces what the Agent-mode step has to handle.

Step 3: Loop Over the Records

Add a Loop node (ForEach mode) over the array of rows. For very large datasets, configure the loop to run in batches of 5-20 records and pass each batch to the AI step in one call - much cheaper than one call per row.

Step 4: AI Clean with a Response Schema

Inside the loop, add a Connector node in Agent mode. Use a clear prompt and a Response Schema that matches your target model, so every record comes back as predictable JSON:

Prompt:
You normalize customer records. For each input record, return a cleaned record.
- Names: Title Case, full first and last where possible.
- Address: split into street, city, region, postcode, countryCode (ISO 3166-1 alpha-2).
- Phone: E.164 (e.g. +61400123456). If country is ambiguous, leave null.
- Email: lowercase, trimmed.
- If a field is unrecoverable, return null. Do not invent data.

Response Schema:
{
  "type": "object",
  "properties": {
    "firstName":   { "type": ["string", "null"] },
    "lastName":    { "type": ["string", "null"] },
    "email":       { "type": ["string", "null"] },
    "phone":       { "type": ["string", "null"] },
    "address": {
      "type": "object",
      "properties": {
        "street":      { "type": ["string", "null"] },
        "city":        { "type": ["string", "null"] },
        "region":      { "type": ["string", "null"] },
        "postcode":    { "type": ["string", "null"] },
        "countryCode": { "type": ["string", "null"] }
      }
    }
  }
}

Step 5: Validate and Write Back

After the AI step, add a Condition node that checks for required fields. Valid records flow into a Connector node on your target system (mongodb / update-documents, mysql / update-rows, or shopify / update-customer for example). Records still missing key fields go to a "needs review" collection so a human can fix them.

Step 6: Report

End with a Connector node on slack / send-message that reports the run summary: total records processed, count cleaned, count rejected, link to the review queue. Also log per-record decisions with a mongodb / insert-documents call so you can audit the changes later.

Tips

Show examples in the prompt. A handful of input/output pairs (one-shot or few-shot) dramatically improves consistency on edge cases.
Run on a sample first. Process 100 rows manually before turning loose on the full dataset - it's cheap insurance against a bad prompt burning through your AI credits.
Keep the original. Don't overwrite source fields. Store cleaned values in parallel columns (email_clean, phone_clean) so you can roll back.

Common Pitfalls

Hallucinated data. Models love to fill blanks. Repeat "do not invent data, return null for unknowns" in the prompt and check for it in spot checks.
Locale-sensitive formats. US "MM/DD/YYYY" vs EU "DD/MM/YYYY" is ambiguous on the 1st-12th of each month. Pass a country hint when you have one.
Cost on big datasets. Every record sent to the Agent-mode step spends AI credits, so a million rows adds up fast. Batch records per call and pick the most economical model that still passes your sample tests.
Tone drift. The model may "improve" names ("Bob" -> "Robert"). Explicitly forbid expansions in the prompt.

Testing

Run on a representative sample (e.g. 200 rows including the ugliest examples you can find). Eyeball every row of output and label correct/incorrect/hallucinated. Iterate on the prompt until accuracy is acceptable, then run the full batch. Keep the original and cleaned versions side by side for one cycle before deleting the originals.