How to Use AI to Clean and Normalize Messy Data

Fix inconsistent data formats, typos, and missing values using AI before syncing to other systems.

What This Integration Does

Data imported from CSV files, spreadsheets, or legacy systems is rarely clean: addresses split across columns inconsistently, phone numbers in five different formats, "USA" vs "United States" vs "us". Pure rule-based cleaning handles the easy cases; the messy ones need a model that can read context. This workflow runs each record through an AI step with Structured Output so the cleaned data lands in a predictable schema, then writes it back to the source (or to a "cleaned" mirror).

The workflow is typically batch (a CSV upload or a nightly database sync), but works the same way for ad-hoc cleanup before a one-off migration. Each record is processed independently so it parallelises trivially.

Prerequisites

  • A data source: a csv file, a mysql/mongodb connection, or an API your http connector can call.
  • A target system to write the cleaned data to (often the same source, updated in place).
  • A workspace LLM provider configured.
  • A clear definition of "clean" - the target schema, casing convention, and phone/date formats.

Step 1: Trigger and Fetch the Raw Data

Drop a Trigger node. For batch cleanup, use Manual or Schedule. For inbound CSV, use Webhook or Email (attachment) and follow with a Connector node on csv / parse to turn the file into JSON rows. For database cleanup, use mysql / execute-query or mongodb / find-documents scoped to records that look dirty (e.g. WHERE phone NOT REGEXP '^\\+').

Step 2: Quick Rule-Based Pre-Clean

Don't make the model do work that trim can do. Add a Transform node (or chain of utility Connector nodes) for the cheap fixes:

  • text / trim and text / case on string columns.
  • regex / replace to strip obvious junk (e.g. (N/A|null|--) -> empty).
  • validation / email to mark which emails are syntactically valid.

This pass is free (no LLM tokens) and dramatically reduces what the AI step has to handle.

Step 3: Loop Over the Records

Add a Loop node (ForEach mode) over the array of rows. For very large datasets, configure the loop to run in batches of 5-20 records and pass each batch to the AI step in one call - much cheaper than one call per row.

Step 4: AI Clean with Structured Output

Inside the loop, add a Connector node configured as a tool-augmented LLM call. Use a clear prompt and a strict output schema matching your target model:

Prompt:
You normalize customer records. For each input record, return a cleaned record.
- Names: Title Case, full first and last where possible.
- Address: split into street, city, region, postcode, countryCode (ISO 3166-1 alpha-2).
- Phone: E.164 (e.g. +61400123456). If country is ambiguous, leave null.
- Email: lowercase, trimmed.
- If a field is unrecoverable, return null. Do not invent data.
Schema:
{
  "type": "object",
  "properties": {
    "firstName":   { "type": ["string", "null"] },
    "lastName":    { "type": ["string", "null"] },
    "email":       { "type": ["string", "null"] },
    "phone":       { "type": ["string", "null"] },
    "address": {
      "type": "object",
      "properties": {
        "street":      { "type": ["string", "null"] },
        "city":        { "type": ["string", "null"] },
        "region":      { "type": ["string", "null"] },
        "postcode":    { "type": ["string", "null"] },
        "countryCode": { "type": ["string", "null"] }
      }
    }
  }
}

Step 5: Validate and Write Back

After the AI step, add a Condition node that checks for required fields. Valid records flow into a Connector node on your target system (mongodb / update-documents, mysql / update-rows, or shopify / update-customer for example). Records still missing key fields go to a "needs review" collection so a human can fix them.

Step 6: Report

End with a Connector node on slack / send-message that reports the run summary: total records processed, count cleaned, count rejected, link to the review queue. Also log per-record decisions to a Mongo collection so you can audit the changes later.

Tips

  • Show examples in the prompt. A handful of input/output pairs (one-shot or few-shot) dramatically improves consistency on edge cases.
  • Run on a sample first. Process 100 rows manually before turning loose on the full dataset - it's cheap insurance against a bad prompt burning your token budget.
  • Keep the original. Don't overwrite source fields. Store cleaned values in parallel columns (email_clean, phone_clean) so you can roll back.

Common Pitfalls

  • Hallucinated data. Models love to fill blanks. Repeat "do not invent data, return null for unknowns" in the prompt and check for it in spot checks.
  • Locale-sensitive formats. US "MM/DD/YYYY" vs EU "DD/MM/YYYY" is ambiguous on the 1st-12th of each month. Pass a country hint when you have one.
  • Cost on big datasets. A million rows at $0.001 each is $1,000. Batch records per call and use the cheapest model that still passes your sample tests.
  • Tone drift. The model may "improve" names ("Bob" -> "Robert"). Explicitly forbid expansions in the prompt.

Testing

Run on a representative sample (e.g. 200 rows including the ugliest examples you can find). Eyeball every row of output and label correct/incorrect/hallucinated. Iterate on the prompt until accuracy is acceptable, then run the full batch. Keep the original and cleaned versions side by side for one cycle before deleting the originals.

Learn More

Did this answer your question? Thanks for the feedback There was a problem submitting your feedback. Please try again later.