How to Use AI to Clean Database Records Before Syncing
Run AI quality checks on database records before pushing them to other systems.
What This Integration Does
Operational databases collect grime over time: duplicate customer rows, phone numbers in five different formats, addresses missing postcodes, names typed in all-caps. When you sync those records downstream into a CRM, a marketing tool, or an ERP, the mess multiplies. This workflow runs a cleaning pass first - using an LLM agent to normalize, deduplicate, and patch records - so only clean data leaves your database.
The workflow runs on a schedule (or on demand against a specific batch), pulls candidate records out of MySQL or MongoDB, asks an AI agent to produce a cleaned version of each row, writes the cleaned record back, and only then triggers the downstream sync. Each pass is idempotent: cleaned rows are flagged so the next run skips them.
Prerequisites
- A MySQL or MongoDB connection with read and write access to the table or collection holding the dirty records.
- A column or field to mark cleaned rows (e.g.
cleaned_attimestamp orstatus = 'clean'). - An LLM provider configured for your workspace (Vertex AI, OpenAI, or Anthropic) so the agent step can call a model.
- A downstream connection (CRM, ERP, marketing tool) that will receive the cleaned records.
Step 1: Schedule Trigger
Drop a Trigger node and set its type to Schedule. Hourly is a good cadence for high-volume tables; nightly is fine if your downstream sync also runs nightly. For ad-hoc cleanups, use a Manual trigger instead so you can kick off a batch from the dashboard.
Step 2: Query Dirty Records
Add a Connector node for the mysql connector and pick the execute-query tool. Pull only records that haven't been cleaned yet and cap the batch so a single run stays predictable:
SELECT id, first_name, last_name, email, phone, address_line, city, postcode
FROM customers
WHERE cleaned_at IS NULL
ORDER BY id
LIMIT 200
For MongoDB, swap in the mongodb connector with find-documents and a filter like { "cleaned_at": { "$exists": false } } with a limit of 200.
Step 3: Loop the Batch Through the AI Agent
Wrap the next steps in a Loop node (ForEach) over the rows returned in Step 2. Inside the loop, add an AI agent step with a prompt like:
You receive one customer record. Return a JSON object with the same fields, cleaned:
- Trim whitespace, fix capitalization (Title Case for names and city)
- Normalize phone to E.164 (+CCXXXXXXXXX)
- Lowercase the email
- If postcode is missing but city is provided, infer the postcode where confident; otherwise leave blank
- Set a "confidence" field between 0 and 1
Input: {{ row }}
Configure the agent to return structured output so the next step can rely on the JSON shape.
Step 4: Validate and Branch on Confidence
Add a Connector node for the validation connector and call email on the cleaned email field. Then add a Condition node that checks {{ cleaned.confidence }} >= 0.8 and that validation passed. High-confidence rows continue to the update step; low-confidence rows route to a review collection (or a Slack alert) instead of being written back blindly.
Step 5: Write Cleaned Records Back
On the high-confidence branch, add a Connector node for the mysql connector and pick the update-rows tool. Configure:
- Table:
customers - Where:
{ "id": "{{ row.id }}" } - Set: the cleaned fields plus
cleaned_at = NOW()so the next run skips this row
For MongoDB, use the mongodb connector with update-documents and $set the cleaned fields.
Step 6: Trigger the Downstream Sync
After the loop completes, add a Subworkflow node that invokes your existing sync workflow (e.g. push-to-crm) scoped to the rows you just cleaned. Pass the list of ids as input so the sync workflow only touches freshly-cleaned records and doesn't re-process the whole table.
Tips
- Batch size matters - 100 to 500 rows per run keeps LLM costs predictable and lets you bail out quickly if something goes wrong.
- Keep the original - store the raw value in a shadow column (
phone_raw) before overwriting. If the model gets a row wrong, you can roll back. - Cheap model first - most cleaning is mechanical. A small, fast model handles 90% of rows; only escalate to a stronger model on low-confidence ones.
Common Pitfalls
- Hallucinated postcodes - LLMs will happily invent a postcode that looks plausible. Always require
confidence >= 0.8and a separate validation pass before writing. - Re-cleaning the same rows - if you forget to set
cleaned_at, the next run pulls the same batch again. Always update the marker field in the same query as the cleaned fields. - PII in logs - the agent step logs prompts and outputs by default. If your rows contain sensitive data, scrub fields before logging or turn step logging off.
Testing
Run the workflow manually with a LIMIT 5 on the query and the downstream subworkflow disabled. Inspect each cleaned row in the database: did capitalization, phone format, and email lowercase as expected? Do the cleaned_at timestamps appear? Once five rows look right, raise the limit and turn the schedule on.