How to Use AI to Clean Database Records Before Syncing

Run AI quality checks on database records before pushing them to other systems.

What This Integration Does

Operational databases collect grime over time: duplicate customer rows, phone numbers in five different formats, addresses missing postcodes, names typed in all-caps. When you sync those records downstream into a CRM, a marketing tool, or an ERP, the mess multiplies. This Spojit workflow runs a cleaning pass first - using a Connector node in Agent mode to normalize, deduplicate, and patch records - so only clean data leaves your database.

The workflow runs on a schedule (or on demand against a specific batch), pulls candidate records out of MySQL or MongoDB, asks a Connector node in Agent mode to produce a cleaned version of each row, writes the cleaned record back, and only then triggers the downstream sync. Each pass is idempotent: cleaned rows are flagged so the next run skips them.

Prerequisites

A MySQL or MongoDB connection with read and write access to the table or collection holding the dirty records.
A column or field to mark cleaned rows (e.g. cleaned_at timestamp or status = 'clean').
A Connector node configured in Agent mode for the cleaning step (it uses AI credits and lets you force structured output with a Response Schema).
A downstream connection (CRM, ERP, marketing tool) that will receive the cleaned records.

Step 1: Schedule Trigger

Drop a Trigger node and set its type to Schedule. Hourly is a good cadence for high-volume tables; nightly is fine if your downstream sync also runs nightly. For ad-hoc cleanups, use a Manual trigger instead so you can kick off a batch from the dashboard.

Step 2: Query Dirty Records

Add a Connector node for the mysql connector and pick the execute-query tool. Pull only records that haven't been cleaned yet and cap the batch so a single run stays predictable:

SELECT id, first_name, last_name, email, phone, address_line, city, postcode
FROM customers
WHERE cleaned_at IS NULL
ORDER BY id
LIMIT 200

For MongoDB, swap in the mongodb connector with find-documents and a filter like { "cleaned_at": { "$exists": false } } with a limit of 200.

Step 3: Loop the Batch Through Agent Mode

Wrap the next steps in a Loop node (ForEach) over the rows returned in Step 2. Inside the loop, add a Connector node in Agent mode with a prompt like:

You receive one customer record. Return a JSON object with the same fields, cleaned:
- Trim whitespace, fix capitalization (Title Case for names and city)
- Normalize phone to E.164 (+CCXXXXXXXXX)
- Lowercase the email
- If postcode is missing but city is provided, infer the postcode where confident; otherwise leave blank
- Set a "confidence" field between 0 and 1
Input: {{ row }}

Set a Response Schema on the node so Agent mode returns the cleaned record as predictable JSON the next step can rely on.

Step 4: Validate and Branch on Confidence

Add a Connector node for the validation connector and call email on the cleaned email field. Then add a Condition node that checks {{ cleaned.confidence }} >= 0.8 and that validation passed. High-confidence rows continue to the update step; low-confidence rows route to a review collection (or a Slack alert) instead of being written back blindly.

Step 5: Write Cleaned Records Back

On the high-confidence branch, add a Connector node for the mysql connector and pick the update-rows tool. Configure:

Table: customers
Where: { "id": "{{ row.id }}" }
Set: the cleaned fields plus cleaned_at = NOW() so the next run skips this row

For MongoDB, use the mongodb connector with update-documents and $set the cleaned fields.

Step 6: Trigger the Downstream Sync

After the loop completes, add a Subworkflow node that invokes your existing sync workflow (e.g. push-to-crm) scoped to the rows you just cleaned. Pass the list of ids as input so the sync workflow only touches freshly-cleaned records and doesn't re-process the whole table.

Tips

Batch size matters - 100 to 500 rows per run keeps LLM costs predictable and lets you bail out quickly if something goes wrong.
Keep the original - store the raw value in a shadow column (phone_raw) before overwriting. If the model gets a row wrong, you can roll back.
Cheap model first - most cleaning is mechanical. A small, fast model handles 90% of rows; only escalate to a stronger model on low-confidence ones.

Common Pitfalls

Hallucinated postcodes - LLMs will happily invent a postcode that looks plausible. Always require confidence >= 0.8 and a separate validation pass before writing.
Re-cleaning the same rows - if you forget to set cleaned_at, the next run pulls the same batch again. Always update the marker field in the same query as the cleaned fields.
PII in prompts - the cleaning prompt and its output appear in execution history. If your rows contain sensitive data, drop or mask fields you do not need to clean (with a Transform node) before they reach the Agent mode step.

Testing

Run the workflow manually with a LIMIT 5 on the query and the downstream subworkflow disabled. Inspect each cleaned row in the database: did capitalization, phone format, and email lowercase as expected? Do the cleaned_at timestamps appear? Once five rows look right, raise the limit and turn the schedule on.