Data Quality Rules
Sluice’s data quality engine runs a configurable set of rules against the raw, staged source data before the transform phase. Rows that fail are written to a rejection CSV with the violating field, value, rule, severity, and message recorded. The same data is summarised in a machine-readable JSON file that you can drop into a CI artefact.
This page is the reference for every built-in rule. To wire DQ into a pipeline, see Writing a Pipeline YAML.
Severity levels
Section titled “Severity levels”Every check declares a severity. Severity decides what happens when a row fails:
| Severity | Effect on the row | Effect on the run |
|---|---|---|
critical | Removed from the output. | Halts the pipeline if dq.stopOnCritical: true (the default). |
warning | Kept in the output, but logged in the rejection CSV. | Run continues. |
info | Kept in the output. | Recorded in the DQ summary JSON only — not in the rejection CSV. |
The default dq.stopOnCritical: true makes Sluice fail loud on any critical violation, so a failing pipeline never silently drops half its data. Set it to false in pipelines where you’d rather emit the clean rows and inspect rejections out-of-band.
Rule reference
Section titled “Rule reference”Each rule is configured as a check under a dq.rules[].field entry:
dq: rules: - field: EMAIL checks: - type: notNull severity: critical - type: email severity: warningnotNull
Section titled “notNull”Fails when the value is null, undefined, an empty string, or whitespace-only.
- type: notNull severity: criticalNo value. Use this on every field that downstream systems treat as required — bc, ifs, and bluecherry adapters all reject loads with null required columns.
unique
Section titled “unique”Fails when the value appears more than once across the full source dataset. Detects duplicates by exact string match.
- type: unique severity: criticalNo value. Common pairing: notNull + unique on a primary-key column.
pattern
Section titled “pattern”Fails when the value does not match the supplied ECMAScript regex. The pattern is compiled with new RegExp(value) and tested with .test() — anchor it with ^…$ if you mean the whole string.
- type: pattern value: "^[A-Z0-9]{3,10}$" severity: warning message: "Must be 3-10 uppercase alphanumeric characters"message is optional; if omitted, Sluice uses a sensible default.
Fails when the value does not match an RFC 5322-flavoured email regex. Whitespace is trimmed before matching.
- type: email severity: warningNo value.
ukPostcode
Section titled “ukPostcode”Fails when the value is not a recognisable UK postcode (current outward + inward formats including SW1A 1AA, EC1A 1BB, M1 1AE, B33 8TH, CR2 6XH, DN55 1PT). Spaces are stripped before testing, so both SW1A1AA and SW1A 1AA validate.
- type: ukPostcode severity: warningNo value. For non-UK postcodes, use pattern with a country-specific regex.
maxLength
Section titled “maxLength”Fails when the string length exceeds value.
- type: maxLength value: 100 severity: warningvalue is an integer.
Fails when the numeric value of the field is less than value. The field value is coerced to Number(); non-numeric strings fail with a coercion error.
- type: min value: 0 severity: criticalFails when the numeric value of the field is greater than value. Same coercion rules as min.
- type: max value: 500000 severity: warningallowedValues
Section titled “allowedValues”Fails when the value is not in the array. Comparison is case-sensitive — normalise upstream with a cleanse: uppercase transform if you want to compare regardless of case.
- type: allowedValues value: [GB, IE, US, DE, FR] severity: warningvalue is an array of strings.
Rejection CSV format
Section titled “Rejection CSV format”When DQ runs, Sluice writes a rejection CSV at dq.rejectionFile (default: ./output/{pipeline.name}-rejected.csv). One row per violation — a single source row that fails three checks produces three lines in the rejection CSV.
row_index,field,value,rule,severity,message4,EMAIL,not-an-email,email,warning,must be a valid email address5,CUST_CODE,,notNull,critical,must not be null5,CUST_NAME,,notNull,critical,must not be null12,POST_CODE,12345,ukPostcode,warning,must be a valid UK postcodeHand the rejection CSV back to whoever owns the source data; it’s the actionable output from a Sluice run.
DQ summary JSON
Section titled “DQ summary JSON”Alongside the rejection CSV, Sluice writes {outputDir}/{pipeline.name}-dq-summary.json:
{ "pipeline": "acme-corp-customers", "runAt": "2026-04-15T09:30:00Z", "rowsChecked": 1842, "rowsPassed": 1801, "rowsRejected": 41, "violations": { "critical": 0, "warning": 38, "info": 3 }, "byField": { "POST_CODE": { "critical": 0, "warning": 22 }, "EMAIL": { "critical": 0, "warning": 16 } }}Drop this into a GitHub Actions artefact step and it becomes a per-PR data-quality report.
Multi-source DQ rules
Section titled “Multi-source DQ rules”In a multi-source pipeline (with sources + merge), each rule can be scoped to a specific pre-merge source by adding sourceId:
dq: rules: - field: STYLE_NO # Pre-merge: runs against stg_raw_sql-server only. sourceId: sql-server checks: - { type: notNull, severity: critical } - { type: unique, severity: critical } - field: STYLE_DESC # Post-merge: runs against stg_merged. checks: - { type: notNull, severity: critical } - { type: maxLength, value: 255, severity: warning }Rules without sourceId run post-merge against the merged staging table. Per-source rejection files are auto-named by appending -{sourceId} to the configured rejectionFile. Rows failing a critical pre-merge rule are filtered out of that source’s staging table before the merge phase.
Custom DQ rules
Section titled “Custom DQ rules”Sluice ships with the nine built-in rules above. For domain-specific validation (e.g. “must be a valid IBAN”, “must match an existing customer in the target”) you can add custom rules via the Plugin System:
- Tier 1 — Composite rules (YAML) — group built-in checks into a named rule. Zero code.
- Tier 2 — File plugins (TypeScript) — drop a
*.rule.tsfile in yourplugins/directory. - Tier 3 — npm packages — publish a reusable rule package as
@caracal-lynx/etl-rules-*.
Caracal Lynx maintains private rule packages for UK compliance (@caracal-lynx/etl-rules-uk) and fashion/retail data standards (@caracal-lynx/etl-rules-fashion); see Commercial Support.