Skip to content

How It Works

This page is for the technically curious — anyone evaluating Sluice for a real migration who wants to know what’s happening under the hood.

The short version: every Sluice pipeline runs through six phases, with an embedded DuckDB instance acting as the staging store between them. The phases are sequenced by PipelineRunner (or MultiSourcePipelineRunner for pipelines with multiple sources). Adapters are pluggable; everything else is engine code with strict module boundaries.

Where Sluice sits between source systems and the target ERP:

flowchart LR
classDef person fill:#FFE8B2,stroke:#B88400,stroke-width:2px,color:#000
classDef input fill:#E3F2FD,stroke:#1565C0,stroke-width:1.5px,color:#000
classDef core fill:#1565C0,stroke:#0D47A1,stroke-width:2px,color:#FFF
classDef target fill:#C8E6C9,stroke:#2E7D32,stroke-width:1.5px,color:#000
classDef report fill:#F3E5F5,stroke:#6A1B9A,stroke-width:1.5px,color:#000
Author([Pipeline Author<br/>consultant / engineer]):::person
subgraph Inputs[Inputs]
YAML[/Pipeline YAML<br/>config/]:::input
Lookups[/Lookup CSVs<br/>& SQL/]:::input
Env[/.env credentials/]:::input
end
subgraph Sluice[Sluice Toolkit]
CLI{{sluice CLI}}:::core
Runner[[PipelineRunner]]:::core
DuckDB[(DuckDB<br/>staging)]:::core
end
subgraph Sources[Legacy Sources]
MSSQL[(MSSQL)]
PG[(PostgreSQL)]
CSV[CSV / XLSX]
REST[REST APIs]
end
subgraph Targets[Target ERPs & Files]
BC[Business Central<br/>REST]:::target
IFS[IFS ERP<br/>CSV import]:::target
BlueCherry[BlueCherry ERP<br/>CSV import]:::target
GenericCsv[Generic CSV / PG]:::target
end
subgraph Outputs[Run Artifacts]
Rejections[/Rejection CSV/]:::report
Summary[/DQ summary JSON/]:::report
State[/Run state JSON/]:::report
Log[/pino JSON log/]:::report
end
Author --> YAML
Author --> Lookups
Author --> Env
YAML --> CLI
Env --> CLI
CLI --> Runner
Sources --> Runner
Lookups --> Runner
Runner <--> DuckDB
Runner --> Targets
Runner --> Rejections
Runner --> Summary
Runner --> State
Runner --> Log

For a pipeline with a single source: block, PipelineRunner walks six phases in order:

sequenceDiagram
autonumber
actor User
participant CLI as sluice CLI
participant Run as PipelineRunner
participant Cfg as ConfigLoader
participant Src as SourceAdapter
participant Duck as DuckDB (StagingStore)
participant DQ as DQEngine
participant Tx as TransformEngine
participant Tgt as TargetAdapter
participant FS as Filesystem
User->>CLI: sluice run pipeline.yaml
CLI->>Cfg: load(path)
Cfg->>Cfg: resolve ${ENV_VAR}
Cfg->>Cfg: Zod validate
Cfg-->>CLI: Pipeline
CLI->>Run: new + run()
Run->>Duck: open()
Run->>Src: connect(config)
Run->>Src: extract(store, 'stg_raw')
Src->>Duck: createTable + insertBatch*
Src-->>Run: ExtractResult
Run->>Src: disconnect()
Run->>DQ: validate('stg_raw')
DQ->>Duck: query
DQ->>FS: write rejection CSV + summary JSON
alt critical & stopOnCritical
DQ-->>Run: throw PipelineDQError
Run->>FS: write state (failed)
Run-->>CLI: exit 2
end
Run->>Tx: resolveLookups()
Tx->>Duck: load lookup sources
Run->>Tx: transform('stg_raw' → 'stg_transformed')
alt dryRun OR validate-only
Run->>FS: write state + summary
Run-->>CLI: exit 0
else full run
Run->>Tgt: connect(config)
Run->>Tgt: load('stg_transformed')
Tgt->>Duck: stream rows
Tgt-->>Run: LoadResult
Run->>Tgt: disconnect()
Run->>FS: write state JSON
Run->>Duck: close()
Run-->>CLI: exit 0
end

Each phase is its own engine module:

PhaseModuleResponsibility
1. Config loadsrc/config/YAML → ${ENV_VAR} resolution → Zod validation → composite-rule expansion
2. Extractsrc/adapters/source/Stream source data into DuckDB stg_raw
3. DQsrc/dq/Run rules against stg_raw; write rejection CSV + summary JSON
4. Transformsrc/transform/Apply field mappings, lookups, cleanse ops, expressions → stg_transformed
5. Loadsrc/adapters/target/Stream stg_transformed to the target adapter
6. Run statesrc/runner.tsWrite {outputDir}/{pipeline.name}-state.json for incremental mode

When a pipeline declares sources[] + merge:, the CLI auto-routes to MultiSourcePipelineRunner (a subclass of PipelineRunner). It runs extract + per-source DQ in priority order, then merges, then runs post-merge DQ + transform + load:

flowchart TB
classDef phase fill:#E3F2FD,stroke:#1565C0,stroke-width:1.5px,color:#000
classDef stage fill:#FFF9C4,stroke:#F9A825,stroke-width:1.5px,color:#000
classDef decision fill:#FFE0B2,stroke:#E65100,stroke-width:1.5px,color:#000
classDef output fill:#F3E5F5,stroke:#6A1B9A,stroke-width:1.5px,color:#000
A[Load & validate YAML<br/>Zod refine: sources XOR source]:::phase
B[Load plugins<br/>file + sluice.config.yaml]:::phase
C[Open DuckDB]:::phase
subgraph PerSource[Per-source extract + DQ - priority-ordered]
direction TB
S1[Extract source 1]:::phase --> R1[renameColumns] --> F1{incremental?}:::decision
F1 -- yes & matches<br/>incrementalSource --> I1[filter by<br/>incrementalField] --> D1[DQ rules<br/>where sourceId = s1]:::phase
F1 -- no --> D1
D1 --> RW1[rewrite stg_raw_s1<br/>accepted rows only]:::stage
S2[Extract source 2]:::phase --> R2[renameColumns] --> D2[DQ rules<br/>where sourceId = s2]:::phase --> RW2[rewrite stg_raw_s2]:::stage
SN[Extract source N]:::phase --> RN[renameColumns] --> DN[DQ rules<br/>where sourceId = sN]:::phase --> RWN[rewrite stg_raw_sN]:::stage
end
M[[MergeEngine<br/>strategy: coalesce / priority-override / union / intersect]]:::phase
MJ[(stg_merge_joined)]:::stage
MG[(stg_merged)]:::stage
MC[(stg_merge_conflicts)]:::stage
CLog[/conflictLog CSV<br/>if configured/]:::output
PMDQ[Post-merge DQ<br/>rules with no sourceId]:::phase
TX[Transform<br/>stg_merged → stg_transformed]:::phase
LD[Load to target]:::phase
ST[/State file with<br/>per-source sources block/]:::output
CL[Close DuckDB]:::phase
A --> B --> C --> PerSource
RW1 --> M
RW2 --> M
RWN --> M
M --> MJ --> MG
M --> MC --> CLog
MG --> PMDQ --> TX --> LD --> ST --> CL

The four built-in merge strategies cover the patterns we hit most often:

flowchart LR
classDef s fill:#BBDEFB,stroke:#1565C0,color:#000
subgraph Coalesce[coalesce]
CA[priority 1: null<br/>priority 2: Alice<br/>priority 3: Alicia]:::s --> CR[Alice]:::s
end
subgraph PO[priority-override]
PA[priority 1: null<br/>priority 2: Alice<br/>priority 3: Alicia]:::s --> PR[null]:::s
end
subgraph Union[union]
UA[A: 100 keys<br/>B: 120 keys<br/>overlap 80]:::s --> UR[140 rows<br/>deduped by key]:::s
end
subgraph Intersect[intersect]
IA[A: 100 keys<br/>B: 120 keys<br/>overlap 80]:::s --> IR[80 rows]:::s
end

DuckDB is the engine’s staging layer. It’s an embedded, single-file analytical database — no server, no daemon, no rebuild on Node ABI bumps. Sluice uses the official @duckdb/node-api package, which is Promise-native, TypeScript-first, and ABI-stable.

What this gives Sluice:

  • Set-based DQ. Rules like unique are evaluated as a SQL GROUP BY … HAVING COUNT(*) > 1 against the staging table. Nothing else fits in memory at scale.
  • Set-based merge. Multi-source merges are SQL JOIN + COALESCE over priority-ordered sources, executed by DuckDB rather than imperatively in JavaScript.
  • Cheap dry-runs. Pass stagingDb: ':memory:' (or --dry-run) and the whole pipeline runs without touching disk except to write the rejection CSV.
  • Predictable type behaviour. Source types map cleanly to DuckDB’s columnar types (VARCHAR, BIGINT, DOUBLE, BOOLEAN, TIMESTAMP).

The default staging file is {outputDir}/{pipeline.name}.duckdb. It’s safe to delete between runs; Sluice rebuilds it from scratch every time.

Pipelines are described declaratively, not procedurally. The YAML schema is the contract. Zod validates the YAML at load time, and every TypeScript type in Sluice is inferred from the Zod schema (z.infer<typeof PipelineSchema>). There are no manual interface declarations for config types — code and config can’t drift apart.

${ENV_VAR} tokens are resolved by the loader before Zod validation, so secrets never appear in YAML and the schema only ever sees plain strings.

The codebase follows a strict dependency direction:

cli.ts → runner.ts / multi-source-runner.ts
├─→ adapters/ (source + target)
├─→ staging/ (DuckDB wrapper)
├─→ dq/ (rule engine)
├─→ transform/ (field mapping engine)
├─→ merge/ (multi-source merge engine)
├─→ plugins/ (Tier 2/3 plugin discovery)
├─→ config/ (schema + loader)
└─→ enrich/ (Phase 4a — types only; impl is private)
utils/ ← (imported by everyone)

plugins/ is imported by runner, dq, transform, and merge, but it must not import any of them — that’s how custom rules and adapters can be registered without creating a cycle. enrich/ is type-only in the open-source core; the runtime implementation lives in the private @caracal-lynx/sluice-enrich package.