metricsign
Start free
Best Practices8 min·

Microsoft Fabric Copy Job: Where It Breaks in Production

The tutorial shows the happy path. Production shows you the watermark column that silently stopped advancing three weeks ago.

Copy Job hides the watermark, and that is the problem

Copy Job stores incremental watermark state internally. You pick a column — typically a LastModified or an integer ID — and Fabric tracks the high-water mark between runs. The UI shows a single value. There is no exposed control table, no audit row you can SELECT against, and no native way to roll the watermark back without recreating the job.

This matters when the source clock skews or when an upstream job backfills records with timestamps earlier than the current watermark. Copy Job will not pick those rows up. The run completes. Rows read: 0. Rows written: 0. Status: Succeeded. Three weeks later someone notices the dim_customer table is missing 14,000 records that exist in the source.

The second failure mode: watermark column type mismatches. If your source column is datetime2(7) and Fabric infers datetime, the truncation can cause the comparison WHERE LastModified > @watermark to skip rows whose sub-second precision falls inside the same second as the previous run's max value. You will not see this in the run history. You see it when row counts diverge from the source system.

The practical mitigation is to never trust the internal watermark alone. Run a daily reconciliation query against the source — SELECT COUNT(*), MAX(LastModified) FROM source.table — and compare to the destination. If the delta exceeds threshold for two consecutive days, treat the Copy Job as broken regardless of its green status. This is the same class of problem as a Power BI dataset refresh that completes successfully against a stale gateway cache: the platform reports success because the operation it ran did not fail.

Schema drift gets coerced, not flagged

When a source adds a column, Copy Job's default behavior depends on the mapping mode. With auto-mapping enabled and 'Allow schema drift' on, the new column flows through. With explicit mappings, the new column is silently dropped. Neither path raises an error.

The more dangerous case is type narrowing. A source column changes from int to bigint because someone hit 2.1 billion. Fabric's mapping still says int. The Copy Job continues to run. Rows where the value exceeds 2^31 fail individual row writes — but if 'fault tolerance' is set to skip incompatible rows (the default in many templates), those failures get logged to a skipped-rows file in the staging location and the activity reports success.

Check the run details JSON for dataConsistencyVerification and dataInconsistency counters. The portal surfaces 'Rows read' and 'Rows written' prominently but tucks the skipped count into a sub-panel. If rows_read - rows_written > 0 and you did not configure deduplication, you have silent data loss.

For a Postgres source — and 51.9% of professional developers work with Postgres according to Stack Overflow's 2024 survey — Copy Job uses the npgsql driver under the hood. NUMERIC columns with no explicit precision get mapped to decimal(38,18) by default, which truncates anything wider. SQL Server sources hit a similar issue with sql_variant columns, which get coerced to nvarchar(max) and lose their underlying type metadata.

The defensible pattern is to lock the destination schema with explicit Delta table DDL, set Copy Job to fail on schema mismatch rather than coerce, and treat any DDL change in the source as a deployment that requires updating the job mapping. Schema-on-read is fine for exploration. It is not fine for the table that feeds the executive dashboard.

Copy Job production readiness checklist 1 Watermark column has matching precision in source and destination 2 Destination Delta table has explicit DDL, not auto-inferred schema 3 Schema mismatch policy set to fail, not coerce 4 Diagnostic logging enabled to Log Analytics workspace 5 Alert configured for retryCount > 0 on succeeded runs 6 Partition column distribution checked: max/median duration ratio < 4 7 Independent row-count reconciliation runs daily against source 8 Freshness contract defined per destination table
Copy Job production readiness checklist

The retry policy that masks transient becomes permanent

Copy Job's default retry policy is 3 attempts with 30-second intervals on transient errors. Errors classified as transient include network timeouts, throttling responses (HTTP 429, 503), and the catch-all UserErrorFailedToConnectToSqlServer when the connection pool is saturated.

The trap: a misconfigured firewall or expired service principal credential can manifest as a connection timeout on the first attempt, succeed on retry because a different gateway node picks up the request, and then fail consistently the next day when the misrouted connection is no longer available. The first day's run history shows 'Succeeded after 1 retry' — easy to ignore. The next day the job fails outright with ErrorCode=UserErrorFailedToConnectToSqlServer and now you are debugging under pressure.

Fabric Copy Job does not surface per-attempt error detail in the activity output by default. To get it you have to enable diagnostic logging to a Log Analytics workspace and query the FabricItemsOperations table:

`` FabricItemsOperations | where OperationName == "CopyJob.Run" | where Status == "Succeeded" | where AdditionalProperties.retryCount > 0 | project TimeGenerated, ItemName, retryCount = AdditionalProperties.retryCount, lastError = AdditionalProperties.lastErrorMessage ``

A succeeded run with retryCount > 0 is a leading indicator. It tells you the job is one configuration drift away from breaking. Treat it the same way you treat a failing-but-eventually-passing CI build: fix the underlying flake, do not normalize it. The runs that succeed-with-retry today are the runs that page you at 2am next week.

Parallel copy and the partition that doesn't partition

For large tables Copy Job offers parallel copy with degree-of-parallelism settings. The UI lets you pick 1 through 32. Behind the scenes Fabric splits the source query by either physical partitions (when the source supports them) or by a dynamic range over a numeric column you select.

Dynamic range partitioning is where this falls apart. If you pick a column with skewed distribution — a CustomerID where 80% of rows belong to a single tenant, for instance — Fabric splits the range evenly by min/max. One partition gets 80% of the rows. The other 31 finish in seconds and sit idle while the heavy partition runs serially. Total throughput is barely better than DOP=1, but you are paying for 32 compute units.

Worse, the heavy partition can hit the 2-hour activity timeout. The partition fails, the activity fails, and Fabric does not provide a partial-success path. You re-run the entire job.

Check the partition strategy by inspecting the run's executionDetails.profile.queue field. It shows per-partition row counts and durations. If max_partition_duration / median_partition_duration > 4, your partition column is wrong. Pick a hash-distributed surrogate key, or pre-partition the source with a computed column like ABS(CHECKSUM(NEWID())) % 32.

For sources without numeric keys — most SaaS APIs, JSON file dumps — parallel copy degrades to file-level parallelism, which only helps if you have many roughly-equal files. A single 40GB JSON file gets one worker regardless of DOP setting.

Detecting the silent success

The pattern across all four failure modes is the same: Copy Job's status field reports the operation, not the outcome. A run that read zero rows because the watermark didn't advance, a run that coerced a bigint to int and skipped overflow rows, a run that succeeded after three retries against a degrading endpoint — all read 'Succeeded' in the activity log. Native Fabric monitoring alerts on failures, not on these patterns.

MetricSign's Fabric Pipelines integration tracks Copy Job runs against historical baselines and flags refresh_delayed when a job's row counts or duration diverge from its rolling average, even when the job itself reports success. The retry-count signal from diagnostic logs is grouped with the eventual failure so you see the warning and the outage as one incident rather than separately filed tickets a week apart.

The operational discipline that matters more than any tool: every Copy Job destination table should have a freshness contract. Define it explicitly — dim_customer must have a row written in the last 4 hours. Check it independently of the job's run history. When the contract breaks, investigate even if every run shows green.

Related integrations

Related articles

← All articlesShare on LinkedIn