Taking snapshots using dbt in Azure Synapse: The Hash Distribution Issue
Photo by William Priess on Unsplash

Dbt is a powerful library used for data transformation and analysis. It supports various data warehouse solutions through different dbt adapters. Each adapter is designed to handle the unique syntax, functions, and configurations of specific data warehouses.

The error

In this article, we will discuss a common error that occurs when using dbt with Azure Synapse Dedicated Pool. The error message looks like this:

18:11:08    ('42000', '[42000] [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Merge statements with a WHEN NOT MATCHED [BY TARGET] clause must target a hash distributed table.

The error is encountered while taking a snapshot of a table using the dbt-synapse adapter.

Reproducing the Error

Let’s understand how this error occurs. Imagine we have a model called model_gamma, which is a table containing columns like id and modified_date. To keep track of updates to this table, we create a snapshot named snapshot_gamma.

Query behind snapshot_gamma

Initially, when we run the dbt snapshot command, it creates the first snapshot successfully. However, the error arises when trying to take another snapshot of model_gamma. This is because dbt attempts to modify the snapshot_gamma table, leading to the error.

The Cause

The error is related to the use of the SQL MERGE command under the hood when taking snapshots with Azure Synapse. Specifically, the error occurs when using the MERGE command with a WHEN NOT MATCHED BY TARGET clause.

dbt-compiled query for snapshot_gamma, located in the project's target folder

The MERGE command performs a full table scan and combines INSERT, UPDATE, and DELETE statements. The error suggests that if the WHEN NOT MATCHED BY TARGET clause is used, the target table (snapshot_gamma) must be hash-distributed.

Round-Robin vs Hash

Azure Synapse supports different distribution styles like Round-Robin and Hash. These are different approaches of storing data in the data warehouse.

The Round-Robin approach ensures data is evenly distributed across nodes. Each node gets an equal share of data, but data points with similar values might not end up in the same node.

The Hash approach uses a hash function (sort of a rule of thumb) to determine which node to save a data point in. Data points with equal values go into the same node, improving data locality and potentially benefiting certain queries like the MERGE statement.

The Solution

To fix the error, we need to change the distribution style of the snapshot_gamma table to hash-distributed. We can pass specific configurations to our models and snapshots on Azure Sypanse thanks to dbt-synapse.

First, delete the original snapshot_gamma table:

drop table snapshot.snapshot_gamma;

Then, modify the snapshot file to indicate the desired distribution approach. That is, a HASH distribution along the id column:

Query behind snapshot_gamma, after specifying a Hash distribution

After making these changes, run the dbt snapshot command again to create the first snapshot table.

Results from running snapshot_gamma from scratch

Finally, confirm that the snapshot works without errors when taken afterward.

Results from running snapshot_gamma a second time

By understanding the error and the differences between Round-Robin and Hash distribution in Azure Synapse, we can effectively fix the issue and ensure smooth data processing with dbt. Happy querying.

Taking snapshots using dbt in Azure Synapse: The Hash Distribution Issue
Older post

Understanding DAGs in dbt: Erring Models and Failing Tests

Three scenarios shows us what happens downstream when our models and/or tests do not succeed

Newer post

Understanding DAGs in dbt: Threads, Errors and Failing Fast

Three scenarios show us the implications of running a Directed Acyclic Graph (DAG) in multi-threaded dbt environments, shedding light on how errors and the "fail fast" argument impact model execution

Taking snapshots using dbt in Azure Synapse: The Hash Distribution Issue