At a certain point in our journey as analysts, we would need to retire a data pipeline. Understandably, we would then like to pause the dbt resources (models, tests, seeds, snapshots) associated with this pipeline to conserve data warehouse storage and computing power. Our goal is to proceed without causing disruptions to the dbt project – a delicate process indeed.
In the context of dbt, a rushed approach could be to delete all soon-to-be-paused resources (.sql and .py) from the project. However, this is labor-intensive and prone to error. Instead, the following are some ideas on approaches in dbt when it comes to pausing and eventually retiring resources.
For the following examples, we use dbt-bigquery 1.6.4
, with sample data on music from Spotify and Last.fm. The dbt project repository is available on GitHub.
idea 1 -> –exclude
One method for pausing resources in dbt is to utilize the exclude flag. Similar to --select
, --exclude
allows us to remove specific resources from our dbt run
(or test
, seed
, snapshot
, build
). Using this flag, we can execute dbt while excluding the resources we intend to pause.
For example, imagine we would like to pause all models depending on the source table bigquery.spotify_album
from a dbt build
. Then we would run the command:
dbt build --exclude source:bigquery.spotify_album+
We can also use tags. For instance, by tagging all resources with ‘legacy’, we can exclude them from a dbt build
:
dbt build --exclude tag:legacy
One advantage of this approach is that the exclude flag also removes all tests associated with the resource. We can leave our resources in their respective folders, and it requires no changes to dependencies downstream from our paused resource.
However, a drawback is the need to consistently include the exclude flag in the CLI command. This may require modifying existing jobs running on dbt, such as through containers.
idea 2 -> enabled: false
Another way to pause dbt resources is to disable them by setting the enabled configuration option to false. This approach essentially makes our dbt project consider the paused resources as non-existent.
To disable a resource, simply set enabled: false
atop its .sql file as a configuration option. Alternatively, we can set the enable option at the project configuration level in dbt_project.yml
. For example, if we wanted to disable all staging files located in models/staging/spotify/
, we would include the following in the YAML configuration:
models:
my_project:
staging:
spotify:
+enabled: false
One advantage of this approach is that it requires little modification, especially if our resources are well-organized into folders. It is also a versatile approach when the resource is an end-node, such as a production model without any downstream dependencies. Also, there are no modifications needed to the CLI command.
Nevertheless, disabling resources becomes challenging when they have many dependencies downstream. For example, if one of our resources is listed in an exposure, we would have to remove the resource from the exposure.
idea 3 -> analyses/
Another option for pausing resources is to turn them into analyses. For example, if we want to pause the model stg_spotify_album
, we would move it from models/staging/spotify
into analyses/
.
One of the advantages of turning resources into analyses is that they will still compile and be found in the target/compiled
folder. This allows us to use the compiled query and keep building upon the resource collaboratively. Also, no modifications to existing CLI commands are required.
The clear disadvantage of this approach is manually moving all paused resources from models/
to analyses/
. Additionally, the dependency issue, similar to disabled models, will persist. We would need to move all downstream dependencies from our paused resource to analyses/
.
conclusion
The above are ideas on the possible steps to take to pause dbt resources. Eventually, we would like to migrate these resources out of our data warehouse, for which model versioning can be a good long-term approach.
There is no right or wrong when it comes to pausing models in dbt. All approaches have their pros and cons, which need to be analyzed on a case-by-case basis. Are there any other elegant ways to pause resources in dbt? Let us know in the comments. Happy querying.