Robert is a Lead in the Data Team at AMP Capital.
He is passionate about building data and analytics applications that empower people to do their best work.
We love dbt and believe it is a great tool to manage your data transformation. This article is all about why and how. This is not an introductory article; you need a basic understanding of containers and serverless architecture. In addition to that, you should be familiar with dbt and preferably try the dbt tutorial. If not please bookmark this post and come back later.
If you read the dbt Viewpoint and start nodding more and more as you go, then dbt is for you. In the data pipeline, we must capture business logic to give the data meaning.
If the data and business logic is maintained by the business units, then the relevance and meaning is rich, but it’s probably disorganised, badly tested, and inconsistently documented.
If it was put together by an IT function, then it was probably relevant 5 years ago, but since then there has been no funding. Neither scenario will allow the business to gain valuable insights that enable reaction to change.
We believe that giving data meaning is fundamentally an analytics function which is best done by the business. We also believe analytics is best achieved through code. We love dbt because it’s a tool aimed at the business data analyst who codes:
This powerful combo means dbt allows the business to explore and experiment, enables complexity to scale elegantly, and provides automation via mature CICD practices and tools.
Here is why dbt is a great fit to this microservices/serverless world we live in:
In order to support dbt, your solution will need:
Serverless architecture will give us just that.
We recommend starting with the default structure created by the dbt init [TD1] command and checking out the best practices from the folks at dbt (dbt documentation, dbt app reference, and discourse).
Some features of dbt that we like:
dbt allows flexible project configuration. It can be done globally at the project level, and all the way down to the model level. It is well documented and follows logical inheritance patterns.
We will highlight just one important feature: environment variables.
Your dbt project exists in a broader data platform ecosystem. Therefore, it needs to inter-operate with all your other components - particularly in a serverless architecture. We use environment variables in our dbt project to configure the project for the given environment.
Some benefits of this approach:
Before we start it is worth mentioning that dbt works well with ELT pattern, where your data is extracted (“E”) and loaded (“L”) into your warehouse separately and then dbt does the transform (“T”) for you. Your data will be extracted from different sources and loaded into your data warehouse which will then trigger dbt serverless to perform the necessary transformation and tests.
The following diagram shows the overall architecture of an ELT pipeline with dbt.
dbt serverless is a process that orchestrates running dbt models against your warehouse. The design is based on AWS services but it can be implemented in any of the other cloud provider services. The services you will need are:
Service |
Function |
Examples of products |
Orchestration service |
Manages the dbt life cycle |
Airflow, AWS Step Functions |
Container management service |
Runs your dbt models on demand |
|
Code repo |
Versions control your dbt project (models, docs, tests) |
GitHub, Bitbucket, GitLab |
Container Registry |
Hosts your dbt docker images (dbt + models) |
AWS ECR, Docker Hub |
Build / CICD server |
Builds a new dbt image that contains the latest dbt models and push that to the container registry |
Jenkins |
Static web host |
Hosts the dbt documentation portal |
Amazon S3 |
The following diagram shows how the above services can be tied together:
At the heart of this solution is your dbt project, it will be packaged into a Docker image (your latest models + dbt lib) and published to your container registry ready to be executed on demand. This enables you to create an Elastic Container Service (ECS) task that refers to your latest dbt image which will then be started by the step function whenever required. The dbt image needs to be updated whenever your model changes to make sure that each time dbt runs it executes the latest code.
Regardless of how you load your data, once a new batch is loaded you need to notify dbt serverless (or schedule it to run after your batch loading routine) to make sure that your staging data is updated.
Orchestrating your dbt runs
The orchestrator orchestrates (ha!) the dbt life cycle and it can be triggered:
The orchestrator can perform the following steps:
Some notes regarding the above process:
In this article we talked about dbt and how to build a serverless platform around it. There are other aspects that we haven’t talked about which can be covered in future articles, for example:
We encourage you to consider dbt for your enterprise data transformation tasks wherever you need version control, automated testing, and dynamic documentation generation.
dbt Serverless starter project includes Step function code and supporting infrastructure.
Starter project from executing dbt init command.
Mature project that diverges majorly from the basic structure.