Professional Documents
Culture Documents
1 - DBT - A New Way To Transform Data and Build Pipelines at The Telegraph - by Stefano Solimito - The Telegraph Engineering - Medium
1 - DBT - A New Way To Transform Data and Build Pipelines at The Telegraph - by Stefano Solimito - The Telegraph Engineering - Medium
1 - DBT - A New Way To Transform Data and Build Pipelines at The Telegraph - by Stefano Solimito - The Telegraph Engineering - Medium
https://medium.com/the-telegraph-engineering/dbt-a-new-way-to-handle-data-transformation-at-the-telegraph-868ce3964eb4 Page 1 of 14
DBT: A new way to transform data and build pipelines at The Telegraph | by Stefano Solimito | The Telegraph Engineering | Medium 16/11/20, 11:01 PM
The challenge
During the last 4 years, I had multiple discussions on how to handle data
transformation or, more extensively, (ETL) Extract Transform and Load
processes. The number of tools that you can choose on the market is
overwhelming and committing to the wrong technology could have a
negative impact on your capabilities to eOectively support diOerent
business units and drive decisions that are based on reliable Egures.
At The Telegraph, the datalake has been built on the top of Cloud Storage
and BigQuery and, according to Google, the natural choice to perform ETL
or ELT should be DataTow (Apache Beam). For most companies, this might
be true. But when you go outside the general use-cases exposed in the
“getting started” guides and you start to relate with real-world challenges,
what is supposed to be an easy choice might be not so easy.
In our case, adopting Apache Beam has proven not to be the easiest solution
for the following reasons:
Java SDK is much more supported than Python SDK and most of the
members of our team have already large expertise in Python, but they
are not proper Java developers. Also, our data scientists only work in
Python and it would mean having codebases in multiple languages to
support, making it hard to rotate engineers across diOerent projects.
DataTow connects really well with Google products, but in 2015 the
number of connectors was limited and we needed to interact with AWS,
on-premise servers etc.
Our analysts and data scientists tend to speak in SQL and it’s much
easier to collaborate with them if, in engineering, we don’t need to
translate the SQL logic that they are producing in Java or Python.
https://medium.com/the-telegraph-engineering/dbt-a-new-way-to-handle-data-transformation-at-the-telegraph-868ce3964eb4 Page 2 of 14
DBT: A new way to transform data and build pipelines at The Telegraph | by Stefano Solimito | The Telegraph Engineering | Medium 16/11/20, 11:01 PM
As a side note, we adopted Apache Beam in a second phase, but only for
real-time data pipelines. In fact, in this speciEc case having a windowing
strategy and being able to perform operations on a stream of records was
paramount for the success of certain projects.
The second product you might be tempted to use if you are using Google
Cloud Platform (GCP) might be Dataproc. If you already have a Spark
cluster or Hadoop in place and you want to migrate on GCP, it would make
perfect sense to consider this option. But in our case, we only had a small
Hadoop cluster and to rewrite the logic of the pipelines that were running
there was not a problem.
The third product that we considered and even used for a while is Talend
(free version). If your company wants to fully commit to Talend and buy its
enterprise version then it is a great choice, but if you don’t have a strong
enough case and you decide to adopt the free version, you might face some
of the following challenges:
You have to come up with your own CI/CD and the testability of your
artefacts is limited.
You have to End Talend experts who are able to drive the development,
enforcing best practices to produce high-quality pipelines.
If the decision to adopt the tool to perform ETL is not shared with the
entire company you could end with a scattered set of technologies that
you have to maintain in a few years that are performing similar tasks.
For the reasons above we considered building our own Python ETL library
as a wrapper of functionalities provided by Google and AWS in order to
https://medium.com/the-telegraph-engineering/dbt-a-new-way-to-handle-data-transformation-at-the-telegraph-868ce3964eb4 Page 3 of 14
DBT: A new way to transform data and build pipelines at The Telegraph | by Stefano Solimito | The Telegraph Engineering | Medium 16/11/20, 11:01 PM
make our life easier when interacting with the main services of both clouds.
Even this approach has proved to be far from perfect. Due to the eOort
required to design and develop our own library and all the maintenance
required to keep it updated and include new features, we started to look for
something that could integrate well with this approach and reduce the
scope of the library.
In June 2019, we started to test DBT for the transformation part with the
idea of continuing to perform Extraction and Load using the Python library
and relying on Apache Beam for real-time data processing.
What is DBT?
DBT (Data Building Tool) is a command-line tool that enables data
analysts and engineers to transform data in their warehouses simply by Top highlight
DBT performs the T (Transform) of ETL but it doesn’t oOer support for
Extraction and Load operations. It allows companies to write
transformations as queries and orchestrate them in a more eccient way.
There are currently more than 280 companies running DBT in production
and The Telegraph is among them.
DBT’s only function is to take code, compile it to SQL, and then run against
https://medium.com/the-telegraph-engineering/dbt-a-new-way-to-handle-data-transformation-at-the-telegraph-868ce3964eb4 Page 4 of 14
DBT: A new way to transform data and build pipelines at The Telegraph | by Stefano Solimito | The Telegraph Engineering | Medium 16/11/20, 11:01 PM
your database.
Postgres
Redshift
BigQuery
SnowTake
Presto
DBT can be easily installed using pip (the Python package installers) and it
comes with both CLI and a UI. DBT application is written in Python and is
opensource, which can potentially allow any customization that you might
need.
The CLI oOers a set of functionalities to execute your data pipelines: run
tests, compile, generate documentation, etc.
The UI doesn’t oOer the possibility to change your data pipeline and it is
used mostly for documentation purposes.
https://medium.com/the-telegraph-engineering/dbt-a-new-way-to-handle-data-transformation-at-the-telegraph-868ce3964eb4 Page 5 of 14
DBT: A new way to transform data and build pipelines at The Telegraph | by Stefano Solimito | The Telegraph Engineering | Medium 16/11/20, 11:01 PM
In the image below you can see how data lineage of a certain table is
highlighted in DBT UI. This helps to quickly understand which data sources
are involved in a certain transformation and the Tow of the data from
source to target. This type of visualisation can facilitate discussion with less
technical people who are not interested in the detailed implementation of
the process, but want an overall view.
Initialising a project in DBT is very simple; running “dbt init” in the CLI
automatically creates the project structure for you. This will ensure that all
engineers will work with the same template and thereby enforces a
common standard.
DBT also oOers the maximum Texibility and if, for some reason, the project
structure produced doesn’t Et your needs, it can be customised by editing
the project conEguration Ele (dbt_project.yml) to rearrange folders as you
prefer.
One of the most important concepts in DBT is the concept of model. Every
model is a select statement that has to be orchestrated with the other
models to transform the data in the desired way. Every model is written
using the query language of your favourite data warehouse (DW). It can
also be enriched using Jinja2, allowing you to:
https://medium.com/the-telegraph-engineering/dbt-a-new-way-to-handle-data-transformation-at-the-telegraph-868ce3964eb4 Page 6 of 14
DBT: A new way to transform data and build pipelines at The Telegraph | by Stefano Solimito | The Telegraph Engineering | Medium 16/11/20, 11:01 PM
The output of each model can be stored in diOerent ways, depending on the
https://medium.com/the-telegraph-engineering/dbt-a-new-way-to-handle-data-transformation-at-the-telegraph-868ce3964eb4 Page 7 of 14
DBT: A new way to transform data and build pipelines at The Telegraph | by Stefano Solimito | The Telegraph Engineering | Medium 16/11/20, 11:01 PM
desired behaviour:
https://medium.com/the-telegraph-engineering/dbt-a-new-way-to-handle-data-transformation-at-the-telegraph-868ce3964eb4 Page 8 of 14
DBT: A new way to transform data and build pipelines at The Telegraph | by Stefano Solimito | The Telegraph Engineering | Medium 16/11/20, 11:01 PM
your data.
Simple tests can be deEned using YAML syntax, placing the test Ele in the
same folder as your models.
https://medium.com/the-telegraph-engineering/dbt-a-new-way-to-handle-data-transformation-at-the-telegraph-868ce3964eb4 Page 9 of 14
DBT: A new way to transform data and build pipelines at The Telegraph | by Stefano Solimito | The Telegraph Engineering | Medium 16/11/20, 11:01 PM
The query below ensures that the “bk_source_driver” Eeld from model
“fact_interaction” doesn’t have more than 5% of the values set as NULL.
Models on DBT rely on the output of other models or on data sources. Data
sources can be also deEned using YAML syntax and are reusable and
documentable entities that are accessible in DBT models.
In the example below, you can see how it is possible to deEne a source on
top of a daily sharded BigQuery tables. It is also possible to use variables to
dynamically select the desired shard. In this speciEc case, the
“execution_date” variable is passed in input to DBT and deEnes which
shards are used during the transformation process.
https://medium.com/the-telegraph-engineering/dbt-a-new-way-to-handle-data-transformation-at-the-telegraph-868ce3964eb4 Page 10 of 14
DBT: A new way to transform data and build pipelines at The Telegraph | by Stefano Solimito | The Telegraph Engineering | Medium 16/11/20, 11:01 PM
DBT oOers also the possibility to write your own functions (Macros) these
can be used to simplify models but also create more powerful queries
adding more expressive power to your SQL without sacriEcing readability.
The macro in the example below is used to unite multiple daily shards of the
same table depending on the “execution_date” variable passed and the
number of past shards we want to take into consideration.
https://medium.com/the-telegraph-engineering/dbt-a-new-way-to-handle-data-transformation-at-the-telegraph-868ce3964eb4 Page 11 of 14
DBT: A new way to transform data and build pipelines at The Telegraph | by Stefano Solimito | The Telegraph Engineering | Medium 16/11/20, 11:01 PM
Conclusions
The Telegraph’s data engineering team has tested DBT (Core version) for
the past two months and it’s proved to be a great tool for all of the projects
that required data transformation. As a summary of our experience, here is
a list of the tool’s pros and cons.
Pros:
It doesn’t require any speciEc skills on the jobs market. If your engineers
are familiar with SQL and have a basic knowledge of Python, that’s
enough to approach DBT.
All of the computational work is pushed towards your DW. This allows
you to attain high performance when using a technology similar to
BigQuery or SnowTake.
It allows you to test your data (schema tests, referential integrity tests,
custom tests) and ensures data quality.
https://medium.com/the-telegraph-engineering/dbt-a-new-way-to-handle-data-transformation-at-the-telegraph-868ce3964eb4 Page 12 of 14
DBT: A new way to transform data and build pipelines at The Telegraph | by Stefano Solimito | The Telegraph Engineering | Medium 16/11/20, 11:01 PM
split into multiple models and macros that can be tested separately.
It’s well documented and the learning curve is not very steep.
Cons:
SQL based; it might oOer less readability compared with tools that have
an interactive UI.
It covers only the T of ETL, so you will need other tools to perform
Extraction and Load.
If you are interested in practical tips to get the best out of DBT have a look
at this series of articles:
Part 1: https://medium.com/photobox-technology-product-and-
design/practical-tips-to-get-the-best-out-of-data-building-tool-dbt-part-1-
8cfa21ef97c5
Part 2: https://medium.com/photobox-technology-product-and-
design/practical-tips-to-get-the-best-out-of-data-build-tool-dbt-part-2-
a3581c76723c
Part 3: https://medium.com/photobox-technology-product-and-
design/practical-tips-to-get-the-best-out-of-data-build-tool-dbt-part-3-
https://medium.com/the-telegraph-engineering/dbt-a-new-way-to-handle-data-transformation-at-the-telegraph-868ce3964eb4 Page 13 of 14
DBT: A new way to transform data and build pipelines at The Telegraph | by Stefano Solimito | The Telegraph Engineering | Medium 16/11/20, 11:01 PM
38cefad40e59
Stefano Solimito is a Principal Data Engineer at The Telegraph. You can follow
him on LinkedIn.
https://medium.com/the-telegraph-engineering/dbt-a-new-way-to-handle-data-transformation-at-the-telegraph-868ce3964eb4 Page 14 of 14