1 - DBT - A New Way To Transform Data and Build Pipelines at The Telegraph - by Stefano Solimito - The Telegraph Engineering - Medium

DBT: A new way to transform data and build pipelines at The Telegraph | by Stefano Solimito | The Telegraph Engineering
| Medium 16/11/20, 11:01 PM
DBT: A new way to transform

data and build pipelines at The
Telegraph
Stefano Solimito Follow
Aug 23, 2019 · 10 min read
The Telegraph is a 164-old-company where data has always had a central

role. With the advent of the cloud and the need to build a platform able to
process a huge quantity of data in 2015, we started to build our big data
platform. We decided to use Google Cloud and, since delivering our Erst
PoC, over the years we have kept improving our platform to better support
the business.
https://medium.com/the-telegraph-engineering/dbt-a-new-way-to-handle-data-transformation-at-the-telegraph-868ce3964eb4 Page 1 of 14
DBT: A new way to transform data and build pipelines at The Telegraph | by Stefano Solimito | The Telegraph Engineering | Medium 16/11/20, 11:01 PM
The challenge
During the last 4 years, I had multiple discussions on how to handle data
transformation or, more extensively, (ETL) Extract Transform and Load
processes. The number of tools that you can choose on the market is
overwhelming and committing to the wrong technology could have a
negative impact on your capabilities to eOectively support diOerent
business units and drive decisions that are based on reliable Egures.
At The Telegraph, the datalake has been built on the top of Cloud Storage
and BigQuery and, according to Google, the natural choice to perform ETL
or ELT should be DataTow (Apache Beam). For most companies, this might
be true. But when you go outside the general use-cases exposed in the
“getting started” guides and you start to relate with real-world challenges,
what is supposed to be an easy choice might be not so easy.
In our case, adopting Apache Beam has proven not to be the easiest solution
for the following reasons:
Java SDK is much more supported than Python SDK and most of the
members of our team have already large expertise in Python, but they
are not proper Java developers. Also, our data scientists only work in
Python and it would mean having codebases in multiple languages to
support, making it hard to rotate engineers across diOerent projects.
Most of our batch data pipelines are structured to be just a series of

queries to run against BigQuery. Considering this, Apache Beam doesn’t
add much value on the ETL process.
DataTow connects really well with Google products, but in 2015 the
number of connectors was limited and we needed to interact with AWS,
on-premise servers etc.
Our analysts and data scientists tend to speak in SQL and it’s much
easier to collaborate with them if, in engineering, we don’t need to
translate the SQL logic that they are producing in Java or Python.
As a side note, we adopted Apache Beam in a second phase, but only for
real-time data pipelines. In fact, in this speciEc case having a windowing
strategy and being able to perform operations on a stream of records was
paramount for the success of certain projects.
The second product you might be tempted to use if you are using Google
Cloud Platform (GCP) might be Dataproc. If you already have a Spark
cluster or Hadoop in place and you want to migrate on GCP, it would make
perfect sense to consider this option. But in our case, we only had a small
Hadoop cluster and to rewrite the logic of the pipelines that were running
there was not a problem.
The third product that we considered and even used for a while is Talend
(free version). If your company wants to fully commit to Talend and buy its
enterprise version then it is a great choice, but if you don’t have a strong
enough case and you decide to adopt the free version, you might face some
of the following challenges:
It is hard to apply version control to your data pipelines.
You have to come up with your own CI/CD and the testability of your
artefacts is limited.
You have to rely on components provided by the community that might

become outdated and eventually not supported. Or you might decide to
develop your own components committing resources on building and
keeping them up to date.
You have to End Talend experts who are able to drive the development,
enforcing best practices to produce high-quality pipelines.
If the decision to adopt the tool to perform ETL is not shared with the
entire company you could end with a scattered set of technologies that
you have to maintain in a few years that are performing similar tasks.
For the reasons above we considered building our own Python ETL library
as a wrapper of functionalities provided by Google and AWS in order to
make our life easier when interacting with the main services of both clouds.
Even this approach has proved to be far from perfect. Due to the eOort
required to design and develop our own library and all the maintenance
required to keep it updated and include new features, we started to look for
something that could integrate well with this approach and reduce the
scope of the library.
In June 2019, we started to test DBT for the transformation part with the
idea of continuing to perform Extraction and Load using the Python library
and relying on Apache Beam for real-time data processing.
What is DBT?
DBT (Data Building Tool) is a command-line tool that enables data
analysts and engineers to transform data in their warehouses simply by Top highlight
writing select statements.
DBT performs the T (Transform) of ETL but it doesn’t oOer support for
Extraction and Load operations. It allows companies to write
transformations as queries and orchestrate them in a more eccient way.
There are currently more than 280 companies running DBT in production
and The Telegraph is among them.
DBT’s only function is to take code, compile it to SQL, and then run against
your database.
Multiple databases are supported, including:
Postgres
Redshift
BigQuery
SnowTake
Presto
DBT can be easily installed using pip (the Python package installers) and it
comes with both CLI and a UI. DBT application is written in Python and is
opensource, which can potentially allow any customization that you might
need.
The CLI oOers a set of functionalities to execute your data pipelines: run
tests, compile, generate documentation, etc.
The UI doesn’t oOer the possibility to change your data pipeline and it is
used mostly for documentation purposes.
In the image below you can see how data lineage of a certain table is
highlighted in DBT UI. This helps to quickly understand which data sources
are involved in a certain transformation and the Tow of the data from
source to target. This type of visualisation can facilitate discussion with less
technical people who are not interested in the detailed implementation of
the process, but want an overall view.
Initialising a project in DBT is very simple; running “dbt init” in the CLI
automatically creates the project structure for you. This will ensure that all
engineers will work with the same template and thereby enforces a
common standard.
DBT also oOers the maximum Texibility and if, for some reason, the project
structure produced doesn’t Et your needs, it can be customised by editing
the project conEguration Ele (dbt_project.yml) to rearrange folders as you
prefer.
One of the most important concepts in DBT is the concept of model. Every
model is a select statement that has to be orchestrated with the other
models to transform the data in the desired way. Every model is written
using the query language of your favourite data warehouse (DW). It can
also be enriched using Jinja2, allowing you to:
Write tidier parameterised queries.
Encapsulate any functionality that you might want to recycle in macros

that can be used as functions in your queries.
Hide the complexity of your transformation to allow the reader to

focus on the logic itself.
Below an example of a model using BigQuery Standard SQL syntax.

In this speciEc case, Jinja is used to inject in the output a set of technical
rows that are not present in the source. Jinja allows iteration on the matrix
of values and reshapes each row in a way that can be included in the main
table, enabling you to keep your query more concise.
The output of each model can be stored in diOerent ways, depending on the
desired behaviour:
Materialise a table — full refresh
Append to a table — incrementally build your output
Ephemeral — the output will not be stored in your DW but it can be

used as a data source from other models.
Every DBT model can be complemented with a schema deKnition. This

means that the documentation lives in the same repository as your
codebase, making it easier to understand every step of what has been
developed. Also, since the documentation is always in front of the engineers
who are working on the pipelines and since it is directly generated from the
codebase, minimal maintenance is required to keep it up to date.
In data-driven companies, ensuring the quality of the data delivered to the

business is always of paramount importance. DBT helps to serve high-
quality data, allowing you to write diOerent typologies of tests to check
your data.
Simple tests can be deEned using YAML syntax, placing the test Ele in the
same folder as your models.
For this speciEc example, the deEned tests are:
sk_interaction, bk_source_driver are both assuming unique values and

are never null
count_interactions is never null
fk_interaction_detail is not null, assumes unique values and all the

foreign keys fk_interaction_detail are able to join with the surrogate
keys sk_interaction_detail in the dimension interaction detail. This is
called referential integrity test and helps to ensure that your star
schema has been built in the proper way.
fk_interaction_text has similar test criteria.
Performance_band can assume only the deEned array of values.
More advanced testing can be implemented using SQL syntax.
The query below ensures that the “bk_source_driver” Eeld from model
“fact_interaction” doesn’t have more than 5% of the values set as NULL.
Models on DBT rely on the output of other models or on data sources. Data
sources can be also deEned using YAML syntax and are reusable and
documentable entities that are accessible in DBT models.
In the example below, you can see how it is possible to deEne a source on
top of a daily sharded BigQuery tables. It is also possible to use variables to
dynamically select the desired shard. In this speciEc case, the
“execution_date” variable is passed in input to DBT and deEnes which
shards are used during the transformation process.
DBT oOers also the possibility to write your own functions (Macros) these
can be used to simplify models but also create more powerful queries
adding more expressive power to your SQL without sacriEcing readability.
The macro in the example below is used to unite multiple daily shards of the
same table depending on the “execution_date” variable passed and the
number of past shards we want to take into consideration.
Conclusions
The Telegraph’s data engineering team has tested DBT (Core version) for
the past two months and it’s proved to be a great tool for all of the projects
that required data transformation. As a summary of our experience, here is
a list of the tool’s pros and cons.
Pros:
It is Opensource and open to customization.
It is easy to apply version control
The documentation lives with your DBT project and it is automatically

generated from your codebase.
It doesn’t require any speciEc skills on the jobs market. If your engineers
are familiar with SQL and have a basic knowledge of Python, that’s
enough to approach DBT.
The template of each project is automatically generated running DBT

init. This enforces a standard for all of our data pipelines.
All of the computational work is pushed towards your DW. This allows
you to attain high performance when using a technology similar to
BigQuery or SnowTake.
Because of the point above, orchestrating a DBT pipeline requires

minimal resources.
It allows you to test your data (schema tests, referential integrity tests,
custom tests) and ensures data quality.
It makes it easier to debug complex chains of queries. They can be
split into multiple models and macros that can be tested separately.
It’s well documented and the learning curve is not very steep.
Cons:
SQL based; it might oOer less readability compared with tools that have
an interactive UI.
Lack of debugging functionalities is a problem, especially when you

write complex macros.
Sometimes you will End yourself overriding DBT standard behaviour,

rewriting macros that are used behind the scenes. This requires an
understanding of the source code.
The UI is for documentation-only purposes. It helps you to visualise

the transformation process, but it’s up to your data engineers to keep the
DBT project tidy and understandable. Having an interactive UI that
allows you to visually see the Tow of the pipeline and amend the queries
can be helpful, especially when it comes to complex data pipelines.
Documentation generation for BigQuery is time-consuming due to a

poor implementation that scans all of the shards inside a dataset.
It covers only the T of ETL, so you will need other tools to perform
Extraction and Load.
If you are interested in practical tips to get the best out of DBT have a look
at this series of articles:
Part 1: https://medium.com/photobox-technology-product-and-
design/practical-tips-to-get-the-best-out-of-data-building-tool-dbt-part-1-
8cfa21ef97c5
design/practical-tips-to-get-the-best-out-of-data-build-tool-dbt-part-2-
a3581c76723c
design/practical-tips-to-get-the-best-out-of-data-build-tool-dbt-part-3-
38cefad40e59
Stefano Solimito is a Principal Data Engineer at The Telegraph. You can follow
him on LinkedIn.
Big Data Data Engineering Google Cloud Platform Data Etl
Learn more. Make Medium yours. Share your thinking.

Medium is an open platform where Follow the writers, publications, and If you have a story to tell, knowledge
170 million readers come to find topics that matter to you, and you’ll to share, or a perspective to offer —
insightful and dynamic thinking. see them on your homepage and in welcome home. It’s easy and free to
Here, expert and undiscovered your inbox. Explore post your thinking on any topic.
voices alike dive into the heart of Write on Medium
any topic and bring new ideas to the
surface. Learn more
About Help Legal

1 - DBT - A New Way To Transform Data and Build Pipelines at The Telegraph - by Stefano Solimito - The Telegraph Engineering - Medium

Uploaded by

Copyright:

Available Formats

You might also like

1 - DBT - A New Way To Transform Data and Build Pipelines at The Telegraph - by Stefano Solimito - The Telegraph Engineering - Medium

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 - DBT - A New Way To Transform Data and Build Pipelines at The Telegraph - by Stefano Solimito - The Telegraph Engineering - Medium

Uploaded by

Copyright:

Available Formats

DBT: A new way to transform data and build pipelines at The Telegraph | by Stefano Solimito | The Telegraph Engineering

| Medium 16/11/20, 11:01 PM

DBT: A new way to transform

The Telegraph is a 164-old-company where data has always had a central

Most of our batch data pipelines are structured to be just a series of

It is hard to apply version control to your data pipelines.

You have to rely on components provided by the community that might

writing select statements.

Multiple databases are supported, including:

Write tidier parameterised queries.

Encapsulate any functionality that you might want to recycle in macros

Hide the complexity of your transformation to allow the reader to

Below an example of a model using BigQuery Standard SQL syntax.

Materialise a table — full refresh

Append to a table — incrementally build your output

Ephemeral — the output will not be stored in your DW but it can be

Every DBT model can be complemented with a schema deKnition. This

In data-driven companies, ensuring the quality of the data delivered to the

For this speciEc example, the deEned tests are:

sk_interaction, bk_source_driver are both assuming unique values and

count_interactions is never null

fk_interaction_detail is not null, assumes unique values and all the

schema has been built in the proper way.

fk_interaction_text has similar test criteria.

Performance_band can assume only the deEned array of values.

More advanced testing can be implemented using SQL syntax.

It is Opensource and open to customization.

It is easy to apply version control

The documentation lives with your DBT project and it is automatically

The template of each project is automatically generated running DBT

Because of the point above, orchestrating a DBT pipeline requires

It makes it easier to debug complex chains of queries. They can be

Lack of debugging functionalities is a problem, especially when you

Sometimes you will End yourself overriding DBT standard behaviour,

The UI is for documentation-only purposes. It helps you to visualise

Documentation generation for BigQuery is time-consuming due to a

Big Data Data Engineering Google Cloud Platform Data Etl

Learn more. Make Medium yours. Share your thinking.

About Help Legal

You might also like