Download as pdf or txt
Download as pdf or txt
You are on page 1of 39

Introduction to data build tool

Agenda for the session:

1. Understand fundamentals of DBT (data build tool)


2. Components of DBT Project
3. Sample Project introduction
4. Pipeline documentation in DBT
5. Pipeline Testing in DBT
6. Pipeline deployment in DBT
Have you ever across the DBT tool before ?
What is DBT ?

1. Data build tool is a development framework to transform data in data warehouses.

2. It acts like an ELT where after the data is extracted and loaded, the data is
transformed.
3. DBT is a combination of SQL and jinja
3.1 Jinja: Common template language used in python ecosystem. It can be used to write
python like expressions and functions.

Let’s look at a quick example:

DBT Jinja functions:


“exceptions” function is used to raise warnings and handle errors in dbt userspace.
“exceptions.raise_compiler_error” will raise a compiler error if the given condition is
false and can cause failure of the script.

“exceptions.warn” will raise a warning if the given condition fails but doesn’t cause
failure of the script.
Use of jinja along with SQL

It is compiled as
Contd
4. Sql when combined with jinja gives it the power to model the pipelines, document
them and test the pipelines as well.

5. Main components of DBT


5.1 Compiler: To compile the sql and jinja code
5.2 Query Results: To show up the results
5.3 Lineage: To show the data dependent flow of tables
Why DBT ?

1. Data plays a crucial role in making up business decisions for organizations. But this
needs a set of skilled data engineers who can convert data into a ready-to-use
format.

2. Due to the increase in the dependency on data engineers to have the production
ready data available, DBT has come up with a solution to make the data
transformation easier.

3. With DBT anyone who have fair knowledge of writing SQL queries can design and
build, test and deploy data models which are analytics ready.

4. No longer we need to rely upon data engineers to have the analytics ready data
available. Even the data analysts can be involved in the data transformation process.
Role of DBT:
Do you think DBT can replace the role of data engineers?
DBT Workflow Diagram:
Features of DBT:

1. Build data models through simple select statements in a sql file which involves data
transformation.
2. Apply modern software engineering principles such as version control, data model
lineage, continuous CI/CD

3. Use template based language to leverage the SQL capability


3. Provide data documentation for the models built

4. Perform data model testing


DBT Project Structure:
Elements of a DBT Project:

6.1 Analyses: Folder which contains sql statements aligned towards the analysis but not
part of the data model. Data can be queried directly from models.
6.2 Package: Collection of libraries which can be used for a particular function. DBT Hub
is a place where list of packages are available and can be installed. “dbt deps” command
to install packages.

6.3 Logs: Contain extensive logging of all the activities happening


6.4 Macros: Piece of code that implements a specific functionality
6.5 Models: Sql file which contains select statement performing a data transformation

6.5.1 Materialization: DDL that dbt uses to create the model equivalent table/view in the
data warehouse
6.6 Seeds: CSV files that can be loaded into the dwh. They are chosen to be used as
static data which doesn’t change frequently.
Use cases:
1. Mapping of country codes to country names
2. Mapping of currency to country name

7. Snapshots: To record changes to a table consisting of historical data over time. They
tend to apply SCD-2 to record the changes to data. It tracks how a row in the table
changes over time.
Sample snapshot:
Strategies to find snapshot:

1. Timestamp:
1.1 It is used to find the latest updated row from the source that is inserted into the table
1.2 “updated_at” represents when the source got last updated

2. Check:
2.1 Compares list of column values given between current and historical values.
2.2 If any of the value changes then the new row is inserted with the updated value
2.3 “check_cols” is used for this purpose

8. Target: To configure different configurations for different environments.


Ex. As developers we may use different sandbox environments to run our models but for
production we have a target env to have all our tables and datasets.
Target “prd” => Tells dbt to use only tables and datasets specified in “prd” in profiles yaml
file.
Sample Project:

Prerequisites:
1. Set up GCP account and connect to big query.
2. Cloud services like snow flake, big query and data bases like postgres are also
compatible with DBT.
3. Create datasets in the Big query which are equivalent to schemas in data bases.
4. Create big query credentials and dbt cloud account
5. Finally connect dbt account with big query

https://docs.getdbt.com/guides/getting-started/getting-set-up/setting-up-bigquery
Sample Project:

1. We take the sample public datasets of jaffle shop which talks about the data
gathered around their customers, orders given by the customers and payment details
for the orders purchased.
2. Main tables:
2.1 Customers: Table having details of customer_id and name
2.2 Orders: Table having order details along with its status
2.3 Payments: Table having details of payment transaction made for the orders held like
mode of payment, payment status and amount payable
2.4 Fact Table: Table having order details along with amount payable
2.5 Dim Table: Final table populated having some of the metrics from above tables
Lineage for our sample project:
DBT Documentation:

1. In .yml file we adding a description below the column or table which acts as description
1. .md files are mark down files used to document extensively which is not possible in .yml files
2. These are called doc blocks
3. Their reference is given in .yml files
Command to generate docs: dbt docs generate
View docs: this generates the documentation which can be viewed here
When the view docs is opened, we can see the complete project as such
Here is the actual documentation that can be seen.
At first table level details can be seen
Then column level descriptions can be viewed
1. Lineage graph can also be viewed
Tests:
1. We have written the generic tests in .yml file
1.1 Not null test
🡪It produces data records which fails the test ie (data records which are null)

All the tests in dbt run this way only, they intend to produce data records which fail our tests.
Tests:
1. The tests which are scheduled to be run for orders table
2. We have 3 tests declared for orders table
Tests which fail:
1. In the accepted values, miss any one value purposefully and see how it can be back traced
2. To test the model: dbt test --select staging_orders
Get the sql in the logs and run it
Now we can find that, the value which is not part of our accepted values which has caused the test to fail.

Now we can either add this value to accepted values or else filter out this value from our model
Deployment:

1. Configure environment and jobs to get started


2. Commit your code to master branch before running it
Documentation:

Jinja functions: https://docs.getdbt.com/reference/dbt-jinja-functions

DBT Project Structure:


https://towardsdatascience.com/anatomy-of-a-dbt-project-50e810abc695

DBT Hub: https://hub.getdbt.com/calogica/dbt_date/latest/

DBT Tests: https://hub.getdbt.com/calogica/dbt_expectations/latest/


https://www.getdbt.com/product/data-testing/

You might also like