Professional Documents
Culture Documents
Data Engineering with dbt: A practical guide to building a cloud-based, pragmatic, and dependable data platform with SQL Zagni full chapter instant download
Data Engineering with dbt: A practical guide to building a cloud-based, pragmatic, and dependable data platform with SQL Zagni full chapter instant download
Data Engineering with dbt: A practical guide to building a cloud-based, pragmatic, and dependable data platform with SQL Zagni full chapter instant download
https://ebookmass.com/product/getting-started-with-sql-and-
databases-managing-and-manipulating-data-with-sql-mark-simon/
https://ebookmass.com/product/getting-started-with-sql-and-
databases-managing-and-manipulating-data-with-sql-1st-edition-
mark-simon/
https://ebookmass.com/product/security-engineering-a-guide-to-
building-dependable-distributed-systems-3-edition-ross-anderson/
https://ebookmass.com/product/data-engineering-with-aws-acquire-
the-skills-to-design-and-build-aws-based-data-transformation-
pipelines-like-a-pro-2nd-edition-eagar/
Data Ingestion with Python Cookbook: A practical guide
to ingesting, monitoring, and identifying errors in the
data ingestion process 1st Edition Esppenchutz
https://ebookmass.com/product/data-ingestion-with-python-
cookbook-a-practical-guide-to-ingesting-monitoring-and-
identifying-errors-in-the-data-ingestion-process-1st-edition-
esppenchutz/
https://ebookmass.com/product/hands-on-with-google-data-studio-a-
data-citizens-survival-guide-lee-hurst/
https://ebookmass.com/product/google-cloud-platform-for-data-
science-a-crash-course-on-big-data-machine-learning-and-data-
analytics-services-dr-shitalkumar-r-sukhdeve/
https://ebookmass.com/product/data-science-with-rust-a-
comprehensive-guide-data-analysis-machine-learning-data-
visualization-more-van-der-post/
https://ebookmass.com/product/unlocking-dbt-design-and-deploy-
transformations-in-your-cloud-data-warehouse-cameron-cyr/
Data Engineering with dbt
Roberto Zagni
BIRMINGHAM—MUMBAI
Data Engineering with dbt
Copyright © 2023 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted
in any form or by any means, without the prior written permission of the publisher, except in the case
of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information
presented. However, the information contained in this book is sold without warranty, either express
or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable
for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and
products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot
guarantee the accuracy of this information.
ISBN 978-1-80324-628-4
www.packtpub.com
To the four females in my daily life: my wife, who supports me every day, my daughters,
who keep me grounded in reality, and our dog Lily for the sparkles of love and happiness
that she spreads around every day.
To my mother and my late father who through their sacrifices and support allowed me
to become what I wanted to be.
– Roberto Zagni
Contributors
I would like to thank my customers and colleagues for all the problems, and discussions to get to a
working solution that helped me to become a better software and data engineer and collect a wide
array of experiences in software and data engineering.
This book is my little contribution to the data engineering community.
I hope that I have been able to put the core set of knowledge that I would have loved to have in my
days as a data engineer in one place, along with a simple, opinionated way to build data platforms
using the modern data stack and proven patterns that scale and simplify everyday work.
About the reviewers
Hari Krishnan has been in the data space for close to 20 years now. He started at Infosys Limited,
working on mainframe technology for about 6 years, and then moved over to Informatica, then
eventually into business intelligence, big data, and the cloud in general. Currently, he is senior manager
of data engineering at Beachbody LLC, where he manages a team of data engineers with Airflow, dbt,
and Snowflake as the ir primary tech stack. He built the data lake and migrated the data warehouse
as well the ETL/ELT pipelines from on-premises to the cloud. He spent close to 13 years working for
Infosys and has spent the last 7 years with Beachbody. He is a technology enthusiast and always has
an appetite to discover, explore, and innovate new avenues in the data space.
Daniel Joshua Jayaraj S R is a data evangelist and business intelligence engineer, with over six years of
experience in the field of data analytics, data modeling, and visualization. He has helped organizations
understand the full potential of their data by providing stakeholders with strong business-oriented
visuals, thereby enhancing data-driven decisions. He has worked with multiple tools and technologies
during his career and completed his master’s in big data and business analytics.
I would like to thank my mother, S J Inbarani, who has been my motivation my whole life. I would also
like to thank Roberto Zagni for allowing me to review this wonderful book on dbt.
Table of Contents
Prefacexv
2
Setting Up Your dbt Cloud Development Environment 53
Technical requirements 54 Creating your GitHub account 56
Setting up your GitHub account 54
Introducing Version Control 54
viii Table of Contents
Setting up your first repository for dbt 61 Experimenting with SQL in dbt Cloud 91
Exploring the dbt Cloud IDE 92
Setting up your dbt Cloud account 63
Executing SQL from the dbt IDE 93
Signing up for a dbt Cloud account 63
Setting up your first dbt Cloud project 65 Introducing the source and ref dbt
Adding the default project to an empty functions94
repository80 Exploring the dbt default model 95
Comparing dbt Core and dbt Cloud Using ref and source to connect models 97
workflows85 Running your first models 98
Testing your first models 100
dbt Core workflows 85
Editing your first model 101
dbt Cloud workflows 88
Summary103
Further reading 103
3
Data Modeling for Data Engineering 105
Technical requirements 106 Modeling use cases and patterns 124
What is and why do we need data Header-detail use case 124
modeling?106 Hierarchical relationships 126
Understanding data 106 Forecasts and actuals 131
What is data modeling? 106 Libraries of standard data models 132
Why we need data modeling 107 Common problems in data models 132
Complementing a visual data model 109
Fan trap 133
Conceptual, logical, and physical Chasm trap 135
data models 109
Modeling styles and architectures 137
Conceptual data model 110
Kimball method or dimensional modeling or
Logical data model 111 star schema 138
Physical data model 113 Unified Star Schema 141
Tools to draw data models 114 Inmon design style 144
Entity-Relationship modeling 114 Data Vault 145
Main notation 114 Data mesh 148
Cardinality115 Our approach, the Pragmatic Data
Time perspective 120 Platform - PDP 150
An example of an E-R model at different Summary151
levels of detail 122
Further reading 151
Generalization and specialization 122
Table of Contents ix
4
Analytics Engineering as the New Core of Data Engineering 153
Technical requirements 154 Defining analytics engineering 168
The data life cycle and its evolution 154 The roles in the modern data stack 169
Understanding the data flow 154 The analytics engineer 169
Data creation 155 DataOps – software engineering best
Data movement and storage 156 practices for data 170
Data transformation 162
Version control 171
Business reporting 164
Quality assurance 171
Feeding back to the source systems 165
The modularity of the code base 172
Understanding the modern Development environments 173
data stack 166 Designing for maintainability 174
The traditional data stack 166 Summary176
The modern data stack 167
Further reading 176
5
Transforming Data with dbt 177
Technical requirements 177 How to write and test
The dbt Core workflow for ingesting transformations198
and transforming data 178 Writing the first dbt model 198
Introducing our stock tracking Real-time lineage and project navigation 200
project181 Deploying the first dbt model 200
The initial data model and glossary 181 Committing the first dbt model 201
Setting up the project in dbt, Snowflake, Configuring our project and where we
and GitHub 183 store data 202
Re-deploying our environment to the
Defining data sources and providing desired schema 204
reference data 187 Configuring the layers for our architecture 207
Defining data sources in dbt 187 Ensuring data quality with tests 209
Loading the first data for the portfolio project 192 Generating the documentation 217
Summary219
x Table of Contents
7
Working with Dimensional Data 259
Adding dimensional data 260 Creating an STG model for the security
Creating clear data models for the refined dimension267
and data mart layers 260 Adding the default record to the STG 269
Loading the data of the first Saving history for the dimensional
dimension262 data269
Creating and loading a CSV as a seed 262 Saving the history with a snapshot 270
Configuring the seeds and loading them 263 Building the REF layer with the
Adding data types and a load timestamp dimensional data 271
to your seed 263
Adding the dimensional data to the
Building the STG model for the first data mart 272
dimension266 Exercise – adding a few more
Defining the external data source for seeds 266 hand-maintained dimensions 273
Summary274
Table of Contents xi
8
Delivering Consistency in Your Data 275
Technical requirements 275 Building on the shoulders of
Keeping consistency by reusing code giants – dbt packages 290
– macros 276 Creating dbt packages 291
Repetition is inherent in data projects 276 How to import a package in dbt 292
Why copy and paste kills your future self 277 Browsing through noteworthy packages
How to write a macro 277 for dbt 296
Refactoring the “current” CTE into a macro 278 Adding the dbt-utils package to our project 299
Fixing data loaded from our CSV file 282 Summary302
The basics of macro writing 284
9
Delivering Reliability in Your Data 303
Testing to provide reliability 303 Testing the right things in the right
Types of tests 304 places313
Singular tests 304 What do we test? 314
Generic tests 305 Where to test what? 316
Defining a generic test 308 Testing our models to ensure good quality 319
Summary329
10
Agile Development 331
Technical requirements 331 Organizing work the agile way 336
Agile development and collaboration 331 Managing the backlog in an agile way 337
S3.x – developing with dbt models the S5 – development and verification of the
pipeline for the XYZ table 354 report in the BI application 357
S4 – an acceptance test of the data produced
Summary357
in the data mart 356
11
Team Collaboration 359
Enabling collaboration 359 Keeping your development environment
Core collaboration practices 360 healthy368
Collaboration with dbt Cloud 361 Suggested Git branch naming 369
Adopting frequent releases 371
Working with branches and PRs 363
Making your first PR 372
Working with Git in dbt Cloud 364
The dbt Cloud Git process 364
Summary377
Further reading 377
Summary426
Table of Contents xiii
13
Moving Beyond the Basics 427
Technical requirements 427 Main uses of keys 440
Building for modularity 428 Master Data management 440
Modularity in the storage layer 429 Data for Master Data management 441
Modularity in the refined layer 431 A light MDM approach with DBT 442
Modularity in the delivery layer 434
Saving history at scale 445
Managing identity 436 Understanding the save_history macro 447
Identity and semantics – defining Understanding the current_from_history
your concepts 436 macro455
Different types of keys 437
Summary458
14
Enhancing Software Quality 459
Technical requirements 459 Calculating transactions 482
Refactoring and evolving models 460 Publishing dependable datasets 484
Dealing with technical debt 460 Managing data marts like APIs 485
Implementing real-world code and What shape should you use for your
business rules 462 data mart? 485
Self-completing dimensions 487
Replacing snapshots with HIST tables 463
History in reports – that is, slowly changing
Renaming the REF_ABC_BANK_
dimensions type two 492
SECURITY_INFO model 466
Handling orphans in facts 468 Summary493
Calculating closed positions 473 Further reading 493
15
Patterns for Frequent Use Cases 495
Technical requirements 495 Loading data from files 501
Ingestion patterns 495 External tables 505
Landing tables 507
Basic setup for ingestion 498
xiv Table of Contents
Index539
Chapter 2, Setting Up Your DBT Cloud Development Environment, gets started with DBT by creating
your GitHub and DBT accounts. You will learn why version control is important and what the data
engineering workflow is when working with DBT.
You will also understand the difference between the open source DBT Core and the commercial DBT
Cloud. Finally, you will experiment with the default project and set up your environment for running
basic SQL with DBT on Snowflake and understand the key functions of DBT: ref and source.
Chapter 3, Data Modeling for Data Engineering, shows why and how you describe data, and how to
travel through different abstraction levels, from business processes to the storage of the data that
supports them: conceptual, logical, and physical data models.
You will understand entities, relationships, attributes, entity-relationship (E-R) diagrams, modeling
use cases and modeling patterns, Data Vault, dimensional models, wide tables, and business reporting.
Chapter 4, Analytics Engineering as the New Core of Data Engineering, showcases the full data life cycle
and the different roles and responsibilities of people that work on data.
You will understand the modern data stack, the role of DBT, and analytic engineering. You will learn
how to adopt software engineering practices to build data platforms (or DataOps), and about working
as a team, not as a silo.
Chapter 5, Transforming Data with DBT, shows us how to develop an example application in dbt and
learn all the steps to create, deploy, run, test, and document a data application with dbt.
Chapter 6, Writing Maintainable Code, continues the example that we started in the previous chapter,
and we will guide you to configure dbt and write some basic but functionally complete code to build the
three layers of our reference architecture: staging/storage, refined data, and delivery with data marts.
Chapter 7, Working with Dimensional Data, shows you how to incorporate dimensional data in our
data models and utilize it for fact-checking and a multitude of purposes. We will explore how to create
data models, edit the data for our reference architecture, and incorporate the dimensional data in data
marts. We will also recap everything we learned in the previous chapters with an example.
Chapter 8, Delivering Consistency in Your Code, shows you how to add consistency to your transformations.
You will learn how to go beyond basic SQL and bring the power of scripting into your code, write
your first macros, and learn how to use external libraries in your projects.
Chapter 9, Delivering Reliability in Your Data, shows you how to ensure the reliability of your code by
adding tests that verify your expectations and check the results of your transformations.
Chapter 10, Agile Development, teaches you how to develop with agility by mixing philosophy and
practical hints, discussing how to keep the backlog agile through the phases of your projects, and a
deep dive into building data marts.
Chapter 11, Collaboration, touches on a few practices that help developers work as a team and the
support that dbt provides toward this.
Preface xvii
Chapter 12, Deployment, Execution, and Documentation Automation, helps you learn how to automate
the operation of your data platform, by setting up environments and jobs that automate the release
and execution of your code following your deployment design.
Chapter 13, Moving beyond Basics, helps you learn how to manage the identity of your entities so that
you can apply master data management to combine data from different systems. At the same time,
you will review the best practices to apply modularity in your pipelines to simplify their evolution
and maintenance. You will also discover macros to implement patterns.
Chapter 14, Enhancing Software Quality, helps you discover and apply more advanced patterns that
provide high-quality results in real-life projects, and you will experiment with how to evolve your
code with confidence through refactoring.
Chapter 15, Patterns for Frequent Use Cases, presents you with a small library of patterns that are
frequently used for ingesting data from external files and storing this ingested data in what we call
history tables. You will also get the insights and the code to ingest data in Snowflake.
If you are using the digital version of this book, we advise you to type the code yourself or access
the code from the book’s GitHub repository (a link is available in the next section). Doing so will
help you avoid any potential errors related to the copying and pasting of code.
Conventions used
There are a number of text conventions used throughout this book.
Code in text: Indicates code words in text, database table names, folder names, filenames, file
extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: “Create
the new database using the executor role. We named it PORTFOLIO_TRACKING.”
xviii Preface
When we wish to draw your attention to a particular part of a code block, the relevant lines or items
are set in bold:
CREATE VIEW BIG_ORDERS AS
SELECT * FROM ORDERS
WHERE TOTAL_AMOUNT > 1000;
Bold: Indicates a new term, an important word, or words that you see onscreen. For instance,
words in menus or dialog boxes appear in bold. Here is an example: “Select System info from the
Administration panel.”
Get in touch
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, email us at customercare@
packtpub.com and mention the book title in the subject of your message.
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen.
If you have found a mistake in this book, we would be grateful if you would report this to us. Please
visit www.packtpub.com/support/errata and fill in the form.
Piracy: If you come across any illegal copies of our works in any form on the internet, we would
be grateful if you would provide us with the location address or website name. Please contact us at
copyright@packt.com with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you
are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Preface xix