Slides

LLM-driven Data Engineering
Day 1
EcZachly Inc
Is data engineering doomed to
AI?
And what you can do about it!
EcZachly Inc
Here’s why data engineering is doomed
(according to shitty LinkedIn influencers)
EcZachly Inc
- LLMs are getting good at writing SQL

- LLMs increase productivity, reduce debug time, and reduce documentation
writing time
- LLMs can even write Spark code! Y’all saw the English SDK for Databricks
right? EcZachly Inc
- LLMs will guarantee amazing data quality!
LLMs and SQL
EcZachly Inc
I’m sure you’ve seen the
SQL is an important part of the data engineering equation
- LLMs are good but not great at SQL (we will see this in the lab today)
EcZachly Inc
I’m going to show you a chart soon
EcZachly Inc
- Red means, LLMs are going to have a significant impact, soon

- Yellow means, LLMs aren’t quite there yet but could disrupt in the future
- Green means, LLMs are far from accomplishing this
EcZachly Inc
EcZachly Inc
EcZachly Inc
Answering business questions (red)
EcZachly Inc
LangChain hooks into databases

directly and can easily make sense
of data warehouses!
This disruption is imminent! EcZachly Inc

Fixing Broken Pipelines (red)
EcZachly Inc
LLMs + Agents will be able to

significantly reduce the oncall burden.
- LLMs that process stack traces /

quality failures can recommend
possible remedies to unblock
(maybe even via Slack)
The amount of data engineering hours

EcZachly Inc
that are recovered from this is
something to be celebrated!
Tricky failures will still need manual

troubleshooting though
Writing analytical queries (red)
EcZachly Inc
- This will be disrupted 100% but complexity

matters here a lot.
- SQL practitioners probably won’t feel much
difference here
- Self-serve analytics will catch fire here though
This is something to be celebrated!
EcZachly Inc
Data engineers won’t have to hand hold
stakeholders through their analytical questions
nearly as much!
Writing pipeline code in:
SQL, Spark, or Flink (orange)
EcZachly Inc
LLMs can oftentimes give us a

good boilerplate to start with, but
debugging it is often not worth it!
We are still pretty far from

EcZachly Inc
prompts generating us optimized
and correct data pipelines!
Why ChatGPT sucks for SQL!
EcZachly Inc
LLMs are like having a good junior engineer

who can write SQL for you but you have to
check their work
The chat.openai.com interface is

NON-DETERMINISTIC!
EcZachly Inc
The same prompt will output different things
depending on when you input it!
REMEMBER FROM CLASS, IF THINGS ARE NOT

IDEMPOTENT, THEY ARE NOT TO BE TRUST
Debugging
EcZachly Inc
EcZachly Inc
Writing data documentation (orange)
EcZachly Inc
If you’ve done the upfront work of all the

conceptual and physical data modeling,
ChatGPT can give you beautiful
summaries and example queries. (We’ll
see this in the lab today)
ChatGPT will miss most business EcZachly Inc
context unless directly given it and if
you’re directly giving ChatGPT the
context, you might as well write the
documentation yourself at that point
Sprint Planning (orange)
EcZachly Inc
LLMs should be able to give us a

good idea about sprint planning.
The soft skills needed to prioritize EcZachly Inc

and push back are sadly not
present in ChatGPT
Dashboarding (orange)
EcZachly Inc
Presenting data is actually pretty

easy. Usually simple GROUP BY
queries does the job
Knowing you customer and what

EcZachly Inc
data to display and when is the
hard part that ChatGPT will
continue to suck at
Physical Data Modeling (orange)
EcZachly Inc
Optimizing logical/conceptual
data models to a specific
architecture is something
ChatGPT will excel at.
Having all the best practices in EcZachly Inc
place and understand the interplay
between data table, pipeline, etc is
hard for ChatGPT to understand
Writing tests for you pipelines (orange)
EcZachly Inc
Generating fake data will be a big

pain that LLMs can give us
already.
EcZachly Inc
Knowing if we have adequate test
coverage is something ChatGPT
will continue to be bad at
Automated Data Quality Checks (orange)
EcZachly Inc
Suggesting automated checks is

something LLMs can do already
Having the business context to EcZachly Inc

know what checks are valuable is
something LLMs will have a hard
time with
Building data processing frameworks
(green)
EcZachly Inc
Once you hit a sufficient level of

complexity, LLMs fall apart.
Imagine asking ChatGPT,

“Give me the code for the next EcZachly Inc
generation of Spark”
Creating data best practices (green)
EcZachly Inc
Data best practices are most

often learned from experience.
Experience that needs to
permeate through the
organization. EcZachly Inc
ChatGPT will catch up 2-3 years

late!
Conceptual / Logical Data Modeling (green)
EcZachly Inc
Data Modeling is a very soft skills

focused task.
Understanding what data is

needed, where it is, and how to get EcZachly Inc
it is a human-oriented task that
ChatGPT will continue to struggle
with
What’s the lab going to be about today?
EcZachly Inc
- When you should and shouldn’t use LLMs

- What types of tasks are LLMs better suited for
- How can you get the most out of each prompt?
EcZachly Inc
When should you use LLMs?
EcZachly Inc
- Debugging specific problems (similar to StackOverflow or Google)

- Getting your starter DAG boilerplate
- Boilerplate documentation
- Solving a very specific part of a query
EcZachly Inc
Generating Documentation
EcZachly Inc
- LLMs are actually decent at generating text and documentation

- Business context is often lacking though!
- Make sure to specify to use Markdown in the prompt
- The more context you give, the better the documentation
EcZachly Inc
How to use the OpenAI API
EcZachly Inc
- Install the openai Python library

- Get an OpenAI API key (REMEMBER ITS PAY PER USE)
EcZachly Inc
How the prompts work in Python API
EcZachly Inc
- You have multiple roles you can play

- System
- Assistant
- User
EcZachly Inc
The System Role
EcZachly Inc
Think of this as the “contextual” clues that guide the prompt the right way
EcZachly Inc
The Assistant Role
EcZachly Inc
This is for when you want to have a multiple prompt/response conversation

with ChatGPT
EcZachly Inc
The User Role
EcZachly Inc
This is the prompt that is used for the specific task at hand
EcZachly Inc
API Configuration
EcZachly Inc
Temperature - Scales from 0 to 2. Closer to 0 the more deterministic the

output.
Max Tokens - you can set this as high as 4000. One token is about 4 characters
in English
EcZachly Inc
Today’s Lab
EcZachly Inc
- We’ll be interacting with GPT-4 in today’s lab

- We’ll generate an Airflow DAG
- A SQL query
- And documentation for our data sets
EcZachly Inc

Slides

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Slides

Uploaded by

Copyright:

Available Formats

LLM-driven Data Engineering

- LLMs are getting good at writing SQL

I’m sure you’ve seen the

SQL is an important part of the data engineering equation

- Red means, LLMs are going to have a signiﬁcant impact, soon

LangChain hooks into databases

This disruption is imminent! EcZachly Inc

LLMs + Agents will be able to

- LLMs that process stack traces /

The amount of data engineering hours

Tricky failures will still need manual

- This will be disrupted 100% but complexity

LLMs can oftentimes give us a

We are still pretty far from

LLMs are like having a good junior engineer

The chat.openai.com interface is

REMEMBER FROM CLASS, IF THINGS ARE NOT

If you’ve done the upfront work of all the

LLMs should be able to give us a

The soft skills needed to prioritize EcZachly Inc

Presenting data is actually pretty

Knowing you customer and what

Generating fake data will be a big

Suggesting automated checks is

Having the business context to EcZachly Inc

Once you hit a suﬃcient level of

Imagine asking ChatGPT,

Data best practices are most

ChatGPT will catch up 2-3 years

Data Modeling is a very soft skills

Understanding what data is

- When you should and shouldn’t use LLMs

- Debugging speciﬁc problems (similar to StackOverﬂow or Google)

- LLMs are actually decent at generating text and documentation

- Install the openai Python library

- You have multiple roles you can play

This is for when you want to have a multiple prompt/response conversation

Temperature - Scales from 0 to 2. Closer to 0 the more deterministic the

- We’ll be interacting with GPT-4 in today’s lab

You might also like