Download as pdf or txt
Download as pdf or txt
You are on page 1of 31

LLM-driven Data Engineering

Day 1
EcZachly Inc
Is data engineering doomed to
AI?
And what you can do about it!
EcZachly Inc
Here’s why data engineering is doomed
(according to shitty LinkedIn influencers)
EcZachly Inc

- LLMs are getting good at writing SQL


- LLMs increase productivity, reduce debug time, and reduce documentation
writing time
- LLMs can even write Spark code! Y’all saw the English SDK for Databricks
right? EcZachly Inc
- LLMs will guarantee amazing data quality!
LLMs and SQL
EcZachly Inc

I’m sure you’ve seen the

SQL is an important part of the data engineering equation

- LLMs are good but not great at SQL (we will see this in the lab today)
EcZachly Inc
I’m going to show you a chart soon
EcZachly Inc

- Red means, LLMs are going to have a significant impact, soon


- Yellow means, LLMs aren’t quite there yet but could disrupt in the future
- Green means, LLMs are far from accomplishing this

EcZachly Inc
EcZachly Inc

EcZachly Inc
Answering business questions (red)
EcZachly Inc

LangChain hooks into databases


directly and can easily make sense
of data warehouses!

This disruption is imminent! EcZachly Inc


Fixing Broken Pipelines (red)
EcZachly Inc

LLMs + Agents will be able to


significantly reduce the oncall burden.

- LLMs that process stack traces /


quality failures can recommend
possible remedies to unblock
(maybe even via Slack)

The amount of data engineering hours


EcZachly Inc
that are recovered from this is
something to be celebrated!

Tricky failures will still need manual


troubleshooting though
Writing analytical queries (red)
EcZachly Inc

- This will be disrupted 100% but complexity


matters here a lot.
- SQL practitioners probably won’t feel much
difference here
- Self-serve analytics will catch fire here though
This is something to be celebrated!
EcZachly Inc
Data engineers won’t have to hand hold
stakeholders through their analytical questions
nearly as much!
Writing pipeline code in:
SQL, Spark, or Flink (orange)
EcZachly Inc

LLMs can oftentimes give us a


good boilerplate to start with, but
debugging it is often not worth it!

We are still pretty far from


EcZachly Inc
prompts generating us optimized
and correct data pipelines!
Why ChatGPT sucks for SQL!
EcZachly Inc

LLMs are like having a good junior engineer


who can write SQL for you but you have to
check their work

The chat.openai.com interface is


NON-DETERMINISTIC!
EcZachly Inc
The same prompt will output different things
depending on when you input it!

REMEMBER FROM CLASS, IF THINGS ARE NOT


IDEMPOTENT, THEY ARE NOT TO BE TRUST
Debugging
EcZachly Inc

EcZachly Inc
Writing data documentation (orange)
EcZachly Inc

If you’ve done the upfront work of all the


conceptual and physical data modeling,
ChatGPT can give you beautiful
summaries and example queries. (We’ll
see this in the lab today)
ChatGPT will miss most business EcZachly Inc
context unless directly given it and if
you’re directly giving ChatGPT the
context, you might as well write the
documentation yourself at that point
Sprint Planning (orange)
EcZachly Inc

LLMs should be able to give us a


good idea about sprint planning.

The soft skills needed to prioritize EcZachly Inc


and push back are sadly not
present in ChatGPT
Dashboarding (orange)
EcZachly Inc

Presenting data is actually pretty


easy. Usually simple GROUP BY
queries does the job

Knowing you customer and what


EcZachly Inc
data to display and when is the
hard part that ChatGPT will
continue to suck at
Physical Data Modeling (orange)
EcZachly Inc

Optimizing logical/conceptual
data models to a specific
architecture is something
ChatGPT will excel at.
Having all the best practices in EcZachly Inc
place and understand the interplay
between data table, pipeline, etc is
hard for ChatGPT to understand
Writing tests for you pipelines (orange)
EcZachly Inc

Generating fake data will be a big


pain that LLMs can give us
already.

EcZachly Inc
Knowing if we have adequate test
coverage is something ChatGPT
will continue to be bad at
Automated Data Quality Checks (orange)
EcZachly Inc

Suggesting automated checks is


something LLMs can do already

Having the business context to EcZachly Inc


know what checks are valuable is
something LLMs will have a hard
time with
Building data processing frameworks
(green)
EcZachly Inc

Once you hit a sufficient level of


complexity, LLMs fall apart.

Imagine asking ChatGPT,


“Give me the code for the next EcZachly Inc
generation of Spark”
Creating data best practices (green)
EcZachly Inc

Data best practices are most


often learned from experience.
Experience that needs to
permeate through the
organization. EcZachly Inc

ChatGPT will catch up 2-3 years


late!
Conceptual / Logical Data Modeling (green)
EcZachly Inc

Data Modeling is a very soft skills


focused task.

Understanding what data is


needed, where it is, and how to get EcZachly Inc
it is a human-oriented task that
ChatGPT will continue to struggle
with
What’s the lab going to be about today?
EcZachly Inc

- When you should and shouldn’t use LLMs


- What types of tasks are LLMs better suited for
- How can you get the most out of each prompt?

EcZachly Inc
When should you use LLMs?
EcZachly Inc

- Debugging specific problems (similar to StackOverflow or Google)


- Getting your starter DAG boilerplate
- Boilerplate documentation
- Solving a very specific part of a query
EcZachly Inc
Generating Documentation
EcZachly Inc

- LLMs are actually decent at generating text and documentation


- Business context is often lacking though!
- Make sure to specify to use Markdown in the prompt
- The more context you give, the better the documentation
EcZachly Inc
How to use the OpenAI API
EcZachly Inc

- Install the openai Python library


- Get an OpenAI API key (REMEMBER ITS PAY PER USE)

EcZachly Inc
How the prompts work in Python API
EcZachly Inc

- You have multiple roles you can play


- System
- Assistant
- User

EcZachly Inc
The System Role
EcZachly Inc

Think of this as the “contextual” clues that guide the prompt the right way

EcZachly Inc
The Assistant Role
EcZachly Inc

This is for when you want to have a multiple prompt/response conversation


with ChatGPT

EcZachly Inc
The User Role
EcZachly Inc

This is the prompt that is used for the specific task at hand

EcZachly Inc
API Configuration
EcZachly Inc

Temperature - Scales from 0 to 2. Closer to 0 the more deterministic the


output.
Max Tokens - you can set this as high as 4000. One token is about 4 characters
in English
EcZachly Inc
Today’s Lab
EcZachly Inc

- We’ll be interacting with GPT-4 in today’s lab


- We’ll generate an Airflow DAG
- A SQL query
- And documentation for our data sets

EcZachly Inc

You might also like