Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

ETL pipeline

- Extract, transform via python, then load


- <Something here. I did not catch>
- ELT: Get data from sources and load immediately to data warehouse
Where do you wanna do the processing?
- If separate machine, ETL
- If target destination itself, ELT
How will you do processing wherein you have to finish a task before proceeding to the next task?

Imagine ML pipeline with intermediate data, how do you make sure you don’t rerun a part of the script
that is already finished?

How do you schedule a batch run?


- Cron job

Workflow managers
- Apache Airflow
o Disadvantage: has to have a running server
- Luigi
o Checks existence of intermediate files. If not exists, rerun the part of the pipeline
- MLFlow
- Kubeflow
o Create jobs in Kubernetes
- Metaflow
o Improved version of Luigi
- Prefect
o Built on Dask

Airflow
- You really have to code (flexible)
- One data pipeline is a DAG
- It also uses a meta database
o Postgres (better for production)
o SQLite (not parallelizable, not recommended, but good for simpler tasks)
- In Airflow, you will create the DAGs
- Setting and starting up Airflow
o Initialize database
▪ airflow db init
o Run scheduler
▪ airflow scheduler
o Run web interface
▪ airflow webserver
Airflow commands
- python dag.py
o check if dag.py is syntactically correct
- airflow tasks
o Manage tasks
- airflow tasks test <dag_id> <task_id>
o test the dag
- airflow dags test my_first_dag
o This command is used to manually run a specific task instance within a DAG for testing
purposes. It allows you to execute a single task in isolation, simulating how the task
would run as part of the DAG. You provide the DAG ID, the task ID, and optionally the
execution date to specify which task instance to run.
- airflow dags trigger my_first_dag
o This command is used to trigger the execution of a DAG run. It starts the entire
workflow specified by the DAG and schedules the tasks within the DAG to run according
to their dependencies and scheduling intervals. This command is used to start a new
DAG run, and it is typically used in production environments or when you want to
initiate the DAG execution according to its schedule.

Airflow (after redshift)


- If large amount of data, storage first

You might also like