Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

Airflow

Workflow Management System

previa [at] gmail.com


Outline

• Why Airflow ? & What about it


• Workflow (DAG)
• Architecture
• DEMO
• Scheduler
• Jinja2 & macros
• Command
• Backfill
• Q/A
Why Airflow ?

ETL Hell
● How to control it ?
○ Complicate and implicit
● How to schedule it?
○ Time and event driven
● How to deal with failure ?
○ Retry, notify, logging

REF:http://tinyurl.com/yclq3slz
Why Airflow ?

Considerations
○ General purpose
○ Backfilling
○ Rich UI
○ Flow define

REF: http://tinyurl.com/ycadwhta
Why Airflow ?

Considerations
○ General purpose
○ Backfilling
○ Rich UI
○ Flow define

We chose Airflow

REF: http://tinyurl.com/ycadwhta
About Airflow

● Open Sourced by Airbnb in June 2015


● Joined ASF’s incubation program in March 2016
● 660+ contributors, 5.7k+ commits, 10k+ stars
● Used by 200+ companies :
○ Adobe, Airbnb, HBO, Intel, iFTTT, Lyft, PayPal, Pandora, Quora, Reddit,
similarweb, Tesla, Twitter, vevo, 9GAG, Square, Yahoo, ..
About Airflow

● Open Sourced by Airbnb in June 2015


● Joined ASF’s incubation program in March 2016
● 660+ contributors, 5.7k+ commits, 10k+ stars
● Used by 200+ companies :
○ Adobe, Airbnb, HBO, Intel, iFTTT, Lyft, PayPal, Pandora, Quora, Reddit,
similarweb, Tesla, Twitter, vevo, 9GAG, Square, Yahoo, ..
○ Wondershare since Dec 2018

● Currently we got working flow use it


○ Order, User, sales insight, …
○ Cluster maintenance
Workflow (DAG)

Workflow

DAG(Directed acyclic graph)

DagRun

Task

Operator

REF:http://tinyurl.com/ydxyuscd TaskInstance
Workflow (DAG)

● Operator
○ Execute
■ Bash, Ssh, Python
○ DB
■ MySQL, MsSQL,
Postgres, Presto,
Redshift, Hive
○ WebService
■ Http, Email, Slack, S3
Workflow (DAG)

● Sensor
○ FileSensor, Database,
Sagemaker, Redshift, Imap,
Bigquery,, Cassandra, EMR,
Ftp, SFtp, Hdfs, Jira, Azure,
...

● Flow control
○ BranchPythonOperator
■ Pass data through tasks
=> XCom
Workflow (DAG)

• Skip
– Raise AirflowSkipExection

• Join (Trigger Rules)


Architecture(Multi-Node)

• parallelism
• dag_concurrency
• max_active_runs_per_dag
• non_pooled_task_slot_count
Architecture(Multi-Node)

Dag, Metadata

REF:http://tinyurl.com/y6uzjcsg
UI

DEMO
http://longfei.leanote.com/post/airflow-operations
Scheduler

● start_date: 2019-01-01
● schedule_interval: 1 1 * * *

# real run_date execution_date

N/A 2019-01-01 01:01

1 2019-01-02 01:01 2019-01-01

2 2019-01-03 01:01 2019-01-02

execution_date = start_date + #N-th * interval

● Airflow using UTC ( ~1.10.x)


○ Patch source code ● interval also support @hourly, @daily
■ http://tinyurl.com/yatqngh8 ○ using crontab form to avoid stress peak(错峰)
○ Modify dag file ● set depends_on_past to true if needed
■ http://tinyurl.com/y8kmzz65
Jinja2 & macros

Variable Description

{{ ds }} the execution date as YYYY-MM-DD

{{ ds_nodash }} the execution date as YYYYMMDD

{{ [yesterday, tomorrow]_ds }} yesterday’s date as YYYY-MM-DD

{{ [yesterday, tomorrow]_ds_nodash }} yesterday’s date as YYYYMMDD

{{ ts }} execution_date.isoformat()

{{ dag }}, {{ task }} the DAG object, the Task object


• Leverage macros & jinja2
{{ task_instance }}, {{ ti }} the task_instance object
• Let airflow control time
– Don’t manipulate time in code, {{ params }} user-defined params dictionary

use args instead


Command

• Run in background
– webserver
– scheduler
– flower / worker
• Develop & Test
– list_tasks / list_dags
– server_logs
– run / test
– trigger_dag
• Backfill
– New metric need historical data
– Re-run failed tasks
– Options
• --donot_pickle, --dry_run,
--rerun_failed_tasks,
--Ignore_dependencies
Security

• Support • By default
– Password – All access are open
– LDAP • Secure access via SSL (https)
– Custom Auth
– Kerberos
• REF
– OAuth – http://tinyurl.com/y8nkzvsy
• Github Enterprise Auth
• Google OAuth
Reference

● Developing elegant workflows in Python code with Apache Airflow


(2017 PyCon@Euro)
○ https://www.youtube.com/watch?v=XJf-f56JbFM&t=1257
● Modern ETL-ing with Python and Airflow (and Spark) - (2017
PyCon@De)
○ https://www.youtube.com/watch?v=tcJhSaowzUI
● Apache Airflow in Production: A Fictional Example
○ https://www.youtube.com/watch?v=iTg-a4icf_I
● A Practical Introduction to Airflow (2016 PyData@SF)
○ https://www.youtube.com/watch?v=cHATHSB_450
Q/A

You might also like