Professional Documents
Culture Documents
Airflowintroduction 190217155729
Airflowintroduction 190217155729
ETL Hell
● How to control it ?
○ Complicate and implicit
● How to schedule it?
○ Time and event driven
● How to deal with failure ?
○ Retry, notify, logging
REF:http://tinyurl.com/yclq3slz
Why Airflow ?
Considerations
○ General purpose
○ Backfilling
○ Rich UI
○ Flow define
REF: http://tinyurl.com/ycadwhta
Why Airflow ?
Considerations
○ General purpose
○ Backfilling
○ Rich UI
○ Flow define
We chose Airflow
REF: http://tinyurl.com/ycadwhta
About Airflow
Workflow
DagRun
Task
Operator
REF:http://tinyurl.com/ydxyuscd TaskInstance
Workflow (DAG)
● Operator
○ Execute
■ Bash, Ssh, Python
○ DB
■ MySQL, MsSQL,
Postgres, Presto,
Redshift, Hive
○ WebService
■ Http, Email, Slack, S3
Workflow (DAG)
● Sensor
○ FileSensor, Database,
Sagemaker, Redshift, Imap,
Bigquery,, Cassandra, EMR,
Ftp, SFtp, Hdfs, Jira, Azure,
...
● Flow control
○ BranchPythonOperator
■ Pass data through tasks
=> XCom
Workflow (DAG)
• Skip
– Raise AirflowSkipExection
• parallelism
• dag_concurrency
• max_active_runs_per_dag
• non_pooled_task_slot_count
Architecture(Multi-Node)
Dag, Metadata
REF:http://tinyurl.com/y6uzjcsg
UI
DEMO
http://longfei.leanote.com/post/airflow-operations
Scheduler
● start_date: 2019-01-01
● schedule_interval: 1 1 * * *
Variable Description
{{ ts }} execution_date.isoformat()
• Run in background
– webserver
– scheduler
– flower / worker
• Develop & Test
– list_tasks / list_dags
– server_logs
– run / test
– trigger_dag
• Backfill
– New metric need historical data
– Re-run failed tasks
– Options
• --donot_pickle, --dry_run,
--rerun_failed_tasks,
--Ignore_dependencies
Security
• Support • By default
– Password – All access are open
– LDAP • Secure access via SSL (https)
– Custom Auth
– Kerberos
• REF
– OAuth – http://tinyurl.com/y8nkzvsy
• Github Enterprise Auth
• Google OAuth
Reference