Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

AIRFLOW

➢ Apache Airflow is an open-source platform used for orchestrating complex


workflows and data processing pipelines. It enables users to schedule,
monitor, and manage workflows programmatically through Python scripts.
Airflow provides a rich set of features for workflow management, including
task dependencies, dynamic scheduling, extensibility, and monitoring
capabilities.

Key Concepts:

​ DAGs (Directed Acyclic Graphs):


● A DAG is a collection of tasks arranged in a specific order with
dependencies between them.
● Tasks represent individual units of work, such as running a Python script,
executing an SQL query, or transferring files.
● DAGs define the workflow logic and dependencies between tasks.
​ Operators:
● Operators define the type of task to be executed within a DAG.
● Airflow provides various built-in operators for common tasks such as
BashOperator, PythonOperator, SQLOperator, etc.
● Custom operators can be developed to extend Airflow's functionality.
​ Schedulers:
● Airflow uses schedulers to manage task execution and scheduling.
● The default scheduler in Airflow is the "CeleryExecutor," which
distributes task execution across multiple worker nodes.
● Other options include the "LocalExecutor" for single-machine execution
and the "KubernetesExecutor" for running tasks on Kubernetes clusters.
​ Web Interface:
● Airflow comes with a web-based user interface for visualizing and
monitoring workflows.
● Users can view DAGs, task statuses, execution logs, and schedule tasks
through the web interface.
​ Plugins:
● Airflow supports a plugin architecture to extend its capabilities.
● Plugins can be used to add custom operators, hooks, sensors, and other
components to Airflow.

DAGs (Directed Acyclic Graphs) :

Introduction:

In Apache Airflow, Directed Acyclic Graphs (DAGs) represent a collection of tasks with
dependencies and a defined schedule. DAGs are defined using Python scripts and are
the backbone of workflows managed by Airflow. Each DAG describes how tasks are
structured and how they relate to each other.

Example​:
➢ default_args = {
➢ 'owner': 'Gaurav',

➢ 'start_date': datetime(2022, 7, 7),

➢ 'sla': timedelta(minutes=5),

➢ "params": {

➢ "endpoint": healthcheck_uuid

➢ }

➢ }
➢ mysql_params = {
➢ "mysql_conf_file": "/opt/airflow/mysql.cnf",

➢ "conn_suffix": "db-zenu"

➢ }

➢ with DAG('age_of_listings_count_daily', default_args=default_args,
schedule_interval='1 0 * * *',
➢ template_searchpath=['/opt/airflow/templates'],
catchup=False, concurrency=1) as dag:

➢ t0 = notification.dag_start_healthcheck_notify("start")


➢ t1 = BashOperator(task_id='get_daily',


bash_command='druid_dump_age_of_listings_count.bash',
➢ params=mysql_params,

➢ retries=2, retry_delay=timedelta(seconds=15),


on_failure_callback=notification.task_fail_healthcheck_notify
➢ )


➢ t2 = S3KeySensor(task_id='check_daily',

➢ bucket_key="{{
var.json.s3_age_of_listings_count_data_path.s3_path }}/{{
execution_date.format('YYYY-MM-DD') }}*.csv.gz",
➢ bucket_name="{{
var.json.s3_age_of_listings_count_data_path.bucket_name }}",
➢ wildcard_match=True,

➢ retries=2, retry_delay=timedelta(seconds=5),


on_failure_callback=notification.task_fail_healthcheck_notify
➢ )


➢ t3 = DruidOperator(task_id='ingest_daily',


json_index_file='druid_ingest_age_of_listings_count.json',
➢ druid_ingest_conn_id='druid_ingest_conn',


on_failure_callback=notification.task_fail_healthcheck_notify
➢ )


➢ tn = notification.dag_success_healthcheck_notify("end")


➢ t0 >> t1 >> t2 >> t3 >> t4

You might also like