Drivy has been using Airflow to orchestrate tasks for 2 years now. We thought it was the best tool on the market when we wanted to start digging into data. The purpose was to understand how well our features were performing. We didn’t really know how the data was going to be used, and by whom. We wanted something easy to use and set up. We set up everything on an ec2 instance. 75 workflows later, we wanted to upgrade our Airflow version and move from a local to a celeryExecutor mode. In a local mode there is only one worker (which is also the webserver and the scheduler). In the celeryExecutor, on the contrary, there are several workers which can execute tasks in parallel. Our number of DAGs is constantly growing and Celery mode is the best choice to handle this growth.
Airflow is Airbnb’s baby. It is an open-source project which schedules DAGs. Dag stands for Directed Acyclic Graph. Basically, they are an organized collection of tasks. Thanks to Airflow’s nice UI, it is possible to look at how DAGs are currently doing and how they perform. If a DAG fails an email is sent with its logs. It can be manually re-triggered through the UI. Dags can combine lot of different types of tasks (bash, python, sql…) and interact with different datasources. Airflow is a really handy tool to transform and load data from a point A to a point B. You can check their documentation over here.
A simple Airflow DAG with several tasks:
An Airflow cluster has a number of daemons that work together : a webserver, a scheduler and one or several workers.
The airflow webserver accepts HTTP requests and allows the user to interact with it. It provides the ability to act on the DAG status (pause, unpause, trigger). When the webserver is started, it starts gunicorn workers to handle different requests in parallel.
The Airflow scheduler monitors DAGs. It triggers the task instances whose dependencies have been met. It monitors and stays in synchronisation with a folder for all DAG objects, and periodically inspects tasks to see if they can be triggered.
Airflow workers are daemons that actually execute the logic of tasks. They manage one to many CeleryD processes to execute the desired tasks of a particular DAG.
Airflow daemons don’t need to register with each other and don’t need to know about each other. They all take care of a specific task and when they are all running, everything works as expected. The scheduler periodically polls to see if any DAGs which are registered need to be executed. If a specific DAG needs to be triggered, then the scheduler creates a new DagRun instance in the Metastore and starts to trigger the individual tasks in the DAG. The scheduler will do that by pushing messages into the queuing service. A message contains information about the task to execute (DAG_id, task_id..) and what function needs to be performed. In some cases, the user will interact with the web server. He can manually trigger a DAG to be ran. A DAGRun is created and the scheduler will start trigger individual tasks the same way as described before. Celeryd processes, controlled by workers, periodically pull from the queuing service. When a celeryd process pulls a task message, it updates the task instance in the metastore to a running state and begins executing the code provided. When the task ends (in a success or fail state) it updates the state of the task.
In a single-node architecture all components are on the same node. To use a single node architecture, Airflow has to be configured with the LocalExecutor mode.
The single-node architecture is widely used by the users in case they have a moderate amount of DAGs. In this mode, the worker pulls tasks to run from an IPC (Inter Process Communication) queue. This mode doesn’t any need external dependencies. It scales up well until all resources on the server are used. This solution works pretty well. However, to scale out to multiple servers, the Celery executor mode has to be used. Celery executor uses Celery (and a message-queuing server) to distribute the load on a pool of workers.
In a multi node architecture daemons are spread in different machines. We decided to colocate the webserver and the scheduler. To use this architecture, Airflow has to be configure with the Celery Executor mode.
In this mode, a Celery backend has to be set (Redis in our case). Celery is an asynchronous queue based on distributed message passing. Airflow uses it to execute several tasks concurrently on several workers server using multiprocessing. This mode allows to scale up the Airflow cluster really easily by adding new workers.
Multi-node architecture provides several benefits :
In the next episode, we’ll look at how we automate Airflow cluster deployment :).