Airflow Basic Configuration for Production Environment

Airflow still becomes one of the widely used job scheduler and monitoring. Its capability to scale well and also its flexibility to run many types of job is the main reasons why it becomes that popular among the others similar open-source product. In this article, we will discuss how to configure Airflow to be able to run well in production environment. After following the entire article, we will be able to:

  • Running Airflow using LocalExecutor
  • Apply authentication to access the Airflow Dashboard

In this article, we have discussed how to deploy Airflow on docker by using puckel/docker-airflow image. We will continue those steps and then apply some basic configurations on the deployed docker image.

The current running docker container (after following steps in the previous article) runs using the default SequentialExecutor. As it is said in the official documentation,

This executor will only run one task instance at a time, can be used for debugging.

This kind of behavior is not the one we expect, since we need the scheduler to be able to process multiple jobs at the same time. So, we need to change the executor type. We will use LocalExecutor (assuming we only have one machine, otherwise CeleryExecutor will be more preferred).

LocalExecutor executes tasks locally in parallel. It uses the multiprocessing Python library and queues to parallelize the execution of tasks.

By default, Airflow running on sqlite when using SequentialExecutor. However, sqlite doesn’t support multiple connections. So, we need to deploy another database that will be used by Airflow to store its necessary metadata. We use PostgreSQL database as one of the alternative, and postgres image that can be pulled from Docker Hub.

Before running the container, we have to specify the user that will use this database. We store this configuration in the postgres.env file containing lines like below.

We then can run the container by specifying the postgres.env as the environment variable. We also want to give an identifier to this container as it will be referenced by the Airflow container in the next step. So, we will pass a container name to the --name option.

The database container is ready. Now, we want to connect the Airflow container to the running database container. In order to do that, we have to specify some configurations for the Airflow container. Airflow store all of its configuration in the airflow.cfg file inside the container. However, as mentioned in the documentation of puckel/docker-airflow, we can specify those configurations from the environment variable named in AIRFLOW__<section>__<key> format. We store those environment variables in airflow.env file.

As we want to run using LocalExecutor we also need to specify the executor type by appending some other environment variables to the airflow.env.

Once the connection is configured, we then need to initialize the database by creating the table needed by Airflow. Fortunately, it is easy to do this step as Airflow has already included the setup function. The only thing we have to run is the airflow initdb command.

We can check the result by connecting to the database container, and see whether these tables has been created in the airflow database.

After successfully connecting the Airflow container to the PostgreSQL container, and then initializing tables in the database, we are finally able to run the Airflow container. (Optional: fill the --name option for the container)

If everything run smoothly, you can access the dashboard on port 8080 in your machine. Try to run some DAG (for debugging, you may want to load the example DAG by specifying LOAD_EX=y on the -e option when running the Airflow container). You will see that Airflow now can run some jobs in parallel.

So, running jobs in parallel is not an issue now. However, the problem arises because we still allow everybody to access the dashboard without any authentication. “Everyone is the admin” is obviously not something that we want to have in our system. We want to give role to everybody that use the dashboard so that they can use it accordingly.

In order to do that, we will configure authentication mechanism for everyone that want to login to our Airflow dashboard. Airflow already has authentication mechanism, so our job is only to make the mechanism works in our container. We need to add some lines in the airflow.env file.

By the way, if you haven’t got any Fernet key, you can make one using the below Python script.

Then, we need to restart our running container using the current airflow.env file. If everything is doing good, we will see this login page when trying to open Airflow dashboard.

Airflow doesn’t give option to specify any first user when deploying the container. As a workaround, we can jump in to the container and manually add the user from the Python console inside the container.

Specify the desired user and password in the script below, and then run it in the opened Python console.

This user will be the user that has super-admin role and can be used to create another user in specific role from inside the Airflow dashboard. The look of the Airflow dashboard when we login using the super-admin user is like the image below.

Now we have successfully applied LocalExecutor to our running Airflow docker container so that we can run job in parallel. We also have configured an authentication mechanism and role-based credentials when using the Airflow dashboard. It is supposed to be a good starting point to use Airflow in production environment. There are lots other configurations that can be explored, so keep curious :)

AI and Machine Learning Enthusiast. Interested in Deep Learning and Computer Vision.