Airflow Basic Configuration for Production Environment

Airflow still becomes one of the widely used job scheduler and monitoring. Its capability to scale well and also its flexibility to run many types of job is the main reasons why it becomes that popular among the others similar open-source product. In this article, we will discuss how to configure Airflow to be able to run well in production environment. After following the entire article, we will be able to:

  • Running Airflow using LocalExecutor
  • Apply authentication to access the Airflow Dashboard

In this article, we have discussed how to deploy Airflow on docker by using puckel/docker-airflow image. We will continue those steps and then apply some basic configurations on the deployed docker image.

The current running docker container (after following steps in the previous article) runs using the default SequentialExecutor. As it is said in the official documentation,

This kind of behavior is not the one we expect, since we need the scheduler to be able to process multiple jobs at the same time. So, we need to change the executor type. We will use LocalExecutor (assuming we only have one machine, otherwise CeleryExecutor will be more preferred).

By default, Airflow running on sqlite when using SequentialExecutor. However, sqlite doesn’t support multiple connections. So, we need to deploy another database that will be used by Airflow to store its necessary metadata. We use PostgreSQL database as one of the alternative, and postgres image that can be pulled from Docker Hub.

docker pull postgres:latest

Before running the container, we have to specify the user that will use this database. We store this configuration in the postgres.env file containing lines like below.

POSTGRES_USER=airflow_user
POSTGRES_PASSWORD=airflow_pass
POSTGRES_DB=airflow

We then can run the container by specifying the postgres.env as the environment variable. We also want to give an identifier to this container as it will be referenced by the Airflow container in the next step. So, we will pass a container name to the --name option.

docker run -d --env-file postgres.env --name postgres_airflow postgres

The database container is ready. Now, we want to connect the Airflow container to the running database container. In order to do that, we have to specify some configurations for the Airflow container. Airflow store all of its configuration in the airflow.cfg file inside the container. However, as mentioned in the documentation of puckel/docker-airflow, we can specify those configurations from the environment variable named in AIRFLOW__<section>__<key> format. We store those environment variables in airflow.env file.

AIRFLOW__CORE__SQL_ALCHEMY_CONN=postgresql+psycopg2://postgres:postgres@postgres_airflow:5432/airflow
POSTGRES_USER=airflow_user
POSTGRES_PASSWORD=airflow_pass
POSTGRES_HOST=postgres_airflow
POSTGRES_PORT=5432
POSTGRES_DB=airflow

As we want to run using LocalExecutor we also need to specify the executor type by appending some other environment variables to the airflow.env.

AIRFLOW__CORE__EXECUTOR=LocalExecutor

Once the connection is configured, we then need to initialize the database by creating the table needed by Airflow. Fortunately, it is easy to do this step as Airflow has already included the setup function. The only thing we have to run is the airflow initdb command.

docker run --rm --link postgres_airflow:postgres_airflow --env-file airflow.env puckel/docker-airflow airflow initdb

We can check the result by connecting to the database container, and see whether these tables has been created in the airflow database.

airflow_user=# \c airflow
You are now connected to database "airflow" as user "airflow_user".
airflow=# \dt
List of relations
Schema | Name | Type | Owner
--------+-------------------------+-------+----------
public | ab_permission | table | postgres
public | ab_permission_view | table | postgres
public | ab_permission_view_role | table | postgres
public | ab_register_user | table | postgres
public | ab_role | table | postgres
public | ab_user | table | postgres
public | ab_user_role | table | postgres
public | ab_view_menu | table | postgres
public | alembic_version | table | postgres
public | chart | table | postgres
public | connection | table | postgres
public | dag | table | postgres
public | dag_pickle | table | postgres
public | dag_run | table | postgres
public | import_error | table | postgres
public | job | table | postgres
public | known_event | table | postgres
public | known_event_type | table | postgres
public | kube_resource_version | table | postgres
public | kube_worker_uuid | table | postgres
public | log | table | postgres
public | sla_miss | table | postgres
public | slot_pool | table | postgres
public | task_fail | table | postgres
public | task_instance | table | postgres
public | task_reschedule | table | postgres
public | users | table | postgres
public | variable | table | postgres
public | xcom | table | postgres
(29 rows)

After successfully connecting the Airflow container to the PostgreSQL container, and then initializing tables in the database, we are finally able to run the Airflow container. (Optional: fill the --name option for the container)

docker run -d --link postgres_airflow:postgres_airflow -p 8080:8080 --env-file airflow.env --name airflow_server puckel/docker-airflow webserver

If everything run smoothly, you can access the dashboard on port 8080 in your machine. Try to run some DAG (for debugging, you may want to load the example DAG by specifying LOAD_EX=y on the -e option when running the Airflow container). You will see that Airflow now can run some jobs in parallel.

So, running jobs in parallel is not an issue now. However, the problem arises because we still allow everybody to access the dashboard without any authentication. “ is obviously not something that we want to have in our system. We want to give role to everybody that use the dashboard so that they can use it accordingly.

In order to do that, we will configure authentication mechanism for everyone that want to login to our Airflow dashboard. Airflow already has authentication mechanism, so our job is only to make the mechanism works in our container. We need to add some lines in the airflow.env file.

AIRFLOW__WEBSERVER__RBAC=True
AIRFLOW__CORE__FERNET_KEY=pZcwcoB8RQfjtE9n0Du5Weu8zLKoFphKkiGDBihOwcM=
AIRFLOW__WEBSERVER__AUTHENTICATE=True
AIRFLOW__WEBSERVER__AUTH_BACKEND=airflow.contrib.auth.backends.password_auth

By the way, if you haven’t got any Fernet key, you can make one using the below Python script.

>>> from cryptography import fernet
>>> fernet.Fernet.generate_key()
b'pZcwcoB8RQfjtE9n0Du5Weu8zLKoFphKkiGDBihOwcM='

Then, we need to restart our running container using the current airflow.env file. If everything is doing good, we will see this login page when trying to open Airflow dashboard.

Airflow doesn’t give option to specify any first user when deploying the container. As a workaround, we can jump in to the container and manually add the user from the Python console inside the container.

docker exec -ti airflow_server python

Specify the desired user and password in the script below, and then run it in the opened Python console.

>>> import airflow
>>> from airflow import models, settings
>>> from airflow.contrib.auth.backends.password_auth import PasswordUser
>>> user = PasswordUser(models.User())
>>> user.username = 'my_user'
>>> user.email = 'my_user@company.com'
>>> user.password = 'my_user_password'
>>> session = settings.Session()
>>> session.add(user)
>>> session.commit()
>>> session.close()
>>> exit()

This user will be the user that has super-admin role and can be used to create another user in specific role from inside the Airflow dashboard. The look of the Airflow dashboard when we login using the super-admin user is like the image below.

Now we have successfully applied LocalExecutor to our running Airflow docker container so that we can run job in parallel. We also have configured an authentication mechanism and role-based credentials when using the Airflow dashboard. It is supposed to be a good starting point to use Airflow in production environment. There are lots other configurations that can be explored, so keep curious :)

AI and Machine Learning Enthusiast. Interested in Deep Learning and Computer Vision.