Airflow Basic Configuration for Production Environment
Airflow still becomes one of the widely used job scheduler and monitoring. Its capability to scale well and also its flexibility to run many types of job is the main reasons why it becomes that popular among the others similar open-source product. In this article, we will discuss how to configure Airflow to be able to run well in production environment. After following the entire article, we will be able to:
- Running Airflow using LocalExecutor
- Apply authentication to access the Airflow Dashboard
In this article, we have discussed how to deploy Airflow on docker by using
puckel/docker-airflow image. We will continue those steps and then apply some basic configurations on the deployed docker image.
The current running docker container (after following steps in the previous article) runs using the default SequentialExecutor. As it is said in the official documentation,
This executor will only run one task instance at a time, can be used for debugging.
This kind of behavior is not the one we expect, since we need the scheduler to be able to process multiple jobs at the same time. So, we need to change the executor type. We will use LocalExecutor (assuming we only have one machine, otherwise CeleryExecutor will be more preferred).
LocalExecutor executes tasks locally in parallel. It uses the multiprocessing Python library and queues to parallelize the execution of tasks.
By default, Airflow running on sqlite when using SequentialExecutor. However, sqlite doesn’t support multiple connections. So, we need to deploy another database that will be used by Airflow to store its necessary metadata. We use PostgreSQL database as one of the alternative, and
postgres image that can be pulled from Docker Hub.
docker pull postgres:latest
Before running the container, we have to specify the user that will use this database. We store this configuration in the
postgres.env file containing lines like below.
We then can run the container by specifying the
postgres.env as the environment variable. We also want to give an identifier to this container as it will be referenced by the Airflow container in the next step. So, we will pass a container name to the
docker run -d --env-file postgres.env --name postgres_airflow postgres
The database container is ready. Now, we want to connect the Airflow container to the running database container. In order to do that, we have to specify some configurations for the Airflow container. Airflow store all of its configuration in the
airflow.cfg file inside the container. However, as mentioned in the documentation of
puckel/docker-airflow, we can specify those configurations from the environment variable named in
AIRFLOW__<section>__<key> format. We store those environment variables in
As we want to run using LocalExecutor we also need to specify the executor type by appending some other environment variables to the
Once the connection is configured, we then need to initialize the database by creating the table needed by Airflow. Fortunately, it is easy to do this step as Airflow has already included the setup function. The only thing we have to run is the
airflow initdb command.
docker run --rm --link postgres_airflow:postgres_airflow --env-file airflow.env puckel/docker-airflow airflow initdb
We can check the result by connecting to the database container, and see whether these tables has been created in the
airflow_user=# \c airflow
You are now connected to database "airflow" as user "airflow_user".airflow=# \dt
List of relations
Schema | Name | Type | Owner
public | ab_permission | table | postgres
public | ab_permission_view | table | postgres
public | ab_permission_view_role | table | postgres
public | ab_register_user | table | postgres
public | ab_role | table | postgres
public | ab_user | table | postgres
public | ab_user_role | table | postgres
public | ab_view_menu | table | postgres
public | alembic_version | table | postgres
public | chart | table | postgres
public | connection | table | postgres
public | dag | table | postgres
public | dag_pickle | table | postgres
public | dag_run | table | postgres
public | import_error | table | postgres
public | job | table | postgres
public | known_event | table | postgres
public | known_event_type | table | postgres
public | kube_resource_version | table | postgres
public | kube_worker_uuid | table | postgres
public | log | table | postgres
public | sla_miss | table | postgres
public | slot_pool | table | postgres
public | task_fail | table | postgres
public | task_instance | table | postgres
public | task_reschedule | table | postgres
public | users | table | postgres
public | variable | table | postgres
public | xcom | table | postgres
After successfully connecting the Airflow container to the PostgreSQL container, and then initializing tables in the database, we are finally able to run the Airflow container. (Optional: fill the
--name option for the container)
docker run -d --link postgres_airflow:postgres_airflow -p 8080:8080 --env-file airflow.env --name airflow_server puckel/docker-airflow webserver
If everything run smoothly, you can access the dashboard on port 8080 in your machine. Try to run some DAG (for debugging, you may want to load the example DAG by specifying
LOAD_EX=y on the
-e option when running the Airflow container). You will see that Airflow now can run some jobs in parallel.
So, running jobs in parallel is not an issue now. However, the problem arises because we still allow everybody to access the dashboard without any authentication. “Everyone is the admin” is obviously not something that we want to have in our system. We want to give role to everybody that use the dashboard so that they can use it accordingly.
In order to do that, we will configure authentication mechanism for everyone that want to login to our Airflow dashboard. Airflow already has authentication mechanism, so our job is only to make the mechanism works in our container. We need to add some lines in the
By the way, if you haven’t got any Fernet key, you can make one using the below Python script.
>>> from cryptography import fernet
Then, we need to restart our running container using the current
airflow.env file. If everything is doing good, we will see this login page when trying to open Airflow dashboard.
Airflow doesn’t give option to specify any first user when deploying the container. As a workaround, we can jump in to the container and manually add the user from the Python console inside the container.
docker exec -ti airflow_server python
Specify the desired user and password in the script below, and then run it in the opened Python console.
>>> import airflow
>>> from airflow import models, settings
>>> from airflow.contrib.auth.backends.password_auth import PasswordUser
>>> user = PasswordUser(models.User())
>>> user.username = 'my_user'
>>> user.email = 'firstname.lastname@example.org'
>>> user.password = 'my_user_password'
>>> session = settings.Session()
This user will be the user that has
super-admin role and can be used to create another user in specific role from inside the Airflow dashboard. The look of the Airflow dashboard when we login using the super-admin user is like the image below.
Now we have successfully applied LocalExecutor to our running Airflow docker container so that we can run job in parallel. We also have configured an authentication mechanism and role-based credentials when using the Airflow dashboard. It is supposed to be a good starting point to use Airflow in production environment. There are lots other configurations that can be explored, so keep curious :)