Airflow is a workflow management system that is widely used these days. It is originally developed by Airbnb to manage their data flow in their fast growing data set. It is now under incubation of Apache Software foundation as Airbnb decided to open source it.
Airflow make your work easier by being able to monitor and programmatically schedule and manage your workflow. It also comes with some additional features that is very useful in production, such as giving alert when your job is ended with an error. It is also very handy because you can schedule your command by either calling a shell script, python code, or even running a docker container.
However, those benefits will be meaningless unless we can install it in the production environment. The most tricky part in the production (especially in a very critical and highly secured data center) is that the servers is not connected to the internet. One of the easiest solution is to download all the package required to install Airflow from your internet-connected device, then upload and install them in the offline servers. Unfortunately, the required packages are so many and it is going to be a very cumbersome job. Here, Docker comes to help!
The first part of the job is to install docker, both in your offline server and in your internet-connected device. You can use the official package from your corresponding OS repository. For security reason, I suggest you to use the official package, especially if you use enterprise operating system such as RedHat Enterprise Linux, so that you still can get the official support on your docker service as well. However, if it is not suitable for you, you can use the community edition as well. We will not cover the docker installation part in this topic, as the official documentation is already includes a very good guidance.
After successfully installing docker, we need to get the docker image of Airflow. There are lots of its versions in the internet, so you can choose one that you prefer. In this topic, we will use
puckel/docker-airflow image that can be pulled from Docker Hub.
In your internet-connected device, pull the docker image by typing the command below.
docker pull puckel/docker-airflow
The image will be downloaded to your device in a while. Once it is completed, we then want to move the downloaded image to the offline server. We can export the image to a file using this command.
docker save puckel/docker-airflow -o <export_file>
Then, we can copy the file to our offline server. Make sure you double check the result of the copy, since we don’t want to get any missing data on the file. You can use
md5sum to check it. Once the file arrives safely on our offline server, we can load Airflow image from the file by running
docker load -i <export_file>
You can check whether the image loaded successfully by running
docker images and then see if
puckel/docker-airflow is in the image list. Once your image is loaded successfully, you can use the image to start your Airflow service.
docker run -d -p 8080:8080 -e LOAD_EXAMPLE=y puckel/docker-airflow webserver
If it can run successfully, this screen will appear if you hit the port 8080 on your offline server.
There are some other configurations to run the airflow docker as can be seen on the Github documentation of