More than 3 years have passed since last update.

A simple script to show Airflow 1.10.9 tasks states in non-GUI environments

Posted at 2022-05-17

TL;DR

Suppose you have an Airflow 1.10.9 (this is important) instance running with REST API enabled listening
on https://airflow.work.in (for example). Suppose further that you have DAG called my_dag with tasks task-1, task-2 and task-3 backfilled from 2022-04-01 till 2022-04-10. Then, to see the each task's completion status in non-graphical environment,
here is what you do:

virtualenv venv # can skip this and next, if you are not afraid of impacting your current environment
. ./venv/bin/activate

> export DAG_TREE_VIEW__AIRFLOW_URL="https://airflow.work.in" #specify Airflow's REST API's URL (and port, if necessary)
> echo '{"my_dag":["task-1","task-2","task-3"]}' > task_list.json #at the moment, we need to specify tasks manually for this approach to work; see below;

> pip3 install git+https://github.com/nailbiter/alex_python_toolbox@v4.14.0
> toolbox-deck dag-tree-view -ostr my_dag 2022-04-01 2022-04-10

dt      2022-04-01 2022-04-02 2022-04-03 2022-04-04 2022-04-05 2022-04-06 2022-04-07 2022-04-08 2022-04-09 2022-04-10
task_id                                                                                                              
task-1           S          S          S          S          S          S          S          S          S          S
task-2           S          S          S          S          S          S          S          S          S          S
task-3           S          S          S          S          S          S          S          S          S          S

This shows that every task in this period ended with status Success. Other status include:

S (for Success)
C (for sCheduled)
R (for Runnning)
K (for sKipped)
F (for Failed)
Q (for Queued)
U (for Upstream_failed)

When this is helpful

When you want to get the Airflow tree view output but for some reason do not have GUI enabled on your environment (e.g. access Airflow through a remote proxy).
Then, this may come handy if you do not want to roll the remote desktop server.

How it works

For those curious about the code, here it is. It is pretty basic at the moment and by no means
is meant to be taken seriously. I rather think of it is an emergency kit tool, for the situation described above. I personally have found it helpful.

What this code does basically, is referring to Airflow's REST API for every task/date combination,
and the renders the data as table (as above). This is good, because it allows to use this script whenever you have Airflow with REST API enabled, without making explicit assumptions about (co)location and/or infrastructure (e.g. whether you use
dockerized Airflow, installed Airflow or remote Airflow-as-a-service).

As you can imagine, unless the Airflow server and the computer where you run the script are co-located, this takes quite a lot of time. Of partial help then may be the -c/--cache-lifetime-min n key (or
DAG_TREE_VIEW__CACHE_LIFETIME_MIN environment variable), which defines the n-minute caching for url requests (that is, if n=5, then any reference within 5 minutes to the same url will be taken from local cache). The default value for n is zero (i.e. no
cache).

The main shortcoming of this approach is the necessity of passing task names for every DAG explicitly: to my knowledge, Airflow 1.10.9 does not provide an endpoint for that (
although newer versions do). By default, the script assumes that you have a JSON file task_list.json at your current folder, with the structure

{
  "dag-1": ["task-1","task-2"],
  ...
}

describing tasks belonging to every DAG. If desirable, you can provide your own script, e.g. script <dag-id>, which will print (newline-separated) task names for every given dag-id. In that case, the script name can be passed to dag-tree-view
via the --task-list-supplier key (or DAG_TREE__TASK_LIST_SUPPLIER environment variable)

The code is easy enough and thanks to the power of pandas, which is used to render the final output, it can be easily extended (e.g. to output static html, which can the be viewed by terminal-based browser).

The case when your REST API is password-protected can be handled by passing login:password pair (colon-separated) via the -p/--airflow-login-password key (or
DAG_TREE_VIEW__AIRFLOW_LOGIN_PASSWORD environment variable).

So far the script makes the implicit assumption that your DAG is ran once per day. The script can definitely be tweaked to adapt for hour-based, month-based etc. DAGs, but I have not yet figured out the intuitive command-line interface
to handle all cases.

Future work

Adapt the script to DAGs of arbitrary frequency (not only day-based)
somehow circumvent the necessity of passing task names explicitly (see above)
speed-up (e.g. by executing API requests parallel instead of sequential)
currently, tasks are listed in alphabetical order; somehow get the info about tasks inter-dependency and order tasks in topological order

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up