During my work I often have to run long Airflow backfills. They take very long time and my boss often asks me "How long do you still need?"
It is not very easy to answer this question, since while Airflow displays INFO messages as
[2021-07-06 14:24:11,117] {backfill_job.py:364} INFO — [backfill progress] | finished run 65 of 67 | tasks waiting: 0 | succeeded: 266 | running: 2 | failed: @ | skipped: 67 | deadlocked: @ | not ready: ®
it does not show the estimated remaining time (ETA).
In order to solve this problem, I found myself using the simple script below (see https://github.com/nailbiter/for/blob/master/forpython/fordatawise/non-reusable/backfill-progress.py for latest version):
import click
import sys
import re
from tqdm import tqdm
@click.command()
def backfill_progress():
idx = 0
max_cnt, tqdm_object = [None]*2
pat = re.compile(r".*finished run (\d+) of (\d+).*")
while True:
try:
line = input()
except EOFError:
break
m = pat.match(line)
if m is not None:
i, cnt = [int(m.group(i+1)) for i in range(2)]
if max_cnt is None:
max_cnt = cnt
tqdm_object = tqdm(total=max_cnt)
if i > idx:
tqdm_object.update(i-idx)
idx = i
click.echo(line)
if __name__ == "__main__":
backfill_progress()
The usage is as follows (if you have the script above saved under the name backfill-progress.py
in your current folder):
airflow backfill DAG_ID -s YYYY-MM-DD -e YYYY-MM-DD | python3 backfill-progress.py
The generated output will be similar to below:
As can be seen from source code, the idea is simple: upon reception of a line from backfill
command's output (here we make use of Unix's mighty piping mechanism)) script searches the line for
text matching the regex finished run (\d+) of (\d+)
and if this regex is present, updates the progress bar and ETA estimate. Also, the line is unconditionally forwarded to stdout (so we can view the backfills
's output as well).
Before wrapping out, I have to mention that displaying progress bar and ETA estimate is done via the great tqdm package. Also, click
library is used for convenience, but this dependency can be easily removed.