How the pipeline runs

The pipeline is separated into jobs. Each job is a full run of one data source through each of its implemented steps. For instance:

pipe lob fd crawl -f 2024-01-01 && pipe lob fd parse -l && pipe lob fd clean

These chained commands are found in run.sh.

Note

You will notice the run script supports two paths, large and small. This is an implementation detail relevant only for our cloud deployment.

All jobs run once a day starting around midnight UTC.

Chain of Command(s)

pipe commands are chained together using the linux logical AND operator, &&. These work with linux exit codes, so 0 is no issue and 1 presence of some issue.

In cli.py, the whole run of a module is covered by a try/except block (in repo):

try:
     result = args.func(rargs, args)
 except Exception:
     logger.error(f"got error during {args.func.__name__}:\n")
     traceback.print_exc()
     sys.exit(1)
 else:
     if result:
         logger.info(f"command {args.func} completed. Exit 0.")
         return
     else:
         logger.warning(
             f"command {args.func} returned {result}! Setting PIPE_CONTINUE=False."
         )
         with open(".pipe_continue", "w") as file:
             file.write("False")
         return

Any exception leads to sys.exit(1). In this case, the && chain will stop at that point, ensuring that e.g. parse does not run on the “latest” data if crawl crashed in the process of collecting it.

Note also PIPE_CONTINUE. This is a variable held in a hidden file (.pipe_continue) that is True unless some step has found that, despite there being no error, the pipeline should not continue. This allows crawl or browse to find no new data on this run and signal to parse and clean that there is nothing more to do. filter also uses it.