How the pipeline runs
The pipeline is separated into jobs. Each job is a full run of one data source through each of its implemented steps. For instance:
pipe lob fd crawl -f 2024-01-01 && pipe lob fd parse -l && pipe lob fd clean
These chained commands are found in run.sh.
Note
You will notice the run script supports two paths, large and small. This is an
implementation detail relevant only for our cloud deployment.
All jobs run once a day starting around midnight UTC.
Chain of Command(s)
pipe commands are chained together using the linux logical AND operator,
&&. These work with linux exit codes, so 0 is no issue and 1 presence of some issue.
In cli.py, the whole run of a module is covered by a try/except block
(in repo):
try:
result = args.func(rargs, args)
except Exception:
logger.error(f"got error during {args.func.__name__}:\n")
traceback.print_exc()
sys.exit(1)
else:
if result:
logger.info(f"command {args.func} completed. Exit 0.")
return
else:
logger.warning(
f"command {args.func} returned {result}! Setting PIPE_CONTINUE=False."
)
with open(".pipe_continue", "w") as file:
file.write("False")
return
Any exception leads to sys.exit(1). In this case, the && chain will stop at that point, ensuring
that e.g. parse does not run on the “latest” data if crawl crashed in the process of collecting it.
Note also PIPE_CONTINUE. This is a variable held in a hidden file (.pipe_continue) that is True
unless some step has found that, despite there being no error, the pipeline should not continue. This allows
crawl or browse to find no new data on this run and signal to parse and clean that there
is nothing more to do. filter also uses it.