How the pipeline runs ==================================== The pipeline is separated into jobs. Each job is a full run of one data source through each of its implemented steps. For instance: .. code-block:: bash pipe lob fd crawl -f 2024-01-01 && pipe lob fd parse -l && pipe lob fd clean These chained commands are found in ``run.sh``. .. note:: You will notice the run script supports two paths, ``large`` and ``small``. This is an implementation detail relevant only for our cloud deployment. All jobs run once a day starting around midnight UTC. Chain of Command(s) ------------------------------------ ``pipe`` commands are chained together using the linux `logical AND operator `_, ``&&``. These work with linux exit codes, so ``0`` is no issue and ``1`` presence of some issue. In ``cli.py``, the whole run of a module is covered by a ``try/except`` block (`in repo `_): .. code-block:: python try: result = args.func(rargs, args) except Exception: logger.error(f"got error during {args.func.__name__}:\n") traceback.print_exc() sys.exit(1) else: if result: logger.info(f"command {args.func} completed. Exit 0.") return else: logger.warning( f"command {args.func} returned {result}! Setting PIPE_CONTINUE=False." ) with open(".pipe_continue", "w") as file: file.write("False") return Any exception leads to ``sys.exit(1)``. In this case, the ``&&`` chain will stop at that point, ensuring that e.g. ``parse`` does not run on the "latest" data if ``crawl`` crashed in the process of collecting it. Note also ``PIPE_CONTINUE``. This is a variable held in a hidden file (``.pipe_continue``) that is ``True`` unless some step has found that, despite there being no error, the pipeline should not continue. This allows ``crawl`` or ``browse`` to find no new data on this run and signal to ``parse`` and ``clean`` that there is nothing more to do. ``filter`` also uses it.