How the pipeline runs
====================================
The pipeline is separated into jobs. Each job is a full run of one data source through
each of its implemented steps. For instance:
.. code-block:: bash
pipe lob fd crawl -f 2024-01-01 && pipe lob fd parse -l && pipe lob fd clean
These chained commands are found in ``run.sh``.
.. note::
You will notice the run script supports two paths, ``large`` and ``small``. This is an
implementation detail relevant only for our cloud deployment.
All jobs run once a day starting around midnight UTC.
Chain of Command(s)
------------------------------------
``pipe`` commands are chained together using the linux `logical AND operator `_,
``&&``. These work with linux exit codes, so ``0`` is no issue and ``1`` presence of some issue.
In ``cli.py``, the whole run of a module is covered by a ``try/except`` block
(`in repo `_):
.. code-block:: python
try:
result = args.func(rargs, args)
except Exception:
logger.error(f"got error during {args.func.__name__}:\n")
traceback.print_exc()
sys.exit(1)
else:
if result:
logger.info(f"command {args.func} completed. Exit 0.")
return
else:
logger.warning(
f"command {args.func} returned {result}! Setting PIPE_CONTINUE=False."
)
with open(".pipe_continue", "w") as file:
file.write("False")
return
Any exception leads to ``sys.exit(1)``. In this case, the ``&&`` chain will stop at that point, ensuring
that e.g. ``parse`` does not run on the "latest" data if ``crawl`` crashed in the process of collecting it.
Note also ``PIPE_CONTINUE``. This is a variable held in a hidden file (``.pipe_continue``) that is ``True``
unless some step has found that, despite there being no error, the pipeline should not continue. This allows
``crawl`` or ``browse`` to find no new data on this run and signal to ``parse`` and ``clean`` that there
is nothing more to do. ``filter`` also uses it.