How the pipeline runs
====================================

The pipeline is separated into jobs. Each job is a full run of one data source through
each of its implemented steps. For instance:

.. code-block:: bash

   pipe lob fd crawl -f 2024-01-01 && pipe lob fd parse -l && pipe lob fd clean

These chained commands are found in ``run.sh``. 

.. note::
   
  You will notice the run script supports two paths, ``large`` and ``small``. This is an
  implementation detail relevant only for our cloud deployment.

All jobs run once a day starting around midnight UTC.

Chain of Command(s)
------------------------------------

``pipe`` commands are chained together using the linux `logical AND operator <https://www.geeksforgeeks.org/difference-between-chaining-operators-in-linux/>`_,
``&&``. These work with linux exit codes, so ``0`` is no issue and ``1`` presence of some issue. 

In ``cli.py``, the whole run of a module is covered by a ``try/except`` block 
(`in repo <https://github.com/InvestigativeJournalismFoundation/pipeline/blob/e8ee24f74f5aa9411f759442eddfcf8003bd64b8/src/pipeline/cli.py#L77>`_):

.. code-block:: python 

   try:
        result = args.func(rargs, args)
    except Exception:
        logger.error(f"got error during {args.func.__name__}:\n")
        traceback.print_exc()
        sys.exit(1)
    else:
        if result:
            logger.info(f"command {args.func} completed. Exit 0.")
            return
        else:
            logger.warning(
                f"command {args.func} returned {result}! Setting PIPE_CONTINUE=False."
            )
            with open(".pipe_continue", "w") as file:
                file.write("False")
            return

Any exception leads to ``sys.exit(1)``. In this case, the ``&&`` chain will stop at that point, ensuring 
that e.g. ``parse`` does not run on the "latest" data if ``crawl`` crashed in the process of collecting it.

Note also ``PIPE_CONTINUE``. This is a variable held in a hidden file (``.pipe_continue``) that is ``True``
unless some step has found that, despite there being no error, the pipeline should not continue. This allows
``crawl`` or ``browse`` to find no new data on this run and signal to ``parse`` and ``clean`` that there
is nothing more to do. ``filter`` also uses it.