######
Crawl
######

Crawl collects structured data from the web, typically as HTML or JSON. Source implementations 
give HTTP requests and metadata to the core architecture, which manages all networking and 
storage asynchronously.

Skip to "Writing Crawlers" at your own peril.

.. contents:: Contents
   :local:
   :depth: 1

*****
Trees
*****

Each website is represented as a tree: nodes are "pages" (could be HTML document or API response) and 
the edges between them are HTTP requests. This structure is used by all web crawling projects. We can
call this the `sitemap <https://en.wikipedia.org/wiki/Site_map>`_.

The IJF's crawling setup has unique goals that require a homemade engine. As outlined in the :ref:`overview <core-concepts>`,
the IJF deals in ``rid``. All raw data is assigned to some record, identified by a unique ``rid``, before
it is ever saved in our storage.

So there are two trees in the crawl: the sitemap described above and the new *record tree*. Each node in
the record tree is data, like in the sitemap, but assigned metadata including an ``rid``. Further, to
prevent confusion later on, each node in the ``rid`` tree has data forming part or whole of only one record.
The edges are still HTTP requests.

The work of a crawler is traversing the sitemap and, along the way, building the record tree.

Example: ``lob fd`` and the searchable database
===============================================

The `federal lobbyist registry <https://lobbycanada.gc.ca/app/secure/ocl/lrs/do/advSrch?keywords=&srch=Search>`_ 
is an instructive example of our crawling projects. It has the structure we  always seen in target sites, 
the searchable database.

.. image:: images/search-db.svg

This diagram shows the searchable database sitemap. There are *N* pages of results which have *P* records
linked in each. The HTTP request to get each result page is usually some condition like a date range followed by
a pagination count. The HTTP request to get each main page is usually given some ID found in the result card,
hopefully in a friendly ``<a>`` tag.

In the above link, we can see  that each result card corresponds to one record, and guess that some unique string 
identifying those records -- what we require for building the RID -- lives therein. So, we can ``section()`` the 
``results`` page into ``P`` records, each with an ``rid``.

.. _rid-tree-section:
.. image:: images/rid-tree-section.svg

One record is comprised by a subtree of the record tree that has all pages associated with that record. We don't 
section ``main`` in this example because each main page describes just that record -- which is normal, but not 
guaranteed. Also note that the RID is inherited from its first appearance in the subtree to all descendant nodes.

*****
Seeds
*****

These trees grow from seeds. In normal web crawlers this is sometimes called the `URL frontier <https://en.wikipedia.org/wiki/Crawl_frontier>`_.
They are priority queues of URLs from which a crawler will recursively all ``<a>`` tags, forming a tree.

We do deep web scraping. In these searchable databases, none of the data we need is accessible through a
chain of surface-level same-site links. We need to make database queries to get results describing the
desired range of records, most often within two dates.

So a static URL does not give us the information we need to crawl that site. Instead, we seed the crawler
dynamically, at runtime in a function called ``seed()``. 

``seed()``: the start of the event loop
=======================================

Every call of ``seed`` produces a new HTTP request from which crawling can begin or resume. In the above 
example, calling ``seed`` would return a ``POST`` request querying the first page of registrations from 
the federal site, kicking off the process shown in the second diagram.

.. image:: images/seed.svg

``seed`` can access a global scheduler object that holds all the identifying information about where you
are in the crawl, implying *what's next* in the crawl. What you must provide in the request is 
source-dependent, but is often: ``page_start``, the starting index of the page; and/or ``start_date`` and 
``end_date``, the full date range of records regardless of pagination.

The core crawling architecture (black box above) updates the scheduler during its run, finishing before
seed is called again.

Keeping with our ``lob fd`` example, calling ``seed`` would return a request that points to one page of 
results that we would process in :ref:`the above figure <rid-tree-section>`.

**********
Algorithm
**********

We can now outline the whole crawling process from ``seed`` to ``seed``.

.. image:: images/crawl-algo.svg

0. ``seed`` is called, sending a request into an outgoing queue of requests
1. Note that ``label``, implied before when using ``result`` and ``main``, is used to indicate the 
   kind of page that the request will provide. These are set arbitrarily per-source.
2. The response for any request is ``Data``, a combination of it's preceding edge's ``label`` and
   the underlying response bytes.
3. ``section`` is here as described before, splitting search results into one ``Data`` per record.
4. ``parse`` is a source-specific function that takes all sectioned ``Data`` and fulfills two
   crucial processes:
   a. finding the ``rid`` for this ``Data``, or choosing to inherit it's parent
   b. finding any ``edges`` in this ``Data``, which are requests to other pages in the sitemap
5. The ``edges`` list goes back into the request queue, like the request provided by ``seed``. 

Steps 2-5 repeat until no further edges are found. At that point, ``seed`` runs again, starting the
cycle anew.

****************
Writing crawlers
****************

Features and the CLI
====================

.. note:: 
   I have to redo the cli docs. There aren't any.

The CLI for ``crawl`` is documented here. The arguments become this struct:

.. autoclass:: pipeline.args.CrawlArgs

Every crawler must support either:

- ``-H``: historical mode
- ``-f/-t``: dated mode

The latter is ideal. Since we run daily, we prefer on each run to cover "the last day". Some sources do
not support date filtering but are simultaneously small enough that covering their whole population takes 
less than 10 minutes. In those cases, just implement ``-H``.

``-c/-s``, or start/stop is also supported in the core but not currently used. None of our data sources are
yet large enough, without date filtering, that they are necessary.

All ``CrawlArgs`` values above are available in the scheduler, and used during ``seed``.

Logistics
=========

Create a new crawler for some ``db, s`` at: ``src/pipeline/crawl/crawlers/<db>/<s>``. This 
directory must be a python module, so it must contain an ``__init__.py`` file which exposes
all the necessary objects. They can be written in the init file, or in any abritrary structure.

This module must have three public functions: ``seed``, ``sections`` and ``parse``. They must
have the following signatures:

.. code-block:: python

  def seed(scheduler: Scheduler) -> tree.Edge | None:

  def sections(data: tree.Data) -> list[tree.Data]:
  
  def parse(
      data: tree.Data, p_rid: str, p_rdate: datetime
  ) -> tuple[str, datetime, list[tree.Edge]]:

``Edge`` and ``Request``
========================

These two core classes are used to represent outgoing network traffic.

.. autoclass:: pipeline.crawl.tree.Edge

.. autoclass:: pipeline.utils.http.Request
   
Seed
=========================

``seed`` uses the global scheduler to return either an ``Edge`` or ``None``. The 
scheduler API is documented in full :doc:`here <scheduler>`.

``seed`` returns ``None`` if all you've exhausted all the data in the date or index range.

As ``-f/-t`` or ``-H`` is passed to the CLI, these choices are visible during seed via the
scheduler. On construction, the scheduler is assigned a ``Runtime`` which reflects the passed
CLI arguments.

.. autoclass:: pipeline.crawl.scheduler.runtime.Runtime

This is a characteristic example of using the runtime in seed:

.. code-block:: python
   :caption: lob sk crawl
   
   def seed(scheduler: Scheduler) -> tree.Edge | None:
       path = Path(__file__).parent.resolve()

       # in HIST, we just cover all pages
       if scheduler.runtime == Runtime.HIST:
           with open(path / "adv.json") as file:
               req = json.load(file)
           req["json"]["start"] = scheduler.indexer.page_start - 1
       elif scheduler.runtime == Runtime.IDX:
           raise NotImplementedError(f"IDX Runtime not supported in lob sk")
       # if DATE, provide the date
       elif scheduler.runtime == Runtime.DATE:
           if (
               MAX_RECORDS_IN_DATE_RANGE <= scheduler.indexer.max_idx
               and scheduler.seeds > 0
           ):
               return None
           with open(path / "adv.json") as file:
               req = json.load(file)
           req["json"]["PostedFromDate"] = scheduler.calendar.from_date.date().isoformat()
           req["json"]["PostedToDate"] = scheduler.calendar.to_date.date().isoformat()

Section
=======

In most sources, you will only need to section a search results page. Sometimes, the way
a source structures its data is different than the schema we've designed. In the Quebec 
lobbyist registry, records are organized per-lobbyist, whereas we do per-entity:

.. code-block:: python
   :caption: lob qc crawl

   def sections(data: tree.Data) -> list[tree.Data]:
       label, text = data
       map = json.loads(text)
       match label:
           case "search_results":
               return [tree.Data("result", json.dumps(result)) for result in map]
           case _:
               sections = []
               # post, consultant: split clients
               if not is_pre(data) and not is_org(data):
                   for client in map["clients"]:
                       # saving "clients" as an array, even though always one element,
                       # to match style we see in raw JSON
                       sections.append(
                           tree.Data("main", json.dumps({**map, "clients": [client]}))
                       )
               # pre, any: multiple mandates?
               elif is_pre(data):
                   for mandat in map["mandts"]:
                       # like clients above, saving mandats in array
                       sections.append(
                           tree.Data("main", json.dumps({**map, "mandts": [mandat]}))
                       )
               # this leaves post org, which doesn't require splitting
               elif not is_pre(data) and is_org(data):
                   sections = [tree.Data("main", json.dumps(map))]

               return sections

Parse
=====

Parse is responsible for collecting:

1. some string unique to that record, for RID
2. a date associated with that record, for RDate
3. all connected pages, as Edges

It parses out this information from the provided ``Data``. ``parse`` also inherits the ``rid``
and ``rdate`` of any parent node, if it exists.

The pipeline container has both ``lxml`` and ``bs4`` for parsing HTML. There is also the
``Marker`` utility class that provides a nicer API to ``lxml.etree``:

.. autoclass:: pipeline.utils.html.marker.Marker
   :members:

********
Advanced
********

HTTP Sessions
=============

By default, all requests go through a single HTTP session. It is an ``aiohttp.ClientSession``,
documented `here <https://docs.aiohttp.org/en/stable/client_reference.html>`_. 

Cookies
^^^^^^^

``ClientSession`` automatically accepts and updates cookies. This has been sufficient for all websites
we've crawled so far. You can pass arbitrary, request-level cookies as dictionary to ``Request.cookies``.

Throttling
^^^^^^^^^^

We have four different ways to affect our throughput to the server:

1. ``-n``: total number of concurrent crawlers between ``seed`` calls
2. ``-ns``: number of sessions for ``-n`` many crawlers
3. ``--session-cap``: total simultaneous requests possible per session
4. ``-w``: flat throttle in seconds on all HTTP requests to the server

Managing sessions
^^^^^^^^^^^^^^^^^

``-ns`` or ``-n-sessions`` sets the total number of HTTP sessions in the crawl. Regardless of
the number, the engine ensures that every edge to a descendant node is sent through the same
session as its parent.

We typically don't need multiple sessions; but if we do, we equalize ``-n`` and ``-ns``, so that
every concurrent crawler has its own session.

SKIP and STOP
=============

By default, any visited page is saved. If you don't want to save a page, you can either skip over
it, saving any subsequent pages, or stop at that page without saving. 

To do either, pass ``__SKIP`` or ``__STOP`` as the RID in ``parse``:

.. code-block:: python

    if data.label == "result":
        # Search results themselves are skipped and we go right to `main`,
        # which uses the same logic as normal mode.
        href = root.first("//a").attr("href")
        bn = re.search(r"Bn=(.+)&", href).group(1)
        req = tree.Edge(
            p_rid="",
            p_rdate=DMIN,
            label="main",
            req=Request(
                method="GET", url=CHA_PAGE_URL, params={"selectedCharityBn": bn}
            ),
        )
        return "__SKIP", DMIN, [req]

Duplicates
==========

The engine does not save any duplicate pages. The internal representation of a node,
which is linked to a SQL table, ``Pages``, implements these methods:

.. code-block:: python
   :caption: pipeline.database.models.crawl.Pages

    def _uniq(self) -> str:
        """Returns string of concatenated values that uniquely identify any Pages
        object."""
        return self.cid + self.rid + self.label

    def __hash__(self) -> int:
        return hash(self._uniq())

    def __eq__(self, other) -> bool:
        if isinstance(other, Pages):
            return self._uniq() == other._uniq()
        return False

If a source implementation returns a page which the same ``rid, label`` pair as any
previous in this crawl (labelled by ``cid``, crawl ID), then it is not ingested
and a warning log is emitted.

Websites in their structure often serve duplicate records and as extra protection we often
account for that inside the source implementation.