###### Crawl ###### Crawl collects structured data from the web, typically as HTML or JSON. Source implementations give HTTP requests and metadata to the core architecture, which manages all networking and storage asynchronously. Skip to "Writing Crawlers" at your own peril. .. contents:: Contents :local: :depth: 1 ***** Trees ***** Each website is represented as a tree: nodes are "pages" (could be HTML document or API response) and the edges between them are HTTP requests. This structure is used by all web crawling projects. We can call this the `sitemap `_. The IJF's crawling setup has unique goals that require a homemade engine. As outlined in the :ref:`overview `, the IJF deals in ``rid``. All raw data is assigned to some record, identified by a unique ``rid``, before it is ever saved in our storage. So there are two trees in the crawl: the sitemap described above and the new *record tree*. Each node in the record tree is data, like in the sitemap, but assigned metadata including an ``rid``. Further, to prevent confusion later on, each node in the ``rid`` tree has data forming part or whole of only one record. The edges are still HTTP requests. The work of a crawler is traversing the sitemap and, along the way, building the record tree. Example: ``lob fd`` and the searchable database =============================================== The `federal lobbyist registry `_ is an instructive example of our crawling projects. It has the structure we always seen in target sites, the searchable database. .. image:: images/search-db.svg This diagram shows the searchable database sitemap. There are *N* pages of results which have *P* records linked in each. The HTTP request to get each result page is usually some condition like a date range followed by a pagination count. The HTTP request to get each main page is usually given some ID found in the result card, hopefully in a friendly ```` tag. In the above link, we can see that each result card corresponds to one record, and guess that some unique string identifying those records -- what we require for building the RID -- lives therein. So, we can ``section()`` the ``results`` page into ``P`` records, each with an ``rid``. .. _rid-tree-section: .. image:: images/rid-tree-section.svg One record is comprised by a subtree of the record tree that has all pages associated with that record. We don't section ``main`` in this example because each main page describes just that record -- which is normal, but not guaranteed. Also note that the RID is inherited from its first appearance in the subtree to all descendant nodes. ***** Seeds ***** These trees grow from seeds. In normal web crawlers this is sometimes called the `URL frontier `_. They are priority queues of URLs from which a crawler will recursively all ```` tags, forming a tree. We do deep web scraping. In these searchable databases, none of the data we need is accessible through a chain of surface-level same-site links. We need to make database queries to get results describing the desired range of records, most often within two dates. So a static URL does not give us the information we need to crawl that site. Instead, we seed the crawler dynamically, at runtime in a function called ``seed()``. ``seed()``: the start of the event loop ======================================= Every call of ``seed`` produces a new HTTP request from which crawling can begin or resume. In the above example, calling ``seed`` would return a ``POST`` request querying the first page of registrations from the federal site, kicking off the process shown in the second diagram. .. image:: images/seed.svg ``seed`` can access a global scheduler object that holds all the identifying information about where you are in the crawl, implying *what's next* in the crawl. What you must provide in the request is source-dependent, but is often: ``page_start``, the starting index of the page; and/or ``start_date`` and ``end_date``, the full date range of records regardless of pagination. The core crawling architecture (black box above) updates the scheduler during its run, finishing before seed is called again. Keeping with our ``lob fd`` example, calling ``seed`` would return a request that points to one page of results that we would process in :ref:`the above figure `. ********** Algorithm ********** We can now outline the whole crawling process from ``seed`` to ``seed``. .. image:: images/crawl-algo.svg 0. ``seed`` is called, sending a request into an outgoing queue of requests 1. Note that ``label``, implied before when using ``result`` and ``main``, is used to indicate the kind of page that the request will provide. These are set arbitrarily per-source. 2. The response for any request is ``Data``, a combination of it's preceding edge's ``label`` and the underlying response bytes. 3. ``section`` is here as described before, splitting search results into one ``Data`` per record. 4. ``parse`` is a source-specific function that takes all sectioned ``Data`` and fulfills two crucial processes: a. finding the ``rid`` for this ``Data``, or choosing to inherit it's parent b. finding any ``edges`` in this ``Data``, which are requests to other pages in the sitemap 5. The ``edges`` list goes back into the request queue, like the request provided by ``seed``. Steps 2-5 repeat until no further edges are found. At that point, ``seed`` runs again, starting the cycle anew. **************** Writing crawlers **************** Features and the CLI ==================== .. note:: I have to redo the cli docs. There aren't any. The CLI for ``crawl`` is documented here. The arguments become this struct: .. autoclass:: pipeline.args.CrawlArgs Every crawler must support either: - ``-H``: historical mode - ``-f/-t``: dated mode The latter is ideal. Since we run daily, we prefer on each run to cover "the last day". Some sources do not support date filtering but are simultaneously small enough that covering their whole population takes less than 10 minutes. In those cases, just implement ``-H``. ``-c/-s``, or start/stop is also supported in the core but not currently used. None of our data sources are yet large enough, without date filtering, that they are necessary. All ``CrawlArgs`` values above are available in the scheduler, and used during ``seed``. Logistics ========= Create a new crawler for some ``db, s`` at: ``src/pipeline/crawl/crawlers//``. This directory must be a python module, so it must contain an ``__init__.py`` file which exposes all the necessary objects. They can be written in the init file, or in any abritrary structure. This module must have three public functions: ``seed``, ``sections`` and ``parse``. They must have the following signatures: .. code-block:: python def seed(scheduler: Scheduler) -> tree.Edge | None: def sections(data: tree.Data) -> list[tree.Data]: def parse( data: tree.Data, p_rid: str, p_rdate: datetime ) -> tuple[str, datetime, list[tree.Edge]]: ``Edge`` and ``Request`` ======================== These two core classes are used to represent outgoing network traffic. .. autoclass:: pipeline.crawl.tree.Edge .. autoclass:: pipeline.utils.http.Request Seed ========================= ``seed`` uses the global scheduler to return either an ``Edge`` or ``None``. The scheduler API is documented in full :doc:`here `. ``seed`` returns ``None`` if all you've exhausted all the data in the date or index range. As ``-f/-t`` or ``-H`` is passed to the CLI, these choices are visible during seed via the scheduler. On construction, the scheduler is assigned a ``Runtime`` which reflects the passed CLI arguments. .. autoclass:: pipeline.crawl.scheduler.runtime.Runtime This is a characteristic example of using the runtime in seed: .. code-block:: python :caption: lob sk crawl def seed(scheduler: Scheduler) -> tree.Edge | None: path = Path(__file__).parent.resolve() # in HIST, we just cover all pages if scheduler.runtime == Runtime.HIST: with open(path / "adv.json") as file: req = json.load(file) req["json"]["start"] = scheduler.indexer.page_start - 1 elif scheduler.runtime == Runtime.IDX: raise NotImplementedError(f"IDX Runtime not supported in lob sk") # if DATE, provide the date elif scheduler.runtime == Runtime.DATE: if ( MAX_RECORDS_IN_DATE_RANGE <= scheduler.indexer.max_idx and scheduler.seeds > 0 ): return None with open(path / "adv.json") as file: req = json.load(file) req["json"]["PostedFromDate"] = scheduler.calendar.from_date.date().isoformat() req["json"]["PostedToDate"] = scheduler.calendar.to_date.date().isoformat() Section ======= In most sources, you will only need to section a search results page. Sometimes, the way a source structures its data is different than the schema we've designed. In the Quebec lobbyist registry, records are organized per-lobbyist, whereas we do per-entity: .. code-block:: python :caption: lob qc crawl def sections(data: tree.Data) -> list[tree.Data]: label, text = data map = json.loads(text) match label: case "search_results": return [tree.Data("result", json.dumps(result)) for result in map] case _: sections = [] # post, consultant: split clients if not is_pre(data) and not is_org(data): for client in map["clients"]: # saving "clients" as an array, even though always one element, # to match style we see in raw JSON sections.append( tree.Data("main", json.dumps({**map, "clients": [client]})) ) # pre, any: multiple mandates? elif is_pre(data): for mandat in map["mandts"]: # like clients above, saving mandats in array sections.append( tree.Data("main", json.dumps({**map, "mandts": [mandat]})) ) # this leaves post org, which doesn't require splitting elif not is_pre(data) and is_org(data): sections = [tree.Data("main", json.dumps(map))] return sections Parse ===== Parse is responsible for collecting: 1. some string unique to that record, for RID 2. a date associated with that record, for RDate 3. all connected pages, as Edges It parses out this information from the provided ``Data``. ``parse`` also inherits the ``rid`` and ``rdate`` of any parent node, if it exists. The pipeline container has both ``lxml`` and ``bs4`` for parsing HTML. There is also the ``Marker`` utility class that provides a nicer API to ``lxml.etree``: .. autoclass:: pipeline.utils.html.marker.Marker :members: ******** Advanced ******** HTTP Sessions ============= By default, all requests go through a single HTTP session. It is an ``aiohttp.ClientSession``, documented `here `_. Cookies ^^^^^^^ ``ClientSession`` automatically accepts and updates cookies. This has been sufficient for all websites we've crawled so far. You can pass arbitrary, request-level cookies as dictionary to ``Request.cookies``. Throttling ^^^^^^^^^^ We have four different ways to affect our throughput to the server: 1. ``-n``: total number of concurrent crawlers between ``seed`` calls 2. ``-ns``: number of sessions for ``-n`` many crawlers 3. ``--session-cap``: total simultaneous requests possible per session 4. ``-w``: flat throttle in seconds on all HTTP requests to the server Managing sessions ^^^^^^^^^^^^^^^^^ ``-ns`` or ``-n-sessions`` sets the total number of HTTP sessions in the crawl. Regardless of the number, the engine ensures that every edge to a descendant node is sent through the same session as its parent. We typically don't need multiple sessions; but if we do, we equalize ``-n`` and ``-ns``, so that every concurrent crawler has its own session. SKIP and STOP ============= By default, any visited page is saved. If you don't want to save a page, you can either skip over it, saving any subsequent pages, or stop at that page without saving. To do either, pass ``__SKIP`` or ``__STOP`` as the RID in ``parse``: .. code-block:: python if data.label == "result": # Search results themselves are skipped and we go right to `main`, # which uses the same logic as normal mode. href = root.first("//a").attr("href") bn = re.search(r"Bn=(.+)&", href).group(1) req = tree.Edge( p_rid="", p_rdate=DMIN, label="main", req=Request( method="GET", url=CHA_PAGE_URL, params={"selectedCharityBn": bn} ), ) return "__SKIP", DMIN, [req] Duplicates ========== The engine does not save any duplicate pages. The internal representation of a node, which is linked to a SQL table, ``Pages``, implements these methods: .. code-block:: python :caption: pipeline.database.models.crawl.Pages def _uniq(self) -> str: """Returns string of concatenated values that uniquely identify any Pages object.""" return self.cid + self.rid + self.label def __hash__(self) -> int: return hash(self._uniq()) def __eq__(self, other) -> bool: if isinstance(other, Pages): return self._uniq() == other._uniq() return False If a source implementation returns a page which the same ``rid, label`` pair as any previous in this crawl (labelled by ``cid``, crawl ID), then it is not ingested and a warning log is emitted. Websites in their structure often serve duplicate records and as extra protection we often account for that inside the source implementation.