Crawl

Crawl collects structured data from the web, typically as HTML or JSON. Source implementations give HTTP requests and metadata to the core architecture, which manages all networking and storage asynchronously.

Skip to “Writing Crawlers” at your own peril.

Trees

Each website is represented as a tree: nodes are “pages” (could be HTML document or API response) and the edges between them are HTTP requests. This structure is used by all web crawling projects. We can call this the sitemap.

The IJF’s crawling setup has unique goals that require a homemade engine. As outlined in the overview, the IJF deals in rid. All raw data is assigned to some record, identified by a unique rid, before it is ever saved in our storage.

So there are two trees in the crawl: the sitemap described above and the new record tree. Each node in the record tree is data, like in the sitemap, but assigned metadata including an rid. Further, to prevent confusion later on, each node in the rid tree has data forming part or whole of only one record. The edges are still HTTP requests.

The work of a crawler is traversing the sitemap and, along the way, building the record tree.

Example: lob fd and the searchable database

The federal lobbyist registry is an instructive example of our crawling projects. It has the structure we always seen in target sites, the searchable database.

_images/search-db.svg

This diagram shows the searchable database sitemap. There are N pages of results which have P records linked in each. The HTTP request to get each result page is usually some condition like a date range followed by a pagination count. The HTTP request to get each main page is usually given some ID found in the result card, hopefully in a friendly <a> tag.

In the above link, we can see that each result card corresponds to one record, and guess that some unique string identifying those records – what we require for building the RID – lives therein. So, we can section() the results page into P records, each with an rid.

_images/rid-tree-section.svg

One record is comprised by a subtree of the record tree that has all pages associated with that record. We don’t section main in this example because each main page describes just that record – which is normal, but not guaranteed. Also note that the RID is inherited from its first appearance in the subtree to all descendant nodes.

Seeds

These trees grow from seeds. In normal web crawlers this is sometimes called the URL frontier. They are priority queues of URLs from which a crawler will recursively all <a> tags, forming a tree.

We do deep web scraping. In these searchable databases, none of the data we need is accessible through a chain of surface-level same-site links. We need to make database queries to get results describing the desired range of records, most often within two dates.

So a static URL does not give us the information we need to crawl that site. Instead, we seed the crawler dynamically, at runtime in a function called seed().

seed(): the start of the event loop

Every call of seed produces a new HTTP request from which crawling can begin or resume. In the above example, calling seed would return a POST request querying the first page of registrations from the federal site, kicking off the process shown in the second diagram.

_images/seed.svg

seed can access a global scheduler object that holds all the identifying information about where you are in the crawl, implying what’s next in the crawl. What you must provide in the request is source-dependent, but is often: page_start, the starting index of the page; and/or start_date and end_date, the full date range of records regardless of pagination.

The core crawling architecture (black box above) updates the scheduler during its run, finishing before seed is called again.

Keeping with our lob fd example, calling seed would return a request that points to one page of results that we would process in the above figure.

Algorithm

We can now outline the whole crawling process from seed to seed.

_images/crawl-algo.svg
  1. seed is called, sending a request into an outgoing queue of requests

  2. Note that label, implied before when using result and main, is used to indicate the kind of page that the request will provide. These are set arbitrarily per-source.

  3. The response for any request is Data, a combination of it’s preceding edge’s label and the underlying response bytes.

  4. section is here as described before, splitting search results into one Data per record.

  5. parse is a source-specific function that takes all sectioned Data and fulfills two crucial processes: a. finding the rid for this Data, or choosing to inherit it’s parent b. finding any edges in this Data, which are requests to other pages in the sitemap

  6. The edges list goes back into the request queue, like the request provided by seed.

Steps 2-5 repeat until no further edges are found. At that point, seed runs again, starting the cycle anew.

Writing crawlers

Features and the CLI

Note

I have to redo the cli docs. There aren’t any.

The CLI for crawl is documented here. The arguments become this struct:

class pipeline.args.CrawlArgs(*args, **kwargs)

Supported arguments for crawl module.

Parameters:
  • n_tasks – int – number of concurrent crawling tasks (10)

  • start – int – start index of crawl (1)

  • stop – int – end index of crawl, inclusive (1,000,000)

  • from_date – str – start date of crawl (1970-01-01)

  • to_date – str – end date of crawl (datetime.now())

  • n_sessions – int – number of separate HTTP sessions (1)

  • session_cap – int – max simultaneous requests per session (n_tasks)

  • throttle – int – flat wait in seconds on every request (0)

Every crawler must support either:

  • -H: historical mode

  • -f/-t: dated mode

The latter is ideal. Since we run daily, we prefer on each run to cover “the last day”. Some sources do not support date filtering but are simultaneously small enough that covering their whole population takes less than 10 minutes. In those cases, just implement -H.

-c/-s, or start/stop is also supported in the core but not currently used. None of our data sources are yet large enough, without date filtering, that they are necessary.

All CrawlArgs values above are available in the scheduler, and used during seed.

Logistics

Create a new crawler for some db, s at: src/pipeline/crawl/crawlers/<db>/<s>. This directory must be a python module, so it must contain an __init__.py file which exposes all the necessary objects. They can be written in the init file, or in any abritrary structure.

This module must have three public functions: seed, sections and parse. They must have the following signatures:

def seed(scheduler: Scheduler) -> tree.Edge | None:

def sections(data: tree.Data) -> list[tree.Data]:

def parse(
    data: tree.Data, p_rid: str, p_rdate: datetime
) -> tuple[str, datetime, list[tree.Edge]]:

Edge and Request

These two core classes are used to represent outgoing network traffic.

class pipeline.crawl.tree.Edge(p_rid: str, p_rdate: datetime, label: str, req: Request, media: bool = False)

Directed connection from one node to a descendant in the record tree.

Parameters:
  • p_rid – str – Parent’s RID. Constructed unhashed but hashed internally later.

  • p_rdate – datetime – Parent’s RDate

  • label – str – label for referenced node

  • req – Request – underlying Request object

class pipeline.utils.http.Request(method: str, url: str, params: dict[str, ~typing.Any]=<factory>, headers: dict[str, str]=<factory>, json: dict | None = None, data: dict | bytes | None = None, cookies: dict[str, str] | None=None)

The core Request object used everywhere in the pipeline. Used either with requests or aiohttp libraries.

Parameters:
  • method – str – method, e.g. “GET”

  • url – str – URL of request

  • params – dict[str, Any] – URL parameters

  • headers – dict[str, str] – request headers

  • json – dict | None = None – json-formatted data accompanying request

  • data – dict | bytes | None – non-json data, like form/multipart

  • cookies – dict[str, str] | None = None – arbitrary cookies, separate from running cookie_jar in HTTP session(s)

Seed

seed uses the global scheduler to return either an Edge or None. The scheduler API is documented in full here.

seed returns None if all you’ve exhausted all the data in the date or index range.

As -f/-t or -H is passed to the CLI, these choices are visible during seed via the scheduler. On construction, the scheduler is assigned a Runtime which reflects the passed CLI arguments.

class pipeline.crawl.scheduler.runtime.Runtime(*values)

Models the possible runtimes of a crawl: HIST :: all records, regardless of index, date IDXBOUND :: records from and/or to some ind(ex|ices) DATEBOUND :: records from and/or to some date(s)

This is a characteristic example of using the runtime in seed:

lob sk crawl
def seed(scheduler: Scheduler) -> tree.Edge | None:
    path = Path(__file__).parent.resolve()

    # in HIST, we just cover all pages
    if scheduler.runtime == Runtime.HIST:
        with open(path / "adv.json") as file:
            req = json.load(file)
        req["json"]["start"] = scheduler.indexer.page_start - 1
    elif scheduler.runtime == Runtime.IDX:
        raise NotImplementedError(f"IDX Runtime not supported in lob sk")
    # if DATE, provide the date
    elif scheduler.runtime == Runtime.DATE:
        if (
            MAX_RECORDS_IN_DATE_RANGE <= scheduler.indexer.max_idx
            and scheduler.seeds > 0
        ):
            return None
        with open(path / "adv.json") as file:
            req = json.load(file)
        req["json"]["PostedFromDate"] = scheduler.calendar.from_date.date().isoformat()
        req["json"]["PostedToDate"] = scheduler.calendar.to_date.date().isoformat()

Section

In most sources, you will only need to section a search results page. Sometimes, the way a source structures its data is different than the schema we’ve designed. In the Quebec lobbyist registry, records are organized per-lobbyist, whereas we do per-entity:

lob qc crawl
def sections(data: tree.Data) -> list[tree.Data]:
    label, text = data
    map = json.loads(text)
    match label:
        case "search_results":
            return [tree.Data("result", json.dumps(result)) for result in map]
        case _:
            sections = []
            # post, consultant: split clients
            if not is_pre(data) and not is_org(data):
                for client in map["clients"]:
                    # saving "clients" as an array, even though always one element,
                    # to match style we see in raw JSON
                    sections.append(
                        tree.Data("main", json.dumps({**map, "clients": [client]}))
                    )
            # pre, any: multiple mandates?
            elif is_pre(data):
                for mandat in map["mandts"]:
                    # like clients above, saving mandats in array
                    sections.append(
                        tree.Data("main", json.dumps({**map, "mandts": [mandat]}))
                    )
            # this leaves post org, which doesn't require splitting
            elif not is_pre(data) and is_org(data):
                sections = [tree.Data("main", json.dumps(map))]

            return sections

Parse

Parse is responsible for collecting:

  1. some string unique to that record, for RID

  2. a date associated with that record, for RDate

  3. all connected pages, as Edges

It parses out this information from the provided Data. parse also inherits the rid and rdate of any parent node, if it exists.

The pipeline container has both lxml and bs4 for parsing HTML. There is also the Marker utility class that provides a nicer API to lxml.etree:

class pipeline.utils.html.marker.Marker(elem: str | _Element)
all(expr: str) list[Self]

Returns all matching element objects as Markers.

attr(attribute: str) str

Returns value of the given attribute.

extract(pat: str, expr: str = '.') str | tuple[str]

Gets regex capture from element’s text, using re.search. If expr, uses text of first matching element. If not, uses current elem.

Parameters:
  • pat (str) – regex string

  • expr (str | None) – XPATH expression for element to use

Returns:

str of regex match if only one group, else tuple of all groups.

first(expr: str) Self

Returns first matching element object as Marker.

save(label: str = 'tmp') None

Saves string content of Marker to the tmp html directory.

to_str() str

Returns string version of self, via etree.tostring().

txt(expr: str = '.') str

Takes XPATH expression and returns text of first matching element. If no expr, returns text of current element. Uses the text() XPATH function, not lxml Element.text attribute.

/text() sometimes returns multiple values. By default, all are joined into one string with no separator.

Parameters:

expr (str | None) – XPATH expression. Defaults to “.”.

Returns:

text string of first elem matching expr.

Advanced

HTTP Sessions

By default, all requests go through a single HTTP session. It is an aiohttp.ClientSession, documented here.

Cookies

ClientSession automatically accepts and updates cookies. This has been sufficient for all websites we’ve crawled so far. You can pass arbitrary, request-level cookies as dictionary to Request.cookies.

Throttling

We have four different ways to affect our throughput to the server:

  1. -n: total number of concurrent crawlers between seed calls

  2. -ns: number of sessions for -n many crawlers

  3. --session-cap: total simultaneous requests possible per session

  4. -w: flat throttle in seconds on all HTTP requests to the server

Managing sessions

-ns or -n-sessions sets the total number of HTTP sessions in the crawl. Regardless of the number, the engine ensures that every edge to a descendant node is sent through the same session as its parent.

We typically don’t need multiple sessions; but if we do, we equalize -n and -ns, so that every concurrent crawler has its own session.

SKIP and STOP

By default, any visited page is saved. If you don’t want to save a page, you can either skip over it, saving any subsequent pages, or stop at that page without saving.

To do either, pass __SKIP or __STOP as the RID in parse:

if data.label == "result":
    # Search results themselves are skipped and we go right to `main`,
    # which uses the same logic as normal mode.
    href = root.first("//a").attr("href")
    bn = re.search(r"Bn=(.+)&", href).group(1)
    req = tree.Edge(
        p_rid="",
        p_rdate=DMIN,
        label="main",
        req=Request(
            method="GET", url=CHA_PAGE_URL, params={"selectedCharityBn": bn}
        ),
    )
    return "__SKIP", DMIN, [req]

Duplicates

The engine does not save any duplicate pages. The internal representation of a node, which is linked to a SQL table, Pages, implements these methods:

pipeline.database.models.crawl.Pages
 def _uniq(self) -> str:
     """Returns string of concatenated values that uniquely identify any Pages
     object."""
     return self.cid + self.rid + self.label

 def __hash__(self) -> int:
     return hash(self._uniq())

 def __eq__(self, other) -> bool:
     if isinstance(other, Pages):
         return self._uniq() == other._uniq()
     return False

If a source implementation returns a page which the same rid, label pair as any previous in this crawl (labelled by cid, crawl ID), then it is not ingested and a warning log is emitted.

Websites in their structure often serve duplicate records and as extra protection we often account for that inside the source implementation.