Crawl
Crawl collects structured data from the web, typically as HTML or JSON. Source implementations give HTTP requests and metadata to the core architecture, which manages all networking and storage asynchronously.
Skip to “Writing Crawlers” at your own peril.
Trees
Each website is represented as a tree: nodes are “pages” (could be HTML document or API response) and the edges between them are HTTP requests. This structure is used by all web crawling projects. We can call this the sitemap.
The IJF’s crawling setup has unique goals that require a homemade engine. As outlined in the overview,
the IJF deals in rid. All raw data is assigned to some record, identified by a unique rid, before
it is ever saved in our storage.
So there are two trees in the crawl: the sitemap described above and the new record tree. Each node in
the record tree is data, like in the sitemap, but assigned metadata including an rid. Further, to
prevent confusion later on, each node in the rid tree has data forming part or whole of only one record.
The edges are still HTTP requests.
The work of a crawler is traversing the sitemap and, along the way, building the record tree.
Example: lob fd and the searchable database
The federal lobbyist registry is an instructive example of our crawling projects. It has the structure we always seen in target sites, the searchable database.
This diagram shows the searchable database sitemap. There are N pages of results which have P records
linked in each. The HTTP request to get each result page is usually some condition like a date range followed by
a pagination count. The HTTP request to get each main page is usually given some ID found in the result card,
hopefully in a friendly <a> tag.
In the above link, we can see that each result card corresponds to one record, and guess that some unique string
identifying those records – what we require for building the RID – lives therein. So, we can section() the
results page into P records, each with an rid.
One record is comprised by a subtree of the record tree that has all pages associated with that record. We don’t
section main in this example because each main page describes just that record – which is normal, but not
guaranteed. Also note that the RID is inherited from its first appearance in the subtree to all descendant nodes.
Seeds
These trees grow from seeds. In normal web crawlers this is sometimes called the URL frontier.
They are priority queues of URLs from which a crawler will recursively all <a> tags, forming a tree.
We do deep web scraping. In these searchable databases, none of the data we need is accessible through a chain of surface-level same-site links. We need to make database queries to get results describing the desired range of records, most often within two dates.
So a static URL does not give us the information we need to crawl that site. Instead, we seed the crawler
dynamically, at runtime in a function called seed().
seed(): the start of the event loop
Every call of seed produces a new HTTP request from which crawling can begin or resume. In the above
example, calling seed would return a POST request querying the first page of registrations from
the federal site, kicking off the process shown in the second diagram.
seed can access a global scheduler object that holds all the identifying information about where you
are in the crawl, implying what’s next in the crawl. What you must provide in the request is
source-dependent, but is often: page_start, the starting index of the page; and/or start_date and
end_date, the full date range of records regardless of pagination.
The core crawling architecture (black box above) updates the scheduler during its run, finishing before seed is called again.
Keeping with our lob fd example, calling seed would return a request that points to one page of
results that we would process in the above figure.
Algorithm
We can now outline the whole crawling process from seed to seed.
seedis called, sending a request into an outgoing queue of requestsNote that
label, implied before when usingresultandmain, is used to indicate the kind of page that the request will provide. These are set arbitrarily per-source.The response for any request is
Data, a combination of it’s preceding edge’slabeland the underlying response bytes.sectionis here as described before, splitting search results into oneDataper record.parseis a source-specific function that takes all sectionedDataand fulfills two crucial processes: a. finding theridfor thisData, or choosing to inherit it’s parent b. finding anyedgesin thisData, which are requests to other pages in the sitemapThe
edgeslist goes back into the request queue, like the request provided byseed.
Steps 2-5 repeat until no further edges are found. At that point, seed runs again, starting the
cycle anew.
Writing crawlers
Features and the CLI
Note
I have to redo the cli docs. There aren’t any.
The CLI for crawl is documented here. The arguments become this struct:
- class pipeline.args.CrawlArgs(*args, **kwargs)
Supported arguments for crawl module.
- Parameters:
n_tasks – int – number of concurrent crawling tasks (10)
start – int – start index of crawl (1)
stop – int – end index of crawl, inclusive (1,000,000)
from_date – str – start date of crawl (1970-01-01)
to_date – str – end date of crawl (datetime.now())
n_sessions – int – number of separate HTTP sessions (1)
session_cap – int – max simultaneous requests per session (n_tasks)
throttle – int – flat wait in seconds on every request (0)
Every crawler must support either:
-H: historical mode-f/-t: dated mode
The latter is ideal. Since we run daily, we prefer on each run to cover “the last day”. Some sources do
not support date filtering but are simultaneously small enough that covering their whole population takes
less than 10 minutes. In those cases, just implement -H.
-c/-s, or start/stop is also supported in the core but not currently used. None of our data sources are
yet large enough, without date filtering, that they are necessary.
All CrawlArgs values above are available in the scheduler, and used during seed.
Logistics
Create a new crawler for some db, s at: src/pipeline/crawl/crawlers/<db>/<s>. This
directory must be a python module, so it must contain an __init__.py file which exposes
all the necessary objects. They can be written in the init file, or in any abritrary structure.
This module must have three public functions: seed, sections and parse. They must
have the following signatures:
def seed(scheduler: Scheduler) -> tree.Edge | None:
def sections(data: tree.Data) -> list[tree.Data]:
def parse(
data: tree.Data, p_rid: str, p_rdate: datetime
) -> tuple[str, datetime, list[tree.Edge]]:
Edge and Request
These two core classes are used to represent outgoing network traffic.
- class pipeline.crawl.tree.Edge(p_rid: str, p_rdate: datetime, label: str, req: Request, media: bool = False)
Directed connection from one node to a descendant in the record tree.
- Parameters:
p_rid – str – Parent’s RID. Constructed unhashed but hashed internally later.
p_rdate – datetime – Parent’s RDate
label – str – label for referenced node
req – Request – underlying Request object
- class pipeline.utils.http.Request(method: str, url: str, params: dict[str, ~typing.Any]=<factory>, headers: dict[str, str]=<factory>, json: dict | None = None, data: dict | bytes | None = None, cookies: dict[str, str] | None=None)
The core Request object used everywhere in the pipeline. Used either with requests or aiohttp libraries.
- Parameters:
method – str – method, e.g. “GET”
url – str – URL of request
params – dict[str, Any] – URL parameters
headers – dict[str, str] – request headers
json – dict | None = None – json-formatted data accompanying request
data – dict | bytes | None – non-json data, like form/multipart
cookies – dict[str, str] | None = None – arbitrary cookies, separate from running cookie_jar in HTTP session(s)
Seed
seed uses the global scheduler to return either an Edge or None. The
scheduler API is documented in full here.
seed returns None if all you’ve exhausted all the data in the date or index range.
As -f/-t or -H is passed to the CLI, these choices are visible during seed via the
scheduler. On construction, the scheduler is assigned a Runtime which reflects the passed
CLI arguments.
- class pipeline.crawl.scheduler.runtime.Runtime(*values)
Models the possible runtimes of a crawl: HIST :: all records, regardless of index, date IDXBOUND :: records from and/or to some ind(ex|ices) DATEBOUND :: records from and/or to some date(s)
This is a characteristic example of using the runtime in seed:
def seed(scheduler: Scheduler) -> tree.Edge | None:
path = Path(__file__).parent.resolve()
# in HIST, we just cover all pages
if scheduler.runtime == Runtime.HIST:
with open(path / "adv.json") as file:
req = json.load(file)
req["json"]["start"] = scheduler.indexer.page_start - 1
elif scheduler.runtime == Runtime.IDX:
raise NotImplementedError(f"IDX Runtime not supported in lob sk")
# if DATE, provide the date
elif scheduler.runtime == Runtime.DATE:
if (
MAX_RECORDS_IN_DATE_RANGE <= scheduler.indexer.max_idx
and scheduler.seeds > 0
):
return None
with open(path / "adv.json") as file:
req = json.load(file)
req["json"]["PostedFromDate"] = scheduler.calendar.from_date.date().isoformat()
req["json"]["PostedToDate"] = scheduler.calendar.to_date.date().isoformat()
Section
In most sources, you will only need to section a search results page. Sometimes, the way a source structures its data is different than the schema we’ve designed. In the Quebec lobbyist registry, records are organized per-lobbyist, whereas we do per-entity:
def sections(data: tree.Data) -> list[tree.Data]:
label, text = data
map = json.loads(text)
match label:
case "search_results":
return [tree.Data("result", json.dumps(result)) for result in map]
case _:
sections = []
# post, consultant: split clients
if not is_pre(data) and not is_org(data):
for client in map["clients"]:
# saving "clients" as an array, even though always one element,
# to match style we see in raw JSON
sections.append(
tree.Data("main", json.dumps({**map, "clients": [client]}))
)
# pre, any: multiple mandates?
elif is_pre(data):
for mandat in map["mandts"]:
# like clients above, saving mandats in array
sections.append(
tree.Data("main", json.dumps({**map, "mandts": [mandat]}))
)
# this leaves post org, which doesn't require splitting
elif not is_pre(data) and is_org(data):
sections = [tree.Data("main", json.dumps(map))]
return sections
Parse
Parse is responsible for collecting:
some string unique to that record, for RID
a date associated with that record, for RDate
all connected pages, as Edges
It parses out this information from the provided Data. parse also inherits the rid
and rdate of any parent node, if it exists.
The pipeline container has both lxml and bs4 for parsing HTML. There is also the
Marker utility class that provides a nicer API to lxml.etree:
- class pipeline.utils.html.marker.Marker(elem: str | _Element)
- all(expr: str) list[Self]
Returns all matching element objects as Markers.
- attr(attribute: str) str
Returns value of the given attribute.
- extract(pat: str, expr: str = '.') str | tuple[str]
Gets regex capture from element’s text, using re.search. If expr, uses text of first matching element. If not, uses current elem.
- Parameters:
pat (str) – regex string
expr (str | None) – XPATH expression for element to use
- Returns:
str of regex match if only one group, else tuple of all groups.
- first(expr: str) Self
Returns first matching element object as Marker.
- save(label: str = 'tmp') None
Saves string content of Marker to the tmp html directory.
- to_str() str
Returns string version of self, via etree.tostring().
- txt(expr: str = '.') str
Takes XPATH expression and returns text of first matching element. If no expr, returns text of current element. Uses the text() XPATH function, not lxml Element.text attribute.
/text() sometimes returns multiple values. By default, all are joined into one string with no separator.
- Parameters:
expr (str | None) – XPATH expression. Defaults to “.”.
- Returns:
text string of first elem matching expr.
Advanced
HTTP Sessions
By default, all requests go through a single HTTP session. It is an aiohttp.ClientSession,
documented here.
Throttling
We have four different ways to affect our throughput to the server:
-n: total number of concurrent crawlers betweenseedcalls-ns: number of sessions for-nmany crawlers--session-cap: total simultaneous requests possible per session-w: flat throttle in seconds on all HTTP requests to the server
Managing sessions
-ns or -n-sessions sets the total number of HTTP sessions in the crawl. Regardless of
the number, the engine ensures that every edge to a descendant node is sent through the same
session as its parent.
We typically don’t need multiple sessions; but if we do, we equalize -n and -ns, so that
every concurrent crawler has its own session.
SKIP and STOP
By default, any visited page is saved. If you don’t want to save a page, you can either skip over it, saving any subsequent pages, or stop at that page without saving.
To do either, pass __SKIP or __STOP as the RID in parse:
if data.label == "result":
# Search results themselves are skipped and we go right to `main`,
# which uses the same logic as normal mode.
href = root.first("//a").attr("href")
bn = re.search(r"Bn=(.+)&", href).group(1)
req = tree.Edge(
p_rid="",
p_rdate=DMIN,
label="main",
req=Request(
method="GET", url=CHA_PAGE_URL, params={"selectedCharityBn": bn}
),
)
return "__SKIP", DMIN, [req]
Duplicates
The engine does not save any duplicate pages. The internal representation of a node,
which is linked to a SQL table, Pages, implements these methods:
def _uniq(self) -> str:
"""Returns string of concatenated values that uniquely identify any Pages
object."""
return self.cid + self.rid + self.label
def __hash__(self) -> int:
return hash(self._uniq())
def __eq__(self, other) -> bool:
if isinstance(other, Pages):
return self._uniq() == other._uniq()
return False
If a source implementation returns a page which the same rid, label pair as any
previous in this crawl (labelled by cid, crawl ID), then it is not ingested
and a warning log is emitted.
Websites in their structure often serve duplicate records and as extra protection we often account for that inside the source implementation.