Overview

This document describes the IJF’s technical aims and its pipeline at a high level. It is relevant to contributing developers at anyone else working with the IJF.

IJF Databases

This pipeline is the engine of the IJF’s public interest databases. These are standardized and highly accessible collections of public data, usually disclosed by the government. The databases are normally amalgamations of multiple data sources describing the same kind of data, each in a different jurisdiction.

For example, our lobbying database. This combines data from the lobbyist registries at the federal level and nearly all provinces and territories: even though these are different sources maintained by different governments, they are sufficiently similar that we can create a unifying record structure which any of these sources can fulfill.

Core concepts

What makes a record

We have described some core concepts in the pipeline:

db: Name of database. Always a three-letter prefix like lob.
s: Name of source. Typically an official two-letter code like bc but also fd (federal).
rid: ID of record. An MD5 hash whose input includes the db, s of the record, such that each RID is absolutely unique.

Every record that we create is part of some db, coming from a particular s, and holding a unique rid.

Where records are

The unification of records from different sources within a database culminates in an a set of SQL tables. Internally, these are called schemas, of which there is one per database (though that has a different meaning in SQL databases):

class pipeline.database.models.lob.LobSchema(*, meta: LobMeta, org: LobOrg, rep: LobRep, fund: list[LobFund], affiliate: list[LobAffiliate], lobbyist: list[LobLobbyist], firm: list[LobFirm], office: list[LobOffice], subject: list[LobSubject], category: list[LobCategory], target: list[LobTarget])

Schema for one record in the lobbying database.

model_config = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

These are typically comprised by multiple tables, which has a Meta table that holds, among other things, the rid and s of each record:

class pipeline.database.models.lob.LobMeta(*, rid: str, s: str, added: datetime, cid: str, rnum: str, active: bool, start_date: datetime, end_date: datetime | None): Metadata about this registration version.

The Pipeline

The core technologies in our pipeline are docker, the Python repository, a PostgreSQL database and some AWS services – primarily S3 for hosting raw and intermediate data and ECS for running different pipeline jobs on schedule.

The Python repository is comprised by different Step, each of which has its own module in src/pipeline/. Each step has a general way of transforming records for some source, and an implementation specific for each db, s pair that uses it. These implementations are found in the ers subdirectory – e.g. crawl/crawlers/lob/fd/.

Directory tree

   └── src
      └── pipeline
         ├── __init__.py
         ├── cli.py
         ├── crawl                     <-- crawl step module
         │  ├── __init__.py            <--- shared crawling code
         │  ├── core.py                <___ /
         │  ├── crawlers
         │  │  └── lob
         │  │     ├── fd
         │  │     │  ├── __init__.py   <-- crawler for lob-fd

Steps

The step either takes input data from storage or collects it, transforms it, and pushes it to one of our storage media. With our current databases, the ultimate step for all data is clean, after which the data is loaded into our production SQL database. Between steps, intermediate data is stored in S3.

As said, each step has a core architecture and a set of implementations. All logic that must be source-specific is left to those implementations (e.g. at what URL is a file held?). The core architecture standardizes what is hoped to be common to all implementations – especially storage. In general, each implementation knows nothing about how to properly store its proceed data such that the next step can access it.

This diagram shows all currently built steps. They are split into two informal paths:

Scraping path, that gets document-style data from APIs and web frontends
Download path, that gets bulk downloads in one or a few large flat files

Lobbying and charities take (1) and donations takes (2).

Crawl

An asynchronous web crawling engine. All collected raw data is, at runtime and before storage, associated with a particular date and a unique rid. Most complex step in the pipeline. Metadata about every crawled document is stored in a SQL table and the data itself is stored in S3.

Not pictured is browse, a drop-in replacement for crawl that uses a Playwright browser to scrape websites with necessary client-side JS.

Parse

Takes raw data from crawl or browse and, by each rid, parses out information into a schema-compliant record for the given database. Recall that one record typically is one row in a Meta table and one or n many rows in other associated tables.

Download

Simple concurrent downloader to collect bulk archives, CSVs etc from the internet. Unlike crawl, rid and date is not assigned until later. We store the files as we get them, raw.

Filter

Compares the latest version of each downloaded file against the most recent version, returning a filtered version that has only the new rows. This step can be skipped but keeps data small during the more complicated tidy.

Tidy

Transforms data from across downloaded/filtered files into a set of schema-compliant records for that database. Whereas parse typically uses HTML parsing tools, tidy would normally use pandas.

Clean

Runs element-level, superficial cleaning on schema-compliant records to bring the data into its final state before SQL ingestion. For example, fixes Lastname, Firstname. Much “cleaning” work like $1,000.00 –> 1000.00 is already done during parse or tidy since it entails changing datatype.