Overview
This document describes the IJF’s technical aims and its pipeline at a high level. It is relevant to contributing developers at anyone else working with the IJF.
IJF Databases
This pipeline is the engine of the IJF’s public interest databases. These are standardized and highly accessible collections of public data, usually disclosed by the government. The databases are normally amalgamations of multiple data sources describing the same kind of data, each in a different jurisdiction.
For example, our lobbying database. This combines data from the lobbyist registries at the federal level and nearly all provinces and territories: even though these are different sources maintained by different governments, they are sufficiently similar that we can create a unifying record structure which any of these sources can fulfill.
Core concepts
What makes a record
We have described some core concepts in the pipeline:
dbName of database. Always a three-letter prefix like
lob.sName of source. Typically an official two-letter code like
bcbut alsofd(federal).ridID of record. An MD5 hash whose input includes the
db, sof the record, such that each RID is absolutely unique.
Every record that we create is part of some db, coming from a particular s, and
holding a unique rid.
Where records are
The unification of records from different sources within a database culminates in an a set of SQL tables. Internally, these are called schemas, of which there is one per database (though that has a different meaning in SQL databases):
- class pipeline.database.models.lob.LobSchema(*, meta: LobMeta, org: LobOrg, rep: LobRep, fund: list[LobFund], affiliate: list[LobAffiliate], lobbyist: list[LobLobbyist], firm: list[LobFirm], office: list[LobOffice], subject: list[LobSubject], category: list[LobCategory], target: list[LobTarget])
Schema for one record in the lobbying database.
- model_config = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
These are typically comprised by multiple tables, which has a Meta table that holds, among other things,
the rid and s of each record:
- class pipeline.database.models.lob.LobMeta(*, rid: str, s: str, added: datetime, cid: str, rnum: str, active: bool, start_date: datetime, end_date: datetime | None)
Metadata about this registration version.
The Pipeline
The core technologies in our pipeline are docker, the Python repository, a PostgreSQL database and some AWS services – primarily S3 for hosting raw and intermediate data and ECS for running different pipeline jobs on schedule.
The Python repository is comprised by different Step, each of which has its own module in src/pipeline/.
Each step has a general way of transforming records for some source, and an implementation specific for each db, s
pair that uses it. These implementations are found in the ers subdirectory – e.g. crawl/crawlers/lob/fd/.
└── src
└── pipeline
├── __init__.py
├── cli.py
├── crawl <-- crawl step module
│ ├── __init__.py <--- shared crawling code
│ ├── core.py <___ /
│ ├── crawlers
│ │ └── lob
│ │ ├── fd
│ │ │ ├── __init__.py <-- crawler for lob-fd
Steps
The step either takes input data from storage or collects it, transforms it, and pushes it to one of our storage
media. With our current databases, the ultimate step for all data is clean, after which the data is loaded
into our production SQL database. Between steps, intermediate data is stored in S3.
As said, each step has a core architecture and a set of implementations. All logic that must be source-specific is left to those implementations (e.g. at what URL is a file held?). The core architecture standardizes what is hoped to be common to all implementations – especially storage. In general, each implementation knows nothing about how to properly store its proceed data such that the next step can access it.
This diagram shows all currently built steps. They are split into two informal paths:
Scraping path, that gets document-style data from APIs and web frontends
Download path, that gets bulk downloads in one or a few large flat files
Lobbying and charities take (1) and donations takes (2).
Crawl
An asynchronous web crawling engine. All collected raw data is, at runtime and before storage, associated with
a particular date and a unique rid. Most complex step in the pipeline. Metadata about every crawled document
is stored in a SQL table and the data itself is stored in S3.
Not pictured is browse, a drop-in replacement for crawl that uses a Playwright browser to scrape
websites with necessary client-side JS.
Parse
Takes raw data from crawl or browse and, by each rid, parses out information into a schema-compliant
record for the given database. Recall that one record typically is one row in a Meta table and one or n
many rows in other associated tables.
Download
Simple concurrent downloader to collect bulk archives, CSVs etc from the internet. Unlike crawl, rid and date
is not assigned until later. We store the files as we get them, raw.
Filter
Compares the latest version of each downloaded file against the most recent version, returning a filtered version
that has only the new rows. This step can be skipped but keeps data small during the more complicated tidy.
Tidy
Transforms data from across downloaded/filtered files into a set of schema-compliant records for that database.
Whereas parse typically uses HTML parsing tools, tidy would normally use pandas.
Clean
Runs element-level, superficial cleaning on schema-compliant records to bring the data into its final state before
SQL ingestion. For example, fixes Lastname, Firstname. Much “cleaning” work like $1,000.00 –> 1000.00
is already done during parse or tidy since it entails changing datatype.