Overview ======================================== This document describes the IJF's technical aims and its pipeline at a high level. It is relevant to contributing developers at anyone else working with the IJF. IJF Databases ################### This pipeline is the engine of the IJF's public interest databases. These are standardized and highly accessible collections of public data, usually disclosed by the government. The databases are normally amalgamations of multiple data sources describing the same kind of data, each in a different jurisdiction. For example, our `lobbying database `_. This combines data from the lobbyist registries at the federal level and nearly all provinces and territories: even though these are different sources maintained by different governments, they are sufficiently similar that we can create a unifying *record* structure which any of these sources can fulfill. .. _core-concepts: Core concepts -------------------------------- What makes a record ******************************** We have described some core concepts in the pipeline: .. glossary:: ``db`` Name of database. Always a three-letter prefix like ``lob``. ``s`` Name of source. Typically an `official two-letter code `_ like ``bc`` but also ``fd`` (federal). ``rid`` ID of record. An MD5 hash whose input includes the ``db, s`` of the record, such that each RID is absolutely unique. Every record that we create is part of some ``db``, coming from a particular ``s``, and holding a unique ``rid``. Where records are ******************************** The unification of records from different sources within a database culminates in an a set of SQL tables. Internally, these are called *schemas*, of which there is one per database (though that has a different meaning in SQL databases): .. autoclass:: pipeline.database.models.lob.LobSchema :members: These are typically comprised by multiple tables, which has a ``Meta`` table that holds, among other things, the ``rid`` and ``s`` of each record: .. autoclass:: pipeline.database.models.lob.LobMeta The Pipeline ################### The core technologies in our pipeline are docker, the Python repository, a PostgreSQL database and some AWS services -- primarily S3 for hosting raw and intermediate data and ECS for running different pipeline jobs on schedule. The Python repository is comprised by different ``Step``, each of which has its own module in ``src/pipeline/``. Each step has a general way of transforming records for some source, and an implementation specific for each ``db, s`` pair that uses it. These implementations are found in the ``ers`` subdirectory -- e.g. ``crawl/crawlers/lob/fd/``. .. code-block:: bash :caption: Directory tree └── src └── pipeline ├── __init__.py ├── cli.py ├── crawl <-- crawl step module │ ├── __init__.py <--- shared crawling code │ ├── core.py <___ / │ ├── crawlers │ │ └── lob │ │ ├── fd │ │ │ ├── __init__.py <-- crawler for lob-fd Steps -------------------------- The step either takes input data from storage or collects it, transforms it, and pushes it to one of our storage media. With our current databases, the ultimate step for all data is ``clean``, after which the data is loaded into our production SQL database. Between steps, intermediate data is stored in S3. As said, each step has a core architecture and a set of implementations. All logic that must be source-specific is left to those implementations (e.g. at what URL is a file held?). The core architecture standardizes what is hoped to be common to all implementations -- especially storage. In general, each implementation knows nothing about how to properly store its proceed data such that the next step can access it. .. image:: images/steps.svg This diagram shows all currently built steps. They are split into two informal paths: 1. *Scraping path*, that gets document-style data from APIs and web frontends 2. *Download path*, that gets bulk downloads in one or a few large flat files Lobbying and charities take (1) and donations takes (2). Crawl ************************* An asynchronous web crawling engine. All collected raw data is, at runtime and before storage, associated with a particular date and a unique ``rid``. Most complex step in the pipeline. Metadata about every crawled document is stored in a SQL table and the data itself is stored in S3. Not pictured is ``browse``, a drop-in replacement for ``crawl`` that uses a Playwright browser to scrape websites with necessary client-side JS. Parse ************************* Takes raw data from ``crawl`` or ``browse`` and, by each ``rid``, parses out information into a schema-compliant record for the given database. Recall that one *record* typically is one row in a ``Meta`` table and one or ``n`` many rows in other associated tables. Download ************************* Simple concurrent downloader to collect bulk archives, CSVs etc from the internet. Unlike crawl, ``rid`` and date is not assigned until later. We store the files as we get them, raw. Filter ************************* Compares the latest version of each downloaded file against the most recent version, returning a filtered version that has only the new rows. This step can be skipped but keeps data small during the more complicated ``tidy``. Tidy ************************* Transforms data from across downloaded/filtered files into a set of schema-compliant records for that database. Whereas ``parse`` typically uses HTML parsing tools, ``tidy`` would normally use ``pandas``. Clean ************************* Runs element-level, superficial cleaning on schema-compliant records to bring the data into its final state before SQL ingestion. For example, fixes ``Lastname, Firstname``. Much "cleaning" work like ``$1,000.00`` --> ``1000.00`` is already done during ``parse`` or ``tidy`` since it entails changing datatype.