Implementation

A functional data pipeline that transforms thesis metadata from 101 universities into an enriched, deployable research site.

294K

Records

101

Universities

Gov Data Sources

Pipeline Steps

The Messy Reality

There is no standardized way to get thesis and dissertation data from American universities. Every institution is different. Some run DSpace, others run bepress Digital Commons, others run Hyrax, EPrints, Figshare, or custom platforms. Some expose rich degree metadata; others give you nothing but a title and a name. Some truncate descriptions at 500 characters. Some put the department in the subject field, others in the contributor field, others in an OAI set name, and some don't record it at all.

The federal data is no better. USCIS truncates employer names at 35 characters, so "THE CURATORS OF THE UNIVERSITY OF MISSOURI" becomes "THE CURATORS OF THE UNIV OF MISSOU". The same university might appear under 5 different legal names across H-1B, LCA, OPT, and PERM filings. DOL publishes LCA data as 3.6 GB of Excel files with inconsistent column names across fiscal years. OPT data comes as PDFs. IPEDS uses numeric UNITID keys. NSF uses different institution names than IPEDS.

The result is a system held together by 105 university-specific configurations, 292 employer name mappings, 8 department extraction strategies, an 8-step degree inference chain, and hundreds of ad hoc rules for edge cases that were discovered one university at a time. A bepress repository in Arkansas formats its ETD metadata differently than a DSpace instance in Massachusetts, which formats it differently than a Figshare deployment at Carnegie Mellon.

This is, inherently, a messy problem. The question is how to manage the mess.

Functional Architecture

The answer is a functional pipeline with immutable state. Rather than a tangle of scripts that read and write files in unpredictable order — which is how this project started, and which led to data corruption bugs where one script would silently overwrite another's output — the pipeline composes discrete steps that each take state in and produce state out.

The pipeline follows railway-oriented programming: each step receives an immutable PipelineState and returns a new one with outcomes appended. If any step fails, subsequent steps are automatically skipped. No step can silently corrupt another's output because the data flow is explicit and unidirectional.

pipeline = compose(
    step1_harvest,
    step2_process,
    step3_parse_federal,
    step4_enrich,
    step5_generate,
    step6_deploy,
)
state = pipeline(initial_state)  # PipelineState flows through

All state is frozen (@dataclass(frozen=True)). No mutation. Each step produces a StepOutcome with success/failure status, metrics, errors, and warnings. The manifest tracks exactly when each step last ran for each university, what the record counts were, and whether the data quality is acceptable.

The hundreds of special cases don't go away — they can't, because the underlying data really is that heterogeneous — but they're contained within well-defined boundaries. University-specific configs live in universities.json. Employer name variants live in entity_map.json. Department extraction strategies are dispatched by config key. Each rule is discoverable, testable, and auditable rather than scattered across ad hoc scripts.

Design Principles

Immutable state — frozen dataclasses, no side effects in core logic. You can't accidentally overwrite enriched data by re-running the HTML generator.
Pure core, effectful shell — I/O only at boundaries (HTTP, files, S3). Business logic is testable without network calls.
Single source of truth — universities.json (101 configs), entity_map.json (292 mappings), manifest.json (tracking). Not three config files that might disagree.
Idempotent steps — re-running any step produces the same output. Safe to retry after failures.
Explicit data ownership — only Step 4 (Enrich) writes to site/data/. Step 5 (Generate) reads it but never writes to it. This single rule eliminated an entire class of data corruption bugs.

1 Harvest

Fetch thesis and dissertation metadata from university OAI-PMH repositories. Each university exposes structured metadata via the Open Archives Initiative Protocol, but every repository is different.

Build Harvest Plan

A pure function examines the manifest to determine which universities need re-harvesting. If a CSV already exists and the checksum matches, the university is skipped. The plan considers last-run timestamps and record counts to decide what's stale.

Repository Platforms

The scraper handles 10 different repository platforms, each with its own quirks:

bepress (53 repos) — Digital Commons. Uses publication: set specs. Most support oai_etdms for degree metadata.
DSpace (41 repos) — Standard DSpace 5/6/7. Uses com_ or col_ set handles. Some have REST APIs for richer metadata.
Hyrax (2 repos) — Oregon State, CU Boulder. Samvera/Hyrax framework with custom schemas.
Figshare (1 repo) — CMU. REST API enrichment for thesis records stored in Figshare.
EPrints, Invenio, Blacklight — Rare platforms with institution-specific parsing.

Metadata Prefixes

The same university can expose different levels of detail depending on the prefix requested:

oai_dc — Dublin Core (standard). Title, creator, date, subject, description, type. Used by most repositories.
oai_etdms — ETD Metadata Standard. Adds degree.name (e.g., "Doctor of Philosophy"), degree.level ("Dissertation" vs "Thesis"), and degree.discipline (department). Switching 5 bepress repos from oai_dc to oai_etdms improved their degree labeling from <20% to 85–100%.

REST API Enrichment

10 universities get supplementary metadata via DSpace REST API calls using concurrent fetching (15 threads). This extracts fields not available through OAI-PMH:

NC State, Rice, Cornell, Harvard — thesis.degree.discipline, thesis.degree.level
U Delaware, RPI, Temple — dc.description.degree, dc.description.department
U Penn — Custom upenn.graduate.group field for department
UW Seattle — HTML citation metadata scrape (not REST)

Checkpoints every 500 records. Resolves OAI handle URIs to DSpace UUIDs via /pid/find.

Write CSV + Checkpoint

Each university produces data/western_us_dissertations/{key}_etds.csv with 19 columns. SHA-256 checksums and resumption tokens are checkpointed for crash recovery.

101

Repositories

294K

Records

Platforms

Source: pipeline/step1_harvest.py → scripts/scrape_oai_etds.py, scripts/enrich_concurrent.py

2 Process

Transform raw CSVs into structured, labeled records. This is the most logic-dense step — 105 university configurations, 16 different department extraction strategies, and an 8-step degree inference chain determine PhD vs Master's and normalize department names for every record.

Per-University Configuration

105 config entries handle the diversity across institutions. Some universities need multiple configs (PhD and Master's from separate collections, merged at output):

Caltech — PhD-only collection scraped from department website. Uses degree_filter: 'phd', custom field mappings (creator_field: 'name', url_field: 'thesis_url').
Missouri S&T — Two separate OAI sets (doctoral dissertations + master's theses), each with degree_label override. Merged via _merge: True.
UC System — Single CSV, 9 configs (one per campus) using campus_filter. All degree_label: 'phd' since the system archives dissertations only.
UT Dallas — Two data sources (OAI + REST API grad school), with exclude_type: 'article|image|photograph' to filter noise from mixed collections.
Iowa State — exclude_type: 'article' to remove published articles mixed into the ETD collection.

Infer Degree Type

An 8-step inference chain determines PhD vs Master's. Each step is tried in priority order; the first match wins:

1. Config override — degree_label: 'phd' or 'ms'. Used for 20 configs where the collection is known (Caltech, UC System, Missouri S&T splits).
2. Explicit degree fields — degree_type, degree_name, degree columns from REST API enrichment or OAI metadata.
3. ETD degree_level — From oai_etdms: "Dissertation" → PhD, "Thesis" → MS. Handles variants like Mississippi State's "Graduate Thesis - Open Access".
4. OAI set membership — publication:dissertations → PhD. Columbia uses degree_level:doctoral and degree_level:master sets.
5. Type field — Dublin Core type: "doctoral" → PhD, "masters" → MS. Generic "Thesis" alone is NOT mapped (ambiguous).
6. Relation field — Penn State stores PHD/MASTERS in the relation field.
7. Description regex — MIT's primary labeling source: pattern matching on Ph.D., S.M., M.S., M.Eng, Sc.D., etc. Catches ~51% of MIT records; the other 49% have descriptions too short to classify.
8. Late OAI fallback — publication:theses → MS, but only if no publication:dissertations set exists (disambiguating repos that split vs repos that lump everything).

Current labeling rate: ~82% across all universities. The oai_etdms switch brought BYU from 6% to 97%, Mississippi State from 7% to 100%, and U Arkansas from 9% to 100%.

Extract Department

8 university-specific extractors plus a generic fallback handle the wide variety of ways departments appear in metadata:

OAI set mapping (8 configs) — Missouri S&T, MTU, SDSU map OAI set specs to department names via lookup tables (e.g., mec_aereng → "Mechanical & Aerospace Eng").
Subject field (86 configs) — Generic regex extraction on the DC subject field. Handles hierarchies like "Engineering -- Software".
Contributor field — Georgia Tech, Virginia Tech, U Buffalo extract departments from the DC contributor field (advisor department affiliations).
Column-based (4 configs) — UC system, UW Seattle, UVA read from a dedicated department column. Handles Python list strings like "['Dental Sciences', 'Dentistry']".
Custom per-university — RPI (subject format), UT Dallas (subject format), U Missouri (hierarchical), Stony Brook (REST enrichment), Stanford (department normalization).

Fallback chain: enriched fields → primary dept_source → title → OAI set names → description → "General"

STEM / Non-STEM Split

Once a department name is extracted, it's classified as STEM or non-STEM against a taxonomy of 65 non-STEM departments (Education, English, History, Political Science, Business, etc.). The entire pipeline then runs twice — once for each track:

Parallel Processing Tracks

Every step from here forward produces two independent outputs. A record classified into "Computer Science" flows through the STEM track; a record in "Education" flows through the non-STEM track.

STEM → data/processed/stem/{key}.json → site/data/stem/{key}.json → site/stem.html
Non-STEM → data/processed/nonstem/{key}.json → site/data/nonstem/{key}.json → site/non_stem.html

Deduplicate

Records are deduplicated by (name, year, dept, title[:50]). Typically removes 1–2% where universities have overlapping OAI sets or dual-indexed collections.

AI-Assisted Classification (Optional)

For the ~56K records that remain unlabeled after the 8-step inference chain, two supplementary classifiers are available:

sklearn classifier — TF-IDF on 228K labeled thesis titles + LogisticRegression. Free, fast, ~72% cross-validation accuracy. Applied 13K high-confidence predictions, bringing overall labeling from 74.5% to 82.9%.
Claude Haiku API — Batch-classifies titles in groups of 50 with department context. ~85% accuracy, ~$4 for all 56K records. Cached in pipeline/ai_classifications.json to avoid re-classifying.

Output Formats

Step 2 produces three output formats for different consumers:

Intermediate: Per-University JSON

data/processed/{stem,nonstem}/{key}.json — compact arrays consumed by Step 4 (Enrich) and Step 5 (Generate). Each record is a 8-element array: [dept, year, name, classification, confidence, title, url, degree]

Archival: Parquet (ZSTD)

data/processed/stem_records.parquet (12 MB) and nonstem_records.parquet (5 MB). Columnar format with named fields: university, department, year, name, title, url, degree. ZSTD-compressed via DuckDB. Designed for reproducibility — anyone can load these in DuckDB, pandas, or R to validate the methodology.

-- Validate degree labeling in DuckDB:
SELECT university,
       COUNT(*) as total,
       COUNT(CASE WHEN degree='phd' THEN 1 END) as phd,
       COUNT(CASE WHEN degree='ms' THEN 1 END) as ms,
       ROUND(100.0 * COUNT(NULLIF(degree,'')) / COUNT(*), 1) as labeled_pct
FROM 'data/processed/stem_records.parquet'
GROUP BY university ORDER BY total DESC;

82%

Labeled

225K

PhD

98K

Master's

84K

Unlabeled

Source: pipeline/step2_process.py → scripts/generate_name_lists.py

3 Parse Federal

Parse official US government data sources into per-university aggregates. These datasets track the immigration-via-education pipeline from graduation through work authorization to permanent residency.

H-1B Employer Data Hub

USCIS publishes approval/denial counts by employer and fiscal year. We parse FY2010–2026 CSVs (both UTF-16 crosstab and standard formats) and match employer names to university keys.

Entity Name Resolution

292 employer name mappings in entity_map.json handle the many ways universities appear in federal filings:

USCIS truncation — Pre-2020 files truncate at ~35 characters: "THE CURATORS OF THE UNIV OF MISSOU", "UNIVERSITY OF ILLINOIS AT URBANA-"
Abbreviations — "UNIVERSITY" → "UNIV", "INSTITUTE" → "INST": "VIRGINIA POLYTECHNIC INST & STATE UNIV"
Legal names — "THE LELAND STANFORD JR UNIVERSITY" (4 variants), "BOARD OF TRUSTEES OF THE UNIVERSITY OF ILLINOIS"
Multi-campus systems — "UNIVERSITY OF CALIFORNIA, BERKELEY" vs "UNIVERSITY OF CALIFORNIA BERKELEY" vs "UNIVERSITY OF CALIFORNIA AT BERKELEY" (8 UC campuses, 3+ variants each)
Punctuation — "COLUMBIA UNIVERSITY IN THE CITY OF NEW YORK" (5 variants with different comma/article patterns)

LCA Disclosure Data

DOL Labor Condition Applications (FY2015–2025): 32 Excel files totaling 3.6 GB. Extracts job titles (normalized to remove Excel artifacts like ="Assistant Professor"), SOC occupation codes, wage statistics with percentile breakdowns, and yearly certification counts.

DuckDB acceleration: XLSX files are converted to CSV once (cached in lca/csv_cache/), then queried via DuckDB SQL for fast employer-name joins. Falls back to openpyxl row-by-row parsing (~25 min) if DuckDB is unavailable.

OPT / STEM OPT

ICE/SEVP SEVIS data (CY 2017–2024): OPT, STEM OPT 24-month extension, and CPT work authorization counts by university. Pre-parsed from PDF tables published by DHS. National trends show 11x OPT growth from 2007 to 2024.

PERM (Green Card)

DOL PERM permanent labor certification data: employer-sponsored green card applications and approval rates. 72 universities with data.

H-1B Unis

LCA Unis

OPT Unis

PERM Unis

Source: pipeline/step3_parse_federal.py

4 Enrich

Merge all data sources into final per-university JSON files. This is the only step that writes to site/data/ — a critical design rule that prevents the overwrite bugs that plagued earlier versions.

Merge Federal Data

Each university's processed records are combined with H-1B, OPT, LCA, and PERM data from Step 3. This is the only step that writes to site/data/ — a critical invariant that prevents the overwrite bugs found in earlier versions of the pipeline.

System H-1B Mapping

Multi-campus universities often file H-1B petitions under the parent system's legal name. system_h1b_map.json maps campuses to their parent system (e.g., Missouri S&T → UM System), so campus-level views show the system's aggregate data with a disclaimer explaining the attribution.

Add IPEDS / HERD / GSS

Three additional federal data sources are merged per university, matched by IPEDS UNITID:

IPEDS (NCES) — Enrollment demographics, completions by citizenship/race/gender, tuition, financial aid. Key metric: NRA share of STEM doctoral completions.
NSF HERD — R&D expenditure time series (2010–2024) by university. Shows funding growth trajectory.
NSF GSS — Graduate enrollment by field and citizenship status. Shows NRA share per STEM discipline.

Attach Department Links

Department-level links to faculty directories, PhD student listings, and repository search pages from data/dept_links.json. These let users jump from a thesis record directly to the department that produced it.

Write Site JSONs

Final enriched JSONs are written to site/data/stem/{key}.json and site/data/nonstem/{key}.json — 202 files total. These are both the pipeline's definitive output and the files served directly to the browser.

Final JSON Schema (per university)

{
  "key": "mit",
  "name": "MIT",
  "nra": 43.8,
  "records": [[dept, year, name, class, conf, title, url, degree], ...],
  "h1b": {"FY2020": {"initial_approvals": 123, ...}, ...},
  "opt": {"CY2020": {"opt": 456, "stem_opt": 234}, ...},
  "lca": {"total_lcas": 890, "top_titles": [...], "wage_median": 85000},
  "perm": {"total": 45, "certified": 42},
  "enrollment": {"total": 7201, "foreign": 2983, "foreign_pct": 41.4},
  "funding": {"yearly_rd_thousands": {"2024": 1945000}},
  "dept_links": {"Computer Science": {"home": "...", "faculty": "..."}}
}

202

JSON Files

50MB

Total Size

Source: pipeline/step4_enrich.py

5 Generate

Build the HTML pages from enriched JSON files. This step has read-only access to site/data/ — it never modifies the per-university JSONs.

Nationwide Statistics

The landing page displays 4 interactive Chart.js visualizations: STEM PhD completions by citizenship (IPEDS), OPT growth pipeline (SEVIS), R&D funding vs American PhDs (HERD + IPEDS), and NRA enrollment by STEM field (GSS).

University Index

A lightweight index (key, name, NRA%, record count) is embedded in the HTML. Individual university data is lazy-loaded on demand via fetch to minimize initial page weight.

Client-Side Rendering

Each university's records are rendered client-side: departments, year blocks, name rows with direct thesis links, degree badges, origin classification flags, and expandable H-1B/OPT/LCA/PERM panels with Chart.js visualizations.

Deep Linking

Every university, department, year, and data panel has a shareable URL fragment. Coach marks teach first-time visitors how to copy deep links with a single click.

Source: pipeline/step5_generate.py → scripts/generate_name_lists_html.py

6 Deploy

Upload the site to AWS with optimized caching and compression for fast global delivery.

Pre-Compress JSON

All JSON files are gzipped at compression level 9 before upload. Uploaded with Content-Encoding: gzip header so browsers decompress transparently.

Upload to S3

Files are uploaded with tiered cache headers:

HTML — max-age=300 (5 minutes) — allows quick updates
JSON data — max-age=3600 (1 hour) — pre-gzipped
Static assets — max-age=604800 (1 week) — JS, images, favicon

Invalidate CloudFront

A full invalidation (/*) is issued after upload to ensure edge caches serve fresh content immediately. CloudFront provides HTTP/2, TLS termination, and sub-100ms latency worldwide.

227

Files

~3MB

Gzipped

<100ms

Edge Latency

Source: pipeline/step6_deploy.py

Living with Complexity

The pipeline currently has 105 university configs, 292 entity name mappings, 16 department extraction strategies, and an 8-step degree inference chain with dozens of regex patterns. This is not elegant, and it's not meant to be. It's a direct reflection of the problem: 101 universities, each running different software, following different metadata standards, with different ideas about what fields to populate and how to format them.

Every one of those ad hoc rules exists because someone discovered a specific university doing something unexpected. MIT puts degree types in the description field. Penn State puts them in the relation field. Columbia uses OAI set names with degree_level:doctoral. BYU returns "MS" as a degree name, while Binghamton returns "Master of Arts (MA)". USCIS truncates Stanford's legal name to "THE LELAND STANFORD JR UNIVERSITY" in some years but not others.

The functional pipeline doesn't eliminate this complexity — nothing can. What it does is make the complexity auditable. Every special case is a config entry, a mapping, or a clearly-scoped extraction function. When a new university is added or an existing one changes its repository software, the fix is a config change and a re-run, not a debugging expedition across 30 interleaved scripts. The manifest tells you exactly what state each university is in. The immutable state guarantees that fixing one university can't break another.

The system is, by nature, an accumulation of discovered edge cases. The pipeline is the structure that keeps that accumulation from becoming unmaintainable.

CLI Usage

# Run full pipeline
python -m pipeline run

# Run specific step(s)
python -m pipeline run --step enrich,generate,deploy

# Single university
python -m pipeline run --university mit

# Dry run
python -m pipeline run --dry-run

# Status check
python -m pipeline status

Project Structure

pipeline/
  __init__.py              # Step registry
  __main__.py              # CLI entry point
  runner.py                # compose_pipeline()
  state.py                 # Frozen dataclasses
  universities.json        # 101 university configs
  entity_map.json          # 292 employer name mappings
  system_h1b_map.json      # Multi-campus H-1B fallback
  manifest.json            # Per-university tracking
  degree_classifier.pkl    # sklearn model
  ai_classifications.json  # Claude Haiku cache
  step1_harvest.py         # OAI-PMH scraping
  step2_process.py         # Record classification
  step2b_ai_classify.py    # AI degree classification
  step3_parse_federal.py   # H-1B, LCA, OPT, PERM
  step4_enrich.py          # Merge all sources
  step5_generate.py        # HTML generation
  step6_deploy.py          # S3 + CloudFront