A functional data pipeline that transforms thesis metadata from 101 universities into an enriched, deployable research site.
There is no standardized way to get thesis and dissertation data from American universities. Every institution is different. Some run DSpace, others run bepress Digital Commons, others run Hyrax, EPrints, Figshare, or custom platforms. Some expose rich degree metadata; others give you nothing but a title and a name. Some truncate descriptions at 500 characters. Some put the department in the subject field, others in the contributor field, others in an OAI set name, and some don't record it at all.
The federal data is no better. USCIS truncates employer names at 35 characters, so "THE CURATORS OF THE UNIVERSITY OF MISSOURI" becomes "THE CURATORS OF THE UNIV OF MISSOU". The same university might appear under 5 different legal names across H-1B, LCA, OPT, and PERM filings. DOL publishes LCA data as 3.6 GB of Excel files with inconsistent column names across fiscal years. OPT data comes as PDFs. IPEDS uses numeric UNITID keys. NSF uses different institution names than IPEDS.
The result is a system held together by 105 university-specific configurations, 292 employer name mappings, 8 department extraction strategies, an 8-step degree inference chain, and hundreds of ad hoc rules for edge cases that were discovered one university at a time. A bepress repository in Arkansas formats its ETD metadata differently than a DSpace instance in Massachusetts, which formats it differently than a Figshare deployment at Carnegie Mellon.
This is, inherently, a messy problem. The question is how to manage the mess.
The answer is a functional pipeline with immutable state. Rather than a tangle of scripts that read and write files in unpredictable order — which is how this project started, and which led to data corruption bugs where one script would silently overwrite another's output — the pipeline composes discrete steps that each take state in and produce state out.
The pipeline follows railway-oriented programming: each step receives an immutable PipelineState and returns a new one with outcomes appended. If any step fails, subsequent steps are automatically skipped. No step can silently corrupt another's output because the data flow is explicit and unidirectional.
pipeline = compose(
step1_harvest,
step2_process,
step3_parse_federal,
step4_enrich,
step5_generate,
step6_deploy,
)
state = pipeline(initial_state) # PipelineState flows through
All state is frozen (@dataclass(frozen=True)). No mutation. Each step produces a StepOutcome with success/failure status, metrics, errors, and warnings. The manifest tracks exactly when each step last ran for each university, what the record counts were, and whether the data quality is acceptable.
The hundreds of special cases don't go away — they can't, because the underlying data really is that heterogeneous — but they're contained within well-defined boundaries. University-specific configs live in universities.json. Employer name variants live in entity_map.json. Department extraction strategies are dispatched by config key. Each rule is discoverable, testable, and auditable rather than scattered across ad hoc scripts.
universities.json (101 configs), entity_map.json (292 mappings), manifest.json (tracking). Not three config files that might disagree.site/data/. Step 5 (Generate) reads it but never writes to it. This single rule eliminated an entire class of data corruption bugs.Fetch thesis and dissertation metadata from university OAI-PMH repositories. Each university exposes structured metadata via the Open Archives Initiative Protocol, but every repository is different.
A pure function examines the manifest to determine which universities need re-harvesting. If a CSV already exists and the checksum matches, the university is skipped. The plan considers last-run timestamps and record counts to decide what's stale.
The scraper handles 10 different repository platforms, each with its own quirks:
publication: set specs. Most support oai_etdms for degree metadata.com_ or col_ set handles. Some have REST APIs for richer metadata.The same university can expose different levels of detail depending on the prefix requested:
degree.name (e.g., "Doctor of Philosophy"), degree.level ("Dissertation" vs "Thesis"), and degree.discipline (department). Switching 5 bepress repos from oai_dc to oai_etdms improved their degree labeling from <20% to 85–100%.10 universities get supplementary metadata via DSpace REST API calls using concurrent fetching (15 threads). This extracts fields not available through OAI-PMH:
thesis.degree.discipline, thesis.degree.leveldc.description.degree, dc.description.departmentupenn.graduate.group field for departmentCheckpoints every 500 records. Resolves OAI handle URIs to DSpace UUIDs via /pid/find.
Each university produces data/western_us_dissertations/{key}_etds.csv with 19 columns. SHA-256 checksums and resumption tokens are checkpointed for crash recovery.
Source: pipeline/step1_harvest.py → scripts/scrape_oai_etds.py, scripts/enrich_concurrent.py
Transform raw CSVs into structured, labeled records. This is the most logic-dense step — 105 university configurations, 16 different department extraction strategies, and an 8-step degree inference chain determine PhD vs Master's and normalize department names for every record.
105 config entries handle the diversity across institutions. Some universities need multiple configs (PhD and Master's from separate collections, merged at output):
degree_filter: 'phd', custom field mappings (creator_field: 'name', url_field: 'thesis_url').degree_label override. Merged via _merge: True.campus_filter. All degree_label: 'phd' since the system archives dissertations only.exclude_type: 'article|image|photograph' to filter noise from mixed collections.exclude_type: 'article' to remove published articles mixed into the ETD collection.An 8-step inference chain determines PhD vs Master's. Each step is tried in priority order; the first match wins:
degree_label: 'phd' or 'ms'. Used for 20 configs where the collection is known (Caltech, UC System, Missouri S&T splits).degree_type, degree_name, degree columns from REST API enrichment or OAI metadata."Dissertation" → PhD, "Thesis" → MS. Handles variants like Mississippi State's "Graduate Thesis - Open Access".publication:dissertations → PhD. Columbia uses degree_level:doctoral and degree_level:master sets.type: "doctoral" → PhD, "masters" → MS. Generic "Thesis" alone is NOT mapped (ambiguous).PHD/MASTERS in the relation field.Ph.D., S.M., M.S., M.Eng, Sc.D., etc. Catches ~51% of MIT records; the other 49% have descriptions too short to classify.publication:theses → MS, but only if no publication:dissertations set exists (disambiguating repos that split vs repos that lump everything).Current labeling rate: ~82% across all universities. The oai_etdms switch brought BYU from 6% to 97%, Mississippi State from 7% to 100%, and U Arkansas from 9% to 100%.
8 university-specific extractors plus a generic fallback handle the wide variety of ways departments appear in metadata:
mec_aereng → "Mechanical & Aerospace Eng")."Engineering -- Software".department column. Handles Python list strings like "['Dental Sciences', 'Dentistry']".Fallback chain: enriched fields → primary dept_source → title → OAI set names → description → "General"
Once a department name is extracted, it's classified as STEM or non-STEM against a taxonomy of 65 non-STEM departments (Education, English, History, Political Science, Business, etc.). The entire pipeline then runs twice — once for each track:
Every step from here forward produces two independent outputs. A record classified into "Computer Science" flows through the STEM track; a record in "Education" flows through the non-STEM track.
data/processed/stem/{key}.json → site/data/stem/{key}.json → site/stem.htmldata/processed/nonstem/{key}.json → site/data/nonstem/{key}.json → site/non_stem.htmlRecords are deduplicated by (name, year, dept, title[:50]). Typically removes 1–2% where universities have overlapping OAI sets or dual-indexed collections.
For the ~56K records that remain unlabeled after the 8-step inference chain, two supplementary classifiers are available:
pipeline/ai_classifications.json to avoid re-classifying.Step 2 produces three output formats for different consumers:
data/processed/{stem,nonstem}/{key}.json — compact arrays consumed by Step 4 (Enrich) and Step 5 (Generate). Each record is a 8-element array: [dept, year, name, classification, confidence, title, url, degree]
data/processed/stem_records.parquet (12 MB) and nonstem_records.parquet (5 MB). Columnar format with named fields: university, department, year, name, title, url, degree. ZSTD-compressed via DuckDB. Designed for reproducibility — anyone can load these in DuckDB, pandas, or R to validate the methodology.
-- Validate degree labeling in DuckDB:
SELECT university,
COUNT(*) as total,
COUNT(CASE WHEN degree='phd' THEN 1 END) as phd,
COUNT(CASE WHEN degree='ms' THEN 1 END) as ms,
ROUND(100.0 * COUNT(NULLIF(degree,'')) / COUNT(*), 1) as labeled_pct
FROM 'data/processed/stem_records.parquet'
GROUP BY university ORDER BY total DESC;
Source: pipeline/step2_process.py → scripts/generate_name_lists.py
Parse official US government data sources into per-university aggregates. These datasets track the immigration-via-education pipeline from graduation through work authorization to permanent residency.
USCIS publishes approval/denial counts by employer and fiscal year. We parse FY2010–2026 CSVs (both UTF-16 crosstab and standard formats) and match employer names to university keys.
292 employer name mappings in entity_map.json handle the many ways universities appear in federal filings:
"THE CURATORS OF THE UNIV OF MISSOU", "UNIVERSITY OF ILLINOIS AT URBANA-""UNIVERSITY" → "UNIV", "INSTITUTE" → "INST": "VIRGINIA POLYTECHNIC INST & STATE UNIV""THE LELAND STANFORD JR UNIVERSITY" (4 variants), "BOARD OF TRUSTEES OF THE UNIVERSITY OF ILLINOIS""UNIVERSITY OF CALIFORNIA, BERKELEY" vs "UNIVERSITY OF CALIFORNIA BERKELEY" vs "UNIVERSITY OF CALIFORNIA AT BERKELEY" (8 UC campuses, 3+ variants each)"COLUMBIA UNIVERSITY IN THE CITY OF NEW YORK" (5 variants with different comma/article patterns)DOL Labor Condition Applications (FY2015–2025): 32 Excel files totaling 3.6 GB. Extracts job titles (normalized to remove Excel artifacts like ="Assistant Professor"), SOC occupation codes, wage statistics with percentile breakdowns, and yearly certification counts.
DuckDB acceleration: XLSX files are converted to CSV once (cached in lca/csv_cache/), then queried via DuckDB SQL for fast employer-name joins. Falls back to openpyxl row-by-row parsing (~25 min) if DuckDB is unavailable.
ICE/SEVP SEVIS data (CY 2017–2024): OPT, STEM OPT 24-month extension, and CPT work authorization counts by university. Pre-parsed from PDF tables published by DHS. National trends show 11x OPT growth from 2007 to 2024.
DOL PERM permanent labor certification data: employer-sponsored green card applications and approval rates. 72 universities with data.
Source: pipeline/step3_parse_federal.py
Merge all data sources into final per-university JSON files. This is the only step that writes to site/data/ — a critical design rule that prevents the overwrite bugs that plagued earlier versions.
Each university's processed records are combined with H-1B, OPT, LCA, and PERM data from Step 3. This is the only step that writes to site/data/ — a critical invariant that prevents the overwrite bugs found in earlier versions of the pipeline.
Multi-campus universities often file H-1B petitions under the parent system's legal name. system_h1b_map.json maps campuses to their parent system (e.g., Missouri S&T → UM System), so campus-level views show the system's aggregate data with a disclaimer explaining the attribution.
Three additional federal data sources are merged per university, matched by IPEDS UNITID:
Department-level links to faculty directories, PhD student listings, and repository search pages from data/dept_links.json. These let users jump from a thesis record directly to the department that produced it.
Final enriched JSONs are written to site/data/stem/{key}.json and site/data/nonstem/{key}.json — 202 files total. These are both the pipeline's definitive output and the files served directly to the browser.
{
"key": "mit",
"name": "MIT",
"nra": 43.8,
"records": [[dept, year, name, class, conf, title, url, degree], ...],
"h1b": {"FY2020": {"initial_approvals": 123, ...}, ...},
"opt": {"CY2020": {"opt": 456, "stem_opt": 234}, ...},
"lca": {"total_lcas": 890, "top_titles": [...], "wage_median": 85000},
"perm": {"total": 45, "certified": 42},
"enrollment": {"total": 7201, "foreign": 2983, "foreign_pct": 41.4},
"funding": {"yearly_rd_thousands": {"2024": 1945000}},
"dept_links": {"Computer Science": {"home": "...", "faculty": "..."}}
}
Source: pipeline/step4_enrich.py
Build the HTML pages from enriched JSON files. This step has read-only access to site/data/ — it never modifies the per-university JSONs.
The landing page displays 4 interactive Chart.js visualizations: STEM PhD completions by citizenship (IPEDS), OPT growth pipeline (SEVIS), R&D funding vs American PhDs (HERD + IPEDS), and NRA enrollment by STEM field (GSS).
A lightweight index (key, name, NRA%, record count) is embedded in the HTML. Individual university data is lazy-loaded on demand via fetch to minimize initial page weight.
Each university's records are rendered client-side: departments, year blocks, name rows with direct thesis links, degree badges, origin classification flags, and expandable H-1B/OPT/LCA/PERM panels with Chart.js visualizations.
Every university, department, year, and data panel has a shareable URL fragment. Coach marks teach first-time visitors how to copy deep links with a single click.
Source: pipeline/step5_generate.py → scripts/generate_name_lists_html.py
Upload the site to AWS with optimized caching and compression for fast global delivery.
All JSON files are gzipped at compression level 9 before upload. Uploaded with Content-Encoding: gzip header so browsers decompress transparently.
Files are uploaded with tiered cache headers:
max-age=300 (5 minutes) — allows quick updatesmax-age=3600 (1 hour) — pre-gzippedmax-age=604800 (1 week) — JS, images, faviconA full invalidation (/*) is issued after upload to ensure edge caches serve fresh content immediately. CloudFront provides HTTP/2, TLS termination, and sub-100ms latency worldwide.
Source: pipeline/step6_deploy.py
The pipeline currently has 105 university configs, 292 entity name mappings, 16 department extraction strategies, and an 8-step degree inference chain with dozens of regex patterns. This is not elegant, and it's not meant to be. It's a direct reflection of the problem: 101 universities, each running different software, following different metadata standards, with different ideas about what fields to populate and how to format them.
Every one of those ad hoc rules exists because someone discovered a specific university doing something unexpected. MIT puts degree types in the description field. Penn State puts them in the relation field. Columbia uses OAI set names with degree_level:doctoral. BYU returns "MS" as a degree name, while Binghamton returns "Master of Arts (MA)". USCIS truncates Stanford's legal name to "THE LELAND STANFORD JR UNIVERSITY" in some years but not others.
The functional pipeline doesn't eliminate this complexity — nothing can. What it does is make the complexity auditable. Every special case is a config entry, a mapping, or a clearly-scoped extraction function. When a new university is added or an existing one changes its repository software, the fix is a config change and a re-run, not a debugging expedition across 30 interleaved scripts. The manifest tells you exactly what state each university is in. The immutable state guarantees that fixing one university can't break another.
The system is, by nature, an accumulation of discovered edge cases. The pipeline is the structure that keeps that accumulation from becoming unmaintainable.
# Run full pipeline python -m pipeline run # Run specific step(s) python -m pipeline run --step enrich,generate,deploy # Single university python -m pipeline run --university mit # Dry run python -m pipeline run --dry-run # Status check python -m pipeline status
pipeline/ __init__.py # Step registry __main__.py # CLI entry point runner.py # compose_pipeline() state.py # Frozen dataclasses universities.json # 101 university configs entity_map.json # 292 employer name mappings system_h1b_map.json # Multi-campus H-1B fallback manifest.json # Per-university tracking degree_classifier.pkl # sklearn model ai_classifications.json # Claude Haiku cache step1_harvest.py # OAI-PMH scraping step2_process.py # Record classification step2b_ai_classify.py # AI degree classification step3_parse_federal.py # H-1B, LCA, OPT, PERM step4_enrich.py # Merge all sources step5_generate.py # HTML generation step6_deploy.py # S3 + CloudFront