Data Sources & Methodology

Where our data comes from, how it was collected, and what it can and cannot tell you

Contents

  1. Federal Datasets
  2. Understanding Citizenship Categories
  3. Dissertation Metadata
  4. Name Classification
  5. Limitations & Caveats
  6. Analytics & Tracking
  7. Links to Original Sources

Immigration via Education Pipeline

This site tracks the full pathway from international graduate enrollment to permanent residency, using exclusively official US government data:

PhD/MS Production
University repositories
OPT Work Auth
ICE/SEVP SEVIS
H-1B Visa
USCIS + DOL LCA
Green Card
DOL PERM
Stage Source Agency Coverage Universities
Thesis/Dissertation Data University repositories (OAI-PMH) 2016-2024 101
OPT Students Employed ICE/SEVP SEVIS Reports CY 2017-2024 59 (Top 100)
H-1B Petition Approvals USCIS H-1B Employer Data Hub FY 2010-2026 98
H-1B Job Titles & Salaries DOL LCA Disclosure Data FY 2020-2025 78
Green Card Sponsorship DOL PERM Disclosure Data FY 2024 71

Note on employer entity names: Universities file H-1B and PERM petitions under their legal entity name, which may differ from their common name and can change over time. Multi-campus university systems (e.g., University of Missouri, SUNY, University of California) often file under a single systemwide entity, making per-campus attribution imprecise. We match employer names to universities using pattern matching with known name variants, but some filings may be attributed to the wrong campus within a system.

1. Federal Datasets

This site draws on seven major federal data sources, each with different strengths and coverage.

IPEDS (Integrated Postsecondary Education Data System)

Agency: NCES / U.S. Dept. of Education Coverage: 1984 – 2024

IPEDS is the primary federal database for higher education statistics. Every Title IV institution reports annually. We use four IPEDS surveys:

IPEDS Data Center

NSF Survey of Earned Doctorates (SED)

Agency: NSF / NCSES Coverage: 1979 – 2024

An annual census of every research doctorate recipient from a U.S. institution. Unlike IPEDS, the SED collects individual-level data including source country for temporary visa holders, postgraduation plans, and financial support. The SED is the source for our national doctorate trend charts, stay-rate analysis, and country-of-origin rankings (China, India, etc.).

NSF SED homepage · 2024 tables

NSF Graduate Student Survey (GSS)

Agency: NSF / NCSES Coverage: 2023 – 2024

Institution-reported data on enrollment, financial support, and demographics of graduate students in science and engineering departments. Provides field-level enrollment counts with foreign/domestic separation at each institution — more granular than IPEDS enrollment for S&E fields.

NSF GSS homepage

NSF Workforce Surveys (SDR & NSCG)

Agency: NSF / NCSES Coverage: 2021 – 2023

The Survey of Doctorate Recipients (SDR) surveys ~80,000 PhD holders in the U.S. workforce. The National Survey of College Graduates (NSCG) surveys ~95,000 college graduates. These are the only federal surveys that separate native-born citizens from naturalized citizens and permanent residents — a distinction that IPEDS and the SED cannot make. The NSCG also records specific birth country (via the BTHST_TOGA variable), enabling China-specific analysis.

SDR microdata · NSCG microdata

NSF Higher Education R&D Survey (HERD)

Agency: NSF / NCSES Coverage: FY 2024

Reports total R&D expenditures by institution and funding source (federal, state, industry, etc.). Used to show how much research funding flows to universities with the highest NRA concentrations.

NSF HERD homepage

ICE SEVIS Data (SEVP)

Agency: DHS / ICE / SEVP Coverage: Jan 2024 – Mar 2026 (monthly snapshots)

Student and Exchange Visitor Information System data tracking all F-1 and M-1 visa holders. Provides current counts by country of citizenship, education level, and U.S. state. Also includes STEM-specific breakdowns at the state level. As of March 2026, China has 229,463 active student records.

SEVIS Data Mapping Tool

DOL H-1B Disclosure & USAspending

Agencies: DOL / Treasury Coverage: FY2024 – FY2025

H-1B LCA data from the Department of Labor covers every Labor Condition Application for H-1B visas, including employer, job title, wage, and work location. Used to understand the post-graduation employment pipeline. USAspending federal grant data shows total federal awards to each university.

DOL OFLC data · USAspending.gov

2. Understanding Citizenship Categories

Different federal surveys define citizenship groups differently. Understanding these definitions is critical to interpreting the data correctly.

The IPEDS Blind Spot: IPEDS — the most widely-cited source for higher education demographics — cannot distinguish between U.S. citizens and permanent residents (green card holders). Both are grouped into a single "U.S. citizen or permanent resident" bucket, then broken down by race/ethnicity. Only "Non-Resident Alien" (temporary visa holders) is reported separately.

This means that when IPEDS reports 60% of PhDs go to "domestic" students, that 60% includes an unknown number of green card holders and naturalized citizens who were born abroad.

How each survey defines groups

Survey Categories Can separate citizens from green card holders?
IPEDS (all surveys) NRA = temporary visa only.
Race/ethnicity categories (Asian, White, Black, Hispanic, etc.) include BOTH citizens AND permanent residents combined.
No
NSF SED "U.S. citizen or permanent resident" vs. "Temporary visa holder." Same grouping as IPEDS. No
NSF SDR CTZN field has 5 values:
1 = Native-born U.S. citizen
2 = Naturalized U.S. citizen (foreign-born)
3 = Permanent resident (green card)
4 = Temporary visa holder
5 = Living outside the U.S.
Yes
NSF NSCG Same CTZN coding as SDR, plus:
BTHST_TOGA = specific birth country code
BTHRGN = birth region
CTZDUAL = dual citizenship flag
Yes, plus birth country
SEVIS / SEVP Country of citizenship for each active F-1/M-1 record. N/A (visa holders only)

The SDR finding: 28% of "domestic" PhDs are foreign-born.

The 2023 Survey of Doctorate Recipients (80,143 respondents) reveals the composition of what IPEDS calls "U.S. citizen or permanent resident":

That means roughly 28% of the "domestic" PhD workforce that IPEDS counts as American (the 17.9% naturalized + 7.0% permanent resident + additional foreign-born among the native-citizen category) are actually foreign-born. Only about two-thirds of PhD holders working in the U.S. were born here.

Key terms

3. Dissertation Metadata Collection

The name-list pages on this site are built from dissertation and thesis metadata harvested directly from university digital repositories. No individual student records are accessed — only publicly available metadata from institutional repositories.

Collection methods

Method Description Universities
OAI-PMH Open Archives Initiative Protocol for Metadata Harvesting. A standard protocol supported by most DSpace, Fedora, and bepress repositories. We send ListRecords requests to each university's OAI endpoint and collect Dublin Core metadata (title, creator, date, subject, description, type). ~100 universities (Arizona, BYU, Cornell, CMU, Columbia, Portland State, etc.)
REST API Stanford Digital Repository exposes a Searchworks/PURL API. We query for genre:Thesis records and collect structured metadata including department and advisor. Stanford
eScholarship API The University of California system's eScholarship platform supports OAI-PMH with campus-specific sets. We harvest all 10 UC campuses via their OAI endpoint. UC Berkeley, UCLA, UCSD, UC Davis, UC Irvine, UCSB, UCSC, UCR, UC Merced, UCSF
HTML scraping For repositories without API access, structured HTML parsing of thesis listing pages. Caltech, UW Seattle

In total, the project has collected metadata for approximately 1.4 million theses and dissertations across 107+ universities. After filtering to doctoral dissertations in STEM fields, the classified name lists cover about 80 high-NRA institutions.

Metadata fields collected

4. Name Classification

Author names from dissertation metadata are classified by likely national origin using a combination of rule-based and statistical methods. The classification assigns each name to a region/origin category (e.g., Chinese, Indian, Korean, Anglo/European, Hispanic, etc.).

This approach is necessary because no federal data source provides country-of-origin information for individual PhD recipients at the institution and field level. IPEDS reports only "NRA" without country breakdown. The SED reports country totals nationally but not per institution. Name classification bridges this gap, albeit with inherent limitations (see below).

5. Limitations & Caveats

Federal data limitations

Dissertation & name classification limitations

Source Agency URL Data Updated
IPEDS Data Center NCES / Dept. of Education nces.ed.gov 2024 academic year
Survey of Earned Doctorates NSF / NCSES ncses.nsf.gov 2024 survey year
Graduate Student Survey NSF / NCSES ncses.nsf.gov 2024 survey year
Survey of Doctorate Recipients NSF / NCSES ncses.nsf.gov 2023 survey
National Survey of College Graduates NSF / NCSES ncses.nsf.gov 2023 survey
HERD Survey NSF / NCSES ncses.nsf.gov FY 2024
SEVIS Data Mapping Tool DHS / ICE / SEVP studyinthestates.dhs.gov March 2026
H-1B LCA Disclosure DOL / OFLC dol.gov FY2025 Q1
USAspending U.S. Treasury usaspending.gov 2024
American Community Survey U.S. Census Bureau data.census.gov 2022 ACS
S&E Indicators (NSB) NSF / NSB ncses.nsf.gov/indicators 2024 and 2026 editions

6. Analytics & Tracking

Site Analytics

This site uses a lightweight, self-hosted analytics system. No third-party tracking (no Google Analytics, no Facebook pixel, no ad networks).

What we collect

Data PointPurpose
Page viewsUnderstand which universities and pages are most viewed
Click eventsTrack which departments and toggles users interact with
Share link clicksMeasure how often content is shared (🔗 button clicks)
Country & cityGeographic distribution of visitors (from CloudFront headers)
IP addressGeo-analysis and unique visitor estimation
Screen size & user agentDevice type analysis

What we don't collect

Infrastructure

Events are sent to analytics.andy-barr.com/collect via the Beacon API. Data is stored in AWS DynamoDB with a 90-day TTL (auto-deleted after 90 days). The analytics dashboard is at analytics.andy-barr.com (password-protected).

Share tracking

When you click any 🔗 share button, we record which anchor was shared (e.g., "purdue--computer-science--2024") so we can understand which content is most shared. The URL you copy goes to your clipboard — we don't track where you paste it.

Page last updated: April 1, 2026. Data freshness varies by source; see individual entries above for the most recent data year available in each dataset.