Where our data comes from, how it was collected, and what it can and cannot tell you
This site tracks the full pathway from international graduate enrollment to permanent residency, using exclusively official US government data:
| Stage | Source Agency | Coverage | Universities |
|---|---|---|---|
| Thesis/Dissertation Data | University repositories (OAI-PMH) | 2016-2024 | 101 |
| OPT Students Employed | ICE/SEVP SEVIS Reports | CY 2017-2024 | 59 (Top 100) |
| H-1B Petition Approvals | USCIS H-1B Employer Data Hub | FY 2010-2026 | 98 |
| H-1B Job Titles & Salaries | DOL LCA Disclosure Data | FY 2020-2025 | 78 |
| Green Card Sponsorship | DOL PERM Disclosure Data | FY 2024 | 71 |
Note on employer entity names: Universities file H-1B and PERM petitions under their legal entity name, which may differ from their common name and can change over time. Multi-campus university systems (e.g., University of Missouri, SUNY, University of California) often file under a single systemwide entity, making per-campus attribution imprecise. We match employer names to universities using pattern matching with known name variants, but some filings may be attributed to the wrong campus within a system.
This site draws on seven major federal data sources, each with different strengths and coverage.
IPEDS is the primary federal database for higher education statistics. Every Title IV institution reports annually. We use four IPEDS surveys:
An annual census of every research doctorate recipient from a U.S. institution. Unlike IPEDS, the SED collects individual-level data including source country for temporary visa holders, postgraduation plans, and financial support. The SED is the source for our national doctorate trend charts, stay-rate analysis, and country-of-origin rankings (China, India, etc.).
Institution-reported data on enrollment, financial support, and demographics of graduate students in science and engineering departments. Provides field-level enrollment counts with foreign/domestic separation at each institution — more granular than IPEDS enrollment for S&E fields.
The Survey of Doctorate Recipients (SDR) surveys ~80,000 PhD holders in the U.S. workforce. The National Survey of College Graduates (NSCG) surveys ~95,000 college graduates. These are the only federal surveys that separate native-born citizens from naturalized citizens and permanent residents — a distinction that IPEDS and the SED cannot make. The NSCG also records specific birth country (via the BTHST_TOGA variable), enabling China-specific analysis.
Reports total R&D expenditures by institution and funding source (federal, state, industry, etc.). Used to show how much research funding flows to universities with the highest NRA concentrations.
Student and Exchange Visitor Information System data tracking all F-1 and M-1 visa holders. Provides current counts by country of citizenship, education level, and U.S. state. Also includes STEM-specific breakdowns at the state level. As of March 2026, China has 229,463 active student records.
H-1B LCA data from the Department of Labor covers every Labor Condition Application for H-1B visas, including employer, job title, wage, and work location. Used to understand the post-graduation employment pipeline. USAspending federal grant data shows total federal awards to each university.
Different federal surveys define citizenship groups differently. Understanding these definitions is critical to interpreting the data correctly.
The IPEDS Blind Spot: IPEDS — the most widely-cited source for higher education demographics — cannot distinguish between U.S. citizens and permanent residents (green card holders). Both are grouped into a single "U.S. citizen or permanent resident" bucket, then broken down by race/ethnicity. Only "Non-Resident Alien" (temporary visa holders) is reported separately.
This means that when IPEDS reports 60% of PhDs go to "domestic" students, that 60% includes an unknown number of green card holders and naturalized citizens who were born abroad.
| Survey | Categories | Can separate citizens from green card holders? |
|---|---|---|
| IPEDS (all surveys) |
NRA = temporary visa only. Race/ethnicity categories (Asian, White, Black, Hispanic, etc.) include BOTH citizens AND permanent residents combined. |
No |
| NSF SED | "U.S. citizen or permanent resident" vs. "Temporary visa holder." Same grouping as IPEDS. | No |
| NSF SDR |
CTZN field has 5 values: 1 = Native-born U.S. citizen 2 = Naturalized U.S. citizen (foreign-born) 3 = Permanent resident (green card) 4 = Temporary visa holder 5 = Living outside the U.S. |
Yes |
| NSF NSCG |
Same CTZN coding as SDR, plus: BTHST_TOGA = specific birth country code BTHRGN = birth region CTZDUAL = dual citizenship flag |
Yes, plus birth country |
| SEVIS / SEVP | Country of citizenship for each active F-1/M-1 record. | N/A (visa holders only) |
The SDR finding: 28% of "domestic" PhDs are foreign-born.
The 2023 Survey of Doctorate Recipients (80,143 respondents) reveals the composition of what IPEDS calls "U.S. citizen or permanent resident":
That means roughly 28% of the "domestic" PhD workforce that IPEDS counts as American (the 17.9% naturalized + 7.0% permanent resident + additional foreign-born among the native-citizen category) are actually foreign-born. Only about two-thirds of PhD holders working in the U.S. were born here.
The name-list pages on this site are built from dissertation and thesis metadata harvested directly from university digital repositories. No individual student records are accessed — only publicly available metadata from institutional repositories.
| Method | Description | Universities |
|---|---|---|
| OAI-PMH | Open Archives Initiative Protocol for Metadata Harvesting. A standard protocol supported by most DSpace, Fedora, and bepress repositories. We send ListRecords requests to each university's OAI endpoint and collect Dublin Core metadata (title, creator, date, subject, description, type). | ~100 universities (Arizona, BYU, Cornell, CMU, Columbia, Portland State, etc.) |
| REST API | Stanford Digital Repository exposes a Searchworks/PURL API. We query for genre:Thesis records and collect structured metadata including department and advisor. | Stanford |
| eScholarship API | The University of California system's eScholarship platform supports OAI-PMH with campus-specific sets. We harvest all 10 UC campuses via their OAI endpoint. | UC Berkeley, UCLA, UCSD, UC Davis, UC Irvine, UCSB, UCSC, UCR, UC Merced, UCSF |
| HTML scraping | For repositories without API access, structured HTML parsing of thesis listing pages. | Caltech, UW Seattle |
In total, the project has collected metadata for approximately 1.4 million theses and dissertations across 107+ universities. After filtering to doctoral dissertations in STEM fields, the classified name lists cover about 80 high-NRA institutions.
Author names from dissertation metadata are classified by likely national origin using a combination of rule-based and statistical methods. The classification assigns each name to a region/origin category (e.g., Chinese, Indian, Korean, Anglo/European, Hispanic, etc.).
This approach is necessary because no federal data source provides country-of-origin information for individual PhD recipients at the institution and field level. IPEDS reports only "NRA" without country breakdown. The SED reports country totals nationally but not per institution. Name classification bridges this gap, albeit with inherent limitations (see below).
| Source | Agency | URL | Data Updated |
|---|---|---|---|
| IPEDS Data Center | NCES / Dept. of Education | nces.ed.gov | 2024 academic year |
| Survey of Earned Doctorates | NSF / NCSES | ncses.nsf.gov | 2024 survey year |
| Graduate Student Survey | NSF / NCSES | ncses.nsf.gov | 2024 survey year |
| Survey of Doctorate Recipients | NSF / NCSES | ncses.nsf.gov | 2023 survey |
| National Survey of College Graduates | NSF / NCSES | ncses.nsf.gov | 2023 survey |
| HERD Survey | NSF / NCSES | ncses.nsf.gov | FY 2024 |
| SEVIS Data Mapping Tool | DHS / ICE / SEVP | studyinthestates.dhs.gov | March 2026 |
| H-1B LCA Disclosure | DOL / OFLC | dol.gov | FY2025 Q1 |
| USAspending | U.S. Treasury | usaspending.gov | 2024 |
| American Community Survey | U.S. Census Bureau | data.census.gov | 2022 ACS |
| S&E Indicators (NSB) | NSF / NSB | ncses.nsf.gov/indicators | 2024 and 2026 editions |
This site uses a lightweight, self-hosted analytics system. No third-party tracking (no Google Analytics, no Facebook pixel, no ad networks).
| Data Point | Purpose |
|---|---|
| Page views | Understand which universities and pages are most viewed |
| Click events | Track which departments and toggles users interact with |
| Share link clicks | Measure how often content is shared (🔗 button clicks) |
| Country & city | Geographic distribution of visitors (from CloudFront headers) |
| IP address | Geo-analysis and unique visitor estimation |
| Screen size & user agent | Device type analysis |
Events are sent to analytics.andy-barr.com/collect via the Beacon API.
Data is stored in AWS DynamoDB with a 90-day TTL (auto-deleted after 90 days).
The analytics dashboard is at analytics.andy-barr.com (password-protected).
When you click any 🔗 share button, we record which anchor was shared (e.g., "purdue--computer-science--2024") so we can understand which content is most shared. The URL you copy goes to your clipboard — we don't track where you paste it.
Page last updated: April 1, 2026. Data freshness varies by source; see individual entries above for the most recent data year available in each dataset.