How the atlas was grounded in recent AI labor research
The atlas is model-generated, but it is not meant to stand alone. This page explains the external research layer behind it: which public papers were reviewed, what kind of evidence each one contributes, which cross-paper claims recur, and where those claims can be attached back to occupations without pretending they are the same thing as the atlas score.
Papers reviewed
5
Labs represented
2
Recurring themes
4
Occupation seeds
5
How this research layer was assembled
This is not a paper dump. The goal was to build a small, legible evidence layer for the atlas: enough external grounding to interpret the map better, without blurring together evidence that means very different things.
Step 1
Start with a narrow public source set
The research layer uses a deliberately small set of recent OpenAI and Anthropic papers that say something concrete about workplace use, task capability, or labor-market effects.
Step 2
Separate evidence by what it actually measures
Some papers show observed usage, some benchmark frontier capability, and some frame workforce or policy implications. Keeping those buckets separate avoids false certainty.
Step 3
Pull out claims that can travel to occupations
For each paper, the useful outputs are recurring claims, caveats, and any occupations or occupation families the paper names directly enough to support a specific note.
Step 4
Use papers as interpretation, not hidden scoring
The papers help explain, challenge, or sharpen atlas estimates. They do not quietly overwrite the atlas score unless the scoring method itself changes.
How to read the papers against the atlas
The papers and the atlas answer related but different questions. The useful move is to let each source do the job it is actually good at.
What the papers contribute
They contribute observed product use, benchmark tasks built from expert work, and policy context. That is stronger evidence for what frontier models are doing now, or are plausibly close to doing soon.
What the atlas contributes
The atlas covers the full BLS occupation taxonomy in one place, keeps labor-market context like pay and projected growth attached to each occupation, and lets readers compare several AI lenses across the whole U.S. job structure.
What to keep separate
Observed usage, benchmark capability, and workforce-policy proposals answer different questions. This page keeps them separate instead of pretending they collapse into one universal exposure score.
Four claims that recur across the source set
Observed use is still concentrated in digital, language-heavy work
Across Anthropic's task-level usage papers and OpenAI's narrative blueprint, the strongest early signal is still software, writing, analysis, and other screen-based knowledge work rather than the whole labor market at once.
Augmentation currently edges pure automation in product usage
The observed usage studies consistently show people using AI to draft, iterate, explain, and review more often than to fully hand over end-to-end work, even if API usage skews more automated.
Capability is ahead of adoption
The Anthropic labor note and OpenAI's GDPval benchmark point the same way: current models can do more than current usage implies, but that gap is mediated by tools, workflow, regulation, verification, and organizational change.
Exposure is uneven across occupations and worker groups
Physical-world jobs remain relatively insulated in current usage studies, while more educated and higher-paid white-collar roles show more observed exposure and more evidence of workflow change.
What each paper contributes
AI at Work: OpenAI's Workforce Blueprint
October 2025
What this source measures
Combines product-usage observations, early market interpretation, and workforce-transition proposals.
How it informs the atlas
Useful as a policy and timing lens for interpreting replacement scores cautiously and for explaining why current usage can lag capability.
It is not an occupation-by-occupation empirical exposure table, so it should not be treated as a direct labeling source for the full atlas.
Near-term use looks more collaborative than fully substitutive
OpenAI frames current workplace use as decision support, writing, research, and routine streamlining, and argues the observed pattern is still more enabling than replacing.
Foreword, pp. 2-4.
Workplace adoption often starts bottom-up
The paper says employees often begin using ChatGPT before formal enterprise deployment, with writing, market research, and data analysis showing up early across business functions.
Foreword, pp. 2-3.
Capability is moving faster than labor-market measurement
OpenAI links its GDPval benchmark to a claim that GPT-5-level systems already match or exceed professionals on about half of the benchmarked economically valuable tasks.
Foreword, p. 3.
Labor Market Impacts of AI: A New Measure and Early Evidence
March 5, 2026
What this source measures
Introduces observed exposure, combining theoretical LLM feasibility with real usage weighted toward work-related and more automated use.
How it informs the atlas
This is the closest external analogue to our replacement metric because it explicitly tries to bridge theoretical exposure and observed use.
It is still built from Claude-centered usage and a custom weighting scheme, so it is informative context rather than a drop-in target variable for our own labels.
Current deployment still trails theoretical capability
Anthropic argues actual task coverage remains only a fraction of what current models could theoretically do, which is a strong warning against equating capability with labor displacement.
Key findings, p. 2.
Higher observed exposure lines up with weaker projected growth
The note reports that occupations with higher observed exposure are projected by BLS to grow less through 2034.
Key findings, p. 2.
Early labor effects are subtle rather than dramatic
Anthropic reports no broad unemployment spike for exposed workers since late 2022, but it does see suggestive evidence that hiring of younger workers has slowed in exposed occupations.
Key findings, p. 2.
GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks
October 2025
What this source measures
Benchmarks frontier models on expert-authored, economically valuable tasks from predominantly digital occupations.
How it informs the atlas
Strong input for our digital-adjacency and augmentation reasoning because it measures capability on serious real-world deliverables.
Because GDPval is only 44 occupations and intentionally digital, it cannot stand in for exposure across manual, care, or field-heavy occupations.
The benchmark targets high-value digital work, not the full labor market
GDPval covers 44 occupations across the top 9 GDP sectors and deliberately focuses on predominantly digital roles.
Abstract and Section 2.1, pp. 1-3.
The task set is grounded in real expert work product
Tasks are built from work contributed by experienced practitioners and are evaluated with human expert pairwise comparisons rather than only automatic grading.
Abstract and Sections 2.2-2.5, pp. 1-4.
Frontier models are approaching expert-quality output on this narrow slice
OpenAI reports that the strongest frontier systems are approaching industry experts on the GDPval gold subset, which raises the upper bound on near-term exposure for digital occupations.
Section 3.1, pp. 4-5.
Which Economic Tasks are Performed with AI? Evidence from Millions of Claude Conversations
February 2025
What this source measures
Maps millions of Claude conversations onto O*NET tasks to show where AI is already being used in the economy.
How it informs the atlas
One of the best external anchors for our augmentation and physical-world insulation metrics.
It is platform-specific usage evidence, so it can understate occupations where capability exists but product adoption, regulation, or workflow integration lag.
Observed use is concentrated in software and writing
Anthropic finds software development and writing tasks together account for nearly half of total observed Claude usage.
Abstract, pp. 1-2.
Adoption is broad but shallow across many occupations
The paper reports that about 36% of occupations show AI use in at least a quarter of their tasks, but only a small share show deep task penetration.
Abstract and contributions, pp. 1-3.
Augmentation edges automation in observed product use
Anthropic estimates 57% of usage is augmentative and 43% is more automation-like, while occupations involving physical manipulation show minimal current use.
Abstract and Section 1 contributions, pp. 1-3.
The Anthropic Economic Index Report: Economic Primitives
January 15, 2026
What this source measures
Adds task complexity, autonomy, success rate, and work-versus-coursework distinctions to Claude usage analysis.
How it informs the atlas
Best source in this set for occupation-specific nuance beyond a single headline score, especially around what kind of work remains after AI takes on some tasks.
It still reflects Claude usage and success rather than a cross-provider equilibrium, so it should complement rather than override our ensemble estimates.
Success rates meaningfully change occupational exposure
When Anthropic weights tasks by both importance and Claude success rate, some occupations such as data entry keyers and database architects show large swaths of work within reach.
Introduction, pp. 3-4.
Observed use remains mixed between collaboration and delegation
Anthropic reports augmentation again exceeds automation on Claude.ai, even while automated use remains stronger in first-party API traffic.
Chapter 1 overview, pp. 4-5.
Task removal can imply deskilling or upskilling depending on the occupation
The report uses travel agents and property managers to show that removing AI-covered tasks can either hollow out the most complex work or strip away bookkeeping-heavy work and leave more strategic responsibilities.
Introduction, pp. 3-4.
How to attach paper evidence to occupations
A good next layer is a short paper-backed note inside each occupation detail view: evidence that sits beside the atlas estimate to explain it, pressure-test it, or add nuance that a single score cannot carry on its own.
Step 1
Start with explicit mentions, not fuzzy semantic matching
Seed the feature only with occupations or occupation families the papers name directly. That keeps the first pass auditable and avoids inventing authority where the papers were actually more general.
Step 2
Store statement, evidence type, and locator separately
Each note should keep a short paraphrased statement, the paper citation, page locator, and whether it reflects observed usage, benchmark capability, or policy framing.
Step 3
Attach notes as secondary evidence, not as score overrides
A citation should sit beside the atlas estimate to explain or challenge it. It should not silently rewrite a score unless the scoring method itself has changed to incorporate that evidence.
Occupation families with the clearest paper trail
These are strong first candidates because the papers either name the occupations directly or describe a narrow enough occupation family to support an auditable note.
Observed Claude usage is especially concentrated in software development, writing, and analytical work, so these occupations are good candidates for paper-backed exposure notes.
These are direct occupation families where our high digital-adjacency and replacement scores can be paired with external observed-usage evidence.
Occupations requiring physical manipulation of the environment show minimal current Claude usage, making them strong examples for the physical-world insulation metric.
This gives the atlas a concrete external citation for why some low-replacement cells stay relatively green even when they are economically large.
When Anthropic factors in task success rates, data entry keyers and database architects are examples where Claude appears capable across a large share of the job.
These are unusually clean occupation-level hooks for the modal because the report names them directly and says something more specific than a generic exposure score.
Anthropic uses travel agents as an example where AI-covered tasks may remove more complex planning work and leave more routine ticketing and payment work behind.
This is exactly the kind of nuance a single replacement score cannot carry on its own.
Anthropic uses property managers as the opposite case, where removing bookkeeping-heavy tasks can leave more negotiation and stakeholder management work behind.
This supports adding citation-backed notes that distinguish augmentation from deskilling even within occupations that look similarly exposed on a single color scale.