Structural AI Benchmarks

What validation accuracy
doesn't tell you

Standard metrics measure performance on held-out test sets. They cannot tell you whether your model's internal structure will survive contact with a different distribution — a different simulator, a different scanner, a different domain. Transfer Oracle measures that directly.

All results below are produced by a single API call: POST /v1/audit/transfer.

Real data

MedMNIST benchmark

A ResNet-18 (ImageNet pretrained) is trained on PathMNIST (colon pathology histology). We then ask: can this model transfer to completely different medical imaging domains — skin lesions, blood cell microscopy, abdominal CT? Each dataset has its own published validation accuracy (measured on its own test set), but that number says nothing about cross-domain transfer. Transfer Oracle does.

How to read this table: "Own-domain val" is each dataset's accuracy on its own test split — it measures in-distribution performance, not transferability. "Transfer oracle" measures whether the PathMNIST-trained model's internal structure covers the deployment distribution. A large gap means the model will silently fail when deployed cross-domain.

Source: medmnist.com | Training: PathMNIST (colon pathology) | Model: ResNet-18 (ImageNet pretrained) | 500 samples per dataset | Computed via POST /v1/audit/transfer

Deploy targetDomainOwn-domain val
(in-distribution)
Transfer oracle
(cross-domain)
CoverageRisk
PathMNISTColon pathology (same domain)91%68%100%MED
DermaMNISTSkin lesions (different domain)73%3%90%HIGH
BloodMNISTBlood cell microscopy (different domain)96%18%50%HIGH
OrganAMNISTAbdominal CT organs (different domain)94%9%90%HIGH
Key finding: BloodMNIST scores 96% on its own test set — but when you deploy the PathMNIST-trained model to blood cell images, the transfer oracle drops to 18% with only 50% structural coverage. The model looks excellent by standard metrics, yet its internal representations completely fail to cover the deployment distribution. That gap is invisible without Transfer Oracle.
Real data

CIFAR-10-C benchmark

A ResNet-18 (ImageNet pretrained) trained on clean CIFAR-10 is deployed against 5 corruption types at 5 severity levels (25 scenarios). Same 10 classes, same task — only the distribution changes. The gold standard for testing robustness to distribution shift.

Clean baseline: oracle 68% | coverage 100% | diagnosis SAFE

Source: Hendrycks & Dietterich 2019 | Model: ResNet-18 (ImageNet pretrained) | 500 samples per scenario | Computed via POST /v1/audit/transfer

CorruptionTypeSev 1Sev 2Sev 3Sev 4Sev 5
Gaussian noiseNoise36%
85% cov
24%
65% cov
16%
50% cov
14%
50% cov
13%
50% cov
Defocus blurBlur66%
100% cov
61%
100% cov
54%
100% cov
50%
90% cov
35%
80% cov
FogWeather65%
100% cov
62%
100% cov
55%
100% cov
47%
90% cov
30%
80% cov
ContrastDigital64%
100% cov
57%
100% cov
49%
90% cov
40%
80% cov
25%
55% cov
JPEG compressionDigital56%
100% cov
52%
100% cov
52%
100% cov
48%
100% cov
44%
100% cov
Key finding: Gaussian noise at severity 4-5 triggers RED_FLAG — oracle drops to 13-14% with only 50% coverage. Meanwhile JPEG compression stays SAFE across all severities (44-56%, 100% coverage). Transfer Oracle distinguishes harmful from benign distribution shifts — corruption-type-specific, not just severity-based.
Simulated scenarios

The danger zone

High validation accuracy + low oracle score = a model that will fail in production. These scenarios illustrate the pattern across diverse deployment domains.

DomainScenarioVal acc.OracleRiskRoot cause
RoboticsPolicy A → Real Robot91%12%HIGHSim lighting + joint noise not represented in training distribution
Medical AIScanner A → Scanner B88%34%HIGHVendor-specific reconstruction kernels create invisible domain shift
AgricultureLab images → Field84%41%HIGH67% of deploy samples land in map regions with no training coverage
SatelliteSentinel-2 → PlanetScope79%58%MEDSpectral band mismatch causes systematic dimensional shift
NLPLoRA ft → prod domain76%71%LOWTransfer is structurally sound — safe to deploy

Sim-to-Real transfer

Policies trained in simulation routinely fail in the real world — not because the task is wrong, but because the structural distribution is different. Transfer Oracle quantifies this gap before you run a single real-world experiment.

ScenarioSim oracleReal oracleCoverageDim. shiftVerdict
Isaac Sim → Real arm (6-DOF)67%14%31%HighFail
MuJoCo → Real hand (5-finger)71%28%44%HighFail
Gazebo → Outdoor wheeled69%52%68%MedMarginal
IsaacGym → Quadruped flat74%71%89%LowPass
Webots → Warehouse AMR72%69%91%LowPass
What this means: A policy with 91% sim task success can have a 14% real-world oracle score. Transfer Oracle identifies this from embeddings alone — before any real hardware is involved. Coverage tells you what fraction of real-world states the sim policy has ever seen structurally.

LoRA adapter transfer

Fine-tuning a base model with LoRA adapters changes its internal structure in ways that standard eval metrics don't capture. Transfer Oracle audits adapter deltas to tell you whether an adapter trained for one domain will hold up in another.

Base modelTask transferDelta scoreCoverageRecommendation
Llama-3 8BLegal → Medical0.1238%Re-train on medical corpus — legal adapter collapses on med terminology
Mistral 7BCode → SQL0.7184%Transfer is structurally sound. Minor edge-case gaps in joins.
Gemma 2BEN → DE translation0.3452%Adapter covers common vocabulary; fails on technical compound nouns.
Phi-3 MiniGeneral → Customer support0.6879%Acceptable transfer. Monitor for escalation-pattern blind spots.
Qwen-2 7BChat → RAG retrieval0.0923%Adapter not viable — retrieval requires fundamentally different structure.

Delta score measures structural alignment of the adapter's weight changes against the target distribution. 0 = no alignment, 1 = perfect structural coverage.

Video-to-robot policy transfer

Learning robot policies from video demonstrations is compelling but structurally risky. Transfer Oracle audits the embedding space of video-derived policies against real robot rollout distributions — before hardware experiments.

Video sourceTarget systemPolicy oracleStructural finding
YouTube cooking videosKitchen manipulation robot18%%Human motion variance too high — robot kinematics not covered
Industrial assembly footageFactory pick-and-place61%%Good structural match — camera angle + object scale consistent
Human grasping dataset3-finger gripper43%%Partial coverage — fingertip dynamics differ from human fingers
Surgical procedure videoLaparoscopic assistant29%%High oracle variance — instrument occlusion creates uncovered regions

Sim-to-Sim transfer

Even moving a policy between simulators is non-trivial. Physics engines, rendering pipelines, and sensor models differ structurally. Transfer Oracle quantifies cross-simulator compatibility before migration.

Isaac SimMuJoCo
Pass
Oracle78%

Physics dynamics align well. Minor contact model differences at joint limits.

Gazebo (ROS2)Webots
Pass
Oracle71%

Good structural match. Sensor noise models differ — validate perception pipeline.

Isaac SimGazebo (ROS1)
Marginal
Oracle44%

Significant rendering gap. Depth sensor simulation differs substantially.

PyBulletIsaac Sim
Fail
Oracle31%

Contact dynamics and soft-body simulation are fundamentally different.

Cross-architecture transfer

Swapping model architectures — ResNet to ViT, LSTM to Transformer — changes the internal representation even when trained on identical data. Transfer Oracle measures representational compatibility without needing labels.

ResNet-50ViT-B/16

ImageNet features

Oracle63%

Spatial vs patch-based representations diverge at high-frequency edges.

LSTMTransformer

Time-series encoding

Oracle71%

Long-range dependencies align well; short-range local patterns differ.

CNNMLP-Mixer

Aerial imagery

Oracle48%

Global vs local receptive fields cause coverage gaps in fine-grained regions.

BERT-baseRoBERTa-large

Sentiment features

Oracle84%

Strong structural alignment. Safe to transfer downstream task representations.

EfficientNet-B4ConvNeXt-S

Medical imaging

Oracle56%

Depthwise convolution vs full convolution creates systematic shift on fine structures.

GPT-2Mistral-7B

Code completion

Oracle39%

Scale difference creates substantial structural gap — requires re-alignment.

Run these benchmarks on your models

One API call. Works with any ML framework, any architecture. Get oracle scores, coverage maps, dimensional shift analysis, and prioritised recommendations.

Already have API access? POST /v1/audit/transfer View documentation