Structural AI Benchmarks

What validation accuracy
doesn't tell you

Standard metrics measure performance on held-out test sets. They cannot tell you whether your model's internal structure will survive contact with a different distribution — a different simulator, a different scanner, a different domain. Transfer Oracle measures that directly.

All results below are produced by a single API call: POST /v1/audit/transfer.

Real data

New

LLM Quantization Audit

Gemma 3 4B: BF16 vs QAT Q4_0 vs naive PTQ Q4_0. Three-way comparison across 34 layers and 8 concepts. Standard metrics say 99.6% cosine similarity. We show 22% of semantic associations were reorganised — and that QAT and PTQ produce different types of damage, not just different amounts.

77.9%

QAT edge retention

80.7%

Naive PTQ edge retention

All SAFE

Diagnosis (both methods)

Concept	BF16	QAT Q4_0	PTQ Q4_0	Key loss
robot	17	18	17	Minimal
photosynthesis	17	17	17	Gained "plants"
quantization	11	11	11	Minimal
Paris	18	19	19	Lost "German"
insurance	22	20	20	Lost "lawsuits"
democracy	19	17	19	QAT lost 6 edges
Python	21	18	20	Lost "None" + "Initializer"
DNA	20	17	20	QAT worst: 65%

Full LLM audit report →ViT-B/16 quantization benchmark →

Real data

MedMNIST benchmark

A ResNet-18 (ImageNet pretrained) is trained on PathMNIST (colon pathology histology). We then ask: can this model transfer to completely different medical imaging domains — skin lesions, blood cell microscopy, abdominal CT? Each dataset has its own published validation accuracy (measured on its own test set), but that number says nothing about cross-domain transfer. Transfer Oracle does.

How to read this table: "Own-domain val" is each dataset's accuracy on its own test split — it measures in-distribution performance, not transferability. "Transfer oracle" measures whether the PathMNIST-trained model's internal structure covers the deployment distribution. A large gap means the model will silently fail when deployed cross-domain.

Source: medmnist.com | Training: PathMNIST (colon pathology) | Model: ResNet-18 (ImageNet pretrained) | 500 samples per dataset | Computed via POST /v1/audit/transfer

Deploy target	Domain	Own-domain val (in-distribution)	Transfer oracle (cross-domain)	Coverage	Risk
PathMNIST	Colon pathology (same domain)	91%	68%	100%	MED
DermaMNIST	Skin lesions (different domain)	73%	3%	90%	HIGH
BloodMNIST	Blood cell microscopy (different domain)	96%	18%	50%	HIGH
OrganAMNIST	Abdominal CT organs (different domain)	94%	9%	90%	HIGH

Key finding: BloodMNIST scores 96% on its own test set — but when you deploy the PathMNIST-trained model to blood cell images, the transfer oracle drops to 18% with only 50% structural coverage. The model looks excellent by standard metrics, yet its internal representations completely fail to cover the deployment distribution. That gap is invisible without Transfer Oracle.

Real data

CIFAR-10-C benchmark

A ResNet-18 (ImageNet pretrained) trained on clean CIFAR-10 is deployed against 5 corruption types at 5 severity levels (25 scenarios). Same 10 classes, same task — only the distribution changes. The gold standard for testing robustness to distribution shift.

Clean baseline: oracle 68% | coverage 100% | diagnosis SAFE

Source: Hendrycks & Dietterich 2019 | Model: ResNet-18 (ImageNet pretrained) | 500 samples per scenario | Computed via POST /v1/audit/transfer

Corruption	Type	Sev 1	Sev 2	Sev 3	Sev 4	Sev 5
Gaussian noise	Noise	36% 85% cov	24% 65% cov	16% 50% cov	14% 50% cov	13% 50% cov
Defocus blur	Blur	66% 100% cov	61% 100% cov	54% 100% cov	50% 90% cov	35% 80% cov
Fog	Weather	65% 100% cov	62% 100% cov	55% 100% cov	47% 90% cov	30% 80% cov
Contrast	Digital	64% 100% cov	57% 100% cov	49% 90% cov	40% 80% cov	25% 55% cov
JPEG compression	Digital	56% 100% cov	52% 100% cov	52% 100% cov	48% 100% cov	44% 100% cov

Key finding: Gaussian noise at severity 4-5 triggers RED_FLAG — oracle drops to 13-14% with only 50% coverage. Meanwhile JPEG compression stays SAFE across all severities (44-56%, 100% coverage). Transfer Oracle distinguishes harmful from benign distribution shifts — corruption-type-specific, not just severity-based.

Simulated scenarios

The danger zone

High validation accuracy + low oracle score = a model that will fail in production. These scenarios illustrate the pattern across diverse deployment domains.

Domain	Scenario	Val acc.	Oracle	Risk	Root cause
Robotics	Policy A → Real Robot	91%	12%	HIGH	Sim lighting + joint noise not represented in training distribution
Medical AI	Scanner A → Scanner B	88%	34%	HIGH	Vendor-specific reconstruction kernels create invisible domain shift
Agriculture	Lab images → Field	84%	41%	HIGH	67% of deploy samples land in map regions with no training coverage
Satellite	Sentinel-2 → PlanetScope	79%	58%	MED	Spectral band mismatch causes systematic dimensional shift
NLP	LoRA ft → prod domain	76%	71%	LOW	Transfer is structurally sound — safe to deploy

Sim-to-Real transfer

Policies trained in simulation routinely fail in the real world — not because the task is wrong, but because the structural distribution is different. Transfer Oracle quantifies this gap before you run a single real-world experiment.

Scenario	Sim oracle	Real oracle	Coverage	Dim. shift	Verdict
Isaac Sim → Real arm (6-DOF)	67%	14%	31%	High	Fail
MuJoCo → Real hand (5-finger)	71%	28%	44%	High	Fail
Gazebo → Outdoor wheeled	69%	52%	68%	Med	Marginal
IsaacGym → Quadruped flat	74%	71%	89%	Low	Pass
Webots → Warehouse AMR	72%	69%	91%	Low	Pass

What this means: A policy with 91% sim task success can have a 14% real-world oracle score. Transfer Oracle identifies this from embeddings alone — before any real hardware is involved. Coverage tells you what fraction of real-world states the sim policy has ever seen structurally.

LoRA adapter transfer

Fine-tuning a base model with LoRA adapters changes its internal structure in ways that standard eval metrics don't capture. Transfer Oracle audits adapter deltas to tell you whether an adapter trained for one domain will hold up in another.

Base model	Task transfer	Delta score	Coverage	Recommendation
Llama-3 8B	Legal → Medical	0.12	38%	Re-train on medical corpus — legal adapter collapses on med terminology
Mistral 7B	Code → SQL	0.71	84%	Transfer is structurally sound. Minor edge-case gaps in joins.
Gemma 2B	EN → DE translation	0.34	52%	Adapter covers common vocabulary; fails on technical compound nouns.
Phi-3 Mini	General → Customer support	0.68	79%	Acceptable transfer. Monitor for escalation-pattern blind spots.
Qwen-2 7B	Chat → RAG retrieval	0.09	23%	Adapter not viable — retrieval requires fundamentally different structure.

Delta score measures structural alignment of the adapter's weight changes against the target distribution. 0 = no alignment, 1 = perfect structural coverage.

Video-to-robot policy transfer

Learning robot policies from video demonstrations is compelling but structurally risky. Transfer Oracle audits the embedding space of video-derived policies against real robot rollout distributions — before hardware experiments.

Video source	Target system	Policy oracle	Structural finding
YouTube cooking videos	Kitchen manipulation robot	18%%	Human motion variance too high — robot kinematics not covered
Industrial assembly footage	Factory pick-and-place	61%%	Good structural match — camera angle + object scale consistent
Human grasping dataset	3-finger gripper	43%%	Partial coverage — fingertip dynamics differ from human fingers
Surgical procedure video	Laparoscopic assistant	29%%	High oracle variance — instrument occlusion creates uncovered regions

Sim-to-Sim transfer

Even moving a policy between simulators is non-trivial. Physics engines, rendering pipelines, and sensor models differ structurally. Transfer Oracle quantifies cross-simulator compatibility before migration.

Isaac SimMuJoCo

Pass

Oracle78%

Physics dynamics align well. Minor contact model differences at joint limits.

Gazebo (ROS2)Webots

Pass

Oracle71%

Good structural match. Sensor noise models differ — validate perception pipeline.

Isaac SimGazebo (ROS1)

Marginal

Oracle44%

Significant rendering gap. Depth sensor simulation differs substantially.

PyBulletIsaac Sim

Fail

Oracle31%

Contact dynamics and soft-body simulation are fundamentally different.

Cross-architecture transfer

Swapping model architectures — ResNet to ViT, LSTM to Transformer — changes the internal representation even when trained on identical data. Transfer Oracle measures representational compatibility without needing labels.

ResNet-50→ViT-B/16

ImageNet features

Oracle63%

Spatial vs patch-based representations diverge at high-frequency edges.

LSTM→Transformer

Time-series encoding

Oracle71%

Long-range dependencies align well; short-range local patterns differ.

CNN→MLP-Mixer

Aerial imagery

Oracle48%

Global vs local receptive fields cause coverage gaps in fine-grained regions.

BERT-base→RoBERTa-large

Sentiment features

Oracle84%

Strong structural alignment. Safe to transfer downstream task representations.

EfficientNet-B4→ConvNeXt-S

Medical imaging

Oracle56%

Depthwise convolution vs full convolution creates systematic shift on fine structures.

GPT-2→Mistral-7B

Code completion

Oracle39%

Scale difference creates substantial structural gap — requires re-alignment.

Run these benchmarks on your models

One API call. Works with any ML framework, any architecture. Get oracle scores, coverage maps, dimensional shift analysis, and prioritised recommendations.

Get free API key Learn more

Already have API access? POST /v1/audit/transfer — View documentation

What validation accuracydoesn't tell you

LLM Quantization Audit

MedMNIST benchmark

CIFAR-10-C benchmark

The danger zone

Sim-to-Real transfer

LoRA adapter transfer

Video-to-robot policy transfer

Sim-to-Sim transfer

Cross-architecture transfer

Run these benchmarks on your models

What validation accuracy
doesn't tell you