Standard metrics measure performance on held-out test sets. They cannot tell you whether your model's internal structure will survive contact with a different distribution — a different simulator, a different scanner, a different domain. Transfer Oracle measures that directly.
All results below are produced by a single API call: POST /v1/audit/transfer.
A ResNet-18 (ImageNet pretrained) is trained on PathMNIST (colon pathology histology). We then ask: can this model transfer to completely different medical imaging domains — skin lesions, blood cell microscopy, abdominal CT? Each dataset has its own published validation accuracy (measured on its own test set), but that number says nothing about cross-domain transfer. Transfer Oracle does.
How to read this table: "Own-domain val" is each dataset's accuracy on its own test split — it measures in-distribution performance, not transferability. "Transfer oracle" measures whether the PathMNIST-trained model's internal structure covers the deployment distribution. A large gap means the model will silently fail when deployed cross-domain.
Source: medmnist.com | Training: PathMNIST (colon pathology) | Model: ResNet-18 (ImageNet pretrained) | 500 samples per dataset | Computed via POST /v1/audit/transfer
| Deploy target | Domain | Own-domain val (in-distribution) | Transfer oracle (cross-domain) | Coverage | Risk |
|---|---|---|---|---|---|
| PathMNIST | Colon pathology (same domain) | 91% | 68% | 100% | MED |
| DermaMNIST | Skin lesions (different domain) | 73% | 3% | 90% | HIGH |
| BloodMNIST | Blood cell microscopy (different domain) | 96% | 18% | 50% | HIGH |
| OrganAMNIST | Abdominal CT organs (different domain) | 94% | 9% | 90% | HIGH |
A ResNet-18 (ImageNet pretrained) trained on clean CIFAR-10 is deployed against 5 corruption types at 5 severity levels (25 scenarios). Same 10 classes, same task — only the distribution changes. The gold standard for testing robustness to distribution shift.
Clean baseline: oracle 68% | coverage 100% | diagnosis SAFE
Source: Hendrycks & Dietterich 2019 | Model: ResNet-18 (ImageNet pretrained) | 500 samples per scenario | Computed via POST /v1/audit/transfer
| Corruption | Type | Sev 1 | Sev 2 | Sev 3 | Sev 4 | Sev 5 |
|---|---|---|---|---|---|---|
| Gaussian noise | Noise | 36% 85% cov | 24% 65% cov | 16% 50% cov | 14% 50% cov | 13% 50% cov |
| Defocus blur | Blur | 66% 100% cov | 61% 100% cov | 54% 100% cov | 50% 90% cov | 35% 80% cov |
| Fog | Weather | 65% 100% cov | 62% 100% cov | 55% 100% cov | 47% 90% cov | 30% 80% cov |
| Contrast | Digital | 64% 100% cov | 57% 100% cov | 49% 90% cov | 40% 80% cov | 25% 55% cov |
| JPEG compression | Digital | 56% 100% cov | 52% 100% cov | 52% 100% cov | 48% 100% cov | 44% 100% cov |
High validation accuracy + low oracle score = a model that will fail in production. These scenarios illustrate the pattern across diverse deployment domains.
| Domain | Scenario | Val acc. | Oracle | Risk | Root cause |
|---|---|---|---|---|---|
| Robotics | Policy A → Real Robot | 91% | 12% | HIGH | Sim lighting + joint noise not represented in training distribution |
| Medical AI | Scanner A → Scanner B | 88% | 34% | HIGH | Vendor-specific reconstruction kernels create invisible domain shift |
| Agriculture | Lab images → Field | 84% | 41% | HIGH | 67% of deploy samples land in map regions with no training coverage |
| Satellite | Sentinel-2 → PlanetScope | 79% | 58% | MED | Spectral band mismatch causes systematic dimensional shift |
| NLP | LoRA ft → prod domain | 76% | 71% | LOW | Transfer is structurally sound — safe to deploy |
Policies trained in simulation routinely fail in the real world — not because the task is wrong, but because the structural distribution is different. Transfer Oracle quantifies this gap before you run a single real-world experiment.
| Scenario | Sim oracle | Real oracle | Coverage | Dim. shift | Verdict |
|---|---|---|---|---|---|
| Isaac Sim → Real arm (6-DOF) | 67% | 14% | 31% | High | Fail |
| MuJoCo → Real hand (5-finger) | 71% | 28% | 44% | High | Fail |
| Gazebo → Outdoor wheeled | 69% | 52% | 68% | Med | Marginal |
| IsaacGym → Quadruped flat | 74% | 71% | 89% | Low | Pass |
| Webots → Warehouse AMR | 72% | 69% | 91% | Low | Pass |
Fine-tuning a base model with LoRA adapters changes its internal structure in ways that standard eval metrics don't capture. Transfer Oracle audits adapter deltas to tell you whether an adapter trained for one domain will hold up in another.
| Base model | Task transfer | Delta score | Coverage | Recommendation |
|---|---|---|---|---|
| Llama-3 8B | Legal → Medical | 0.12 | 38% | Re-train on medical corpus — legal adapter collapses on med terminology |
| Mistral 7B | Code → SQL | 0.71 | 84% | Transfer is structurally sound. Minor edge-case gaps in joins. |
| Gemma 2B | EN → DE translation | 0.34 | 52% | Adapter covers common vocabulary; fails on technical compound nouns. |
| Phi-3 Mini | General → Customer support | 0.68 | 79% | Acceptable transfer. Monitor for escalation-pattern blind spots. |
| Qwen-2 7B | Chat → RAG retrieval | 0.09 | 23% | Adapter not viable — retrieval requires fundamentally different structure. |
Delta score measures structural alignment of the adapter's weight changes against the target distribution. 0 = no alignment, 1 = perfect structural coverage.
Learning robot policies from video demonstrations is compelling but structurally risky. Transfer Oracle audits the embedding space of video-derived policies against real robot rollout distributions — before hardware experiments.
| Video source | Target system | Policy oracle | Structural finding |
|---|---|---|---|
| YouTube cooking videos | Kitchen manipulation robot | 18%% | Human motion variance too high — robot kinematics not covered |
| Industrial assembly footage | Factory pick-and-place | 61%% | Good structural match — camera angle + object scale consistent |
| Human grasping dataset | 3-finger gripper | 43%% | Partial coverage — fingertip dynamics differ from human fingers |
| Surgical procedure video | Laparoscopic assistant | 29%% | High oracle variance — instrument occlusion creates uncovered regions |
Even moving a policy between simulators is non-trivial. Physics engines, rendering pipelines, and sensor models differ structurally. Transfer Oracle quantifies cross-simulator compatibility before migration.
Physics dynamics align well. Minor contact model differences at joint limits.
Good structural match. Sensor noise models differ — validate perception pipeline.
Significant rendering gap. Depth sensor simulation differs substantially.
Contact dynamics and soft-body simulation are fundamentally different.
Swapping model architectures — ResNet to ViT, LSTM to Transformer — changes the internal representation even when trained on identical data. Transfer Oracle measures representational compatibility without needing labels.
ImageNet features
Spatial vs patch-based representations diverge at high-frequency edges.
Time-series encoding
Long-range dependencies align well; short-range local patterns differ.
Aerial imagery
Global vs local receptive fields cause coverage gaps in fine-grained regions.
Sentiment features
Strong structural alignment. Safe to transfer downstream task representations.
Medical imaging
Depthwise convolution vs full convolution creates systematic shift on fine structures.
Code completion
Scale difference creates substantial structural gap — requires re-alignment.
One API call. Works with any ML framework, any architecture. Get oracle scores, coverage maps, dimensional shift analysis, and prioritised recommendations.
Already have API access? POST /v1/audit/transfer — View documentation