Benchmark Theater

ACI on top of GPT-5.5, with the benchmark story visible turn by turn.

This UI is built for live demo sessions: benchmark sessions on the left, score deltas at the top, replay in the middle, and a prompt explorer on the right so Aramco can inspect how ACI compares to memory systems and what changes when ACI governs them.

Data source
Fallback demo bundle
/benchmark_results
Sessions loaded: 2
Scoreboard

ACI vs External Systems

Fallback session showing how the theater renders side-by-side continual benchmark turns.
Run tag: fallback
aci_native
0.91
Weighted score
Online accuracy: 88.0%
Avg incremental: 0.0%
p95 latency: 410 ms
Cost / sample: $0.0112
bifrost_native
0.78
Weighted score
Online accuracy: 79.0%
Avg incremental: 0.0%
p95 latency: 540 ms
Cost / sample: $0.0108
neo4j_native
0.67
Weighted score
Online accuracy: 71.0%
Avg incremental: 0.0%
p95 latency: 980 ms
Cost / sample: $0.0121
Replay

Turn-by-turn benchmark playback

Turn 1eval_current
Current Input
What changed in the welding standard for heat treatment?
Expected: compare_versionsTask: engineering_task_1Dataset: mtop
aci_native
compare_versions
Matches expected label
Confidence: 92.0%
Latency: 320 ms
Cost: $0.0112
bifrost_native
compare_versions
Matches expected label
Confidence: 77.0%
Latency: 440 ms
Cost: $0.0108
neo4j_native
find_document
Diverges from expected label
Confidence: 51.0%
Latency: 910 ms
Cost: $0.0121