Benchmark Theater

ACI on top of GPT-5.5, with the benchmark story visible turn by turn.

This UI is built for live demo sessions: benchmark sessions on the left, score deltas at the top, replay in the middle, and a prompt explorer on the right so Aramco can inspect how ACI compares to memory systems and what changes when ACI governs them.

Data source

Fallback demo bundle

/benchmark_results

Sessions loaded: 2

Scoreboard

ACI vs External Systems

Fallback session showing how the theater renders side-by-side continual benchmark turns.

Run tag: fallback

aci_native

0.91

Weighted score

Online accuracy: 88.0%

Avg incremental: 0.0%

p95 latency: 410 ms

Cost / sample: $0.0112

bifrost_native

0.78

Weighted score

Online accuracy: 79.0%

Avg incremental: 0.0%

p95 latency: 540 ms

Cost / sample: $0.0108

neo4j_native

0.67

Weighted score

Online accuracy: 71.0%

Avg incremental: 0.0%

p95 latency: 980 ms

Cost / sample: $0.0121

Replay

Turn-by-turn benchmark playback

Turn 1eval_current

Current Input

What changed in the welding standard for heat treatment?

Expected: compare_versionsTask: engineering_task_1Dataset: mtop

aci_native

compare_versions

Matches expected label

Confidence: 92.0%

Latency: 320 ms

Cost: $0.0112

bifrost_native

compare_versions

Matches expected label

Confidence: 77.0%

Latency: 440 ms

Cost: $0.0108

neo4j_native

find_document

Diverges from expected label

Confidence: 51.0%

Latency: 910 ms

Cost: $0.0121