Speed Variant (11:1) vs. Full Context (2:1)
When running large language models across two GPUs of different speeds, the tensor split ratio determines how much of the model sits on each GPU. A balanced split (2:1) maximizes available context but forces every token through the slower GPU. An aggressive split (11:1) keeps almost everything on the fast GPU but limits context size.
This benchmark compares both configurations in a real-world AIfred multi-agent tribunal: 3 agents (AIfred, Sokrates, Salomo) debating the same philosophical question across 2 rounds — identical prompt, identical model weights, different tensor placement.
| Component | Specification |
|---|---|
| GPU 0 (CUDA0) | Quadro RTX 8000 — 48 GB GDDR6, 672 GB/s bandwidth |
| GPU 1 (CUDA1) | Tesla P40 — 24 GB GDDR5X, 346 GB/s bandwidth |
| Interconnect | PCIe 3.0 x16 (CPU↔GPU0), PCIe 3.0 x4 eGPU (GPU1) |
| Model | Qwen3-Next-80B-A3B-Instruct Q4_K_M (46.6 GB, 48 layers, 512 experts / 10 active) |
| llama.cpp | v8076, CUDA Graphs enabled |
| Test prompt | "Ist Wasser nass?" (Is water wet?) — 6-turn multi-agent debate |
-ts 11,1 (92% on RTX 8000)-c 32768-ctk q4_0 -ctv q4_0-ts 2,1 (67% on RTX 8000)-c 262144 (native max)-ctk q4_0 -ctv q4_0AIfred's calibration system uses binary search to find the maximum tensor split that still fits in VRAM at 32K context:
| Turn | Metric | Speed (11:1) | Normal (2:1) | Delta |
|---|---|---|---|---|
| AIfred | TTFT | 3.89s | 3.86s | +0.8% |
| PP | 310.1 tok/s | 312.5 tok/s | -0.8% | |
| Gen tok/s | 33.4 | 29.0 | +15.2% | |
| Inference | 9.6s | 10.5s | -8.6% | |
| Sokrates | TTFT | 8.39s | 8.16s | +2.8% |
| PP | 319.0 tok/s | 325.9 tok/s | -2.1% | |
| Gen tok/s | 35.0 | 30.2 | +15.9% | |
| Inference | 22.9s | 25.2s | -9.1% | |
| Salomo | TTFT | 7.76s | 7.33s | +5.9% |
| PP | 331.3 tok/s | 343.3 tok/s | -3.5% | |
| Gen tok/s | 30.9 | 28.3 | +9.2% | |
| Inference | 17.7s | 19.7s | -10.2% |
| Turn | Metric | Speed (11:1) | Normal (2:1) | Delta |
|---|---|---|---|---|
| AIfred R2 | TTFT | 11.57s | 11.30s | +2.4% |
| PP | 334.2 tok/s | 339.3 tok/s | -1.5% | |
| Gen tok/s | 18.8 | 20.5 | -8.3% | |
| Inference | 17.6s | 21.0s | -16.2% | |
| Sokrates R2 | TTFT | 13.14s | 13.10s | +0.3% |
| PP | 333.0 tok/s | 338.3 tok/s | -1.6% | |
| Gen tok/s | 27.2 | 20.8 | +30.6% | |
| Inference | 26.5s | 24.9s | +6.4% | |
| Salomo R2 | TTFT | 12.47s | 11.65s | +7.0% |
| PP | 336.3 tok/s | 347.7 tok/s | -3.3% | |
| Gen tok/s | 19.2 | 21.2 | -9.4% | |
| Inference | 19.2s | 22.3s | -13.9% |
| Metric | Speed (11:1) | Normal (2:1) | Delta |
|---|---|---|---|
| Total inference time | 113.5s | 123.6s | -8.2% |
| Total tokens generated | 3,087 | 3,044 | +1.4% |
| Effective throughput | 27.2 tok/s | 24.6 tok/s | +10.6% |
| Average PP | 327.3 tok/s | 334.5 tok/s | -2.2% |
| Context utilization | 11% of 32K | 1% of 262K | — |
With small context, every generated token benefits from having 92% of model layers on the faster RTX 8000 (672 GB/s) instead of routing through the P40 (346 GB/s). R1 average: 33.1 vs 29.2 tok/s.
As the conversation grows, attention over accumulated KV cache becomes the dominant cost. R2 shows higher variance (18.8–27.2 tok/s) suggesting that output length and content complexity matter more than tensor placement at these context sizes.
PP benefits from parallel KV cache writes across both GPUs. With 11:1, 92% of the KV cache must be written to one GPU, creating a memory bandwidth bottleneck during prompt ingestion. The effect is small but consistent.
Shorter wall-clock inference does not always mean higher tok/s. Example: AIfred R2 speed variant finishes in 17.6s (332 tokens, 18.8 tok/s) vs normal in 21.0s (430 tokens, 20.5 tok/s). The speed variant was faster in wall-clock because it generated fewer tokens — not because it generated them faster. The model is nondeterministic, so output length varies between runs. Always compare tok/s (throughput) rather than inference time when evaluating hardware configurations.
| Use Case | Recommended | Reason |
|---|---|---|
| Short conversations (<8K context) | Speed (11:1) | 10–15% faster generation |
| Multi-turn tribunal (typical) | Speed (11:1) | ~10% faster overall, 32K is sufficient |
| Long conversations (>32K tokens) | Normal (2:1) | Speed variant would truncate history |
| RAG with large documents | Normal (2:1) | Full context needed for retrieval |