Tensor Split Benchmark

Speed Variant (11:1) vs. Full Context (2:1)

Date: February 19, 2026 Model: Qwen3-Next-80B-A3B Q4_K_M Mode: Auto-Consensus (2 rounds, 6 turns)

What This Benchmark Measures

When running large language models across two GPUs of different speeds, the tensor split ratio determines how much of the model sits on each GPU. A balanced split (2:1) maximizes available context but forces every token through the slower GPU. An aggressive split (11:1) keeps almost everything on the fast GPU but limits context size.

This benchmark compares both configurations in a real-world AIfred multi-agent tribunal: 3 agents (AIfred, Sokrates, Salomo) debating the same philosophical question across 2 rounds — identical prompt, identical model weights, different tensor placement.

Hardware & Configuration

ComponentSpecification
GPU 0 (CUDA0)Quadro RTX 8000 — 48 GB GDDR6, 672 GB/s bandwidth
GPU 1 (CUDA1)Tesla P40 — 24 GB GDDR5X, 346 GB/s bandwidth
InterconnectPCIe 3.0 x16 (CPU↔GPU0), PCIe 3.0 x4 eGPU (GPU1)
ModelQwen3-Next-80B-A3B-Instruct Q4_K_M (46.6 GB, 48 layers, 512 experts / 10 active)
llama.cppv8076, CUDA Graphs enabled
Test prompt"Ist Wasser nass?" (Is water wet?) — 6-turn multi-agent debate

Speed Variant (11:1)

  • Tensor split: -ts 11,1 (92% on RTX 8000)
  • Context: -c 32768
  • KV cache: -ctk q4_0 -ctv q4_0
  • VRAM: 52 GB (35.3 + 16.8 GB)

Normal / Full Context (2:1)

  • Tensor split: -ts 2,1 (67% on RTX 8000)
  • Context: -c 262144 (native max)
  • KV cache: -ctk q4_0 -ctv q4_0
  • VRAM: ~65 GB (balanced)

How 11:1 Was Found

AIfred's calibration system uses binary search to find the maximum tensor split that still fits in VRAM at 32K context:

[1] 99:1 → failed    [2] 50:1 → failed    [3] 26:1 → failed    [4] 14:1 → failed
[5]  8:1 → fits     [6] 11:1 → fits ← SELECTED    [7] 12:1 → failed

Results at a Glance

Generation Speed (R1)
+13.4%
33.1 vs 29.2 tok/s
Generation Speed (R2)
+4.3%
21.7 vs 20.8 tok/s
Prompt Processing
-2.2%
327 vs 335 tok/s
Total Wall-Clock
-8.2%
113.5s vs 123.6s

Detailed Per-Turn Comparison

Round 1 — Initial Responses

Turn Metric Speed (11:1) Normal (2:1) Delta
AIfred TTFT3.89s3.86s+0.8%
PP310.1 tok/s312.5 tok/s-0.8%
Gen tok/s33.429.0+15.2%
Inference9.6s10.5s-8.6%
Sokrates TTFT8.39s8.16s+2.8%
PP319.0 tok/s325.9 tok/s-2.1%
Gen tok/s35.030.2+15.9%
Inference22.9s25.2s-9.1%
Salomo TTFT7.76s7.33s+5.9%
PP331.3 tok/s343.3 tok/s-3.5%
Gen tok/s30.928.3+9.2%
Inference17.7s19.7s-10.2%

Round 2 — Refinement, Critical Review, Synthesis

Turn Metric Speed (11:1) Normal (2:1) Delta
AIfred R2 TTFT11.57s11.30s+2.4%
PP334.2 tok/s339.3 tok/s-1.5%
Gen tok/s18.820.5-8.3%
Inference17.6s21.0s-16.2%
Sokrates R2 TTFT13.14s13.10s+0.3%
PP333.0 tok/s338.3 tok/s-1.6%
Gen tok/s27.220.8+30.6%
Inference26.5s24.9s+6.4%
Salomo R2 TTFT12.47s11.65s+7.0%
PP336.3 tok/s347.7 tok/s-3.3%
Gen tok/s19.221.2-9.4%
Inference19.2s22.3s-13.9%

Session Totals

Metric Speed (11:1) Normal (2:1) Delta
Total inference time113.5s123.6s-8.2%
Total tokens generated3,0873,044+1.4%
Effective throughput27.2 tok/s24.6 tok/s+10.6%
Average PP327.3 tok/s334.5 tok/s-2.2%
Context utilization11% of 32K1% of 262K

Key Findings

1. Generation is 10–15% faster with aggressive split — in Round 1

With small context, every generated token benefits from having 92% of model layers on the faster RTX 8000 (672 GB/s) instead of routing through the P40 (346 GB/s). R1 average: 33.1 vs 29.2 tok/s.

2. The advantage shrinks in Round 2 (~4%)

As the conversation grows, attention over accumulated KV cache becomes the dominant cost. R2 shows higher variance (18.8–27.2 tok/s) suggesting that output length and content complexity matter more than tensor placement at these context sizes.

3. Prompt processing is ~2% faster with balanced split

PP benefits from parallel KV cache writes across both GPUs. With 11:1, 92% of the KV cache must be written to one GPU, creating a memory bandwidth bottleneck during prompt ingestion. The effect is small but consistent.

4. Inference time ≠ generation speed

Shorter wall-clock inference does not always mean higher tok/s. Example: AIfred R2 speed variant finishes in 17.6s (332 tokens, 18.8 tok/s) vs normal in 21.0s (430 tokens, 20.5 tok/s). The speed variant was faster in wall-clock because it generated fewer tokens — not because it generated them faster. The model is nondeterministic, so output length varies between runs. Always compare tok/s (throughput) rather than inference time when evaluating hardware configurations.

When to Use Which

Use CaseRecommendedReason
Short conversations (<8K context)Speed (11:1)10–15% faster generation
Multi-turn tribunal (typical)Speed (11:1)~10% faster overall, 32K is sufficient
Long conversations (>32K tokens)Normal (2:1)Speed variant would truncate history
RAG with large documentsNormal (2:1)Full context needed for retrieval
Rule of thumb: Use the speed variant by default. Switch to normal only when you actually need >32K context tokens. AIfred's calibration system creates both configurations automatically and selects based on context requirements.