Tensor Split Benchmark

Speed Variant (11:1) vs. Full Context (2:1)

Date: February 19, 2026 Model: Qwen3-Next-80B-A3B Q4_K_M Mode: Auto-Consensus (2 rounds, 6 turns)

What This Benchmark Measures

When running large language models across two GPUs of different speeds, the tensor split ratio determines how much of the model sits on each GPU. A balanced split (2:1) maximizes available context but forces every token through the slower GPU. An aggressive split (11:1) keeps almost everything on the fast GPU but limits context size.

This benchmark compares both configurations in a real-world AIfred multi-agent tribunal: 3 agents (AIfred, Sokrates, Salomo) debating the same philosophical question across 2 rounds — identical prompt, identical model weights, different tensor placement.

Hardware & Configuration

Component	Specification
GPU 0 (CUDA0)	Quadro RTX 8000 — 48 GB GDDR6, 672 GB/s bandwidth
GPU 1 (CUDA1)	Tesla P40 — 24 GB GDDR5X, 346 GB/s bandwidth
Interconnect	PCIe 3.0 x16 (CPU↔GPU0), PCIe 3.0 x4 eGPU (GPU1)
Model	Qwen3-Next-80B-A3B-Instruct Q4_K_M (46.6 GB, 48 layers, 512 experts / 10 active)
llama.cpp	v8076, CUDA Graphs enabled
Test prompt	"Ist Wasser nass?" (Is water wet?) — 6-turn multi-agent debate

Speed Variant (11:1)

Tensor split: -ts 11,1 (92% on RTX 8000)
Context: -c 32768
KV cache: -ctk q4_0 -ctv q4_0
VRAM: 52 GB (35.3 + 16.8 GB)

Normal / Full Context (2:1)

Tensor split: -ts 2,1 (67% on RTX 8000)
Context: -c 262144 (native max)
KV cache: -ctk q4_0 -ctv q4_0
VRAM: ~65 GB (balanced)

How 11:1 Was Found

AIfred's calibration system uses binary search to find the maximum tensor split that still fits in VRAM at 32K context:

[1] 99:1 → failed [2] 50:1 → failed [3] 26:1 → failed [4] 14:1 → failed
[5] 8:1 → fits [6] 11:1 → fits ← SELECTED [7] 12:1 → failed

Results at a Glance

Generation Speed (R1)

+13.4%

33.1 vs 29.2 tok/s

Generation Speed (R2)

+4.3%

21.7 vs 20.8 tok/s

Prompt Processing

-2.2%

327 vs 335 tok/s

Total Wall-Clock

-8.2%

113.5s vs 123.6s

Detailed Per-Turn Comparison

Round 1 — Initial Responses

Turn	Metric	Speed (11:1)	Normal (2:1)	Delta
AIfred	TTFT	3.89s	3.86s	+0.8%
	PP	310.1 tok/s	312.5 tok/s	-0.8%
	Gen tok/s	33.4	29.0	+15.2%
	Inference	9.6s	10.5s	-8.6%
Sokrates	TTFT	8.39s	8.16s	+2.8%
	PP	319.0 tok/s	325.9 tok/s	-2.1%
	Gen tok/s	35.0	30.2	+15.9%
	Inference	22.9s	25.2s	-9.1%
Salomo	TTFT	7.76s	7.33s	+5.9%
	PP	331.3 tok/s	343.3 tok/s	-3.5%
	Gen tok/s	30.9	28.3	+9.2%
	Inference	17.7s	19.7s	-10.2%

Round 2 — Refinement, Critical Review, Synthesis

Turn	Metric	Speed (11:1)	Normal (2:1)	Delta
AIfred R2	TTFT	11.57s	11.30s	+2.4%
	PP	334.2 tok/s	339.3 tok/s	-1.5%
	Gen tok/s	18.8	20.5	-8.3%
	Inference	17.6s	21.0s	-16.2%
Sokrates R2	TTFT	13.14s	13.10s	+0.3%
	PP	333.0 tok/s	338.3 tok/s	-1.6%
	Gen tok/s	27.2	20.8	+30.6%
	Inference	26.5s	24.9s	+6.4%
Salomo R2	TTFT	12.47s	11.65s	+7.0%
	PP	336.3 tok/s	347.7 tok/s	-3.3%
	Gen tok/s	19.2	21.2	-9.4%
	Inference	19.2s	22.3s	-13.9%

Session Totals

Metric	Speed (11:1)	Normal (2:1)	Delta
Total inference time	113.5s	123.6s	-8.2%
Total tokens generated	3,087	3,044	+1.4%
Effective throughput	27.2 tok/s	24.6 tok/s	+10.6%
Average PP	327.3 tok/s	334.5 tok/s	-2.2%
Context utilization	11% of 32K	1% of 262K	—

Key Findings

1. Generation is 10–15% faster with aggressive split — in Round 1

With small context, every generated token benefits from having 92% of model layers on the faster RTX 8000 (672 GB/s) instead of routing through the P40 (346 GB/s). R1 average: 33.1 vs 29.2 tok/s.

2. The advantage shrinks in Round 2 (~4%)

As the conversation grows, attention over accumulated KV cache becomes the dominant cost. R2 shows higher variance (18.8–27.2 tok/s) suggesting that output length and content complexity matter more than tensor placement at these context sizes.

3. Prompt processing is ~2% faster with balanced split

PP benefits from parallel KV cache writes across both GPUs. With 11:1, 92% of the KV cache must be written to one GPU, creating a memory bandwidth bottleneck during prompt ingestion. The effect is small but consistent.

4. Inference time ≠ generation speed

Shorter wall-clock inference does not always mean higher tok/s. Example: AIfred R2 speed variant finishes in 17.6s (332 tokens, 18.8 tok/s) vs normal in 21.0s (430 tokens, 20.5 tok/s). The speed variant was faster in wall-clock because it generated fewer tokens — not because it generated them faster. The model is nondeterministic, so output length varies between runs. Always compare tok/s (throughput) rather than inference time when evaluating hardware configurations.

When to Use Which

Use Case	Recommended	Reason
Short conversations (<8K context)	Speed (11:1)	10–15% faster generation
Multi-turn tribunal (typical)	Speed (11:1)	~10% faster overall, 32K is sufficient
Long conversations (>32K tokens)	Normal (2:1)	Speed variant would truncate history
RAG with large documents	Normal (2:1)	Full context needed for retrieval

            Rule of thumb: Use the speed variant by default. Switch to normal
            only when you actually need >32K context tokens. AIfred's calibration system
            creates both configurations automatically and selects based on context requirements.