Qwen3.6-27B-Claude-Opus-Sonnet-DistilledV2-MTP-GGUF

GGUF quantized release of the Claude Opus / Sonnet reasoning distillation on Qwen3.6-27B, with native MTP speculative decoding support in llama.cpp.

Key numbers: Q4_K_M + MTP2 → 114.78 tok/s generation, 80.33% draft acceptance, 64% faster than non-MTP baseline. On the same machine, this release delivers 2x the visible answer content vs the original qwen3.6-27b while maintaining 4/4 correctness.

Quick Download

File Size Best for
Q4_K_M (recommended) 15.66 GB Best overall balance
Q6_K 20.89 GB Quality-first
Q2_K 10.12 GB Extreme compression
Q8_0 27.05 GB High-fidelity experiments

Compared to Original qwen3.6-27b

Same-machine benchmark against the original (non-quantized) qwen3.6-27b:

Release vs original efficiency comparison

GGUF side includes llama-cli cold start — this is a conservative estimate.

Original This release
Average response time 10.93s 10.09s
Correctness (4 prompts) 3/4 4/4
Visible answer chars 1336 2845
Hidden reasoning overhead 9002 chars minimal

The original spends a large fraction of its token budget on hidden reasoning chains. This release converts that budget into visible answers, making it better suited for interactive local use.

Compatibility

Requires a recent llama.cpp build with Qwen3.5/3.6 MTP support. Older conversion pipelines may miss the required metadata and fail with failed to create MTP context.

Verified stack:

  • Windows CUDA build of llama.cpp
  • GPU: NVIDIA RTX PRO 6000 Blackwell 96 GB
  • -ngl 999 --spec-type draft-mtp --spec-draft-n-max 2 --spec-draft-ngl 999
  • LM Studio 0.4.14+ opens MTP by default, zero configuration

Benchmarks

Quantization Comparison (short context)

Test: three-person logic puzzle, n=160, GPU + MTP2.

Variant Prompt tok/s Generation tok/s Draft acceptance
Q2_K + MTP2 439.73 118.01 68.66%
Q4_K_M + MTP2 240.55 114.78 80.33%
Q6_K + MTP2 503.87 99.85 78.86%
Q8_0 + MTP2 421.04 78.86 69.17%

MTP vs non-MTP baseline (Q4_K_M):

Variant Prompt tok/s Generation tok/s
Non-MTP 796.22 69.98
MTP2 240.55 114.78
MTP3 390.77 117.16

MTP2 offers the best acceptance/throughput tradeoff. MTP3 acceptance drops to 69.48%.

Long Context

Prompt lengths ~6.6K (ctx8k) and ~26.7K (ctx32k). Generation is intentionally short (17-23 tokens) to isolate prompt processing.

Context Variant Prompt tok/s Generation tok/s Draft acceptance
ctx8k Q2_K 1304.11 104.41 83.33%
ctx8k Q4_K_M 2798.63 31.73 60.00%
ctx8k Q6_K 2415.74 69.48 60.00%
ctx8k Q8_0 2143.06 63.78 60.00%
ctx32k Q2_K 2450.46 71.41 78.57%
ctx32k Q4_K_M 2846.65 87.42 83.33%
ctx32k Q6_K 2620.59 81.02 71.43%
ctx32k Q8_0 3120.27 71.19 71.43%

Q4_K_M is the most balanced variant across both short and long contexts. Q6_K is a solid quality-first choice.

Note: BF16 + MTP2 (historical reference) yielded 20.49 tok/s prompt / 0.85 tok/s generation on this GPU — quantization is required for practical throughput on this hardware.

Usage

LM Studio (zero config)

Upgrade to LM Studio 0.4.14 or later. Load the GGUF file and MTP speculative decoding is enabled automatically — no settings, no flags, no configuration needed.

llama-cli

# Regular inference
./llama-cli -m Qwen3.6-27B-Claude-Opus-Sonnet-DistilledV2-MTP-Q4_K_M.gguf -ngl 999 -c 8192 -p "Your prompt here"

# With MTP enabled
./llama-cli -m Qwen3.6-27B-Claude-Opus-Sonnet-DistilledV2-MTP-Q4_K_M.gguf -ngl 999 -c 8192 \
  --spec-type draft-mtp --spec-draft-n-max 2 --spec-draft-ngl 999 -p "Your prompt here"

Recommended args:

  • Short replies: -c 4096 --temp 0 --top-k 1 --spec-draft-n-max 2
  • Long reasoning: -c 8192 --temp 0 --top-k 1 --spec-draft-n-max 2

Quality Validation

All four quantized variants passed:

  • GGUF header integrity check
  • GPU draft-mtp loadability
  • Same-prompt logic consistency (all converge to the same answer: A=lying, B=truth, C=lying)
Variant Quality verdict Recommendation
Q2_K Usable, most aggressive compression Extreme compression only
Q4_K_M Best balance Default
Q6_K More stable quality Quality-first choice
Q8_0 Fine, but not always faster than Q6_K High-fidelity experiments

Note: Windows PowerShell CLI may corrupt Chinese prompt arguments. Use UTF-8 prompt files, API calls, or your own inference service for Chinese workloads.

Known Limitations

  • Requires a recent llama.cpp build (older exports may miss Qwen3.5/3.6 MTP metadata)
  • Q8_0 is not guaranteed to be faster than Q6_K on bandwidth-limited GPUs
  • Chinese prompts may need extra encoding care in Windows CLI environments

V1 → V2

V2 optimizes distillation targets, reasoning chain compression, and MTP deployment compatibility. Coding accuracy, tool calling stability, and debugging efficiency are all meaningfully improved.

Downloads last month
8,952
GGUF
Model size
27B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

2-bit

4-bit

6-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Brian6145/Qwen3.6-27B-Claude-Opus-Sonnet-DistilledV2-MTP-GGUF

Base model

Qwen/Qwen3.6-27B
Quantized
(436)
this model