💜Qwen3.6 - How to Run Locally

Run the new Qwen3.6-27B and 35B-A3B models locally!

Qwen3.6 is Alibaba’s new family of multimodal hybrid-thinking models, including: Qwen3.6-27B and 35B-A3B. It delivers top performance for its size, supports 256K context across 201 languages. It excels in agentic coding, vision, chat tasks. Qwen3.6-27B runs on 18GB RAM setups and 35B-A3B runs on 22GB. You can now run and train the models in Unsloth Studio.

circle-check

Run Qwen3.6 TutorialsMTP Guide

Qwen3.6 GGUFs use Unsloth Dynamic 2.0 for SOTA quant performance - so quants are calibrated on real world use-case datasets and important layers are upcasted. Thank you Qwen for day zero access.

  • Developer Role Support for Codex, OpenCode and more: Our uploads now support the developer role for agentic coding tools.

  • Tool calling: Like Qwen3.5, we improved parsing nested objects to make tool calling succeed more.

Qwen3.6 running in Unsloth Studio.

⚙️ Usage Guide

Table: Inference hardware requirements (units = total memory: RAM + VRAM, or unified memory)

Qwen3.6
3-bit
4-bit
6-bit
8-bit
BF16

27B

15 GB

18 GB

24 GB

30 GB

55 GB

35B-A3B

17 GB

23 GB

30 GB

38 GB

70 GB

circle-check
circle-exclamation

To train Qwen3.6, you can refer to our previous Qwen3.5 fine-tuning guide.

  • Maximum context window: 262,144 (can be extended to 1M via YaRN)

  • presence_penalty = 0.0 to 2.0 default this is off, but to reduce repetitions, you can use this, however using a higher value may result in slight decrease in performance

  • Adequate Output Length: 32,768 tokens for most queries

circle-info

If you're getting gibberish, your context length might be set too low. Or try using --cache-type-k bf16 --cache-type-v bf16 which might help.

As Qwen3.6 is hybrid reasoning, thinking and non-thinking mode have different settings:

Thinking mode:

circle-check
General tasks
Precise coding tasks (e.g. WebDev)

temperature = 1.0

temperature = 0.6

top_p = 0.95

top_p = 0.95

top_k = 20

top_k = 20

min_p = 0.0

min_p = 0.0

presence_penalty = 1.5

presence_penalty = 0.0

repeat_penalty = disabled or 1.0

repeat_penalty = disabled or 1.0

Thinking mode for general tasks:

Thinking mode for precise coding tasks:

Instruct (non-thinking) mode settings:

General tasks
Reasoning tasks

temperature = 0.7

temperature = 1.0

top_p = 0.8

top_p = 0.95

top_k = 20

top_k = 20

min_p = 0.0

min_p = 0.0

presence_penalty = 1.5

presence_penalty = 1.5

repeat_penalty = disabled or 1.0

repeat_penalty = disabled or 1.0

circle-exclamation

Instruct (non-thinking) for general tasks:

Instruct (non-thinking) for reasoning tasks:

Qwen3.6 Inference Tutorials:

We'll be using Dynamic 4-bit UD_Q4_K_XL GGUF variants for inference workloads. Click below to navigate to designated model instructions:

circle-exclamation

MTP GuideRun in Unsloth StudioRun in llama.cpp

circle-info

presence_penalty = 0.0 to 2.0 default this is off, but to reduce repetitions, you can use this, however using a higher value may result in slight decrease in performance.

Currently no Qwen3.6 GGUF works in Ollama due to separate mmproj vision files. Use llama.cpp compatible backends.

⚡ MTP Guide

MTP (Multi Token Prediction) speculative decoding enables models like Qwen3.6 to have ~1.4-2x faster generation with no change in accuracy. This enables Qwen3.6 27B and 35B-A3B to have >1.4x speed-up over the original baseline which is especially useful for local models.

Qwen3.6 27B can now do 140 tokens / s generation and Qwen3.6 35B-A3B 220 tokens / s generation! See MTP Benchmarks for more details

In practice, MTP predicts several future tokens, then the main model verifies those tokens in parallel. This reduces the number of forward passes needed during generation and make output faster. We found --spec-draft-n-max 2 to work best!

1

Install the specific llama.cpp PR branch on GitHub herearrow-up-right. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference. For Apple Mac / Metal devices, set -DGGML_CUDA=OFF then continue as usual - Metal support is on by default.

2

If you want to use llama.cpp directly to load models, you can do the below: (:Q4_K_XL) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run . Use export LLAMA_CACHE="folder" to force llama.cpp to save to a specific location. The model has a maximum of 256K context length.

Follow one of the commands for the specific models:

27B MTP35-A3B MTP

MTP Qwen3.6-27B:

Thinking mode:

circle-info

Please see Qwen3.6's new Preserved Thinking.

General tasks:

For precise coding tasks, change: temperature=0.6, presence-penalty=0.0

Non-thinking mode:

General tasks:

For reasoning tasks, change: temperature=1.0, top-p=0.95

MTP Qwen3.6-35B-A3B:

Thinking mode:

circle-info

Please see Qwen3.6's new Preserved Thinking.

General tasks:

For precise coding tasks, change: temperature=0.6, presence-penalty=0.0

Non-thinking mode:

General tasks:

For reasoning tasks, change: temperature=1.0, top-p=0.95

3

Download the model via the code below (after installing pip install huggingface_hub hf_transfer). You can choose Q4_K_M or other quantized versions like UD-Q4_K_XL . We recommend using at least 2-bit dynamic quant UD-Q2_K_XL to balance size and accuracy. If downloads get stuck, see: Hugging Face Hub, XET debugging

4

Then run the model in conversation mode:

🦥 Unsloth Studio Guide

Qwen3.6 can be run and fine-tuned in Unsloth Studio, our new open-source web UI for local AI. Unsloth Studio lets you run models locally on MacOS, Windows, Linux and:

1

Install Unsloth

Run in your terminal:

MacOS, Linux, WSL:

Windows PowerShell:

circle-check
2

Launch Unsloth

MacOS, Linux, WSL and Windows:

Then open http://127.0.0.1:8888 (or your specific URL) in your browser.

3

Search and download Qwen3.6

On first launch you will need to create a password to secure your account and sign in again later. You’ll then see a brief onboarding wizard to choose a model, dataset, and basic settings. You can skip it at any time.

Then go to the Studio Chat tab and search for Qwen3.6 in the search bar and download your desired model and quant.

4

Run Qwen3.6

Inference parameters should be auto-set when using Unsloth Studio, however you can still change it manually. You can also edit the context length, chat template and other settings.

For more information, you can view our Unsloth Studio inference guide. Below, the 2-bit Qwen3.6 GGUF made 30+ tool calls, searched 20 sites and executed Python code:

🦙 Llama.cpp Guides

For this guide we will be utilizing Dynamic 4-bit which works great on a 24GB RAM / Mac device for fast inference on llama.cpparrow-up-right. Because the model is only around 72GB at full F16 precision, we won't need to worry much about performance. See our GGUF collectionarrow-up-right.

27B35-A3B

1

Obtain the latest llama.cpp on GitHub herearrow-up-right. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference. For Apple Mac / Metal devices, set -DGGML_CUDA=OFF then continue as usual - Metal support is on by default.

2

If you want to use llama.cpp directly to load models, you can do the below: (:Q4_K_XL) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run . Use export LLAMA_CACHE="folder" to force llama.cpp to save to a specific location. The model has a maximum of 256K context length.

Follow one of the commands for the specific models:

27B35-A3B

Qwen3.6-27B:

Thinking mode:

circle-info

Please see Qwen3.6's new Preserved Thinking.

General tasks:

For precise coding tasks, change: temperature=0.6, presence-penalty=0.0

Non-thinking mode:

General tasks:

For reasoning tasks, change: temperature=1.0, top-p=0.95

Qwen3.6-35B-A3B:

Thinking mode:

circle-info

Please see Qwen3.6's new Preserved Thinking.

General tasks:

For precise coding tasks, change: temperature=0.6, presence-penalty=0.0

Non-thinking mode:

General tasks:

For reasoning tasks, change: temperature=1.0, top-p=0.95

3

Download the model via the code below (after installing pip install huggingface_hub hf_transfer). You can choose Q4_K_M or other quantized versions like UD-Q4_K_XL . We recommend using at least 2-bit dynamic quant UD-Q2_K_XL to balance size and accuracy. If downloads get stuck, see: Hugging Face Hub, XET debugging

4

Then run the model in conversation mode:

Llama-server & OpenAI completion library

To deploy Qwen3.6 for production, we use llama-server In a new terminal say via tmux, deploy the model via:

Then in a new terminal, after doing pip install openai, do:

🍎 MLX Dynamic Quants

We also uploaded dynamic Qwen3.6 4bit and 8bit quants for MacOS devices! Our MLX quant algorithm is still evolving, and we’re actively refining it wherever improvements can be made.

Qwen3.6-27B MLX:

Qwen3.6-35B-A3B MLX:

To try them out use:

See below for Qwen3.6-27B KL Divergence (KLD) and Perplexity (PPL) scores (lower is better):

Model
Mean KLD
Median KLD
PPL
P90 KLD
P99.9 KLD
Size

0.0028

0.0003

4.812

0.0019

0.192

34.7 GB

0.0037

0.0007

4.809

0.0032

0.343

30.5 GB

0.0227

0.0053

4.821

0.0293

2.339

26.2 GB

0.0325

0.0087

4.843

0.0466

3.693

26.2 GB

0.0479

0.0153

4.902

0.0769

4.035

25.6 GB

0.0734

0.0223

4.976

0.1261

5.529

24.1 GB

💡 Thinking: Enable/Disable + Preserve Thinking

Qwen3.6 also has Preserve Thinking which leaves the thinking trace from the previous conversation. This increases the number of tokens you use, but could increase accuracy in continued conversations. Unsloth Studio has 'Think' and Preserved Thinking toggles for Qwen3.6:

Unsloth Studio has Think toggle by default and a new Preserved Thinking toggle

To enable preserve thinking in llama.cpp use (change to 'true' or 'false') 'preseve_thinking' instead of 'enable_thinking' or 'disable_thinking'.

For normal thinking, you can enable / disable thinking in llama.cpp by following the below commands. Use 'true' and 'false' interchangeably.

llama-server OS:
Enable Thinking
Disable Thinking

Linux, MacOS, WSL:

Windows / Powershell:

As an example for Qwen3.6-35B-A3B to enable preserve thinking (default is enabled):

And then in Python:

👨‍💻 OpenAI Codex & Claude Code

To run the model via local coding agentic workloads, you can follow our guide. Just change the model name to your 'Qwen3.6' variant and ensure you follow the correct Qwen3.6 parameters and usage instructions. Use the llama-server we just set up just then.

After following the instructions for Claude Code for example you will see:

We can then ask say Create a Python game for Chess :

📊 Benchmarks

Unsloth GGUF Benchmarks

We conducted Mean KL Divergence benchmarks for Qwen3.6-35-A3B GGUFs across providers to help you pick the best quant.

  • KL Divergence puts nearly all Unsloth GGUFs on the SOTA Pareto frontier

  • KLD shows how well a quantized model matches the original BF16 output distribution, indicating retained accuracy.

  • This makes Unsloth the top-performing in 21 of 22 sizes

  • Only Q6_K was updated for more Dynamic layers and we introduced a new UD-IQ4_NL_XL quant

35B-A3B - KLD benchmarks (lower is better)

MTP Benchmarks

We benchmarked the new quants we made for 27B and 35B MoE. In general, dense models are much more accelerated with MTP (1.4-2x) vs MoE models (1.15-1.25x).

With this, Qwen3.6 27B can now do 140 tokens / s generation with UD-Q2_K_XL and Qwen3.6 35B-A3B 220 tokens / s generation! Some of the throughput numbers are noisy, so don't infer some quants are slower than others.

In terms of average speedup, we see a 1.4x for dense models at draft tokens = 2 and for the MoE around 1.15 to 1.2x.

We do not recommend more than 2 draft tokens because the acceptance rate drops precipitously from 83% to 50% with 4 draft tokens, and the forward passes for MTP become less beneficial.

Official Qwen Benchmarks

Qwen3.6-27B

Qwen3.6-35B-A3B

Last updated

Was this helpful?