How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="zaakirio/gemma-4-12b-it-uncensored-GGUF",
	filename="",
)
llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": [
				{
					"type": "text",
					"text": "Describe this image in one sentence."
				},
				{
					"type": "image_url",
					"image_url": {
						"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
					}
				}
			]
		}
	]
)

gemma-4-12b-it-uncensored - GGUF

GGUF quantizations of zaakirio/gemma-4-12b-it-uncensored, a decensored (Heretic-abliterated) version of google/gemma-4-12B-it.

These files run with llama.cpp.

⚠️ Requires a current llama.cpp build. Gemma 4 (gemma4_unified) is a brand-new architecture; only recent llama.cpp builds can load these files. Older builds may fail with an unknown architecture error - build from source (or use a current release) if you hit that. Always pass --jinja so the chat template is applied.

Files

Filenames follow gemma-4-12b-it-uncensored-<QUANT>.gguf.

Quant Size Notes
Q2_K 4.50 GB Smallest; lowest quality. Very tight memory only.
Q3_K_M 5.67 GB Small; usable on low RAM.
Q4_K_S 6.54 GB Compact 4-bit.
Q4_K_M 6.87 GB Recommended - best size/quality balance.
Q5_K_M 7.96 GB Higher quality, slightly larger.
Q6_K 9.11 GB Near-lossless.
Q8_0 11.80 GB Effectively lossless vs the BF16 source.
f16 22.20 GB Full precision; reference / re-quantizing.

Not sure which to pick? Start with Q4_K_M. Go up to Q5/Q6/Q8 if you have the memory and want maximum fidelity; drop to Q3/Q2 only if you're memory-constrained.

Multimodal projector (for image input - see Multimodal):

File Size Notes
mmproj-gemma-4-12B-it-bf16.gguf 0.16 GB Vision encoder - pair with any quant above.

Usage

llama.cpp (auto-downloads the chosen quant from this repo):

# Interactive chat
llama-cli -hf zaakirio/gemma-4-12b-it-uncensored-GGUF:Q4_K_M --jinja

# OpenAI-compatible server with web UI
llama-server -hf zaakirio/gemma-4-12b-it-uncensored-GGUF:Q4_K_M --jinja -c 4096

Or with a file you've already downloaded:

llama-cli -m gemma-4-12b-it-uncensored-Q4_K_M.gguf --jinja -p "Hello, who are you?"

Download a single file:

pip install -U "huggingface_hub[cli]"
hf download zaakirio/gemma-4-12b-it-uncensored-GGUF \
  --include "gemma-4-12b-it-uncensored-Q4_K_M.gguf" --local-dir ./

Prompt format & settings

The chat template is embedded in the GGUF, chat-aware tools apply it automatically (always pass --jinja with llama.cpp). For reference, Gemma 4's format is:

<|turn>user
{prompt}<turn|>
<|turn>model

Recommended sampling (Google defaults): --temp 1.0 --top-p 0.95 --top-k 64.

Thinking mode: Gemma 4 has a reasoning channel. To disable it, pass --chat-template-kwargs '{"enable_thinking":false}' to llama-server.

Multimodal (image input)

Gemma 4 is multimodal, but in llama.cpp the vision tower ships as a separate projector file. The language .gguf alone is text-only and will reject images. This repo includes mmproj-gemma-4-12B-it-bf16.gguf for that purpose.

When you load via -hf, llama.cpp auto-downloads the projector from this repo - images just work:

llama-server -hf zaakirio/gemma-4-12b-it-uncensored-GGUF:Q4_K_M --jinja

With local files, pass it explicitly with --mmproj:

llama-server -m gemma-4-12b-it-uncensored-Q4_K_M.gguf \
  --mmproj mmproj-gemma-4-12B-it-bf16.gguf --jinja

# download both files
hf download zaakirio/gemma-4-12b-it-uncensored-GGUF \
  --include "gemma-4-12b-it-uncensored-Q4_K_M.gguf" "mmproj-gemma-4-12B-it-bf16.gguf" --local-dir ./

The projector pairs with any quant in the table above. It's the unmodified Gemma 4 vision encoder. Abliteration only touches the language weights, so the vision tower is unchanged from the base model. Prefer bf16 here: the encoder is small, so there's no benefit to quantizing it.

About the base model

A decensored derivative produced with Heretic (automatic directional ablation). Compared with the original:

Metric Decensored Original
Refusals (/100 harmful prompts) 23 99
KL divergence (harmless prompts) 0.043 0 (by definition)

The refusal count is Heretic's keyword heuristic, which is known to over-count (it flags disclaimer-wrapped compliance as a refusal; ~11% precision per arXiv:2512.13655). We report only the measured marker figure and did not run a classifier-based eval on this model, so real compliance is likely higher. See the source model card for parameters and details.

Intended use & disclaimer

This model has had its refusal behaviour substantially removed and will comply with requests the original would have declined. Provided for research and unrestricted local use. You are responsible for how you use it and for complying with applicable law and the base model's Gemma license, which carries over to this derivative. Not for all audiences.

Provenance

Downloads last month
12,264
GGUF
Model size
12B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for zaakirio/gemma-4-12b-it-uncensored-GGUF

Quantized
(3)
this model

Collection including zaakirio/gemma-4-12b-it-uncensored-GGUF

Papers for zaakirio/gemma-4-12b-it-uncensored-GGUF