osmGemma-4-12B-uncensored-bf16

Full-precision (bf16) abliterated google/gemma-4-12B-it — the complete encoder-free unified multimodal model (text · image · audio · video) with refusals removed via Heretic. This is the artifact that runs refusal-free vision + audio + video today (in 🤗 transformers), and the source for the MLX quants below. By osmAPI.

⚠️ Abliterated model — read this

Refusal directions were surgically removed from the parent. It will answer many prompts the parent refuses. No new capabilities were added — only refusal behavior was reduced. Use responsibly and within applicable law.

🔓 Refusal removal — before / after

Measured with Heretic's evaluator on 100 harmful prompts (mlabonne/harmful_behaviors test[:100]), greedy decoding, refusal-marker classifier:

Model Refusals Refusal rate
google/gemma-4-12B-it (original) 99 / 100 99.0%
this model (abliterated) 12 / 100 12.0%

↓ 87 fewer refusals — an 87.9% reduction, at KL divergence 0.053 from the original (≪ 0.5, the damage threshold) → general capabilities preserved.

📊 Specs

Precision bfloat16 (full precision)
Disk size ~23.9 GB
Base google/gemma-4-12B-it — 11.95B, 48 layers, 256K context, 140+ languages
Modalities text · image · audio · video in, text out (encoder-free / unified)
Refusal-free multimodal today ✅ via 🤗 transformers

⚡ Inference & compatibility

Runtime Supported? Notes
🤗 transformers (PyTorch · CUDA/MPS) full multimodal (text · image · audio · video) needs torchvision + librosa
vLLM (CUDA) ⚠️ quantize first convert to FP8/AWQ/GPTQ; gemma4_unified serving support is rolling out
MLX (Apple Silicon) ➡️ use the MLX quants below text today; vision pending mlx-vlm
Ollama / llama.cpp ❌ needs GGUF conversion pending llama.cpp gemma4_unified support

🚀 Quick start — transformers (text)

pip install -U "transformers>=5.10" torch torchvision librosa accelerate
from transformers import AutoProcessor, AutoModelForMultimodalLM

mid = "osmapi/osmGemma-4-12B-uncensored-bf16"
processor = AutoProcessor.from_pretrained(mid)
model = AutoModelForMultimodalLM.from_pretrained(mid, dtype="auto", device_map="auto")

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain abliteration in two sentences."},
]
inputs = processor.apply_chat_template(messages, tokenize=True, return_dict=True,
    return_tensors="pt", add_generation_prompt=True, enable_thinking=False).to(model.device)
n = inputs["input_ids"].shape[-1]
out = model.generate(**inputs, max_new_tokens=256)
print(processor.parse_response(processor.decode(out[0][n:], skip_special_tokens=False)))

enable_thinking=True turns on reasoning mode; parse_response separates the thinking channel.

🖼️🎙️ Vision & audio (image · audio · video)

Full multimodal runs here today — pass image/audio/video in the message content:

messages = [{"role": "user", "content": [
    {"type": "image", "url":   "https://.../photo.jpg"},   # image → key "url"
    {"type": "audio", "audio": "https://.../clip.wav"},    # audio → key "audio" (≤30s)
    {"type": "text",  "text":  "Describe what you see and hear."},
]}]
inputs = processor.apply_chat_template(messages, tokenize=True, return_dict=True,
    return_tensors="pt", add_generation_prompt=True).to(model.device)
n = inputs["input_ids"].shape[-1]
out = model.generate(**inputs, max_new_tokens=512)
print(processor.parse_response(processor.decode(out[0][n:], skip_special_tokens=False)))

Audio ≤ 30 s (native ASR + speech translation) · images variable-resolution · video ≤ 60 s (~1 fps).

🍎 Running on Mac

This bf16 repo runs in 🤗 transformers on Apple Silicon (MPS) — full multimodal, as above. For lighter, faster MLX serving, use the MLX quants of this model (see the family table) with: oMLX (inference server + macOS menu-bar app, SSD KV cache), vMLX, LM Studio (MLX engine), Ollama 0.19+, or mlx-vlm directly. Those serve the MLX quants once their bundled mlx-lm/mlx-vlm adds gemma4_unified support (text today via mlx-vlm + a small shim).

🗂️ Quant family

Repo Scheme Eff. BPW Size
osmGemma-4-12B-uncensored-bf16 — abliterated, full multimodal bf16 16 ~23.9 GB you are here
osmGemma-4-12B-uncensored-8bit-mlx 8-bit affine 8.805 ~13.7 GB
osmGemma-4-12B-uncensored-mxfp4-mlx MXFP4 (4-bit microscaling) 7.628 ~11.9 GB
osmGemma-4-12B-uncensored-mixed-4.2bpw-mlx mixed 3/4-bit 4.2 ~6.6 GB
google/gemma-4-12B-it — base (not abliterated) bf16 16 ~24 GB
google/gemma-4-12B-it-assistant — MTP draft can be added later ⏳ planned

🧬 Lineage

google/gemma-4-12B                       (Google DeepMind — base pretrain)
        ↓  instruction tuning
google/gemma-4-12B-it               (multimodal, encoder-free)
        ↓  Heretic 1.3.0 — directional ablation, Optuna/TPE-optimized over 100 trials, best Pareto trial #55
this repo — abliterated bf16             (refusals 99→12 / 100, KL 0.053)
        ↓  mlx-vlm quantization
MLX quants (8-bit · MXFP4 · mixed)       — see family table

🙏 Credits

Role Project
Abliteration & release osmAPI
Abliteration tool Heretic by p-e-w
Research osmAPI Research Team · Terv Student Research Team
Base model Google DeepMind — Gemma 4

📜 License

Apache-2.0 (inherited from the base). Also subject to the Gemma 4 Terms of Use.

Downloads last month
81
Safetensors
Model size
12B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for osmapi/osmGemma-4-12B-uncensored-bf16

Finetuned
(30)
this model
Quantizations
1 model