gemma-4-12B-it-txt-mlx

This is the text backbone of google/gemma-4-12B-it converted to MLX for Apple Silicon, in bfloat16 (no quantization).

Gemma 4 12B is a gemma4_unified any-to-any model: text, vision and audio in, text out. This conversion keeps the language model only — the vision and audio towers are dropped, which is why the name carries a txt marker. If you need image or audio understanding, use the original model with a multimodal runtime. If you want a fast, local, text-and-tool-calling model that runs on stock mlx_lm, this is it.

The weights are the unmodified language model; the only change made during conversion is setting the config model_type to gemma4 so that a stock mlx_lm install loads it without any extra code.

What works

  • Runs directly with the standard mlx_lm CLI and server — nothing custom to install.
  • Tool calling is reliable. Gemma 4 emits calls in its own delimiter-based format (<|tool_call>call:name{arg:<|"|>value<|"|>}<tool_call|>); mlx_lm's built-in gemma4 tool parser turns that back into standard OpenAI tool_calls JSON, so OpenAI-compatible clients work out of the box.
  • Reasoning ("thinking") is supported. mlx_lm turns it on by default for this model; see below if you want it off.

Usage

Generate from the command line:

mlx_lm.generate --model jedisct1/gemma-4-12B-it-txt-mlx \
  --prompt "Explain the Monty Hall problem in three sentences." --temp 0.0

From Python:

from mlx_lm import load, generate

model, tokenizer = load("jedisct1/gemma-4-12B-it-txt-mlx")
messages = [{"role": "user", "content": "Write a haiku about pointers."}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
print(generate(model, tokenizer, prompt=prompt, max_tokens=256))

Tool calling and OpenAI-compatible serving

Start the server:

mlx_lm.server --model jedisct1/gemma-4-12B-it-txt-mlx --port 8080

Then call it like any OpenAI chat endpoint with a tools array. The response comes back with finish_reason: "tool_calls" and a normal tool_calls list:

curl http://127.0.0.1:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{
  "model": "jedisct1/gemma-4-12B-it-txt-mlx",
  "messages": [{"role": "user", "content": "What is the weather in Paris in celsius?"}],
  "tools": [{"type": "function", "function": {
    "name": "get_weather",
    "description": "Get the current weather for a city.",
    "parameters": {"type": "object",
      "properties": {"city": {"type": "string"}, "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}},
      "required": ["city"]}}}],
  "temperature": 0.0
}'

This model has been exercised end-to-end as the backend for the swival coding agent (file reads/writes, edits, shell commands, grep, listing) and drives its full tool loop. For agentic and coding work, use temperature 0.0 — greedy decoding is markedly more reliable for tool arguments and code than the model's creative defaults (temperature 1.0, top_k 64, top_p 0.95).

Thinking on/off

mlx_lm enables Gemma 4's thinking channel by default. To turn it off, pass the chat-template argument when serving:

mlx_lm.server --model jedisct1/gemma-4-12B-it-txt-mlx --chat-template-args '{"enable_thinking": false}'

or --chat-template-config '{"enable_thinking": false}' with mlx_lm.generate. Tool calling is correct either way; thinking off is faster, thinking on tends to plan tool use more carefully.

Faster generation with a draft model

This bf16 model benefits a lot from speculative decoding. Pair it with the small E2B draft (same tokenizer/vocab) for roughly a 3–4x throughput increase with identical output:

mlx_lm.generate \
  --model jedisct1/gemma-4-12B-it-txt-mlx \
  --draft-model jedisct1/gemma-4-E2B-it-txt-mlx-4bit \
  --num-draft-tokens 6 \
  --prompt "Write a Python function that returns the nth Fibonacci number." --temp 0.0

Drafts: gemma-4-E2B-it-txt-mlx-4bit (fastest) and gemma-4-E2B-it-txt-mlx-8bit. Tool calls are preserved exactly.

Conversion details

  • Source: google/gemma-4-12B-it.
  • Tool: mlx_lm.convert, mlx_lm 0.31.3 / mlx 0.31.2.
  • Precision: bfloat16, no quantization.
  • Vision and audio weights (vision_embedder, embed_vision, embed_audio) were removed; only the language model is kept. model_type set to gemma4.

License

Released under the Apache 2.0 license, the same terms as the original Gemma 4 release. Please review Google's terms before use.

Downloads last month
268
Safetensors
Model size
12B params
Tensor type
BF16
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jedisct1/gemma-4-12B-it-txt-mlx

Finetuned
(32)
this model