Instructions to use jedisct1/gemma-4-12B-it-txt-mlx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use jedisct1/gemma-4-12B-it-txt-mlx with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("jedisct1/gemma-4-12B-it-txt-mlx")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps Settings
LM Studio

How to use jedisct1/gemma-4-12B-it-txt-mlx with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "jedisct1/gemma-4-12B-it-txt-mlx"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "jedisct1/gemma-4-12B-it-txt-mlx"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use jedisct1/gemma-4-12B-it-txt-mlx with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "jedisct1/gemma-4-12B-it-txt-mlx"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default jedisct1/gemma-4-12B-it-txt-mlx

Run Hermes

hermes

MLX LM

How to use jedisct1/gemma-4-12B-it-txt-mlx with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "jedisct1/gemma-4-12B-it-txt-mlx"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "jedisct1/gemma-4-12B-it-txt-mlx"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "jedisct1/gemma-4-12B-it-txt-mlx",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

gemma-4-12B-it-txt-mlx

This is the text backbone of google/gemma-4-12B-it converted to MLX for Apple Silicon, in bfloat16 (no quantization).

Gemma 4 12B is a gemma4_unified any-to-any model: text, vision and audio in, text out. This conversion keeps the language model only — the vision and audio towers are dropped, which is why the name carries a txt marker. If you need image or audio understanding, use the original model with a multimodal runtime. If you want a fast, local, text-and-tool-calling model that runs on stock mlx_lm, this is it.

The weights are the unmodified language model; the only change made during conversion is setting the config model_type to gemma4 so that a stock mlx_lm install loads it without any extra code.

What works

Runs directly with the standard mlx_lm CLI and server — nothing custom to install.
Tool calling is reliable. Gemma 4 emits calls in its own delimiter-based format (<|tool_call>call:name{arg:<|"|>value<|"|>}<tool_call|>); mlx_lm's built-in gemma4 tool parser turns that back into standard OpenAI tool_calls JSON, so OpenAI-compatible clients work out of the box.
Reasoning ("thinking") is supported. mlx_lm turns it on by default for this model; see below if you want it off.

Usage

Generate from the command line:

mlx_lm.generate --model jedisct1/gemma-4-12B-it-txt-mlx \
  --prompt "Explain the Monty Hall problem in three sentences." --temp 0.0

From Python:

from mlx_lm import load, generate

model, tokenizer = load("jedisct1/gemma-4-12B-it-txt-mlx")
messages = [{"role": "user", "content": "Write a haiku about pointers."}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
print(generate(model, tokenizer, prompt=prompt, max_tokens=256))

Tool calling and OpenAI-compatible serving

Start the server:

mlx_lm.server --model jedisct1/gemma-4-12B-it-txt-mlx --port 8080

Then call it like any OpenAI chat endpoint with a tools array. The response comes back with finish_reason: "tool_calls" and a normal tool_calls list:

curl http://127.0.0.1:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{
  "model": "jedisct1/gemma-4-12B-it-txt-mlx",
  "messages": [{"role": "user", "content": "What is the weather in Paris in celsius?"}],
  "tools": [{"type": "function", "function": {
    "name": "get_weather",
    "description": "Get the current weather for a city.",
    "parameters": {"type": "object",
      "properties": {"city": {"type": "string"}, "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}},
      "required": ["city"]}}}],
  "temperature": 0.0
}'

This model has been exercised end-to-end as the backend for the swival coding agent (file reads/writes, edits, shell commands, grep, listing) and drives its full tool loop. For agentic and coding work, use temperature 0.0 — greedy decoding is markedly more reliable for tool arguments and code than the model's creative defaults (temperature 1.0, top_k 64, top_p 0.95).

Thinking on/off

mlx_lm enables Gemma 4's thinking channel by default. To turn it off, pass the chat-template argument when serving:

mlx_lm.server --model jedisct1/gemma-4-12B-it-txt-mlx --chat-template-args '{"enable_thinking": false}'

or --chat-template-config '{"enable_thinking": false}' with mlx_lm.generate. Tool calling is correct either way; thinking off is faster, thinking on tends to plan tool use more carefully.

Faster generation with a draft model

This bf16 model benefits a lot from speculative decoding. Pair it with the small E2B draft (same tokenizer/vocab) for roughly a 3–4x throughput increase with identical output:

mlx_lm.generate \
  --model jedisct1/gemma-4-12B-it-txt-mlx \
  --draft-model jedisct1/gemma-4-E2B-it-txt-mlx-4bit \
  --num-draft-tokens 6 \
  --prompt "Write a Python function that returns the nth Fibonacci number." --temp 0.0

Drafts: gemma-4-E2B-it-txt-mlx-4bit (fastest) and gemma-4-E2B-it-txt-mlx-8bit. Tool calls are preserved exactly.

Conversion details

Source: google/gemma-4-12B-it.
Tool: mlx_lm.convert, mlx_lm 0.31.3 / mlx 0.31.2.
Precision: bfloat16, no quantization.
Vision and audio weights (vision_embedder, embed_vision, embed_audio) were removed; only the language model is kept. model_type set to gemma4.

License

Released under the Apache 2.0 license, the same terms as the original Gemma 4 release. Please review Google's terms before use.

Downloads last month: 268

Safetensors

Model size

12B params

Tensor type

BF16

MLX

Hardware compatibility

Quantized

Model tree for jedisct1/gemma-4-12B-it-txt-mlx

Base model

google/gemma-4-12B

Finetuned

google/gemma-4-12B-it

Finetuned

(32)

this model