Instructions to use jedisct1/gemma-4-12B-it-txt-mlx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use jedisct1/gemma-4-12B-it-txt-mlx with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("jedisct1/gemma-4-12B-it-txt-mlx") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use jedisct1/gemma-4-12B-it-txt-mlx with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "jedisct1/gemma-4-12B-it-txt-mlx"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "jedisct1/gemma-4-12B-it-txt-mlx" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use jedisct1/gemma-4-12B-it-txt-mlx with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "jedisct1/gemma-4-12B-it-txt-mlx"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default jedisct1/gemma-4-12B-it-txt-mlx
Run Hermes
hermes
- MLX LM
How to use jedisct1/gemma-4-12B-it-txt-mlx with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "jedisct1/gemma-4-12B-it-txt-mlx"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "jedisct1/gemma-4-12B-it-txt-mlx" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "jedisct1/gemma-4-12B-it-txt-mlx", "messages": [ {"role": "user", "content": "Hello"} ] }'
Configure the model in Pi
# Install Pi:
npm install -g @mariozechner/pi-coding-agent# Add to ~/.pi/agent/models.json:
{
"providers": {
"mlx-lm": {
"baseUrl": "http://localhost:8080/v1",
"api": "openai-completions",
"apiKey": "none",
"models": [
{
"id": "jedisct1/gemma-4-12B-it-txt-mlx"
}
]
}
}
}Run Pi
# Start Pi in your project directory:
pigemma-4-12B-it-txt-mlx
This is the text backbone of google/gemma-4-12B-it
converted to MLX for Apple Silicon, in bfloat16 (no
quantization).
Gemma 4 12B is a gemma4_unified any-to-any model: text, vision and audio in, text out. This
conversion keeps the language model only — the vision and audio towers are dropped, which is why the
name carries a txt marker. If you need image or audio understanding, use the original model with a
multimodal runtime. If you want a fast, local, text-and-tool-calling model that runs on stock
mlx_lm, this is it.
The weights are the unmodified language model; the only change made during conversion is setting the
config model_type to gemma4 so that a stock mlx_lm install loads it without any extra code.
What works
- Runs directly with the standard
mlx_lmCLI and server — nothing custom to install. - Tool calling is reliable. Gemma 4 emits calls in its own delimiter-based format
(
<|tool_call>call:name{arg:<|"|>value<|"|>}<tool_call|>);mlx_lm's built-ingemma4tool parser turns that back into standard OpenAItool_callsJSON, so OpenAI-compatible clients work out of the box. - Reasoning ("thinking") is supported.
mlx_lmturns it on by default for this model; see below if you want it off.
Usage
Generate from the command line:
mlx_lm.generate --model jedisct1/gemma-4-12B-it-txt-mlx \
--prompt "Explain the Monty Hall problem in three sentences." --temp 0.0
From Python:
from mlx_lm import load, generate
model, tokenizer = load("jedisct1/gemma-4-12B-it-txt-mlx")
messages = [{"role": "user", "content": "Write a haiku about pointers."}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
print(generate(model, tokenizer, prompt=prompt, max_tokens=256))
Tool calling and OpenAI-compatible serving
Start the server:
mlx_lm.server --model jedisct1/gemma-4-12B-it-txt-mlx --port 8080
Then call it like any OpenAI chat endpoint with a tools array. The response comes back with
finish_reason: "tool_calls" and a normal tool_calls list:
curl http://127.0.0.1:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{
"model": "jedisct1/gemma-4-12B-it-txt-mlx",
"messages": [{"role": "user", "content": "What is the weather in Paris in celsius?"}],
"tools": [{"type": "function", "function": {
"name": "get_weather",
"description": "Get the current weather for a city.",
"parameters": {"type": "object",
"properties": {"city": {"type": "string"}, "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}},
"required": ["city"]}}}],
"temperature": 0.0
}'
This model has been exercised end-to-end as the backend for the
swival coding agent (file reads/writes, edits, shell commands, grep, listing)
and drives its full tool loop. For agentic and coding work, use temperature 0.0 — greedy
decoding is markedly more reliable for tool arguments and code than the model's creative defaults
(temperature 1.0, top_k 64, top_p 0.95).
Thinking on/off
mlx_lm enables Gemma 4's thinking channel by default. To turn it off, pass the chat-template
argument when serving:
mlx_lm.server --model jedisct1/gemma-4-12B-it-txt-mlx --chat-template-args '{"enable_thinking": false}'
or --chat-template-config '{"enable_thinking": false}' with mlx_lm.generate. Tool calling is
correct either way; thinking off is faster, thinking on tends to plan tool use more carefully.
Faster generation with a draft model
This bf16 model benefits a lot from speculative decoding. Pair it with the small E2B draft (same tokenizer/vocab) for roughly a 3–4x throughput increase with identical output:
mlx_lm.generate \
--model jedisct1/gemma-4-12B-it-txt-mlx \
--draft-model jedisct1/gemma-4-E2B-it-txt-mlx-4bit \
--num-draft-tokens 6 \
--prompt "Write a Python function that returns the nth Fibonacci number." --temp 0.0
Drafts: gemma-4-E2B-it-txt-mlx-4bit
(fastest) and gemma-4-E2B-it-txt-mlx-8bit.
Tool calls are preserved exactly.
Conversion details
- Source:
google/gemma-4-12B-it. - Tool:
mlx_lm.convert,mlx_lm0.31.3 /mlx0.31.2. - Precision: bfloat16, no quantization.
- Vision and audio weights (
vision_embedder,embed_vision,embed_audio) were removed; only the language model is kept.model_typeset togemma4.
License
Released under the Apache 2.0 license, the same terms as the original Gemma 4 release. Please review Google's terms before use.
- Downloads last month
- 268
Quantized
Start the MLX server
# Install MLX LM: uv tool install mlx-lm# Start a local OpenAI-compatible server: mlx_lm.server --model "jedisct1/gemma-4-12B-it-txt-mlx"