What this is
A QLoRA fine‑tune of Gemma 4 E4B (≈4B effective) that bakes Tacita's per‑turn director behaviors directly into the weights — preamble, native thinking, end‑to‑end search reasoning, honesty, and Tacita identity — so the assistant needs far fewer serial passes per turn and runs fully offline.
This is the GGUF / llama.cpp build, consumed by tacita-desktop (Rust runtime). For mobile (flutter_gemma), use the LiteRT‑LM build: Gemma‑4‑Tacita‑E4B‑litert‑lm.
What's baked in
- 🗣️ Inline preamble + native thinking — route‑declaration preamble, then
enable_thinkinghonored natively; emits nothing when thinking is off (fixes the stock‑Gemma llama.cpp #21338 class of bugs). - 🔎 End‑to‑end web search — decides when to search, multilingual query plan inside the tool call (
{queries:[{q,lang}]}), grounds answers, retries off‑topic results. - 🤝 Honesty — refuses unanswerable live‑data questions without inventing numbers.
- 🪪 Tacita identity — knows it is Tacita, built on Gemma 4, private on‑device.
Capability metadata
GGUF bundle is stamped with tacita.* keys (gguf namespace) so the Tacita desktop runtime reads them once at load and bypasses the matching director call. A stock llama.cpp runtime ignores them and runs the model normally.
tacita.model = gemma-4-tacita
tacita.variant = E4B
tacita.tier = desktop-pro
tacita.capabilities = [preamble_inline, thinking_native, search_plan_inline, ...]
Quantizations
| File | Bits | Size (approx) | Use |
|---|---|---|---|
*-Q4_K_M.gguf |
4‑bit | ~2.6 GB | default desktop |
*-Q8_0.gguf |
8‑bit | ~4.3 GB | quality‑first |
*-F16.gguf |
16‑bit | ~8 GB | reference |
Run with llama.cpp (b‑recent) or any GGUF runtime. eos_token is <end_of_turn> (Gemma‑4 turn format).
License
Base Gemma 4 is Apache 2.0; this derivative inherits Apache 2.0.
Provenance
Training labels = adapted permissively‑licensed open datasets + an open‑weights teacher — never frontier‑API model outputs.