Instructions to use burtenshaw/gemma-4-12b-sdpo-pi-mono-trace-feedback with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use burtenshaw/gemma-4-12b-sdpo-pi-mono-trace-feedback with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("google/gemma-4-12B-it") model = PeftModel.from_pretrained(base_model, "burtenshaw/gemma-4-12b-sdpo-pi-mono-trace-feedback") - Notebooks
- Google Colab
- Kaggle
gemma-4-12b-sdpo-pi-mono-trace-feedback
LoRA adapter trained with TRL SDPO on filtered badlogicgames/pi-mono coding-agent traces that contain concrete tool errors or later user corrections.
This adapter is a self-distillation experiment, not a general-purpose coding model release. It uses TRL's experimental SDPO trainer with include_environment_feedback=True, so filtered trace diagnostics are supplied as privileged_context for teacher-conditioned reprompts.
Training Run
| Field | Value |
|---|---|
| Hub repo | burtenshaw/gemma-4-12b-sdpo-pi-mono-trace-feedback |
| Base model | google/gemma-4-12B-it |
| Dataset | badlogicgames/pi-mono |
| Dataset split | train |
| Selected samples | 64 |
| Minimum filter score | 13.3 |
| Average filter score | 15.298 |
| Max steps | 24 |
| Learning rate | 5e-05 |
| Training method | 4-bit NF4 QLoRA + TRL SDPO |
| LoRA rank | 8 |
| LoRA alpha | 16 |
| Num generations | 2 |
| Success reward threshold | 0.35 |
| Max prompt length | 768 |
| Max completion length | 160 |
- Trackio: Trackio run
- Run name:
gemma4-12b-sdpo-pi-mono-full-20260604 - Output directory:
outputs/sdpo-pi-mono-trace-feedback
Data Filtering
The source data is badlogicgames/pi-mono. The script uses the Dataset Viewer parquet export by default because the raw JSONL files have schema drift that can make direct load_dataset() reconstruction brittle.
A row is kept when it has a substantive prompt plus at least one concrete tool error, environment diagnostic, or later user correction. Rows are scored higher for test/build failures, runtime exceptions, missing files or commands, explicit user feedback, and evidence that the trace continued after the failure.
Selected-data summary:
- Source sessions:
64 - Source files:
64 - Score range:
13.3to19.5 - Average score:
15.298
Category counts:
build_lint_compile: 40command_error: 64missing_file_or_command: 43other_marked_error: 3permission_auth: 3runtime_exception: 21test_or_assertion: 63tool_schema_validation: 2user_feedback: 62
First selected sample preview:
score: 19.5
categories: build_lint_compile, command_error, missing_file_or_command, runtime_exception, test_or_assertion, user_feedback
reward_terms: build_lint_compile, command_error, missing_file_or_command, runtime_exception, test_or_assertion, user_feedback, environment, diagnostic, promises:332, triggeruncaughtexception
prompt:
Analyze GitHub issue(s): https://github.com/badlogic/pi-mono/issues/2291
For each issue:
1. Read the issue in full, including all comments and linked issues/PRs.
2. Do not trust analysis written in the issue. Independently verify behavior and derive your own analysis from the code and execution path.
3. **For bugs**:
- Ignore any root cause analysis in the issue (likely wrong)
- Read all related code files in full (no truncation)
- Trace the code path and identify the actual root cause
- Propose a fix
4. **For feature requests**:
- Do not trust implementation proposals in the issue without verification
- Read all related code files in full (no truncation)
- Propose the most concise implementation approach
- List affected files and changes needed
Do NOT implement unless explicitly asked. Analyze and propose only.
privileged_context:
Tool/environment diagnostic 1 (test_or_assertion, runtime_exception, command_error):
node:internal/process/promises:332
triggerUncaughtException(err, true /* fromPromise */);
^
Error: Transform failed with 3 errors:
/eval.ts:29:2: ERROR: Top-level await is currently not supported with the "cjs" output format
/eval.ts:31:22: ERROR: Top-level await is currently not supported with the "cjs" output format
/eval.ts:46:2: ERROR: Top-level await is currently not supported with the "cjs" output format
at failureErrorWithLog ($WORKSPACE/node_modules/esbuild/lib/main.js:1748:15)
at $WOR
[... trimmed ...]
el' does not exist in type 'SimpleStreamOptions'.
packages/coding-agent/src/core/extensions/runner.ts(242,46): error TS2304: Cannot find name 'ProviderConfig'.
Command exited with code 2
Tool/environment diagnostic 3 (test_or_assertion):
No changes made to packages/coding-agent/src/core/agent-session.ts. The replacement produced identical content. This might indicate an issue with special characters or the text not existing as expected.
Later user correction 1:
what'st he most concise fix? this sounds overly complex
Later user correction 2:
well, we can't just fix it for session_start then
Reward Function
The included trace_grounding_reward is intentionally lightweight. It rewards completions that look like concrete coding-agent responses and mention terms grounded in the trace diagnostic. That is enough to exercise SDPO, Trackio, HF Jobs, LoRA push-to-Hub, and the filtered trace format.
For a serious training run, replace the heuristic reward with a verifier that can replay or grade the task, such as tests, build checks, lint output, tool-call validation, or another sandboxed outcome signal.
Training Metrics
epoch: 0.375total_flos: 0.0train_loss: -0.3888257265401383train_runtime: 952.6648train_samples_per_second: 0.05train_steps_per_second: 0.025
Reproduce
hf jobs uv run <raw-gist-url-for-train_sdpo_pi_mono_full.py> \
--flavor a10g-large \
--timeout 3h \
--secrets HF_TOKEN \
--env HUB_MODEL_ID=burtenshaw/gemma-4-12b-sdpo-pi-mono-trace-feedback \
--env TRACKIO_SPACE_ID=burtenshaw/trackio \
-- \
--mode train \
--run-name gemma4-12b-sdpo-pi-mono-full-20260604
Limitations
- The data is filtered from existing traces, so it reflects the trace collector's task distribution and failure modes.
- The reward is a smoke-test heuristic and should not be interpreted as a reliable coding benchmark.
- The model is pushed as a PEFT LoRA adapter; load it with the base model listed above.
- SDPO is experimental in TRL, so pin versions for long-running comparisons.
- Downloads last month
- 19