gemma-4-12b-sdpo-pi-mono-trace-feedback

LoRA adapter trained with TRL SDPO on filtered badlogicgames/pi-mono coding-agent traces that contain concrete tool errors or later user corrections.

This adapter is a self-distillation experiment, not a general-purpose coding model release. It uses TRL's experimental SDPO trainer with include_environment_feedback=True, so filtered trace diagnostics are supplied as privileged_context for teacher-conditioned reprompts.

Training Run

Field	Value
Hub repo	`burtenshaw/gemma-4-12b-sdpo-pi-mono-trace-feedback`
Base model	`google/gemma-4-12B-it`
Dataset	`badlogicgames/pi-mono`
Dataset split	`train`
Selected samples	`64`
Minimum filter score	`13.3`
Average filter score	`15.298`
Max steps	`24`
Learning rate	`5e-05`
Training method	`4-bit NF4 QLoRA + TRL SDPO`
LoRA rank	`8`
LoRA alpha	`16`
Num generations	`2`
Success reward threshold	`0.35`
Max prompt length	`768`
Max completion length	`160`

Trackio: Trackio run
Run name: gemma4-12b-sdpo-pi-mono-full-20260604
Output directory: outputs/sdpo-pi-mono-trace-feedback

Data Filtering

The source data is badlogicgames/pi-mono. The script uses the Dataset Viewer parquet export by default because the raw JSONL files have schema drift that can make direct load_dataset() reconstruction brittle.

A row is kept when it has a substantive prompt plus at least one concrete tool error, environment diagnostic, or later user correction. Rows are scored higher for test/build failures, runtime exceptions, missing files or commands, explicit user feedback, and evidence that the trace continued after the failure.

Selected-data summary:

Source sessions: 64
Source files: 64
Score range: 13.3 to 19.5
Average score: 15.298

Category counts:

build_lint_compile: 40
command_error: 64
missing_file_or_command: 43
other_marked_error: 3
permission_auth: 3
runtime_exception: 21
test_or_assertion: 63
tool_schema_validation: 2
user_feedback: 62

First selected sample preview:

score: 19.5
categories: build_lint_compile, command_error, missing_file_or_command, runtime_exception, test_or_assertion, user_feedback
reward_terms: build_lint_compile, command_error, missing_file_or_command, runtime_exception, test_or_assertion, user_feedback, environment, diagnostic, promises:332, triggeruncaughtexception

prompt:
Analyze GitHub issue(s): https://github.com/badlogic/pi-mono/issues/2291

For each issue:

1. Read the issue in full, including all comments and linked issues/PRs.
2. Do not trust analysis written in the issue. Independently verify behavior and derive your own analysis from the code and execution path.

3. **For bugs**:
   - Ignore any root cause analysis in the issue (likely wrong)
   - Read all related code files in full (no truncation)
   - Trace the code path and identify the actual root cause
   - Propose a fix

4. **For feature requests**:
   - Do not trust implementation proposals in the issue without verification
   - Read all related code files in full (no truncation)
   - Propose the most concise implementation approach
   - List affected files and changes needed

Do NOT implement unless explicitly asked. Analyze and propose only.

privileged_context:
Tool/environment diagnostic 1 (test_or_assertion, runtime_exception, command_error):
node:internal/process/promises:332
    triggerUncaughtException(err, true /* fromPromise */);
    ^

Error: Transform failed with 3 errors:
/eval.ts:29:2: ERROR: Top-level await is currently not supported with the "cjs" output format
/eval.ts:31:22: ERROR: Top-level await is currently not supported with the "cjs" output format
/eval.ts:46:2: ERROR: Top-level await is currently not supported with the "cjs" output format
    at failureErrorWithLog ($WORKSPACE/node_modules/esbuild/lib/main.js:1748:15)
    at $WOR

[... trimmed ...]

el' does not exist in type 'SimpleStreamOptions'.
packages/coding-agent/src/core/extensions/runner.ts(242,46): error TS2304: Cannot find name 'ProviderConfig'.

Command exited with code 2

Tool/environment diagnostic 3 (test_or_assertion):
No changes made to packages/coding-agent/src/core/agent-session.ts. The replacement produced identical content. This might indicate an issue with special characters or the text not existing as expected.

Later user correction 1:
what'st he most concise fix? this sounds overly complex

Later user correction 2:
well, we can't just fix it for session_start then

Reward Function

The included trace_grounding_reward is intentionally lightweight. It rewards completions that look like concrete coding-agent responses and mention terms grounded in the trace diagnostic. That is enough to exercise SDPO, Trackio, HF Jobs, LoRA push-to-Hub, and the filtered trace format.

For a serious training run, replace the heuristic reward with a verifier that can replay or grade the task, such as tests, build checks, lint output, tool-call validation, or another sandboxed outcome signal.

Training Metrics

epoch: 0.375
total_flos: 0.0
train_loss: -0.3888257265401383
train_runtime: 952.6648
train_samples_per_second: 0.05
train_steps_per_second: 0.025

Reproduce

hf jobs uv run <raw-gist-url-for-train_sdpo_pi_mono_full.py> \
  --flavor a10g-large \
  --timeout 3h \
  --secrets HF_TOKEN \
  --env HUB_MODEL_ID=burtenshaw/gemma-4-12b-sdpo-pi-mono-trace-feedback \
  --env TRACKIO_SPACE_ID=burtenshaw/trackio \
  -- \
  --mode train \
  --run-name gemma4-12b-sdpo-pi-mono-full-20260604

Limitations

The data is filtered from existing traces, so it reflects the trace collector's task distribution and failure modes.
The reward is a smoke-test heuristic and should not be interpreted as a reliable coding benchmark.
The model is pushed as a PEFT LoRA adapter; load it with the base model listed above.
SDPO is experimental in TRL, so pin versions for long-running comparisons.

Downloads last month: 19

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for burtenshaw/gemma-4-12b-sdpo-pi-mono-trace-feedback

Base model

google/gemma-4-12B

Finetuned

google/gemma-4-12B-it

Adapter

(7)

this model

burtenshaw
/

gemma-4-12b-sdpo-pi-mono-trace-feedback