gemma-4-12b-sdpo-pi-mono-trace-feedback

LoRA adapter trained with TRL SDPO on filtered badlogicgames/pi-mono coding-agent traces that contain concrete tool errors or later user corrections.

This adapter is a self-distillation experiment, not a general-purpose coding model release. It uses TRL's experimental SDPO trainer with include_environment_feedback=True, so filtered trace diagnostics are supplied as privileged_context for teacher-conditioned reprompts.

Training Run

Field Value
Hub repo burtenshaw/gemma-4-12b-sdpo-pi-mono-trace-feedback
Base model google/gemma-4-12B-it
Dataset badlogicgames/pi-mono
Dataset split train
Selected samples 64
Minimum filter score 13.3
Average filter score 15.298
Max steps 24
Learning rate 5e-05
Training method 4-bit NF4 QLoRA + TRL SDPO
LoRA rank 8
LoRA alpha 16
Num generations 2
Success reward threshold 0.35
Max prompt length 768
Max completion length 160
  • Trackio: Trackio run
  • Run name: gemma4-12b-sdpo-pi-mono-full-20260604
  • Output directory: outputs/sdpo-pi-mono-trace-feedback

Data Filtering

The source data is badlogicgames/pi-mono. The script uses the Dataset Viewer parquet export by default because the raw JSONL files have schema drift that can make direct load_dataset() reconstruction brittle.

A row is kept when it has a substantive prompt plus at least one concrete tool error, environment diagnostic, or later user correction. Rows are scored higher for test/build failures, runtime exceptions, missing files or commands, explicit user feedback, and evidence that the trace continued after the failure.

Selected-data summary:

  • Source sessions: 64
  • Source files: 64
  • Score range: 13.3 to 19.5
  • Average score: 15.298

Category counts:

  • build_lint_compile: 40
  • command_error: 64
  • missing_file_or_command: 43
  • other_marked_error: 3
  • permission_auth: 3
  • runtime_exception: 21
  • test_or_assertion: 63
  • tool_schema_validation: 2
  • user_feedback: 62

First selected sample preview:

score: 19.5
categories: build_lint_compile, command_error, missing_file_or_command, runtime_exception, test_or_assertion, user_feedback
reward_terms: build_lint_compile, command_error, missing_file_or_command, runtime_exception, test_or_assertion, user_feedback, environment, diagnostic, promises:332, triggeruncaughtexception

prompt:
Analyze GitHub issue(s): https://github.com/badlogic/pi-mono/issues/2291

For each issue:

1. Read the issue in full, including all comments and linked issues/PRs.
2. Do not trust analysis written in the issue. Independently verify behavior and derive your own analysis from the code and execution path.

3. **For bugs**:
   - Ignore any root cause analysis in the issue (likely wrong)
   - Read all related code files in full (no truncation)
   - Trace the code path and identify the actual root cause
   - Propose a fix

4. **For feature requests**:
   - Do not trust implementation proposals in the issue without verification
   - Read all related code files in full (no truncation)
   - Propose the most concise implementation approach
   - List affected files and changes needed

Do NOT implement unless explicitly asked. Analyze and propose only.

privileged_context:
Tool/environment diagnostic 1 (test_or_assertion, runtime_exception, command_error):
node:internal/process/promises:332
    triggerUncaughtException(err, true /* fromPromise */);
    ^

Error: Transform failed with 3 errors:
/eval.ts:29:2: ERROR: Top-level await is currently not supported with the "cjs" output format
/eval.ts:31:22: ERROR: Top-level await is currently not supported with the "cjs" output format
/eval.ts:46:2: ERROR: Top-level await is currently not supported with the "cjs" output format
    at failureErrorWithLog ($WORKSPACE/node_modules/esbuild/lib/main.js:1748:15)
    at $WOR

[... trimmed ...]

el' does not exist in type 'SimpleStreamOptions'.
packages/coding-agent/src/core/extensions/runner.ts(242,46): error TS2304: Cannot find name 'ProviderConfig'.

Command exited with code 2

Tool/environment diagnostic 3 (test_or_assertion):
No changes made to packages/coding-agent/src/core/agent-session.ts. The replacement produced identical content. This might indicate an issue with special characters or the text not existing as expected.

Later user correction 1:
what'st he most concise fix? this sounds overly complex

Later user correction 2:
well, we can't just fix it for session_start then

Reward Function

The included trace_grounding_reward is intentionally lightweight. It rewards completions that look like concrete coding-agent responses and mention terms grounded in the trace diagnostic. That is enough to exercise SDPO, Trackio, HF Jobs, LoRA push-to-Hub, and the filtered trace format.

For a serious training run, replace the heuristic reward with a verifier that can replay or grade the task, such as tests, build checks, lint output, tool-call validation, or another sandboxed outcome signal.

Training Metrics

  • epoch: 0.375
  • total_flos: 0.0
  • train_loss: -0.3888257265401383
  • train_runtime: 952.6648
  • train_samples_per_second: 0.05
  • train_steps_per_second: 0.025

Reproduce

hf jobs uv run <raw-gist-url-for-train_sdpo_pi_mono_full.py> \
  --flavor a10g-large \
  --timeout 3h \
  --secrets HF_TOKEN \
  --env HUB_MODEL_ID=burtenshaw/gemma-4-12b-sdpo-pi-mono-trace-feedback \
  --env TRACKIO_SPACE_ID=burtenshaw/trackio \
  -- \
  --mode train \
  --run-name gemma4-12b-sdpo-pi-mono-full-20260604

Limitations

  • The data is filtered from existing traces, so it reflects the trace collector's task distribution and failure modes.
  • The reward is a smoke-test heuristic and should not be interpreted as a reliable coding benchmark.
  • The model is pushed as a PEFT LoRA adapter; load it with the base model listed above.
  • SDPO is experimental in TRL, so pin versions for long-running comparisons.
Downloads last month
19
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for burtenshaw/gemma-4-12b-sdpo-pi-mono-trace-feedback

Adapter
(7)
this model

Dataset used to train burtenshaw/gemma-4-12b-sdpo-pi-mono-trace-feedback