Add CHI-Bench eval results — agent harness: Hermes

#46
by hlnchen - opened

Adds CHI-Bench (actAVA) evaluation results for moonshotai/Kimi-K2.6.

  • Benchmark: actava/chi-bench (evaluation_framework: harbor)
  • Agent harness: Hermes (best-performing harness for this model)
  • Protocol: 75 managed-care tasks x 3 trials; metric pass@1 (%)
  • Scores: Overall 15.6 | PA 18.7 | UM 21.3 | CM 6.7 ; reliability pass^3 6.7
  • Source: CHI-Bench paper, arXiv:2605.16679

Submitted as community-provided results; close the PR if disputed.

Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment