LFM 2.5 Audio 1.5B — ASR LoRA v2 (50K samples, Voice → Home Assistant Commands)

LoRA adapter fine-tuned on LFM2.5-Audio-1.5B to convert spoken Home Assistant commands into structured function calls (HassLightTurnOn|$area=hall).

Trained entirely on Apple Silicon using MLX. No cloud GPU required.

This is v2 — trained on the full 49,909-sample OHF-Voice dataset vs 950 samples in v1. More data, fewer steps, better generalisation.


What It Does

Takes audio of a spoken command and outputs a structured function call:

Input audio: "switch on the light in the hall"
Output:      HassLightTurnOn|$area=hall

Output format: FunctionName|$arg1=val1|$arg2=val2 — pipe-delimited, parseable.


Training Details

Value
Base model mlx-community/LFM2.5-Audio-1.5B-8bit
Dataset Paulescu/OHF-Voice-audio-20260504 (train split, 49,909 samples)
LoRA rank 16
LoRA alpha 32.0
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Trainable params 884,736 / 168,786,688 total (0.52%)
Optimizer AdamW, lr=5e-5, cosine decay
Warmup steps 500
Effective batch size 8 (grad_accumulation=8)
Best checkpoint Step 4,000 (val_loss 0.020)
Hardware Apple Silicon (M-series)
Framework MLX

Evaluation Results

Evaluated on Paulescu/OHF-Voice-audio-20260504 test split, 397 stratified samples (10 per function across 41 classes), system prompt "Perform ASR.", temperature 0.0.

Metric v1 (950 samples, 10K steps) v2 (50K samples, 4K steps) Cookbook (full FT, A100)
Format compliance 99.0% 100.0% 99.7%
Function-name accuracy 82.0% 92.2% 98.8%
Argument accuracy 61.0% 70.0% ~97%

Key takeaway: going from 950 → 49,909 training samples gained +10% function-name accuracy and +9% argument accuracy, using fewer training steps. Data volume matters more than training duration for LoRA fine-tuning.

Per-function highlights

Function Name acc Arg acc
HassStartTimer 10/10 10/10
HassSetPosition 10/10 10/10
HassGetCurrentDate 6/6 6/6
HassGetCurrentTime 7/7 7/7
HassCancelAllTimers 10/10 9/10
HassSetVolume 10/10 9/10
HassRespond 0/5 0/5 ← known weak spot
HassSetVolumeRelative 10/10 2/10 ← relative values hard

Key Finding: Modality Positioning

LFM 2.5 Audio's __call__ interface appends audio embeddings after all text tokens. In a causal model the decoder cannot attend to audio when predicting the assistant response — the model will produce plain transcription regardless of fine-tuning.

The fix: use model._prefill with an explicit modalities array that places AUDIO_IN tokens inside the user turn, before the assistant tokens. This matches the inference-time ChatState.add_audio() layout.

Wrong  (model.__call__):   [system][user " "][assistant "HassLightTurnOn..."][AUDIO]
Correct (model._prefill):  [system][user AUDIO×N ][assistant "HassLightTurnOn..."]

Impact: val_loss 1.40 → 0.05 in 500 steps; model starts producing function calls instead of transcriptions.

Implemented in train/losses/lfm_audio_loss.py.


Usage

import mlx.core as mx
from mlx_audio.sts.models.lfm_audio import LFM2AudioModel, LFM2AudioProcessor
from mlx_audio.sts.models.lfm_audio.processor import ChatState
from mlx_audio.sts.models.lfm_audio.model import LFMModality

# Load base model + adapter
processor = LFM2AudioProcessor.from_pretrained("mlx-community/LFM2.5-Audio-1.5B-8bit")
model     = LFM2AudioModel.from_pretrained("mlx-community/LFM2.5-Audio-1.5B-8bit")

# Apply LoRA and load weights
from train.lora import apply_lora, load_adapters, LoRAConfig
apply_lora(model, LoRAConfig(model_type="lfm_audio", rank=16, alpha=32.0))
load_adapters(model, "path/to/adapters.safetensors")
model.eval()

# Transcribe
import soundfile as sf
audio_numpy, sr = sf.read("command.wav")
audio_mx = mx.array(audio_numpy)

chat = ChatState(processor)
chat.new_turn("system"); chat.add_text("Perform ASR."); chat.end_turn()
chat.new_turn("user");   chat.add_audio(audio_mx, sample_rate=sr); chat.end_turn()
chat.new_turn("assistant")

output = ""
for token, modality in model.generate_from_chat_state(
    chat, mode="sequential", max_new_tokens=64, temperature=0.0, top_k=1
):
    if modality == LFMModality.TEXT:
        tok_id = int(token.item())
        if tok_id == 7:  # <|im_end|>
            break
        output += processor.tokenizer.decode([tok_id])

print(output)  # e.g. HassLightTurnOn|$area=hall

Or run the included demo:

git clone https://github.com/akashicMarga/mlx-audio-train
cd mlx-audio-train
python scripts/lfm_asr_demo.py --adapter path/to/checkpoint-best
# open http://localhost:7860

Reproduce / Experiment

Full training code and configs: akashicMarga/mlx-audio-train.

# Prepare full 50K dataset (requires HF_TOKEN for gated dataset)
python scripts/prepare_ohf_voice.py --output-dir data/ohf_voice

# Train (early-stop at ~4K steps or let it run to 20K)
caffeinate -i python scripts/train.py --config configs/lfm_audio_asr_50k.yaml

# Eval
python scripts/lfm_asr_eval.py \
    --adapter checkpoints/lfm-audio-asr-50k/checkpoint-best \
    --samples-per-function 10 \
    --output evals/

Things worth experimenting with:

  • Higher LoRA rank (32, 64) — more capacity for exact argument values
  • Full fine-tuning — remove LoRA, train all params; should approach cookbook accuracy
  • Unfreeze audio encoder — may help with accented or noisy speech
  • Longer training (20K steps) — val_loss was still healthy at 4K; more epochs may push arg accuracy further

Citation / Acknowledgements

Downloads last month

-

Downloads are not tracked for this model. How to track
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for akashicmarga/LFM2.5-Audio-1.5B-ASR-LoRA-v2

Adapter
(2)
this model

Dataset used to train akashicmarga/LFM2.5-Audio-1.5B-ASR-LoRA-v2