LFM 2.5 Audio 1.5B — ASR LoRA v2 (50K samples, Voice → Home Assistant Commands)

LoRA adapter fine-tuned on LFM2.5-Audio-1.5B to convert spoken Home Assistant commands into structured function calls (HassLightTurnOn|$area=hall).

Trained entirely on Apple Silicon using MLX. No cloud GPU required.

This is v2 — trained on the full 49,909-sample OHF-Voice dataset vs 950 samples in v1. More data, fewer steps, better generalisation.

What It Does

Takes audio of a spoken command and outputs a structured function call:

Input audio: "switch on the light in the hall"
Output:      HassLightTurnOn|$area=hall

Output format: FunctionName|$arg1=val1|$arg2=val2 — pipe-delimited, parseable.

Training Details

	Value
Base model	`mlx-community/LFM2.5-Audio-1.5B-8bit`
Dataset	`Paulescu/OHF-Voice-audio-20260504` (train split, 49,909 samples)
LoRA rank	16
LoRA alpha	32.0
Target modules	`q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
Trainable params	884,736 / 168,786,688 total (0.52%)
Optimizer	AdamW, lr=5e-5, cosine decay
Warmup steps	500
Effective batch size	8 (grad_accumulation=8)
Best checkpoint	Step 4,000 (val_loss 0.020)
Hardware	Apple Silicon (M-series)
Framework	MLX

Evaluation Results

Evaluated on Paulescu/OHF-Voice-audio-20260504 test split, 397 stratified samples (10 per function across 41 classes), system prompt "Perform ASR.", temperature 0.0.

Metric	v1 (950 samples, 10K steps)	v2 (50K samples, 4K steps)	Cookbook (full FT, A100)
Format compliance	99.0%	100.0%	99.7%
Function-name accuracy	82.0%	92.2%	98.8%
Argument accuracy	61.0%	70.0%	~97%

Key takeaway: going from 950 → 49,909 training samples gained +10% function-name accuracy and +9% argument accuracy, using fewer training steps. Data volume matters more than training duration for LoRA fine-tuning.

Per-function highlights

Function	Name acc	Arg acc
HassStartTimer	10/10	10/10
HassSetPosition	10/10	10/10
HassGetCurrentDate	6/6	6/6
HassGetCurrentTime	7/7	7/7
HassCancelAllTimers	10/10	9/10
HassSetVolume	10/10	9/10
HassRespond	0/5	0/5 ← known weak spot
HassSetVolumeRelative	10/10	2/10 ← relative values hard

Key Finding: Modality Positioning

LFM 2.5 Audio's __call__ interface appends audio embeddings after all text tokens. In a causal model the decoder cannot attend to audio when predicting the assistant response — the model will produce plain transcription regardless of fine-tuning.

The fix: use model._prefill with an explicit modalities array that places AUDIO_IN tokens inside the user turn, before the assistant tokens. This matches the inference-time ChatState.add_audio() layout.

Wrong  (model.__call__):   [system][user " "][assistant "HassLightTurnOn..."][AUDIO]
Correct (model._prefill):  [system][user AUDIO×N ][assistant "HassLightTurnOn..."]

Impact: val_loss 1.40 → 0.05 in 500 steps; model starts producing function calls instead of transcriptions.

Implemented in train/losses/lfm_audio_loss.py.

Usage

import mlx.core as mx
from mlx_audio.sts.models.lfm_audio import LFM2AudioModel, LFM2AudioProcessor
from mlx_audio.sts.models.lfm_audio.processor import ChatState
from mlx_audio.sts.models.lfm_audio.model import LFMModality

# Load base model + adapter
processor = LFM2AudioProcessor.from_pretrained("mlx-community/LFM2.5-Audio-1.5B-8bit")
model     = LFM2AudioModel.from_pretrained("mlx-community/LFM2.5-Audio-1.5B-8bit")

# Apply LoRA and load weights
from train.lora import apply_lora, load_adapters, LoRAConfig
apply_lora(model, LoRAConfig(model_type="lfm_audio", rank=16, alpha=32.0))
load_adapters(model, "path/to/adapters.safetensors")
model.eval()

# Transcribe
import soundfile as sf
audio_numpy, sr = sf.read("command.wav")
audio_mx = mx.array(audio_numpy)

chat = ChatState(processor)
chat.new_turn("system"); chat.add_text("Perform ASR."); chat.end_turn()
chat.new_turn("user");   chat.add_audio(audio_mx, sample_rate=sr); chat.end_turn()
chat.new_turn("assistant")

output = ""
for token, modality in model.generate_from_chat_state(
    chat, mode="sequential", max_new_tokens=64, temperature=0.0, top_k=1
):
    if modality == LFMModality.TEXT:
        tok_id = int(token.item())
        if tok_id == 7:  # <|im_end|>
            break
        output += processor.tokenizer.decode([tok_id])

print(output)  # e.g. HassLightTurnOn|$area=hall

Or run the included demo:

git clone https://github.com/akashicMarga/mlx-audio-train
cd mlx-audio-train
python scripts/lfm_asr_demo.py --adapter path/to/checkpoint-best
# open http://localhost:7860

Reproduce / Experiment

Full training code and configs: akashicMarga/mlx-audio-train.

# Prepare full 50K dataset (requires HF_TOKEN for gated dataset)
python scripts/prepare_ohf_voice.py --output-dir data/ohf_voice

# Train (early-stop at ~4K steps or let it run to 20K)
caffeinate -i python scripts/train.py --config configs/lfm_audio_asr_50k.yaml

# Eval
python scripts/lfm_asr_eval.py \
    --adapter checkpoints/lfm-audio-asr-50k/checkpoint-best \
    --samples-per-function 10 \
    --output evals/

Things worth experimenting with:

Higher LoRA rank (32, 64) — more capacity for exact argument values
Full fine-tuning — remove LoRA, train all params; should approach cookbook accuracy
Unfreeze audio encoder — may help with accented or noisy speech
Longer training (20K steps) — val_loss was still healthy at 4K; more epochs may push arg accuracy further

Citation / Acknowledgements

Base model: LiquidAI/LFM2.5-Audio-1.5B
Dataset: Paulescu/OHF-Voice-audio-20260504
Cookbook reference: Liquid4All/cookbook
Training framework: mlx-audio
v1 adapter: akashicmarga/LFM2.5-Audio-1.5B-ASR-LoRA

Downloads last month: -; Downloads are not tracked for this model. How to track

MLX

Hardware compatibility

Quantized

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for akashicmarga/LFM2.5-Audio-1.5B-ASR-LoRA-v2

Base model

LiquidAI/LFM2-1.2B

Quantized

mlx-community/LFM2.5-Audio-1.5B-8bit

Adapter

(2)

this model

akashicmarga
/

LFM2.5-Audio-1.5B-ASR-LoRA-v2