Instructions to use akashicmarga/LFM2.5-Audio-1.5B-ASR-LoRA-v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use akashicmarga/LFM2.5-Audio-1.5B-ASR-LoRA-v2 with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir LFM2.5-Audio-1.5B-ASR-LoRA-v2 akashicmarga/LFM2.5-Audio-1.5B-ASR-LoRA-v2
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
LFM 2.5 Audio 1.5B — ASR LoRA v2 (50K samples, Voice → Home Assistant Commands)
LoRA adapter fine-tuned on LFM2.5-Audio-1.5B to convert spoken Home Assistant commands into structured function calls (HassLightTurnOn|$area=hall).
Trained entirely on Apple Silicon using MLX. No cloud GPU required.
This is v2 — trained on the full 49,909-sample OHF-Voice dataset vs 950 samples in v1. More data, fewer steps, better generalisation.
What It Does
Takes audio of a spoken command and outputs a structured function call:
Input audio: "switch on the light in the hall"
Output: HassLightTurnOn|$area=hall
Output format: FunctionName|$arg1=val1|$arg2=val2 — pipe-delimited, parseable.
Training Details
| Value | |
|---|---|
| Base model | mlx-community/LFM2.5-Audio-1.5B-8bit |
| Dataset | Paulescu/OHF-Voice-audio-20260504 (train split, 49,909 samples) |
| LoRA rank | 16 |
| LoRA alpha | 32.0 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Trainable params | 884,736 / 168,786,688 total (0.52%) |
| Optimizer | AdamW, lr=5e-5, cosine decay |
| Warmup steps | 500 |
| Effective batch size | 8 (grad_accumulation=8) |
| Best checkpoint | Step 4,000 (val_loss 0.020) |
| Hardware | Apple Silicon (M-series) |
| Framework | MLX |
Evaluation Results
Evaluated on Paulescu/OHF-Voice-audio-20260504 test split, 397 stratified samples (10 per function across 41 classes), system prompt "Perform ASR.", temperature 0.0.
| Metric | v1 (950 samples, 10K steps) | v2 (50K samples, 4K steps) | Cookbook (full FT, A100) |
|---|---|---|---|
| Format compliance | 99.0% | 100.0% | 99.7% |
| Function-name accuracy | 82.0% | 92.2% | 98.8% |
| Argument accuracy | 61.0% | 70.0% | ~97% |
Key takeaway: going from 950 → 49,909 training samples gained +10% function-name accuracy and +9% argument accuracy, using fewer training steps. Data volume matters more than training duration for LoRA fine-tuning.
Per-function highlights
| Function | Name acc | Arg acc |
|---|---|---|
| HassStartTimer | 10/10 | 10/10 |
| HassSetPosition | 10/10 | 10/10 |
| HassGetCurrentDate | 6/6 | 6/6 |
| HassGetCurrentTime | 7/7 | 7/7 |
| HassCancelAllTimers | 10/10 | 9/10 |
| HassSetVolume | 10/10 | 9/10 |
| HassRespond | 0/5 | 0/5 ← known weak spot |
| HassSetVolumeRelative | 10/10 | 2/10 ← relative values hard |
Key Finding: Modality Positioning
LFM 2.5 Audio's __call__ interface appends audio embeddings after all text tokens. In a causal model the decoder cannot attend to audio when predicting the assistant response — the model will produce plain transcription regardless of fine-tuning.
The fix: use model._prefill with an explicit modalities array that places AUDIO_IN tokens inside the user turn, before the assistant tokens. This matches the inference-time ChatState.add_audio() layout.
Wrong (model.__call__): [system][user " "][assistant "HassLightTurnOn..."][AUDIO]
Correct (model._prefill): [system][user AUDIO×N ][assistant "HassLightTurnOn..."]
Impact: val_loss 1.40 → 0.05 in 500 steps; model starts producing function calls instead of transcriptions.
Implemented in train/losses/lfm_audio_loss.py.
Usage
import mlx.core as mx
from mlx_audio.sts.models.lfm_audio import LFM2AudioModel, LFM2AudioProcessor
from mlx_audio.sts.models.lfm_audio.processor import ChatState
from mlx_audio.sts.models.lfm_audio.model import LFMModality
# Load base model + adapter
processor = LFM2AudioProcessor.from_pretrained("mlx-community/LFM2.5-Audio-1.5B-8bit")
model = LFM2AudioModel.from_pretrained("mlx-community/LFM2.5-Audio-1.5B-8bit")
# Apply LoRA and load weights
from train.lora import apply_lora, load_adapters, LoRAConfig
apply_lora(model, LoRAConfig(model_type="lfm_audio", rank=16, alpha=32.0))
load_adapters(model, "path/to/adapters.safetensors")
model.eval()
# Transcribe
import soundfile as sf
audio_numpy, sr = sf.read("command.wav")
audio_mx = mx.array(audio_numpy)
chat = ChatState(processor)
chat.new_turn("system"); chat.add_text("Perform ASR."); chat.end_turn()
chat.new_turn("user"); chat.add_audio(audio_mx, sample_rate=sr); chat.end_turn()
chat.new_turn("assistant")
output = ""
for token, modality in model.generate_from_chat_state(
chat, mode="sequential", max_new_tokens=64, temperature=0.0, top_k=1
):
if modality == LFMModality.TEXT:
tok_id = int(token.item())
if tok_id == 7: # <|im_end|>
break
output += processor.tokenizer.decode([tok_id])
print(output) # e.g. HassLightTurnOn|$area=hall
Or run the included demo:
git clone https://github.com/akashicMarga/mlx-audio-train
cd mlx-audio-train
python scripts/lfm_asr_demo.py --adapter path/to/checkpoint-best
# open http://localhost:7860
Reproduce / Experiment
Full training code and configs: akashicMarga/mlx-audio-train.
# Prepare full 50K dataset (requires HF_TOKEN for gated dataset)
python scripts/prepare_ohf_voice.py --output-dir data/ohf_voice
# Train (early-stop at ~4K steps or let it run to 20K)
caffeinate -i python scripts/train.py --config configs/lfm_audio_asr_50k.yaml
# Eval
python scripts/lfm_asr_eval.py \
--adapter checkpoints/lfm-audio-asr-50k/checkpoint-best \
--samples-per-function 10 \
--output evals/
Things worth experimenting with:
- Higher LoRA rank (32, 64) — more capacity for exact argument values
- Full fine-tuning — remove LoRA, train all params; should approach cookbook accuracy
- Unfreeze audio encoder — may help with accented or noisy speech
- Longer training (20K steps) — val_loss was still healthy at 4K; more epochs may push arg accuracy further
Citation / Acknowledgements
- Base model: LiquidAI/LFM2.5-Audio-1.5B
- Dataset: Paulescu/OHF-Voice-audio-20260504
- Cookbook reference: Liquid4All/cookbook
- Training framework: mlx-audio
- v1 adapter: akashicmarga/LFM2.5-Audio-1.5B-ASR-LoRA
Quantized