Show HN: 17MB pronunciation scorer beats human experts at phoneme level
I built an English pronunciation assessment engine that fits in 17MB and runs in under 300ms on CPU.

Architecture: CTC forced alignment + GOP scoring + ensemble heads (MLP + XGBoost). No wav2vec2 or large self-supervised models — the entire pipeline uses a quantized NeMo Citrinet-256 as the acoustic backbone.

Benchmarked on speechocean762 (standard academic benchmark, 2500 utterances): - Phone accuracy (PCC): 0.580 — exceeds human inter-annotator agreement (0.555) - Sentence accuracy: 0.710 — exceeds human agreement (0.675) - Model is 70x smaller than wav2vec2-based SOTA

Trade-off: we're ~10-15% below SOTA on raw accuracy. But for real-time feedback in language learning apps, the latency/size trade-off is worth it.

Available as REST API, MCP server (for AI agents), and on Azure Marketplace.

Demo: https://huggingface.co/spaces/fabiosuizu/pronunciation-asses...

Interested in feedback on the scoring approach and use cases people would find valuable.

very cool, weights? training recipe?