Architecture: CTC forced alignment + GOP scoring + ensemble heads (MLP + XGBoost). No wav2vec2 or large self-supervised models — the entire pipeline uses a quantized NeMo Citrinet-256 as the acoustic backbone.
Benchmarked on speechocean762 (standard academic benchmark, 2500 utterances): - Phone accuracy (PCC): 0.580 — exceeds human inter-annotator agreement (0.555) - Sentence accuracy: 0.710 — exceeds human agreement (0.675) - Model is 70x smaller than wav2vec2-based SOTA
Trade-off: we're ~10-15% below SOTA on raw accuracy. But for real-time feedback in language learning apps, the latency/size trade-off is worth it.
Available as REST API, MCP server (for AI agents), and on Azure Marketplace.
Demo: https://huggingface.co/spaces/fabiosuizu/pronunciation-asses...
Interested in feedback on the scoring approach and use cases people would find valuable.