From Weak Labels to Strong Results: Utilizing 5,000 Hours of Noisy Classroom Transcripts with Minimal Accurate Data

Authors: Ahmed Adel Attia, Dorottya Demszky, Jing Liu, Carol Espy-Wilson

Published date: May, 2025

Publication: arXiv

Recent progress in speech recognition has relied on models trained on vast amounts of labeled data. However, classroom Automatic Speech Recognition (ASR) faces the real-world challenge of abundant weak transcripts paired with only a small amount of accurate, gold-standard data. In such low-resource settings, high transcription costs make re-transcription impractical. To address this, we ask: what is the best approach when abundant inexpensive weak transcripts coexist with limited gold-standard data, as is the case for classroom speech data? We propose Weakly Supervised Pretraining (WSP), a two-step process where models are first pretrained on weak transcripts in a supervised manner, and then fine-tuned on accurate data. Our results, based on both synthetic and real weak transcripts, show that WSP outperforms alternative methods, establishing it as an effective training methodology for low-resource ASR in real-world scenarios.

Project: Automatic Speech Recognition

More Publications

Conditional Hypothesis Generation for LLM-Based Text Analysis with Researcher-Specified Covariates

2026. Paiheng Xu, Jing Liu, Wei Ai. arXiv

Discipline Beyond Suspensions: Racial/Ethnic Disparities Across the Spectrum of Disciplinary Actions

2026. Youngsun Lee, Jing Liu, Emily K. Penner. EdWorkingPapers.com

The Language of Closure: Examining Racial Differences in How A Community Discusses School Closure Metrics

2026. Michael L. Chrzan, Francis A. Pearman. EdWorkingPapers.com

AI-enhanced coaching: What early studies show

2026. Heather Hill, James Malamut, Dora Demszky, Jing Liu, Samantha Booth, Hannah Rosenstein. The Learning Professional

The Effect of Centralized-Admission School Lotteries on Between-School Segregation: Evidence from 300 Largest School Districts in the United States

2025. Francisco Lagos, Jason Saltmarsh, Jing Liu. EdWorkingPapers.com