Skip to main content

Speech Technology

Mapping the Emotional Landscape of Classroom Discourse: Classrooms are not merely spaces for information transfer; they are complex social environments where emotion and learning are deeply intertwined. While traditional educational analytics have focused on what is being said (content and syntax), they often miss the vital context of how it is said. A student saying "I can't do this" might be expressing frustration, resignation, or a playful bid for help—distinctions that text alone cannot capture.To bridge this gap, EDSI is developing a multimodal (Audio + Text) Deep Learning model designed to map the emotional landscape of the classroom. By training algorithms to score utterances on the continuous dimensions of Valence (positivity vs. negativity), Arousal (calm vs. excitement), and Dominance (submissiveness vs. agency), we aim to move beyond binary sentiment analysis. This nuaced approach allows us to measure the "emotional temperature" of classroom moments, quantifying moments of high engagement, confusion, or confidence.Leveraging the rich, multimodal data of the EDSI repository, this work seeks to understand how emotional dynamics influence student participation and learning outcomes. Ultimately, this tool will contribute to a more holistic feedback system for educators, helping them recognize and foster the socio-emotional conditions that lead to equitable and effective instruction.

Multimodal Speaker Identification in Classroom Audio: The automated analysis of classroom discourse is a critical frontier in educational technology, promising to provide scalable, objective feedback on instructional quality and student engagement. However, the K-12 classroom presents a "perfect storm" for audio processing, characterized by non-stationary background noise, significant reverberation, and the highly variable spectral characteristics of children's speech. Because of these adversarial conditions, traditional acoustic-only models frequently struggle to answer the fundamental questions of "who spoke when" and "who said what," often failing to distinguish target speakers from peer discussions.To address these challenges, EDSI is working to build s highly-accurate multimodal speaker identification framework that supplements physical acoustic data with semantic context. By recognizing that speaker identity is defined by both an acoustic signature (vocal characteristics) and a linguistic signature (speech content), we integrated high-fidelity acoustic embeddings with Large Language Models (LLMs). This "LLM-adaptive" approach utilizes transcript-based cues—such as a teacher nominating a student by name—to create "contextual anchors" that refine and correct speaker attribution.The results of this work demonstrate the viability of granular, automated classroom analytics. Our current system achieves 99.3% accuracy in distinguishing between teacher and student roles and a Top-3 accuracy of 95.8% for identifying individual students during substantive turns where the speaker speaks for more than 10 seconds. By accurately attributing speech even in complex acoustic environments, this project is working to advance the feasibility of feedback systems capable of monitoring individual student participation, a necessary step for supporting equitable instruction at scale.

Back to Top