Automatic speaker tracking in audio recordings
A central topic in spoken-language-systems research is what’s called speaker diarization, or computationally determining how many speakers feature in a recording and which of them speaks when. Speaker diarization would be an essential function of any program that automatically annotated audio or video recordings.To date, the best diarization systems have used what’s called supervised machine learning: They’re trained on sample recordings that a human has indexed, indicating which speaker enters when. In the October issue of IEEE Transactions on Audio, Speech, and Language Processing, however, MIT researchers describe a new speaker-diarization system that achieves comparable results without supervision: No prior indexing is necessary. Moreover, one of the MIT researchers’ innovations was a new, compact way to represent the differences between individual speakers’ voices, which could be of use in other spoken-language computational tasks.“You can know something about the identity of a person from the sound of their voice, so this...