Félicien Vallet, Jean Carrive, Hakim Nabi, Mathieu Derval
This work has been conducted with the help of Jim Uro (Université de Technologie de Compiègne) et Jérémy Andriamakaoly (Télécom ParisTech) during their respective internships.
We describe the Speech Trax system that aims at analyzing the audio content of TV and radio documents. In particular, we focus on the speaker tracking task that is very valuable for indexing purposes. First, we detail the overall architecture of the system and show the results obtained on a large-scale experiment, the largest to our knowledge for this type of content (about 1,300 speakers). Then, we present the Speech Trax demonstrator that gathers the results of various automatic speech processing techniques on top of our speaker tracking system (speaker diarization, speech transcription, etc.). Finally, we provide insight on the obtained performances and suggest hints for future improvements.
As stated in  speaker tracking is derived from the larger field of the speaker recognition technologies that has been a very hot research topic for several decades. However, the challenge of dealing with “big data” is just getting tackled. For instance, in  the authors propose a statistical utterance comparison that, coupled with kernelized locality-sensitive hashing (KLSH), they use to retrieve very large population of speakers (about 10,000). Similarly, in  the authors propose a system based on i-vector, the state-of-the-art approach for speaker recognition, and locality-sensitive hashing (LSH) to speed-up the the comparison of a given target with the referenced 1,000 speaker models. Finally, it is worth mentioning the evaluation campaigns Speaker Recognition Evaluation et i-vector Machine Learning Challenge organized and led by NIST. However, it has to be noted that in all these cases, the speech segments used for the speaker recognition task are not issued from broadcast material. They either come from databases designed for the research on consumer devices as in , telephonic and microphone speech as in the NIST challenges or technical talks posted on YouTube as in .
Speaker tracking in audiovisual content is nevertheless slowly getting attention. In , based on speaker diarization techniques, the authors propose to link speakers in an unsupervised fashion for about 1,800 hours of Dutch television broadcasts. Similarly, the prototype developed at BBC gathers 3 years of radio archives amounting to 70,000 programs. In ,the authors present the speaker recognition feature of the prototype. In this case, 780 speaker models are built, based on Gaussian Mixture Models (GMM), and compared using Kullback-Leibler divergence through a LSH index. Finally, an approach relying on TV material
issued form the REPERE challenge  is proposed in . In particular, the authors investigate the i-vector framework and propose a specific protocol to identify candidates using a 533 speakers dictionary.
In this section, we focus on the speaker recognition feature of our demonstrator. In particular, we describe how our speaker dictionary is created and detail the implementation choices made for the speaker recognition system itself.
As detailed in our previous work , we propose to automatically gather segments for famous speakers. It is based on the simple hypothesis that states that during a TV newscast, when the name of a person appears on screen, that person is presently speaking. To this end, an optical character recognition (OCR) software
presented in  is used to detect names (see Figure 1). A comparison between the transcribed text and a list of referenced people issued from Ina’s thesaurus is performed using the Levenshtein distance. Then, if a match is found, the corresponding speech turn obtained using the LIUM Speaker Diarization tool  is associated with the person identity.
Speaker recognition methods rely very heavily on the quality of the training data to perform correctly. Unfortunately, the automated methods described previously don’t allow to reach this necessary quality. Therefore, a web interface has been designed to manually validate the segments attributed to a given personality (see Figure 2).
It allows us to confirm or infirm for each collected segment the identity of the assumed speaker. At the end of the validation process, a total of 2,290 personalities constitute our speaker dictionary (Table 1 shows the detail). However it has to be kept in mind that, despite our efforts to reduce it, such dictionaries are by nature greatly imbalanced. On top of the number of files and the cumulated time both in total and on average, Table 1 displays the number of sessions. We qualify as belonging to the same acoustic session. speech segments issued from the same program. This information is of utmost importance. Indeed for the evaluation of the speaker recognition process, segments belonging to the same session cannot be part of both the training and the testing sets. It would otherwise lead to biased results since similar acoustic conditions would be both learned and evaluated.
number of sessions
number of segments
average per speaker
~3 minutes 10 seconds
State-of-the-art techniques for speaker recognition rely now on the i-vector paradigm. As stated in , an i-vector is a compact representation of a speaker’s utterance after projection into a low-dimensional, total variability subspace trained using factor analysis and making no distinction between speaker and channel/session information. The speaker and channel dependent GMM supervector, , issued from the concatenation of speaker GMM means can be defined as :
We chose to use a similar system as the one described in  meaning that we rely on the ALIZE v3.0 toolkit . However, based on a series of preliminary experiments, we decided to adopt the Probabilistic Linear Discriminant Analysis (PLDA) scoring approach defined in . The Universal Background Model (UBM), the total
variability matrix and the PLDA-related matrices were learnt on data issued from the REPERE challenge .
In order to evaluate the performances of the previously described speaker recognition system, we build a close-set experiment exploiting our speaker dictionary.
In a close-set evaluation all the tested speakers or targets possess a corresponding voice model. For this, the speaker dictionary is split in two parts, one dedicated to the training of the models and the other to the testing. Contrary to the protocol presented in , we take great care not to mix-up segments belonging to the same session both in the train and the test set. Besides, to try to ensure a balanced volume of data among the various speakers, we limit the maximum number of sessions considered to 10 in total. We then allocate the sessions and segments in the same fashion as in , meaning that we set a minimum of 30 seconds and a maximum of 150 seconds for the training phase. Thus, with all this constraints, the number of sessions to be added in the train base is determined as follow for a speaker :
It has to be noted that a great variance is observed in the number of segments belonging to one session. Thus, to ensure that the process is statistically significant, we randomly generate 5 training and testing sets and average performances.
Table 2 displays the averaged results obtained with our speaker recognition system over 1,290 speakers satisfying the previously described conditions. Scores are computed both at the segment and at the session level.
Besides, the classical metrics precision, recall and equal error rate (EER, the rate at which both acceptance and rejection errors are equal) we prompt et . These metrics allow to hint if the targeted speaker is amongst the first 10 retrieved. It is a slightly different approach than the classical measure that is based on result pages of web search and is defined as with the number of relevant documents retrieved at rank .
The results highlight the fact that the identification by session performs better than by segment which is in adequation with the results obtained in  with a referenced segmentation. Also, one can note that compared with classical metrics, the et measures highlight the fact that targeted speakers
are most of the time to be found in the first 10 retrieved candidates.
The idea of the Speech Trax demonstrator is to propose new ways of exploring audiovisual collections based on oral interventions of famous French speakers. On an archiving point of view it also is to issue a raw and imperfect documentation of the contents kept at Ina to help their detailed description. To this end, we use a similar approach to the one presented in . As illustrated in the block diagram Figure 3, Speech Trax relies on various automatic speech processing techniques: speech/music discrimination, speaker diarization, speaker recognition and speech transcription.
A corpus of about 250 hours of broadcast news and magazines is selected. It comes from 6 public channels: 3 for TV (France 2, France 5, France 24) and 3 for radio (France Inter, France Info, France Culture). Choice was made to process data from March 2014 due to the important events that occured at that time: French local elections, phone tapping of former president Sarkozy, Ukraine invasion by Russian troops, missing Malaysian Airlines flight 370, etc. Programs are cut in 15 minute slices both to reduce speaker diarization errors and to ease the browsing of the media in the user interface. Besides, adds are manually discarded which is the only manual operation that is performed.
As described in Figure 3, the first step is to identify speech tracks by running a speech/music discrimination methods. To this end, we used the approached proposed in . Once the speech tracks identified, the LIUM Speaker Diarization tool  is used to provide a speaker segmentation and clustering of the analyzed shows. It then allows to track speakers using the speaker recognition system described earlier. However, in this case, it is worth mentioning that tested speech segments are produced by speaker diarization and not manually validated, inducing potential errors. From the dictionary described earlier, a sub-dataset of 1,783 speakers possessing at least 45 seconds of speech is used. Finally, the commercial system VoxSigma from Vocapia Research is used to provide automatic speech transcriptions.
The Speech Trax Trax’s GUI is a responsive and flat design web-application with all actions accessible on the same single page. The video player used is amalia.js, Ina’s HTML5 open-source player. Figure 4 provides a view of the metadata enriched player amalia.js. Speech Trax enables a user to navigate inside a corpus of radio and video documents according to the interventions of famous speakers. For this, the user can enter the name of a person he wants to find in the search bar. An auto-completion module will show him if this speaker has been identified in the corpus. Then, the user just needs to select the excerpt of his choice to listen to the speaker’s intervention. For TV material, the images of the excerpts are automatically extracted based on the speaker’s interventions. Once a program is shown on screen, the user can also browse through the media and access all the interventions of an identified speaker by clicking on the magnifying glass next to his/her name. On the right of the name the user can also take notice of the confidence with which the labeling was made.
The processing of the described corpus enabled the identification of 533 unique speakers, most of them being anchormen, presenters, sportsmen, politicians or experts. It has to be noted that the validation threshold has deliberately been set pretty high in order to increase the user experience. For a professional usage this threshold would be set lower to ensure a greater exhaustivity. Better performances appear to be obtained for radio than for TV material. An argument could be made that maybe radio speech is cleaner since no visual cues are available to the listener. Also, it is worth noting that, as highlighted in Table 3, there are on average less speakers in the radio programs processed than in their TV counterparts.
number of speakers
The repartition of the identified speakers across channels is also of interest. Table 4 reveals that the vast majority of speakers is retrieved on a single channel. A closer look at that population prompts that these persons are for the greater part anchormen, presenters, columnists, etc. On the other end of the spectrum, a handful of personalities are retrieved on all or almost all channels. It is interesting to note that all these personalities are politicians: François Hollande, Laurent Fabius, Jean-Claude Gaudin, Marine Le Pen, Jean-François Copé, etc. At the same time, it is also worth keeping in mind that politicians are among the speakers with the most training segments in the dictionary on account of their regular appearances on TV.
number of speakers
Finally, following the famous zoo introduced in , one can notice that several speakers appear to be wolves, meaning that they are particularly successful at imitating other speakers. That is, their speech is very likely to be accepted as that of another speaker. Personalities such as Benjamin Millepied, Manu Payet, Béatrice Idiard-Chamois certainly appear to be wolves in our demonstrator which seems due to noisy speech models. Finally, it is also worth noting that many errors are caused by telephone speech. For instance, the speaker Alain Cayzac who must have telephonic data in his speech model, is often identified when a speaker is interviewed on the phone. If needed in the future, a simple pass-band identification would easily enable to discard such segments.
Relying on various state-of-the-art speech processing techniques, Speech Trax is a first attempt to index and retrieve famous speakers in Ina’s archiving context. Results are very encouraging. Future uses of Speech Trax could allow users to navigate differently in archives. For instance by generating new queries: “find media where A speaks with B” or “get me contents where C talks about subject D”, etc.
However a boost of performances could be obtained by using multimodality to confirm, correct or invalidate the identity of detected speakers. Technologies such as speech transcription, optical character recognition or face recognition could be directly plugged-in to enhance the identification results as it is done in the MediaEval task “Multimodal person discovery in broadcast TV” . Also, as in , Ina could rely on crowdsourcing to extend the size of the speaker dictionary but also to clean it when necessary, allowing a steady improvement of the system.
 N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet. Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing , 19(4):788 – 798, 2011.
 G. Doddington, W. Liggett, A. Martin, M. Przybocki, and D. Reynolds. Sheep, goats, lambs and woves: A statistical analysis of speaker performance in the nist 1998 speaker recognition evaluation. In International Conference on Spoken Language Processing , Sydney, Australia, december 1998.
 A. Giraudel, M. Carré, V. Mapelli, J. Kahn, O. Galibert, and L. Quintard. The REPERE Corpus : a multimodal corpus for person recognition. In International Conference on Language Resources and Evaluation , Istanbul, Turkey, may 2012.
 M. Huijbregts and D. van Leeuwen. Towards automatic speaker retrieval for large multimedia archives. In International Workshop on Automated Information Extraction in Media Production , Florence, Italy, october 2010.
 W. Jeon and Y.-M. Cheng. Efficient speaker search over large populations using kernelized locality-sensitive hashing. In International Conference on Acoustics, Speech and Signal Processing , Kyoto, Japan, march 2012.
 A. Larcher, J.-F. Bonastre, B. Fauve, K.-A. Lee, C. Lévy, H. Li, J. Mason, and J.-Y. Parfait. Alize 3.0-open source toolkit for state-of-the-art speaker recognition. In Interspeech , Lyon, France, september 2013.
We wish to thank the developers who participated in this project and who can not be named for legal reasons. We also would like to thank Élisabeth Chapalain from Ina for her help validating speech segments in the speaker dictionary as well as our academic partners Dr. Sylvain Meignier and Dr. Anthony Larcher from LIUM and Dr. Julien Pinquier from IRIT.