Benchmarking Speech-To-Text, Machine-Translation and Text-To-Speech

Proof of Concept by

Sébastien Doré / Ubisoft

Philippe Anel, CTO / Mediawen

Erwan de Kerautem, CEO / Mediawen

[ #ubinnovationlab]

Just how good is AI at automated translation, subtitling and voiceover?

AI automated translation and speech-to-text is such a a hot topic that tech giants are investing heavily in it, announcing every so often that they have beaten the best word error rate.


We teamed up with French start-up Mediawen because we wanted to see for ourselves how well AI performs at these tasks. By combining state of the art solutions from key players in the field (Google, Microsoft and IBM) with the best research-led solutions and its own algorithms, Mediawen is improving results.

We tested six videos:

Speech-to-text was applied to all six videos, and automated translation to all but the GRW gameplay sequence, in which the mix of sounds, music and voices meant that AI performed poorly - confirming our hunch that each sound file should be individually translated. For both speech-to-text and translation, we tested tools by Google, Microsoft, IBM and Voxolab, picking the best-performing solution in each case.

This varied depending on whether it was speech-to-text or translation, and on direction of translation. Next, a human translator corrected the AI automated solution. Mediawen displays all corrections made, showing their type; some results required a lot more human intervention than others.

In a spirit of complete transparency, all the videos we processed are accessible here. Click on the thumbnail below, then start the video. The planet icon in the video player brings up a menu allowing you to cycle through the various speech-to-text, translation and voiceover tests.








bs-radar episode 1


bs-radar episode 2



Our analysis allowed us to estimate accuracy for both speech-to-text and translation. Not counting minor errors (such as omission of a capital letter), accuracy was around 90-95% for speech-to-text, and around 85% for translation. According to Mediawen, 85% is the point at which automation becomes worthwhile – in other words, at which it is quicker to correct than to start from scratch. Speech-to-text is therefore already time-saving, and translation is almost there.

These results can be improved by teaching unfamiliar vocabulary to the AI. Since we use a lot of videogame vocabulary, Ubisoft jargon, and English in French, this is likely to be effective for us.

Using these techniques, we could get near-perfect speech-to-text results as well as automated translation that, while still requiring human intervention, will allow us to save valuable time. Voiceover was tested on just one video. We were pleasantly surprised by the quality - especially in English. The variety of voices and accents was also impressive.

Copyright (c) 2017-2018 MediaWen International