H ere at AST, Quality comes first. You might say it’s our creed. So as we welcome you to the initial “Caption-IT” Blog, we’d like to talk about Quality. What it means in terms of technical standards and approaches, and how AST holds the standard for accessibility high — complete and accurate, and nothing less.
The key to quality captioning is to start with a quality transcript. It may seem like a trivial matter to simply transcribe what a speaker is saying … until you try it yourself. The average speaker speaks between 120 and 150 words per minute (and even higher in some cases). Tracking and keeping up to this speed accurately is a challenge. An audio recording, of course, has no punctuation in it, so the transcriber must use knowledge and “common sense” to determine sentence and paragraph structure. Specialized terminology and proper names represent another hurdle to accuracy. Background noise, speaker dialect, mispronunciations, and speaker hesitations all make the task even more difficult. Getting an accurate transcript is a complex process – even for a trained and experienced transcriber.
With everyone’s attention focused on ways to reduce costs, it is very tempting to seek short-cuts in captioning. We see a plethora of recent vendors in this field offering solutions based on speech recognition, “crowdsourcing” or student labor pools. Use a critical eye as you examine these solutions. If it is difficult for a trained, experienced transcriber to produce an accurate transcript, then it is a daunting challenge for an untrained person (like a student, or the typical worker picking up tasks in a crowdsourcing arrangement) to successfully complete the task.
Automated Speech Recognition, of course, is the most tempting offer. Wouldn’t it be wonderful if this task could be completely automated? There are some impressive demos from speech recognition systems and while they make the process seem flawless, these demos are not indicative of the actual performance you will get when captioning your own videos. YouTube’s “AutoCaption” feature represents the typical quality you should expect from a speech recognition system. Steer clear of their demo pieces and check out the performance on real videos. Here are a few I’ve run into:
That last one is one of my personal favorites. Try out AutoCaption on these. When you are listening to the audio and reading the text simultaneously in a language you know, your mind will tend to “fill-in” the errors for you. This is one of the reasons why editing error-filled speech recognition output also does not work that well. To get the real experience, try watching these videos with the audio off and the captions on – it will make the true error rate more apparent.
In today’s typical captioning task, there are rarely constraints on who the speaker is, what the topic is, or the acoustic conditions. This results in high error rates from today’s speech recognition systems—often in excess of 20%. To put this in perspective, readers report that error rates above 3% significantly degrade the intelligibility of text, and by the time the error rate reaches 10% they report that they are unable to even discern the topic being discussed (see our Research).
The argument that “something is better than nothing” is often put forward for these low quality approaches and tools. But this really is not a valid argument – error-filled captions are at best difficult for the viewer to follow, and at worst convey information that is simply wrong. For public information and education content, particularly academic content, accuracy rates typical of speech-to-text tools would not meet legal accessibility guidelines, nor have they been acceptable enough to rely on for delivering academic education.
For content owners that do not want their message distorted by error, risking civil rights lawsuits or compromising academic integrity, a solution that ensures the highest quality is needed.
At AST we are sensitive to the need to contain the costs for transcription and captioning. By making extensive use of our proprietary automation technologies—but avoiding the pitfalls of speech recognition—we are proud to offer one highest quality and lowest cost solutions on the market. We believe that approaches using speech recognition technology, crowdsourcing, and untrained transcribers are simply not adequate to provide a quality result to your viewers. Captioning with a 20% error rate may provide comic relief, but it offers nothing in the way of accessibility.
As video adoption continues to increase and you are inundated with offers to caption your content using speech recognition, edited speech recognition, crowdsourcing, or student labor – view them critically. At AST, we will continue our approach that combines superior technology and trained professional transcribers, as we keep our commitment to quality captioning.