Speech recognition, also known as automatic speech recognition (ASR), computer speech recognition, or speech-to-text, is a capability which enables a program to process human speech into a written format. While it’s commonly confused with voice recognition, speech recognition focuses on the translation of speech from a verbal format to a text one whereas voice recognition just seeks to identify an individual user’s voice.
IBM has had a prominent role within speech recognition since its inception, releasing of “Shoebox” in 1962. This machine had the ability to recognize 16 different words, advancing the initial work from Bell Labs from the 1950s. However, IBM didn’t stop there, but continued to innovate over the years, launching VoiceType Simply Speaking application in 1996. This speech recognition software had a 42,000-word vocabulary, supported English and Spanish, and included a spelling dictionary of 100,000 words. While speech technology had a limited vocabulary in the early days, it is utilized in a wide number of industries today, such as automotive, technology, and healthcare. Its adoption has only continued to accelerate in recent years due to advancements in deep learning and big data. Research (link resides outside ibm.com) shows that this market is expected to be worth USD 24.9 billion by 2025.
IBM Watson Speech to Text
IBM Watson Text to Speech
Many speech recognition applications and devices are available, but the more advanced solutions use AI and machine learning . They integrate grammar, syntax, structure, and composition of audio and voice signals to understand and process human speech. Ideally, they learn as they go — evolving responses with each interaction.
The best kind of systems also allow organizations to customize and adapt the technology to their specific requirements — everything from language and nuances of speech to brand recognition. For example:
- Language weighting: Improve precision by weighting specific words that are spoken frequently (such as product names or industry jargon), beyond terms already in the base vocabulary.
- Speaker labeling: Output a transcription that cites or tags each speaker’s contributions to a multi-participant conversation.
- Acoustics training: Attend to the acoustical side of the business. Train the system to adapt to an acoustic environment (like the ambient noise in a call center) and speaker styles (like voice pitch, volume and pace).
- Profanity filtering: Use filters to identify certain words or phrases and sanitize speech output.
Meanwhile, speech recognition continues to advance. Companies, like IBM, are making inroads in several areas, the better to improve human and machine interaction.
The vagaries of human speech have made development challenging. It’s considered to be one of the most complex areas of computer science – involving linguistics, mathematics and statistics. Speech recognizers are made up of a few components, such as the speech input, feature extraction, feature vectors, a decoder, and a word output. The decoder leverages acoustic models, a pronunciation dictionary, and language models to determine the appropriate output.
Speech recognition technology is evaluated on its accuracy rate, i.e. word error rate (WER), and speed. A number of factors can impact word error rate, such as pronunciation, accent, pitch, volume, and background noise. Reaching human parity – meaning an error rate on par with that of two humans speaking – has long been the goal of speech recognition systems. Research from Lippmann (link resides outside ibm.com) estimates the word error rate to be around 4 percent, but it’s been difficult to replicate the results from this paper.
Various algorithms and computation techniques are used to recognize speech into text and improve the accuracy of transcription. Below are brief explanations of some of the most commonly used methods:
- Natural language processing (NLP): While NLP isn’t necessarily a specific algorithm used in speech recognition, it is the area of artificial intelligence which focuses on the interaction between humans and machines through language through speech and text. Many mobile devices incorporate speech recognition into their systems to conduct voice search—e.g. Siri—or provide more accessibility around texting.
- Hidden markov models (HMM): Hidden Markov Models build on the Markov chain model, which stipulates that the probability of a given state hinges on the current state, not its prior states. While a Markov chain model is useful for observable events, such as text inputs, hidden markov models allow us to incorporate hidden events, such as part-of-speech tags, into a probabilistic model. They are utilized as sequence models within speech recognition, assigning labels to each unit—i.e. words, syllables, sentences, etc.—in the sequence. These labels create a mapping with the provided input, allowing it to determine the most appropriate label sequence.
- N-grams: This is the simplest type of language model (LM), which assigns probabilities to sentences or phrases. An N-gram is sequence of N-words. For example, “order the pizza” is a trigram or 3-gram and “please order the pizza” is a 4-gram. Grammar and the probability of certain word sequences are used to improve recognition and accuracy.
- Neural networks: Primarily leveraged for deep learning algorithms, neural networks process training data by mimicking the interconnectivity of the human brain through layers of nodes. Each node is made up of inputs, weights, a bias (or threshold) and an output. If that output value exceeds a given threshold, it “fires” or activates the node, passing data to the next layer in the network. Neural networks learn this mapping function through supervised learning, adjusting based on the loss function through the process of gradient descent. While neural networks tend to be more accurate and can accept more data, this comes at a performance efficiency cost as they tend to be slower to train compared to traditional language models.
- Speaker Diarization (SD): Speaker diarization algorithms identify and segment speech by speaker identity. This helps programs better distinguish individuals in a conversation and is frequently applied at call centers distinguishing customers and sales agents.
A wide number of industries are utilizing different applications of speech technology today, helping businesses and consumers save time and even lives. Some examples include:
Automotive: Speech recognizers improves driver safety by enabling voice-activated navigation systems and search capabilities in car radios.
Technology: Virtual agents are increasingly becoming integrated within our daily lives, particularly on our mobile devices. We use voice commands to access them through our smartphones, such as through Google Assistant or Apple’s Siri, for tasks, such as voice search, or through our speakers, via Amazon’s Alexa or Microsoft’s Cortana, to play music. They’ll only continue to integrate into the everyday products that we use, fueling the “Internet of Things” movement.
Healthcare: Doctors and nurses leverage dictation applications to capture and log patient diagnoses and treatment notes.
Sales: Speech recognition technology has a couple of applications in sales. It can help a call center transcribe thousands of phone calls between customers and agents to identify common call patterns and issues. AI chatbots can also talk to people via a webpage, answering common queries and solving basic requests without needing to wait for a contact center agent to be available. It both instances speech recognition systems help reduce time to resolution for consumer issues.
Security: As technology integrates into our daily lives, security protocols are an increasing priority. Voice-based authentication adds a viable level of security.
Convert speech into text using AI-powered speech recognition and transcription.
Convert text into natural-sounding speech in a variety of languages and voices.
AI-powered hybrid cloud software.
Enable speech transcription in multiple languages for a variety of use cases, including but not limited to customer self-service, agent assistance and speech analytics.
Learn how to keep up, rethink how to use technologies like the cloud, AI and automation to accelerate innovation, and meet the evolving customer expectations.
IBM has pioneered the development of Speech Recognition tools and services that enable organizations to automate their complex business processes while gaining essential business insights. IBM Watson Speech to Text is a cloud-native solution that uses deep-learning AI algorithms to apply knowledge about grammar, language structure, and audio/voice signal composition to create customizable speech recognition for optimal text transcription.
- Voice &Video
- Programmable Voice
- Programmable Video
- Elastic SIP Trunking
- Network Traversal
- Programmable SMS
- Programmable Chat
- Phone Numbers
- Programmable Wireless
- Enterprise Plan
- Access Security
- Call & Text Marketing
- Commerce Communications
- Contact Center
- Customer Stories
- Partner Showcase
- Why Twilio?
- Support Plans
- API & Docs
- Jobs at Twilio
What Is Speech Recognition?
The human voice allows people to express their thoughts, emotions, and ideas through sound. Speech separates us from computing technology, but both similarly rely on words to transform ideas into shared understanding. In the past, we interfaced with computers and applications only through keyboards, controllers, and consoles—all hardware. But today, speech recognition software bridges the gap that separates speech and text.
First, let’s start with the meaning of automatic speech recognition: it’s the process of converting what speakers say into written or electronic text. Potential business applications include everything from customer support to translation services.
Now that you understand what speech recognition is, read on to learn how speech recognition works, different speech recognition types, and how your business can benefit from speech recognition applications.
How does speech recognition work?
Speech recognition technologies capture the human voice with physical devices like receivers or microphones. The hardware digitizes recorded sound vibrations into electrical signals. Then, the software attempts to identify sounds and phonemes—the smallest unit of speech—from the signals and match these sounds to corresponding text. Depending on the application, this text displays on the screen or triggers a directive—like when you ask your smart speaker to play a specific song and it does.
Background noise, accents, slang, and cross talk can interfere with speech recognition, but advancements in artificial intelligence (AI) and machine learning technologies filter through these anomalies to increase precision and performance.
Thanks to new and emerging machine learning algorithms, speech recognition offers advanced capabilities:
- Natural language processing is a branch of computer science that uses AI to emulate how humans engage in and understand speech and text-based interactions.
- Hidden Markov Models (HMM) are statistical models that assign text labels to units of speech—like words, syllables, and sentences—in a sequence. Labels map to the provided input to determine the correct label or text sequence.
- N-grams are language models that assign probabilities to sentences or phrases to improve speech recognition accuracy. These contain sequences of words and use prior sequences of the same words to understand or predict new words and phrases. These calculations improve the predictions of sentence automatic completion systems, spell-check results, and even grammar checks.
- Neural networks consist of node layers that together emulate the learning and decision-making capabilities of the human brain. Nodes contain inputs, weights, a threshold, and an output value. Outputs that exceed the threshold activate the corresponding node and pass data to the next layer. This means remembering earlier words to continually improve recognition accuracy.
- Connectionist temporal classification is a neural network algorithm that uses probability to map text transcript labels to incoming audio. It helps train neural networks to understand speech and build out node networks.
Features of speech recognition
Not all speech recognition works the same. Implementations vary by application, but each uses AI to quickly process speech at a high—but not flawless—quality level. Many speech recognition technologies include the same features:
- Filtering identifies and censors—or removes—specified words or phrases to sanitize text outputs.
- Language weighting assigns more value to frequently spoken words—like proper nouns or industry jargon—to improve speech recognition precision.
- Speaker labeling distinguishes between multiple conversing speakers by identifying contributions based on vocal characteristics.
- Acoustics training analyzes conditions—like ambient noise and particular speaker styles—then tailors the speech recognition software to that environment. It’s useful when recording speech in busy locations, like call centers and offices.
- Voice recognition helps speech recognition software pivot the listening approach to each user’s accent, dialect, and grammatical library.
5 benefits of speech recognition technology
The popularity and convenience of speech recognition technology have made speech recognition a big part of everyday life. Adoption of this technology will only continue to spread, so learn more about how speech recognition transforms how we live and work:
- Speed: Speaking with your voice is faster than typing with your fingers—in most cases.
- Assistance: Listening to directions from users and taking action accordingly is possible thanks to speech recognition technology. For instance, if your vehicle’s sound system has speech recognition capabilities, you can tell it to tune the radio to a particular channel or map directions to a specified address.
- Productivity: Dictating your thoughts and ideas instead of typing them out, saves time and effort to redirect toward other tasks. To illustrate, picture yourself dictating a report into your smartphone while walking or driving to your next meeting.
- Intelligence: Learning from and adapting to your unique speech habits and environment to identify and understand you better over time is possible thanks to speech recognition applications.
- Accessibility: Entering text with speech recognition is possible for people with visual impairments who can’t see a keyboard thanks to this technology. Software and websites like Google Meet and YouTube can accommodate hearing-impaired viewers with text captions of live speech translated to the user’s specific language.
Business speech recognition use cases
Speech recognition directly connects products and services to customers. It powers interactive voice recognition software that delivers customers to the right support agents—each more productive with faster, hands-free communication. Along the way, speech recognition captures actionable insights from customer conversations you can use to bolster your organization’s operational and marketing processes.
Here are some real-world speech recognition contexts and applications:
- SMS/MMS messages: Write and send SMS or MMS messages conveniently in some environments.
- Chatbot discussions: Get answers to product or service-related questions any time of day or night with chatbots.
- Web browsing : Browse the internet without a mouse, keyboard, or touch screen through voice commands.
- Active learning: Enable students to enjoy interactive learning applications—such as those that teach a new language—while teachers create lesson plans.
- Document writing: Draft a Google or Word document when you can't access a physical or digital keyboard with speech-to-text. You can later return to the document and refine it once you have an opportunity to use a keyboard. Doctors and nurses often use these applications to log patient diagnoses and treatment notes efficiently.
- Phone transcriptions: Help callers and receivers transcribe a conversation between 2 or more speakers with phone APIs .
- Interviews: Turn spoken words into a comprehensive speech log the interviewer can reference later with this software. When a journalist interviews someone, they may want to record it to be more active and attentive without risking misquotes.
Try Twilio’s Speech Recognition API
Speech-to-text applications help you connect to larger and more diverse audiences. But to deploy these capabilities at scale, you need flexible and affordable speech recognition technology—and that’s where we can help.
Twilio’s Speech Recognition API performs real-time translation and converts speech to text in 119 languages and dialects. Make your customer service more accessible on a pay-as-you-go plan, with no upfront fees and free support. Get started for free !
Thanks for rating this post!
We are always striving to improve our blog quality, and your feedback is valuable to us. How could this post serve you better?
Speech recognition is the capability of an electronic device to understand spoken words. A microphone records a person's voice and the hardware converts the signal from analog sound waves to digital audio. The audio data is then processed by software , which interprets the sound as individual words.
A common type of speech recognition is "speech-to-text" or "dictation" software, such as Dragon Naturally Speaking, which outputs text as you speak. While you can buy speech recognition programs, modern versions of the Macintosh and Windows operating systems include a built-in dictation feature. This capability allows you to record text as well as perform basic system commands.
In Windows, some programs support speech recognition automatically while others do not. You can enable speech recognition for all applications by selecting All Programs → Accessories → Ease of Access → Windows Speech Recognition and clicking "Enable dictation everywhere." In OS X, you can enable dictation in the "Dictation & Speech" system preference pane. Simply check the "On" button next to Dictation to turn on the speech-to-text capability. To start dictating in a supported program, select Edit → Start Dictation . You can also view and edit spoken commands in OS X by opening the "Accessibility" system preference pane and selecting "Speakable Items."
Another type of speech recognition is interactive speech, which is common on mobile devices, such as smartphones and tablets . Both iOS and Android devices allow you to speak to your phone and receive a verbal response. The iOS version is called "Siri," and serves as a personal assistant. You can ask Siri to save a reminder on your phone, tell you the weather forecast, give you directions, or answer many other questions. This type of speech recognition is considered a natural user interface (or NUI ), since it responds naturally to your spoken input .
While many speech recognition systems only support English, some speech recognition software supports multiple languages. This requires a unique dictionary for each language and extra algorithms to understand and process different accents. Some dictation systems, such as Dragon Naturally Speaking, can be trained to understand your voice and will adapt over time to understand you more accurately.
Test Your Knowledge
The "ed" in "edtech" refers to what?
The tech terms computer dictionary.
The definition of Speech Recognition on this page is an original definition written by the TechTerms.com team . If you would like to reference this page or cite this definition, please use the green citation links above.
The goal of TechTerms.com is to explain computer terminology in a way that is easy to understand. We strive for simplicity and accuracy with every definition we publish. If you have feedback about this definition or would like to suggest a new technical term, please contact us .
Sign up for the free TechTerms Newsletter
You can unsubscribe or change your frequency setting at any time using the links available in each email. Questions? Please contact us .
We just sent you an email to confirm your email address. Once you confirm your address, you will begin to receive the newsletter.
If you have any questions, please contact us .