Detailed Guide on Sample Rate for ASR! [2023]

Sample rate for speech recognition

🎧 Listen to this blog

Subscribe 📨 to FutureBeeAI’s News Letter

• The Fundamentals of Automatic Speech Recognition (ASR)

• sample rate explained, why higher sample rate produce better audio quality, limitation of higher sample rate, how to choose the right sample rate for asr, recommended sample rates for various asr use cases, • futurebeeai is here to assist you.

In our blog, we love exploring a bunch of cool AI stuff, like making computers understand speech, recognize things in pictures, create art, chat like humans, read handwriting, and much more!

Today, we're diving deep into a crucial aspect of training speech recognition models: the sample rate. We'll keep things simple and explain why it's a big deal.

By the end of this blog, you'll know exactly how to pick the perfect sample rate for your speech recognition project and why it matters so much! So, let's get started!

The Fundamentals of Automatic Speech Recognition (ASR)

Automatic Speech Recognition, or ASR for short, is a branch of artificial intelligence dedicated to the conversion of spoken words into written text. For an ASR model to effectively understand any language, it must undergo rigorous training using a substantial amount of spoken language data in that particular language.

This speech dataset comprises audio files recorded in the target language, along with their corresponding transcriptions . These audio files consist of recordings featuring human speech, and a crucial technical aspect of these files is the sample rate along with bit depth, format etc. We will discuss other technical features later in the future.

When training our ASR model, we have two main options: utilizing open-source datasets, off-the-shelf datasets, or creating our own custom training dataset. In the case of open-source or off-the-shelf datasets, it is essential to verify the sample rate at which the audio data was recorded. For custom speech dataset collection , it is equally vital to ensure that all audio data is recorded at the specified sample rate.

In summary, the selection of audio files with the required sample rate plays a pivotal role in the ASR training process. To gain a deeper understanding of sample rate, let's delve into its intricacies.

Sample Rate Explained

Let's dive into the concept of sample rate. In simple terms, sample rate refers to the number of audio samples captured in one second. You might also hear it called sampling frequency or sampling rate.

To measure the sample rate, we use Hertz (Hz) as the unit of measurement. Often, you'll see it expressed in kilohertz (kHz) in everyday discussions.

Now, let's visualize what the sample rate looks like on an audio graph.

Sample Rate Graph

The red line in the graph represents the sound signal, while the yellow dots scattered along it represent individual samples. Think of sample rate as a measure of how many of these samples are taken in a single second. For instance, if you have an audio file with an 8 kHz sample rate, it means that 8,000 samples are captured per second for that audio file.

Now, imagine you want to recreate the sound signal from these samples. Which scenario do you think would make it easier: having a high sample rate or a low one?

To clarify, think of the graph again. If you have more dots (samples), you can reconstruct the sound signal more accurately compared to having fewer dots. Essentially, a higher sample rate means a more detailed representation of the audio signal, allowing you to encode more information and ultimately resulting in better audio quality.

So, if you have two audio files, one with an 8 kHz sample rate and another with a 48 kHz sample rate, the 48 kHz file will generally sound much better.

Let's dive into why a higher sample rate allows for more information to be encoded.

Picture trying to capture images of a fast-moving car on a road. Your frequency of capturing images can be likened to the sample rate. If your capture frequency is too low, you'll miss important moments because the car is moving too quickly.

But if your capture frequency is high, you can capture each crucial moment, making it possible to faithfully reproduce the visual.

This same principle applies to audio. If your sample rate is low, meaning you're capturing fewer sound signals in a given time, you might miss subtle nuances in speech. Consequently, when you attempt to reproduce the audio, it won't match the original quality.

However, when you have a high enough sample rate, you capture all the nuances of speech, enabling accurate audio reproduction.

In fact, with a sufficiently high sample rate, you can reproduce audio so accurately that humans can't distinguish it from the original.

But what qualifies as a "high enough" sample rate? Does this mean that a higher sample rate is always better?

Not necessarily. Using the image analogy again, if your capture frequency is excessively high, you might end up with duplicate images. Similarly, in audio, an excessively high sample rate can capture unnecessary background noise and other irrelevant details.

To determine the right sample rate, we turn to Nyquist's theorem . This theorem suggests that to avoid aliasing and accurately capture a signal, you should sample it at a rate at least twice the highest frequency you want to capture.

For humans, our ears are sensitive to frequencies between 20 Hz and 20 kHz. Following Nyquist, the optimal sample rate for us would be 40 kHz. This is why most music CDs are recorded with sample rates of 44 kHz to 48 kHz, with the additional 4 kHz to 8 kHz serving as a buffer to prevent data loss during the analog-to-digital conversion process.

However, despite its high audio quality, a 48 kHz audio file may not be suitable for training Automatic Speech Recognition (ASR) models due to several reasons:

High sample rates require more computational power, making them less practical for certain applications.

Increased computational demands result in higher power consumption, leading to a larger carbon footprint.

Audio files with higher sample rates have larger file sizes, necessitating more storage space.

Larger file sizes also mean slower data transmission between modules.

As discussed earlier, higher sample rate the audio signal will try to contain more information and it sometimes captures the background noise as well which can lead to noise amplification as well.

Not all ASR systems or AI modules support high sample rates, which can limit interoperability.

Now the question is, then how to choose the optimal sample rate for the ASR system? Let’s find out its answer.

It primarily depends on the use case and the frequency range of human speech. Human speech intelligibility typically falls within the range of 300 Hz to 3400 Hz. Doubling the upper limit according to Nyquist, a sample rate of around 8000 Hz is sufficient to capture human voice accurately. This is why 8 kHz is commonly used in speech recognition systems, telecommunication channels, and codecs.

With enough quality 8 kHz also brings the advantages of lower computational power, power consumption, and lower amount of data that needs to be transferred. But that doesn’t mean 8 kHz is the best quality, it’s rather a sweet spot between the tradeoff of quality and limitation.

As mentioned earlier choosing the right sample rate also depends upon the use case. Many HD voice devices use 16 kHz as it provides more accurate high-frequency information compared to 8 kHz. So it’s like if you have more computational power to train your AI model you can choose 16 kHz in place of 8 kHz.

In most cases, ASR models for voice recognition tasks often do not require sample rates exceeding 22 kHz. On the other hand, in scenarios where exceptional audio quality is essential, such as music and audio production, a sample rate of 44 kHz to 48 kHz is preferred.

For Text-to-Speech (TTS) applications, which require detailed acoustic characteristics, sample rates of 22.05 kHz, 32 kHz, 44.1 kHz, or 48 kHz are used to ensure accurate audio reproduction from text.

We are clear till now that choosing the optimal sample rate depends on your use case. Below are some of the common ASR use cases and generally used sample rates for them.

Voice Assistants (e.g., Siri, Alexa, Google Assistant):

- Optimal Sample Rate: 16 kHz to 48 kHz - These applications prioritize high-quality audio for natural language understanding. Sample rates between 16 kHz and 48 kHz are often used to capture clear and detailed voice input.

Conversational AI, Telephony, and IVR Systems:

- Optimal Sample Rate: 8 kHz - Traditional telephone systems, Interactive Voice Response (IVR) systems, and call center asr solutions typically use an 8 kHz sample rate to match telephony standards.

Transcription Services (e.g., Speech-to-Text):

- Optimal Sample Rate: 16 kHz to 48 kHz - When transcribing spoken content for applications like transcription services, podcasts, or video captions, higher sample rates in the range of 16 kHz to 48 kHz are often preferred for accuracy.

Medical Transcription and Dictation:

- Optimal Sample Rate: 16 kHz to 48 kHz - Medical transcription and dictation applications typically benefit from higher sample rates to capture medical professionals' detailed speech accurately.

Remember that the optimal sample rate can vary based on the specific requirements and constraints of each ASR use case. It's essential to conduct testing and evaluation to determine the best sample rate for your application while considering factors like audio quality, computational resources, and the intended environment.

FutureBeeAI is Here to Assist You!

We at FutureBeeAI assist AI organizations working on any ASR use cases with our extensive speech data offerings. With our pre-made datasets including general conversation, call center conversation or scripted monologue you can scale your AI model development. All of these datasets are diverse across 40+ languages and 6+ industries. You can check out all the published speech data here .

Speech Data collection app Yugo

Along with that with our state-of-the-art mobile application and global crowd community, you can collect custom speech datasets as per your tailored requirements. Our data collection mobile application Yugo allows you to record both scripted and conversational speech data with flexible technical features like sample rate, bit depth, file format, and audio channels. Check out our Yugo application here .

Feel free to reach out to us in case you need any help with training datasets for your ASR use cases. We would love to assist you!

Read more Blogs

Gather bespoke speech data effortlessly and quickly with the simplest and fastest approach available

Custom Speech Data Collection

The easiest and quickest way to collect custom speech dataset.

How to prepare training dataset for speech recognition

Training Data Training Data Preparation

How to prepare training data for speech recognition models.

Speech recognition vs voice recognition

Speech Recognition Voice Recognition

Speech recognition vs. voice recognition: in depth comparisonr, supercharge your model creation with futurebeeai’s premium quality datasets.

Prompt Contact Arrow

We Use Cookies!!!

We use cookies to ensure that we give you the best experience on our website. Read cookies policies.

cookie-icon

  • About AssemblyAI

DeepSpeech for Dummies - A Tutorial and Overview

What is DeepSpeech and how does it work? This post shows basic examples of how to use DeepSpeech for asynchronous and real time transcription.

DeepSpeech for Dummies - A Tutorial and Overview

Contributor

What is DeepSpeech? DeepSpeech is a neural network architecture first published by a research team at Baidu . In 2017, Mozilla created an open source implementation of this paper - dubbed “ Mozilla DeepSpeech ”.

The original DeepSpeech paper from Baidu popularized the concept of “end-to-end” speech recognition models. “End-to-end” means that the model takes in audio, and directly outputs characters or words. This is compared to traditional speech recognition models, like those built with popular open source libraries such as Kaldi or CMU Sphinx, that predict phonemes, and then convert those phonemes to words in a later, downstream process.

The goal of “end-to-end” models, like DeepSpeech, was to simplify the speech recognition pipeline into a single model. In addition, the theory introduced by the Baidu research paper was that training large deep learning models, on large amounts of data, would yield better performance than classical speech recognition models.

Today, the Mozilla DeepSpeech library offers pre-trained speech recognition models that you can build with, as well as tools to train your own DeepSpeech models. Another cool feature is the ability to contribute to DeepSpeech’s public training dataset through the Common Voice project.

In the below tutorial, we’re going to walk you through installing and transcribing audio files with the Mozilla DeepSpeech library (which we’ll just refer to as DeepSpeech going forward).

Basic DeepSpeech Example

DeepSpeech is easy to get started with. As discussed in our overview of Python Speech Recognition in 2021 , you can download, and get started with, DeepSpeech using Python’s built-in package installer, pip. If you have cURL installed, you can download DeepSpeech’s pre-trained English model files from the DeepSpeech GitHub repo as well. Notice that the files we’re downloading below are the ‘.scorer’ and ‘.pbmm’ files.

A quick heads up - when using DeepSpeech, it is important to consider that only 16 kilohertz (kHz) .wav files are supported as of late September 2021.

Let’s go through some example code on how to asynchronously transcribe speech with DeepSpeech. If you’re using a Unix distribution, you’ll need to install Sound eXchange (sox). Sox can be installed by using either ‘apt’ for Ubuntu/Debian or ‘dnf’ for Fedora as shown below.

Now let’s also install the Python libraries we’ll need to get this to work. We’re going to need the DeepSpeech library, webrtcvad for voice activity detection, and pyqt5 for accessing multimedia (sound) capabilities on desktop systems. Earlier, we already installed DeepSpeech, we can install the other two libraries with pip like so:

Now that we have all of our dependencies, let’s create a transcriber. When we’re finished, we will be able to transcribe any ‘.wav’ audio file just like the example shown below.

sample rate for speech recognition

Before we get started on building our transcriber, make sure the model files we downloaded earlier are saved in the ‘./models’ directory of the working directory. The first thing we’re going to do is create a voice activity detection (VAD) function and use that to extract the parts of the audio file that have voice activity.

How can we create a VAD function? We’re going to need a function to read in the ‘.wav’ file, a way to generate frames of audio, and a way to create a buffer to collect the parts of the audio that have voice activity. Frames of audio are objects that we construct that contain the byte data of the audio, the timestamp in the total audio, and the duration of the frame. Let’s start by creating our wav file reader function.

All we need to do is open the file given, assert that the channels, sample width, sample rate are what we need, and finally get the frames and return the data as PCM data along with the sample rate and duration. We’ll use ‘contextlib’ to open, read, and close the wav file.

We’re expecting audio files with 1 channel, a sample width of 2, and a sample rate of either 8000, 16000, or 32000. We calculate duration as the number of frames divided by the sample rate.

Now that we have a way to read in the wav file, let’s create a frame generator to generate individual frames containing the size, timestamp, and duration of a frame. We’re going to generate frames in order to ensure that our audio is processed in reasonably sized clips and to separate out segments with and without speech.

Looking for more tutorials like this?

Subscribe to our newsletter!

The below generator function takes the frame duration in milliseconds, the PCM audio data, and the sample rate as inputs. It uses that data to create an offset starting at 0, a frame size, and a duration. While we have not yet produced enough frames to cover the entire audio file, the function will continue to yield frames and add to our timestamp and offset.

After being able to generate frames of audio, we’ll create a function called vad_collector to separate out the parts of audio with and without speech. This function requires an input of the sample rate, the frame duration in milliseconds, the padding duration in milliseconds, a webrtcvad.Vad object, and a collection of audio frames. This function, although not explicitly called as such, is also a generator function that generates a series of PCM audio data.

The first thing we’re going to do in this function is get the number of padding frames and create a ring buffer with a dequeue. Ring buffers are most commonly used for buffering data streams.

We’ll have two states, triggered and not triggered, to indicate whether or not the VAD collector function should be adding frames to the list of voiced frames or yielding that list in bytes.

Starting with an empty list of voiced frames and a not triggered state, we loop through each frame. If we are not in a triggered state, and the frame is decided to be speech, then we add it to the buffer. If after this addition of the new frame to the buffer more than 90% of the buffer is decided to be speech, we enter the triggered state, appending the buffered frames to voiced frames and clearing the buffer.

If the function is already in a triggered state when we process a frame, then we append that frame to the voiced frames list regardless of whether it is speech or not. We then append it, and the truth value for whether it is speech or not, to the buffer. After appending to the buffer, if the buffer is more than 90% non-speech, then we change our state to not-triggered, yield voiced frames as bytes, and clear both the voiced frames list and the ring buffer. If, by the end of the frames, there are still frames in voiced frames, yield them as bytes.

That’s all we need to do to make sure that we can read in our wav file and use it to generate clips of PCM audio with voice activity detection. Now let’s create a segment generator that will return more than just the segment of byte data for the audio, but also the metadata needed to transcribe it. This function requires only one parameter, the ‘.wav’ file. It is meant to filter out all the audio frames that it does not detect voice on, and return the parts of the audio file with voice. The function returns a tuple of the segments, the sample rate of the audio file, and the length of the audio file.

Now that we’ve handled the wav file and have created all the functions necessary to turn a wav file into segments of voiced PCM audio data that DeepSpeech can process, let’s create a way to load and resolve our models.

We’ll create two functions called load_model and resolve_models . Intuitively, the load_model function loads a model, returning the DeepSpeech object, the model load time, and the scorer load time. This function requires a model and a scorer. This function calculates the time it takes to load the model and scorer via the timer() module from Python. It also creates a DeepSpeech ‘Model’ object from the ‘model’ parameter passed in.

The resolve models function takes a directory name indicating which directory the models are in. Then it grabs the first file ending in ‘.pbmm’ and the first file ending in ‘.scorer’ and loads them as the models.

Being able to segment out the speech from our wav file, and load up our models, is all the preprocessing we need leading up to doing the actual Speech-to-Text conversion.

Let’s now create a function that will allow us to transcribe our speech segments . This function will have three parameters: the DeepSpeech object (returned from load_models), the audio file, and fs the sampling rate of the audio file. All it does, other than keep track of processing time, is call the DeepSpeech object’s stt function on the audio.

sample rate for speech recognition

Alright, all our support functions are ready to go, let’s do the actual Speech-to-Text conversion.

In our “main” function below we’ll go ahead and directly provide a path to the models we downloaded and moved to the ‘./models’ directory of our working directory at the beginning of this tutorial.

We can ask the user for the level of aggressiveness for filtering out non-voice, or just automatically set it to 1 (from a scale of 0-3). We’ll also need to know where the audio file is located.

After that, all we have to do is use the functions we made earlier to load and resolve our models, load up the audio file, and run the Speech-to-Text inference on each segment of audio. The rest of the code below is just for debugging purposes to show you the filename, the duration of the file, how long it took to run inference on a segment, and the load times for the model and the scorer.

The function will save your transcript  to a ‘.txt’ file, as well as output the transcription in the terminal.

That’s it! That’s all we have to do to use DeepSpeech to do Speech Recognition on an audio file. That’s a surprisingly large amount of code. A while ago, I also wrote an article on how to do this in much less code with the AssemblyAI Speech-to-Text API. You can read about how to do Speech Recognition in Python in under 25 lines of code if you don’t want to go through all of this code to use DeepSpeech.

Basic DeepSpeech Real-Time Speech Recognition Example

Now that we’ve seen how we can do asynchronous Speech Recognition with DeepSpeech, let’s also build a real time Speech Recognition example. Just like before, we’ll start with installing the right requirements. Similar to the asynchronous example above, we’ll need webrtcvad, but we’ll also need pyaudio, halo, numpy, and scipy.

Halo is for an indicator that the program is streaming, numpy and scipy are used for resampling our audio to the right sampling rate.

How will we build a real time Speech Recognition program with DeepSpeech? Just as we did in the example above, we’ll need to separate out voice activity detected segments of audio from segments with no voice activity. If the audio frame has voice activity, then we’ll feed it into the DeepSpeech model to be transcribed.

Let’s make an object for our voice activity detected audio frames, we’ll call it VADAudio (voice activity detection audio). To start, we’ll define the format, the rate, the number of channels, and the number of frames per second for our class.

Every class needs an __init__ function. The __init__ function for our VADAudio class, defined below, will take in four parameters: a callback, a device, an input rate, and a file. Everything but the input_rate will default to None if they are not passed at creation.

The input sampling rate will be the rate sampling process we defined in our class above. When we initialize our class, we will also create an instance method called proxy_callback which returns a tuple of None and the pyAudio signal to continue, but before it returns it calls the callback function, hence the name proxy_callback .

Upon initialization, the first thing we do is set ‘callback’ to a function that puts the data into the buffer queue belonging to the object instance. We initialize an empty queue for the instance’s buffer queue. We set the device and input rate to the values passed in, and the sample rate to the Class’ sample rate. Then, we derive our block size and block size input as quotients of the Class’ sample rate and the input rate divided by the number of blocks per second respectively. Blocks are the discrete segments of audio data that we will work with.

Next, we create a PyAudio object and declare a set of keyword arguments. The keyword arguments are format , set to the VADAudio Class’ format value we declared earlier, channels , set to the Class’s channel value, rate , set to the input rate, input , set to true, frames_per_buffer set to the block size input calculated earlier, and stream_callback , set to the proxy_callback instance function we created earlier. We’ll also set our aggressiveness of filtering background noise here to the aggressiveness passed in, set to a default of 3, the highest filter. We set the chunk size to None for now. If there is a device passed into the initialization of the object, we set a new keyword argument, input_device_index to the device. The device is the input device used, but what we actually pass through will be the index of the device as defined by pyAudio, this is only necessary if you want to use an input device that is not the default input device of your computer. If there was not a device passed in and we passed in a file object, we change the chunk size to 320 and open up the file to read in as bytes. Finally, we open and start a PyAudio stream with the keyword arguments dictionary we made.

Our VADAudio Class will have 6 functions: resample, read_resampled, read, write_wav, a frame generator, and a voice activity detected segment collector. Let’s start by making the resample function. Due to limitations in technology, not all microphones support DeepSpeech’s native processing sampling rate. This function takes in audio data and an input sample rate, and returns a string of the data resampled into 16 kHz.

Next, we’ll make the read and read_resampled functions together because they do basically the same thing. The read function “reads” the audio data, and the read_resampled function will read the resampled audio data. The read_resampled function will be used to read audio that wasn’t sampled at the right sampling rate initially.

The write_wav function takes a filename and data. It opens a file with the filename and allows writing of bytes with a sample width of 2 and a frame rate equal to the instance’s sample rate, and writes the data as the frames before closing the wave file.

Before we create our frame generator, we’ll set a property for the frame duration in milliseconds using the block size and sample rate of the instance.

Now, let’s create our frame generator. The frame generator will either yield the raw data from the microphone/file, or the resampled data using the read and read_resampled functions from the Audio class. If the input rate is equal to the default rate, then it will simply read in the raw data, else it will return the resampled data.

The final function we’ll need in our VADAudio is a way to collect our audio frames. This function takes a padding in milliseconds, a ratio that controls when the function “triggers” similar to the one in the basic async example above, and a set of frames that defaults to None.

The default value for padding_ms is 300, and the default for the ratio is 0.75. The padding is for padding the audio segments, and a ratio of 0.75 here means that if 75% of the audio in the buffer is speech, we will enter the triggered state. If there are no frames passed in, we’ll call the frame generator function we created earlier. We’ll define the number of padding frames as the padding in milliseconds divided by the frame duration in milliseconds that we derived earlier.

The ring buffer for this example will use a dequeue with a max length of the number of padding frames. We will start in a not triggered state. We will loop through each of the frames, returning if we hit a frame with a length of under 640. As long as the length of the frame is over 640, we check to see if the audio contains speech.

Now, we execute the same algorithm we did above for the basic example in order to collect audio frames that contain speech. While not triggered, we append speech frames to the ring buffer, triggering the state if the amount of speech frames to the total frames is above the threshold or ratio we passed in earlier.

Once triggered, we yield each frame in the buffer and clear the buffer. In a triggered state, we immediately yield the frame, and then append the frame to the ring buffer. We then check the ring buffer for the ratio of non-speech frames to speech frames and if that is over our predefined ratio, we untrigger, yield a None frame, and then clear the buffer.

Alright - we’ve finished creating all the functions for the audio class we’ll use to stream to our DeepSpeech model and get real time Speech-to-Text transcription. Now it’s time to create a main function that we’ll run to actually do our streaming transcription.

First we’ll give our main function the location of our model and scorer. Then we’ll create a VADAudio object with aggressiveness, device, rate, and file passed in.

Using the vad_collector function we created earlier, we get the frames and set up our spinner/indicator. We use the DeepSpeech model we created from the model passed through the argument to create a stream. After initializing an empty byte array called wav_data , we go through each frame.

For each frame, if the frame is not None, we show a spinner spinning and then feed the audio content into our stream. If we’ve sent in the argument to save as a .wav file, then that file is also extended. If the frame is a None object, then we end the “utterance” and save the .wav file created, if we created one at all, and clear the byte array. Then we close the stream and open a new one.

Just like with the asynchronous Speech-to-Text transcription, the real-time transcription is an awful lot of code to do real time Speech Recognition. If you don’t want to manage all this code, you can check out our guide on how to do real time Speech Recognition in Python in much less code using the AssemblyAI Speech-to-Text API.

This ends part one of our DeepSpeech overview and tutorial. In this tutorial, we went over how to do basic Speech Recognition on a .wav file, and how to do Speech Recognition in real time, with DeepSpeech. Part two will be about training your own models with DeepSpeech, and how accurately it performs. It will be coming soon - so be on the lookout for that!

For more information, follow us @assemblyai and @yujian_tang on Twitter.- and subscribe to our newsletter .

Popular posts

AI trends in 2024: Graph Neural Networks

AI trends in 2024: Graph Neural Networks

Marco Ramponi's picture

Developer Educator at AssemblyAI

AI for Universal Audio Understanding: Qwen-Audio Explained

AI for Universal Audio Understanding: Qwen-Audio Explained

Combining Speech Recognition and Diarization in one model

Combining Speech Recognition and Diarization in one model

How DALL-E 2 Actually Works

How DALL-E 2 Actually Works

Ryan O'Connor's picture

Introduction to Speech Processing - Home

Speaker Recognition and Verification

8.4. speaker recognition and verification #, 8.4.1. introduction to speaker recognition #.

Speaker recognition is the task of identifying a speaker using their voice. Speaker recognition is classified into two parts: speaker identification and speaker verification. While speaker identification is the process of determining which voice in a group of known voices best matches the speaker’, speaker verification is the task of accepting or rejecting the identity claim of a speaker by analyzing their acoustic samples. Speaker verification systems are computationally less complex than speaker identification systems since they require a comparison between only one or two models, whereas speaker identification requires comparison of one model to N speaker models.

Speaker verification methods are divided into text-dependent and text-independent methods. In text-dependent methods, the speaker verification system has prior knowledge about the text to be spoken and the user is expected to speak this text. However, in a text-independent system, the system has no prior knowledge about the text to be spoken and the user is not expected to be cooperative. Text-dependent systems achieve high speaker verification performance from relatively short utterances, while text-independent systems require long utterances to train reliable models and achieve good performance.

sample rate for speech recognition

As it is shown in the above block diagram of a basic speaker verification system, a speaker verification system involves two main phases: the training phase in which the target speakers are enrolled and the testing phase in which a decision about the identity of the speaker is taken. From a training point of view, speaker models can be classified into generative and discriminative. Generative models such as Gaussian Mixture Model (GMM) estimate the feature distribution within each speaker. Discriminative models such as Support Vector Machine and Deep Neural Network (DNN), in contrast, model the boundary between speakers.

The performance of speaker verification systems is degraded by the variability in channels and sessions between enrolment and verification speech signals. Factors which affect channel/session variability include:

Channel mismatch between enrolment and verification speech signals such as using different microphones in enrolment and verification speech signals.

Environmental noise and reverberation conditions.

The differences in speaker voice such as ageing, health, speaking style and emotional state.

Transmission channel such as landline, mobile phone, microphone and voice over Internet protocol (VoIP).

8.4.2. Front-end Processing #

Many front-end processing are often used to process the speech signals and to extract the features which are used in the speaker verification system. The front-end processing consists of mainly voice activity detection (VAD), feature extraction and channel compensation techniques;

Voice activity detection (VAD) ; The main goal of voice activity detection is to determine which segments of a signal are speech and non-speech. A robust VAD algorithm can improve the performance of a speaker verification system by making sure that speaker identity is calculated only from speech regions. Therefore, it is necessary to review the VAD algorithm to overcome the problems in designing a robust speaker verification system. The three widely used techniques for VAD are the following: energy based, model based and hybrid approaches.

Feature extraction techniques are used to transform the speech signals into acoustic feature vectors. Thus, the extracted acoustic features should carry the essential characteristics of the speech signal which recognizes the identity of the speaker by their voice. The aim of feature extraction is to reduce the dimension of acoustic feature vectors by removing unwanted information and emphasizing the speaker-specific information. The MFCCs are commonly used as the feature extraction technique for the modern speaker verification.

Channel compensation techniques are used to reduce the effect of channel mismatch and environmental noise. Channel compensation can be used in different stages of speaker verification such as feature and model domains. Various channel compensation techniques such as cepstral mean subtraction (CMS) , feature warping, cepstral mean variance normalization (CMVN)  and relative spectral (RASTA) processing have been used to reduce the effect of channel mismatch during the feature extraction phase. In the model domain, Joint Factor Analysis (JFA) and i-vectors are used to combat enrolment and verification mismatch.

8.4.3. Speaker Modeling Techniques #

One of the crucial issues in speaker diarization is the techniques employed for speaker modeling. Several modeling techniques have been used in speaker recognition and speaker diarization tasks. The state-of-the-art speaker modeling techniques in speaker diarization are the following:

8.4.3.1. Gaussian Mixture Modeling (GMM) - Universal Background Model (UBM) Approach #

A Gaussian Mixture Model (GMM) is a parametric probability density function represented as a weighted sum of Gaussian component densities. GMMs have been successfully used to model the speech features in different speech processing applications. A Gaussian mixture model is a weighted sum of M component Gaussian densities. Each of the components is a multi-variant Gaussian function. A GMM is represented by mean vectors, covariance matrices and mixture weights.

The covariance matrices of a GMM,  \( \Sigma_i \) , can be full rank or constrained to be diagonal. The parameters of a GMM can also be shared, or tied, among the Gaussian components. The number of GMM components and type of covariance matrices are often determined based on the amount of data available for estimating GMM parameters.

In speaker recognition, a speaker can be modeled by a GMM from training data or using Maximum A Posteriori (MAP) adaptation. While the speaker model is built using the training utterances of a specific speaker in the GMM training, the model is also usually adapted from a large number of speakers called Universal Background Model in MAP adaptation.

Given a set of training vectors and a GMM configuration, there are several techniques available for estimating the parameters of a GMM . The most popular and used method is the maximum likelihood (ML) estimation.

The ML estimation finds the model parameters that maximize the likelihood of the GMM given a set of data. Assuming an independence between the training vectors \( X = \{x_i,\dots,x_N\} \) , the GMM likelihood is typically described as:

Since direct maximization is not possible on equation on equation 1, the ML parameters are obtained iteratively using expectation-maximization (EM) algorithm.  The EM iteratively estimate new model parameters  \(\bar{\lambda} \)  based on a given model \(\lambda\) such that \(p(X|\bar{\lambda}) \ge p(X|\lambda) \) .

image2020-1-20_15-34-38.png

The parameters of a GMM can also be estimated using Maximum A Posteriori (MAP) estimation, in addition to the EM algorithm. The MAP estimation technique derives a speaker model by adapting from a universal background model (UBM). The “Expectation” step of EM and MAP are the same. MAP adapts the new sufficient statistics by combining them with old statistics from the prior mixture parameters.

Given a prior model and training vectors from the desired class, \( X = {x_1 . . . , x_T } \) , we first determine the probabilistic alignment of the training vectors into the prior mixture components. For mixture  \( i \) in the prior model  \( Pr(i|x_t,\lambda_{UBM}) \) is computed as the percentage of the mixture component  \( i \) to the total likelihood,

Then, the sufficient statistics for the weight, mean and variance parameters is computed as follows:

Finally, the new sufficient statistics from the training data are used to update the prior sufficient statistics for mixture  \( i \) to create the adapted mixture weight, mean and variance for mixture \(i\) as follows:

The adaptation coefficients controlling the balance between old and new estimates are  \( \{\alpha^w_i, \alpha^m_i, \alpha^v_i\} \) for the weights, means and variances, respectively. The scale factor, \( \gamma \) , is computed over all adapted mixture weights to ensure they sum to unity.

8.4.3.2. i-Vectors #

Different approaches have been developed recently to improve the performance of speaker recognition systems. The most popular ones were based on GMM-UBM. The Joint Factor Analysis (JFA) is then built on the success of the GMM-UBM approach. JFA modeling defines two distinct spaces: the speaker space defined by the eigenvoice matrix and the channel space represented by the eigen-channel matrix. The channel factors estimated using JFA, which are supposed to model only channel effects, also contain information about speakers. A new speaker verification system has been proposed using factor analysis as a feature extractor that defines only a single space, instead of two separate spaces. In this new space, a given speech recording is represented by a new vector, called total factors as it contains the speaker and channel variabilities simultaneously. Speaker recognition based on the i-vector framework is currently the state-of-the-art in the field.

Given an utterance, the speaker and channel dependent GMM supervector is defined as follows:

where  \( m \) is a speaker and channel independent supervector,  \( T \) is a rectangular matrix of low rank and  \( w \) is a random vector having a standard normal distribution \( N(0,1) \) . The components of the vector  \( w \) are the total factors. These new vectors are called i-vectors.  \( M \) is assumed to be normally distributed with mean vector and covariance matrix \( TT^t \) .

The total factor is a hidden variable, which can be defined by its posterior distribution conditioned to the Baum–Welch statistics for a given utterance. This posterior distribution is a Gaussian distribution and the mean of this distribution corresponds exactly to i-vector. The Baum–Welch statistics are extracted using the UBM.

Given a sequence of L frames  \( \{y_1,y_2,......,y_n\} \) and a UBM  \( \Omega \) composed of \( C \) mixture components defined in some feature space of dimension \( F \) , the Baum–Welch statistics needed to estimate the i-vector mixturefor a given speech utterance  \( u \) is given by :

where  \( m_c \) is the mean of UBM mixture component \( c \) . The i-vector for a given utterance can be obtained using the following equation:

where  \( N_u \) is a diagonal matrix of dimension  \( CF \times CF \) whose diagonal blocks are \( N_cI(c=1,......, C) \) . The supervector obtained by concatenating all first-order Baum–Welch statistics \( F_c \) for a given utterance  \( u \) is represented by  \( \hat{F}(u) \) which has  \( CF \times 1 \) dimension. The diagonal covariance matrix,  \( \Sigma \) , with dimension \( CF \times CF \) estimated during factor analysis training models the residual variability not captured by the total variability matrix \( T \) .

image2020-1-20_20-26-53.png

One of the most widely used feature normalization techniques of i-vectors is length normalization. Length normalization ensures that the distribution of i-vectors matches the Gaussian normal distribution and makes the distributions of i-vector more similar. Performing whitening before length normalization improves the performance of speaker verification systems.  i-vector normalization improves the gaussianity of the i-vectors and reduces the gap between the underlying assumptions of the data and real distributions. It also reduces the dataset shift between development and test i-vectors.

where  \( \mu \) and  \( \Sigma \) are the mean and the covariance matrix of a training corpus, respectively. The data is standardized according to  covariance matrix  \( \Sigma \) and length-normalized (i.e., the i-vectors are confined to the hypersphere of unit radius.

The two most widely and common intersession compensation techniques of i-vectors are Within-Class Covariance Normalization (WCCN) and Linear Discriminant Analysis (LDA). WCCN uses the within-class covariance matrix to normalize the cosine kernel functions in order to compensate for intersession variability. LDA attempts to define a reduced special axes that minimize the within-speaker variability caused by channel effects, and maximize between-speaker variability.

8.4.3.2.1. Cosine Distance #

Once the i-vectors are extracted from the outputs of speech clusters, cosine distance scoring tests the hypothesis if two i-vectors belong to the same speaker or different speakers. Given two i-vectors, the cosine distance among them is calculated as follows:

where  \( \theta \) is the threshold value, and  \( cos(w_i,w_j) \) is the cosine distance score between clusters  \( i \) and \( j \) . The corresponding i-vectors extracted for clusters  \( i \) and  \( j \) are represented by  \( w_i \) and \( w_j \) , respectively.

The cosine distance scoring considers only the angle between two i-vectors, not their magnitude. Since the non-speaker information such as session and channel variabilities affect the i-vector magnitude, removing the magnitudes can increase the robustness of i-vector systems.

8.4.3.3. Probabilistic Linear Discriminant Analysis #

The i-vector representation followed by probabilistic linear discriminant analysis (PLDA) modeling technique is the state-of-the-art in speaker verification systems. PLDA has been successfully applied in speaker recognition experiments. It is also applied to handle speaker and session variability in speaker verification task. It has also been successfully applied in speaker clustering since it can separate speaker and noise specific parts of an audio signal which is essential for speaker diarization.

image2020-1-20_21-46-46.png

In PLDA, assuming that the training data consists of \(J\) i-vectors where each of these i-vectors belong to speaker \(I\) , the \(j\) ’th i-vector of the \(I\) th speaker is denoted by:

where  \( \mu \)  is the overall speaker and segment independent mean of the i-vectors in the training dataset, columns of the matrix F define the between-speaker variability and columns of the matrix G define the basis for the within-speaker variability subspace.  \( \Sigma_{ij} \) represents any unexplained data variation. The components of the vector  \( h_i \) are the eigenvoice factor loadings and components of the vector  \( y_{ij} \) are the eigen-channel factor loadings. The term  \( Fh_i \) depends only on the identity of the speaker, not on the particular segment.

Although the PLDA model assumes Gaussian behavior, there is empirical evidence that channel and speaker effects result in i-vectors that are non-Gaussian.  It is reported in that the use of Student’s t-distribution, on the assumed Gaussian PLDA model, improves the performance. Since this normalization technique is complicated, a non-linear transformation of i-vectors called radial Gaussianization has been proposed. It whitens the i-vectors and performs length normalization. This restores the Gaussian assumptions of the PLDA model.

A variant of PLDA model called Gaussian PLDA (GPLDA) is shown to provide better results.  Because of its low computational requirements, and its performance, it is the most widely used PLDA modeling. In GPLDA model, the within-speaker variability is modeled by a full covariance residual term  which allows us to omit the channel subspace. The generative PLDA model for the i-vector is  represented by

The residual term  \( \Sigma \) representing the within-speaker variability is assumed to have a normal distribution with full covariance matrix \( \Sigma \) .

Given two i-vectors  \( w_1 \) and \( w_1 \) , the PLDA computes the likelihood ratio of the two i-vectors as follows:

where the hypothesis  \( H_1 \) indicates that both i-vectors belong to the same speaker and  \( H_0 \) indicates they belong to two different speakers.

8.4.3.4. Deep Learning (DL) #

The recent advances in computing hardware, new DL architectures and training methods, and access to large amount of training data has inspired the research community to make use of DL technology again as in speaker recognition systems.  DL techniques can be used in the frontend or/and backend of a speaker recognition system. The whole end-to-end recognition process can even be performed by a DL architecture.

Deep Learning Frontends: The traditional i-vector approach consists of mainly three stages: Baum-Welch statistics collection, i-vector extraction, and PLDA backend. Recently, it is shown that if the Baum-Welch statistics are computed with respect to a DNN rather than a GMM or if bottleneck features are used in addition to conventional spectral features, a substantial improvement can be achieved. Another possible use of DL in the frontend is to represent the speaker characteristics of a speech signal with a single low dimensional vector using a DL architecture, rather than the traditional i-vector algorithm. These vectors are often referred to as speaker embeddings. Typically, the inputs of the neural network are a sequence of feature vectors and the outputs are speaker classes.

Deep Learning Backends: One of the most effective backend techniques for i-vectors is PLDA which performs the scoring along with the session variability compensation. Usually, a large number of different speakers with several speech samples each are necessary for PLDA to work efficiently. Access to the speaker labeled data is costly and in some cases almost impossible. Moreover, the amount of the performance gain, in terms of accuracy, for short utterances is not as much as that for long utterances. These facts motivated the research community to look for DL based alternative backends. Several techniques have been proposed. Most of these approaches use the speaker labels of the background data for training, as in PLDA, and mostly with no significant gain compared to PLDA.

Deep Learning End-to-Ends: It is also interesting to train an end-to-end recognition system capable of doing multiple stages of signal processing with a unified DL architecture. The neural network will be responsible for the whole process from the feature extraction to the final similarity scores. However, working directly on the audio signals in the time domain is still computationally too expensive and, therefore, the current end-to-end DL systems take mainly the handcrafted feature vectors, e.g., MFCCs, as inputs. Recently, there have been several attempts to build an end-to-end speaker recognition system using DL though most of them focus on text-dependent speaker recognition.

8.4.4. Applications of Speaker Recognition #

Transaction authentication – Toll fraud prevention, telephone credit card purchases, telephone brokerage (e.g., stock trading)

Access control – Physical facilities, computers and data networks

Monitoring – Remote time and attendance logging, home parole verification, prison telephone usage

Information retrieval – Customer information for call centers, audio indexing (speech skimming device), speaker diarization

Forensics – Voice sample matching

8.4.5. Performance Evaluations #

The performance of the speaker verification is measured in terms of errors. The types of error and evaluation metrics commonly used in speaker verification systems are the following.

8.4.6. Types of errors #

False acceptance: A false acceptance occurs when the speech segments from an imposter speaker are falsely accepted as a target speaker by the system.

False rejection: A false rejection occurs when the target speaker is rejected by the verification systems.

8.4.7. Performance metrics #

The performance metrics of speaker verification systems can be measured using the equal error rate (EER) and minimum decision cost function (mDCF). These measures represent different performance characteristics of system though the accuracy of the measurements is based on the number of trials evaluated in order to robustly compute the relevant statistics. Speaker verification performance can also be represented graphically by using the detection error trade-off (DET) plot. The EER is obtained when the false acceptance rate and false rejection rate have the same value. The performance of the system improves if the value of ERR is lower because the sum of total error of the false acceptance and false rejection at the point of ERR decreases. The decision cost function (DCF) is defined by assigning a cost of each error and taking into account the prior probability of target and impostor trails. The decision cost function is defined as:

where  \( C_{miss} \) and  \( C_{fa} \) are the cost functions of a missed detection and false alarm, respectively. The prior probabilities of target and impostor trails are given by  \( P_{target} \) and   \( P_{impostor} \) , respectively. The percentages of the missed target and falsely accepted impostors’ trails are represented by  \( P_{miss} \) and \( P_{fa} \) , respectively. The mDCF is used to evaluate speaker verification by selecting the minimum value of DCF estimated by changing the threshold value. The mDCF can be used to evaluate speaker verification by selecting the minimum value of DCF estimated by changing the threshold value.

where  \( P_{miss} \) and  \( P_{fa} \) are the miss and false alarm rates recorded from the trials, and the other parameters are adjusted to suit the evaluation of application-specific requirements.

8.4.8. See also #

Forensic Speaker Recognition

Mobile Navigation

Introducing whisper.

Whisper

Illustration: Ruby Chen

We’ve trained and are open-sourcing a neural net called Whisper that approaches human level robustness and accuracy on English speech recognition.

More resources

Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. We show that the use of such a large and diverse dataset leads to improved robustness to accents, background noise and technical language. Moreover, it enables transcription in multiple languages, as well as translation from those languages into English. We are open-sourcing models and inference code to serve as a foundation for building useful applications and for further research on robust speech processing.

The Whisper architecture is a simple end-to-end approach, implemented as an encoder-decoder Transformer. Input audio is split into 30-second chunks, converted into a log-Mel spectrogram, and then passed into an encoder. A decoder is trained to predict the corresponding text caption, intermixed with special tokens that direct the single model to perform tasks such as language identification, phrase-level timestamps, multilingual speech transcription, and to-English speech translation.

Other existing approaches frequently use smaller, more closely paired audio-text training datasets, [^reference-1] [^reference-2] [^reference-3] or use broad but unsupervised audio pretraining. [^reference-4] [^reference-5] [^reference-6]  Because Whisper was trained on a large and diverse dataset and was not fine-tuned to any specific one, it does not beat models that specialize in LibriSpeech performance, a famously competitive benchmark in speech recognition. However, when we measure Whisper’s zero-shot performance across many diverse datasets we find it is much more robust and makes 50% fewer errors than those models.

About a third of Whisper’s audio dataset is non-English, and it is alternately given the task of transcribing in the original language or translating to English. We find this approach is particularly effective at learning speech to text translation and outperforms the supervised SOTA on CoVoST2 to English translation zero-shot.

We hope Whisper’s high accuracy and ease of use will allow developers to add voice interfaces to a much wider set of applications. Check out the  paper ,  model card , and  code  to learn more details and to try out Whisper.

Cloud Speech-to-Text API v1 (revision 119)

  • Prev Class
  • Next Class
  • No Frames
  • All Classes
  • Summary: 
  • Nested  | 
  • Field | 
  • Constr  | 
  • Detail: 

Class RecognitionConfig

  • java.lang.Object
  • java.util.AbstractMap<java.lang.String,java.lang.Object>
  • com.google.api.client.util.GenericData
  • com.google.api.client.json.GenericJson
  • com.google.api.services.speech.v1.model.RecognitionConfig

This is the Java data model class that specifies how to parse/serialize into the JSON that is transmitted over HTTP when working with the Cloud Speech-to-Text API. For a detailed explanation see: https://developers.google.com/api-client-library/java/google-http-java-client/json

Nested Class Summary

Nested classes/interfaces inherited from class com.google.api.client.util.genericdata, nested classes/interfaces inherited from class java.util.abstractmap, nested classes/interfaces inherited from interface java.util.map, constructor summary, method summary, methods inherited from class com.google.api.client.json.genericjson, methods inherited from class com.google.api.client.util.genericdata, methods inherited from class java.util.abstractmap, methods inherited from class java.lang.object, methods inherited from interface java.util.map, constructor detail, recognitionconfig, method detail, getaudiochannelcount, setaudiochannelcount, getdiarizationconfig, setdiarizationconfig, getenableautomaticpunctuation, setenableautomaticpunctuation, getenableseparaterecognitionperchannel, setenableseparaterecognitionperchannel, getenablewordtimeoffsets, setenablewordtimeoffsets, getencoding, setencoding, getlanguagecode, setlanguagecode, getmaxalternatives, setmaxalternatives, getmetadata, setmetadata, getprofanityfilter, setprofanityfilter, getsampleratehertz, setsampleratehertz, getspeechcontexts, setspeechcontexts, getuseenhanced, setuseenhanced.

© 2020 Google - Privacy Policy - Terms and Conditions - About Google

  • Get Started

Learn about PyTorch’s features and capabilities

Learn about the PyTorch foundation

Join the PyTorch developer community to contribute, learn, and get your questions answered.

Learn how our community solves real, everyday machine learning problems with PyTorch.

Find resources and get questions answered

Find events, webinars, and podcasts

A place to discuss PyTorch code, issues, install, research

Discover, publish, and reuse pre-trained models

  • Tutorials >
  • Speech Command Classification with torchaudio

Click here to download the full example code

Speech Command Classification with torchaudio ¶

This tutorial will show you how to correctly format an audio dataset and then train/test an audio classifier network on the dataset.

Colab has GPU option available. In the menu tabs, select “Runtime” then “Change runtime type”. In the pop-up that follows, you can choose GPU. After the change, your runtime should automatically restart (which means information from executed cells disappear).

First, let’s import the common torch packages such as torchaudio that can be installed by following the instructions on the website.

Let’s check if a CUDA GPU is available and select our device. Running the network on a GPU will greatly decrease the training/testing runtime.

Importing the Dataset ¶

We use torchaudio to download and represent the dataset. Here we use SpeechCommands , which is a datasets of 35 commands spoken by different people. The dataset SPEECHCOMMANDS is a torch.utils.data.Dataset version of the dataset. In this dataset, all audio files are about 1 second long (and so about 16000 time frames long).

The actual loading and formatting steps happen when a data point is being accessed, and torchaudio takes care of converting the audio files to tensors. If one wants to load an audio file directly instead, torchaudio.load() can be used. It returns a tuple containing the newly created tensor along with the sampling frequency of the audio file (16kHz for SpeechCommands).

Going back to the dataset, here we create a subclass that splits it into standard training, validation, testing subsets.

A data point in the SPEECHCOMMANDS dataset is a tuple made of a waveform (the audio signal), the sample rate, the utterance (label), the ID of the speaker, the number of the utterance.

speech command classification with torchaudio tutorial

Let’s find the list of labels available in the dataset.

The 35 audio labels are commands that are said by users. The first few files are people saying “marvin”.

The last file is someone saying “visual”.

Formatting the Data ¶

This is a good place to apply transformations to the data. For the waveform, we downsample the audio for faster processing without losing too much of the classification power.

We don’t need to apply other transformations here. It is common for some datasets though to have to reduce the number of channels (say from stereo to mono) by either taking the mean along the channel dimension, or simply keeping only one of the channels. Since SpeechCommands uses a single channel for audio, this is not needed here.

We are encoding each word using its index in the list of labels.

To turn a list of data point made of audio recordings and utterances into two batched tensors for the model, we implement a collate function which is used by the PyTorch DataLoader that allows us to iterate over a dataset by batches. Please see the documentation for more information about working with a collate function.

In the collate function, we also apply the resampling, and the text encoding.

Define the Network ¶

For this tutorial we will use a convolutional neural network to process the raw audio data. Usually more advanced transforms are applied to the audio data, however CNNs can be used to accurately process the raw data. The specific architecture is modeled after the M5 network architecture described in this paper . An important aspect of models processing raw audio data is the receptive field of their first layer’s filters. Our model’s first filter is length 80 so when processing audio sampled at 8kHz the receptive field is around 10ms (and at 4kHz, around 20 ms). This size is similar to speech processing applications that often use receptive fields ranging from 20ms to 40ms.

We will use the same optimization technique used in the paper, an Adam optimizer with weight decay set to 0.0001. At first, we will train with a learning rate of 0.01, but we will use a scheduler to decrease it to 0.001 during training after 20 epochs.

Training and Testing the Network ¶

Now let’s define a training function that will feed our training data into the model and perform the backward pass and optimization steps. For training, the loss we will use is the negative log-likelihood. The network will then be tested after each epoch to see how the accuracy varies during the training.

Now that we have a training function, we need to make one for testing the networks accuracy. We will set the model to eval() mode and then run inference on the test dataset. Calling eval() sets the training variable in all modules in the network to false. Certain layers like batch normalization and dropout layers behave differently during training so this step is crucial for getting correct results.

Finally, we can train and test the network. We will train the network for ten epochs then reduce the learn rate and train for ten more epochs. The network will be tested after each epoch to see how the accuracy varies during the training.

The network should be more than 65% accurate on the test set after 2 epochs, and 85% after 21 epochs. Let’s look at the last words in the train set, and see how the model did on it.

Let’s find an example that isn’t classified correctly, if there is one.

Feel free to try with one of your own recordings of one of the labels! For example, using Colab, say “Go” while executing the cell below. This will record one second of audio and try to classify it.

Conclusion ¶

In this tutorial, we used torchaudio to load a dataset and resample the signal. We have then defined a neural network that we trained to recognize a given command. There are also other data preprocessing methods, such as finding the mel frequency cepstral coefficients (MFCC), that can reduce the size of the dataset. This transform is also available in torchaudio as torchaudio.transforms.MFCC .

Total running time of the script: ( 2 minutes 29.974 seconds)

Download Python source code: speech_command_classification_with_torchaudio_tutorial.py

Download Jupyter notebook: speech_command_classification_with_torchaudio_tutorial.ipynb

Gallery generated by Sphinx-Gallery

  • Importing the Dataset
  • Formatting the Data
  • Define the Network
  • Training and Testing the Network

sample rate for speech recognition

Access comprehensive developer documentation for PyTorch

Get in-depth tutorials for beginners and advanced developers

Find development resources and get your questions answered

To analyze traffic and optimize your experience, we serve cookies on this site. By clicking or navigating, you agree to allow our usage of cookies. As the current maintainers of this site, Facebook’s Cookies Policy applies. Learn more, including about available controls: Cookies Policy .

  • torchvision
  • PyTorch on XLA Devices
  • PyTorch Foundation
  • Community Stories
  • Developer Resources
  • Models (Beta)

Audio Course documentation

Pre-trained models for automatic speech recognition

Audio course.

and get access to the augmented documentation experience

to get started

In this section, we’ll cover how to use the pipeline() to leverage pre-trained models for speech recognition. In Unit 2 , we introduced the pipeline() as an easy way of running speech recognition tasks, with all pre- and post-processing handled under-the-hood and the flexibility to quickly experiment with any pre-trained checkpoint on the Hugging Face Hub. In this Unit, we’ll go a level deeper and explore the different attributes of speech recognition models and how we can use them to tackle a range of different tasks.

As detailed in Unit 3, speech recognition model broadly fall into one of two categories:

  • Connectionist Temporal Classification (CTC): encoder-only models with a linear classification (CTC) head on top
  • Sequence-to-sequence (Seq2Seq): encoder-decoder models, with a cross-attention mechanism between the encoder and decoder

Prior to 2022, CTC was the more popular of the two architectures, with encoder-only models such as Wav2Vec2, HuBERT and XLSR achieving breakthoughs in the pre-training / fine-tuning paradigm for speech. Big corporations, such as Meta and Microsoft, pre-trained the encoder on vast amounts of unlabelled audio data for many days or weeks. Users could then take a pre-trained checkpoint, and fine-tune it with a CTC head on as little as 10 minutes of labelled speech data to achieve strong performance on a downstream speech recognition task.

However, CTC models have their shortcomings. Appending a simple linear layer to an encoder gives a small, fast overall model, but can be prone to phonetic spelling errors. We’ll demonstrate this for the Wav2Vec2 model below.

Probing CTC Models

Let’s load a small excerpt of the LibriSpeech ASR dataset to demonstrate Wav2Vec2’s speech transcription capabilities:

We can pick one of the 73 audio samples and inspect the audio sample as well as the transcription:

Alright! Christmas and roast beef, sounds great! 🎄 Having chosen a data sample, we now load a fine-tuned checkpoint into the pipeline() . For this, we’ll use the official Wav2Vec2 base checkpoint fine-tuned on 100 hours of LibriSpeech data:

Next, we’ll take an example from the dataset and pass its raw data to the pipeline. Since the pipeline consumes any dictionary that we pass it (meaning it cannot be re-used), we’ll pass a copy of the data. This way, we can safely re-use the same audio sample in the following examples:

We can see that the Wav2Vec2 model does a pretty good job at transcribing this sample - at a first glance it looks generally correct. Let’s put the target and prediction side-by-side and highlight the differences:

Comparing the target text to the predicted transcription, we can see that all words sound correct, but some are not spelled accurately. For example:

  • CHRISTMAUS vs. CHRISTMAS
  • ROSE vs. ROAST
  • SIMALYIS vs. SIMILES

This highlights the shortcoming of a CTC model. A CTC model is essentially an ‘acoustic-only’ model: it consists of an encoder which forms hidden-state representations from the audio inputs, and a linear layer which maps the hidden-states to characters:

This means that the system almost entirely bases its prediction on the acoustic input it was given (the phonetic sounds of the audio), and so has a tendency to transcribe the audio in a phonetic way (e.g. CHRISTMAUS ). It gives less importance to the language modelling context of previous and successive letters, and so is prone to phonetic spelling errors. A more intelligent model would identify that CHRISTMAUS is not a valid word in the English vocabulary, and correct it to CHRISTMAS when making its predictions. We’re also missing two big features in our prediction - casing and punctuation - which limits the usefulness of the model’s transcriptions to real-world applications.

Graduation to Seq2Seq

Cue Seq2Seq models! As outlined in Unit 3, Seq2Seq models are formed of an encoder and decoder linked via a cross-attention mechanism. The encoder plays the same role as before, computing hidden-state representations of the audio inputs, while the decoder plays the role of a language model . The decoder processes the entire sequence of hidden-state representations from the encoder and generates the corresponding text transcriptions. With global context of the audio input, the decoder is able to use language modelling context as it makes its predictions, correcting for spelling mistakes on-the-fly and thus circumventing the issue of phonetic predictions.

There are two downsides to Seq2Seq models:

  • They are inherently slower at decoding, since the decoding process happens one step at a time, rather than all at once
  • They are more data hungry, requiring significantly more training data to reach convergence

In particular, the need for large amounts of training data has been a bottleneck in the advancement of Seq2Seq architectures for speech. Labelled speech data is difficult to come by, with the largest annotated datasets at the time clocking in at just 10,000 hours. This all changed in 2022 upon the release of Whisper . Whisper is a pre-trained model for speech recognition published in September 2022 by the authors Alec Radford et al. from OpenAI. Unlike its CTC predecessors, which were pre-trained entirely on un-labelled audio data, Whisper is pre-trained on a vast quantity of labelled audio-transcription data, 680,000 hours to be precise.

This is an order of magnitude more data than the un-labelled audio data used to train Wav2Vec 2.0 (60,000 hours). What is more, 117,000 hours of this pre-training data is multilingual (or “non-English”) data. This results in checkpoints that can be applied to over 96 languages, many of which are considered low-resource , meaning the language lacks a large corpus of data suitable for training.

When scaled to 680,000 hours of labelled pre-training data, Whisper models demonstrate a strong ability to generalise to many datasets and domains. The pre-trained checkpoints achieve competitive results to state-of-the-art pipe systems, with near 3% word error rate (WER) on the test-clean subset of LibriSpeech pipe and a new state-of-the-art on TED-LIUM with 4.7% WER ( c.f. Table 8 of the Whisper paper ).

Of particular importance is Whisper’s ability to handle long-form audio samples, its robustness to input noise and ability to predict cased and punctuated transcriptions. This makes it a viable candidate for real-world speech recognition systems.

The remainder of this section will show you how to use the pre-trained Whisper models for speech recognition using 🤗 Transformers. In many situations, the pre-trained Whisper checkpoints are extremely performant and give great results, thus we encourage you to try using the pre-trained checkpoints as a first step to solving any speech recognition problem. Through fine-tuning, the pre-trained checkpoints can be adapted for specific datasets and languages to further improve upon these results. We’ll demonstrate how to do this in the upcoming subsection on fine-tuning .

The Whisper checkpoints come in five configurations of varying model sizes. The smallest four are trained on either English-only or multilingual data. The largest checkpoint is multilingual only. All nine of the pre-trained checkpoints are available on the Hugging Face Hub . The checkpoints are summarised in the following table with links to the models on the Hub. “VRAM” denotes the required GPU memory to run the model with the minimum batch size of 1. “Rel Speed” is the relative speed of a checkpoint compared to the largest model. Based on this information, you can select a checkpoint that is best suited to your hardware.

Let’s load the Whisper Base checkpoint, which is of comparable size to the Wav2Vec2 checkpoint we used previously. Preempting our move to multilingual speech recognition, we’ll load the multilingual variant of the base checkpoint. We’ll also load the model on the GPU if available, or CPU otherwise. The pipeline() will subsequently take care of moving all inputs / outputs from the CPU to the GPU as required:

Great! Now let’s transcribe the audio as before. The only change we make is passing an extra argument, max_new_tokens , which tells the model the maximum number of tokens to generate when making its prediction:

Easy enough! The first thing you’ll notice is the presence of both casing and punctuation. Immediately this makes the transcription easier to read compared to the un-cased and un-punctuated transcription from Wav2Vec2. Let’s put the transcription side-by-side with the target:

Whisper has done a great job at correcting the phonetic errors we saw from Wav2Vec2 - both Christmas and roast are spelled correctly. We see that the model still struggles with SIMILES , being incorrectly transcribed as similarly , but this time the prediction is a valid word from the English vocabulary. Using a larger Whisper checkpoint can help further reduce transcription errors, at the expense of requiring more compute and a longer transcription time.

We’ve been promised a model that can handle 96 languages, so lets leave English speech recognition for now and go global 🌎! The Multilingual LibriSpeech (MLS) dataset is the multilingual equivalent of the LibriSpeech dataset, with labelled audio data in six languages. We’ll load one sample from the Spanish split of the MLS dataset, making use of streaming mode so that we don’t have to download the entire dataset:

Again, we’ll inspect the text transcription and take a listen to the audio segment:

This is the target text that we’re aiming for with our Whisper transcription. Although we now know that we can probably do better this, since our model is also going to predict punctuation and casing, neither of which are present in the reference. Let’s forward the audio sample to the pipeline to get our text prediction. One thing to note is that the pipeline consumes the dictionary of audio inputs that we input, meaning the dictionary can’t be re-used. To circumvent this, we’ll pass a copy of the audio sample, so that we can re-use the same audio sample in the proceeding code examples:

Great - this looks very similar to our reference text (arguably better since it has punctuation and casing!). You’ll notice that we forwarded the "task" as a generate key-word argument (generate kwarg). Setting the "task" to "transcribe" forces Whisper to perform the task of speech recognition , where the audio is transcribed in the same language that the speech was spoken in. Whisper is also capable of performing the closely related task of speech translation , where the audio in Spanish can be translated to text in English. To achieve this, we set the "task" to "translate" :

Now that we know we can toggle between speech recognition and speech translation, we can pick our task depending on our needs. Either we recognise from audio in language X to text in the same language X (e.g. Spanish audio to Spanish text), or we translate from audio in any language X to text in English (e.g. Spanish audio to English text).

To read more about how the "task" argument is used to control the properties of the generated text, refer to the model card for the Whisper base model.

Long-Form Transcription and Timestamps

So far, we’ve focussed on transcribing short audio samples of less than 30 seconds. We mentioned that one of the appeals of Whisper was its ability to work on long audio samples. We’ll tackle this task here!

Let’s create a long audio file by concatenating sequential samples from the MLS dataset. Since the MLS dataset is curated by splitting long audiobook recordings into shorter segments, concatenating samples is one way of reconstructing longer audiobook passages. Consequently, the resulting audio should be coherent across the entire sample.

We’ll set our target audio length to 5 minutes, and stop concatenating samples once we hit this value:

Alright! 5 minutes and 17 seconds of audio to transcribe. There are two problems with forwarding this long audio sample directly to the model:

  • Whisper is inherently designed to work with 30 second samples: anything shorter than 30s is padded to 30s with silence, anything longer than 30s is truncated to 30s by cutting of the extra audio, so if we pass our audio directly we’ll only get the transcription for the first 30s
  • Memory in a transformer network scales with the sequence length squared: doubling the input length quadruples the memory requirement, so passing super long audio files is bound to lead to an out-of-memory (OOM) error

The way long-form transcription works in 🤗 Transformers is by chunking the input audio into smaller, more manageable segments. Each segment has a small amount of overlap with the previous one. This allows us to accurately stitch the segments back together at the boundaries, since we can find the overlap between segments and merge the transcriptions accordingly:

🤗 Transformers chunking algorithm. Source: https://huggingface.co/blog/asr-chunking.

The advantage of chunking the samples is that we don’t need the result of chunk i i i to transcribe the subsequent chunk i + 1 i + 1 i + 1 . The stitching is done after we have transcribed all the chunks at the chunk boundaries, so it doesn’t matter which order we transcribe chunks in. The algorithm is entirely stateless , so we can even do chunk i + 1 i + 1 i + 1 at the same time as chunk i i i ! This allows us to batch the chunks and run them through the model in parallel, providing a large computational speed-up compared to transcribing them sequentially. To read more about chunking in 🤗 Transformers, you can refer to this blog post .

To activate long-form transcriptions, we have to add one additional argument when we call the pipeline. This argument, chunk_length_s , controls the length of the chunked segments in seconds. For Whisper, 30 second chunks are optimal, since this matches the input length Whisper expects.

To activate batching, we need to pass the argument batch_size to the pipeline. Putting it all together, we can transcribe the long audio sample with chunking and batching as follows:

We won’t print the entire output here since it’s pretty long (312 words total)! On a 16GB V100 GPU, you can expect the above line to take approximately 3.45 seconds to run, which is pretty good for a 317 second audio sample. On a CPU, expect closer to 30 seconds.

Whisper is also able to predict segment-level timestamps for the audio data. These timestamps indicate the start and end time for a short passage of audio, and are particularly useful for aligning a transcription with the input audio. Suppose we want to provide closed captions for a video - we need these timestamps to know which part of the transcription corresponds to a certain segment of video, in order to display the correct transcription for that time.

Activating timestamp prediction is straightforward, we just need to set the argument return_timestamps=True . Timestamps are compatible with both the chunking and batching methods we used previously, so we can simply append the timestamp argument to our previous call:

And voila! We have our predicted text as well as corresponding timestamps.

Whisper is a strong pre-trained model for speech recognition and translation. Compared to Wav2Vec2, it has higher transcription accuracy, with outputs that contain punctuation and casing. It can be used to transcribe speech in English as well as 96 other languages, both on short audio segments and longer ones through chunking . These attributes make it a viable model for many speech recognition and translation tasks without the need for fine-tuning. The pipeline() method provides an easy way of running inference in one-line API calls with control over the generated predictions.

While the Whisper model performs extremely well on many high-resource languages, it has lower transcription and translation accuracy on low-resource languages, i.e. those with less readily available training data. There is also varying performance across different accents and dialects of certain languages, including lower accuracy for speakers of different genders, races, ages or other demographic criteria ( c.f. Whisper paper ).

To boost the performance on low-resource languages, accents or dialects, we can take the pre-trained Whisper model and train it on a small corpus of appropriately selected data, in a process called fine-tuning . We’ll show that with as little as ten hours of additional data, we can improve the performance of the Whisper model by over 100% on a low-resource language. In the next section, we’ll cover the process behind selecting a dataset for fine-tuning.

How to Build a Basic Speech Recognition Network with Tensorflow (Demo Video Included)

Photo by RetroSupply on Unsplash

How to Build a Basic Speech Recognition Network with Tensorflow (Demo Video Included)

Master speech recognition with tensorflow and learn to build a basic network for recognizing speech commands..

GeekyAnts's photo

Table of contents

Introduction, a basic understanding of the techniques involved, step 1: import necessary modules and dependencies, step 2: download the dataset, step 3: data exploration and visualization, step 4: preprocessing, step 5: training, step 6: testing and prediction, parting thoughts.

This tutorial will show you how to build a basic speech recognition network that recognizes simple speech commands. Speech recognition is a subfield of computer science and linguistics that identifies spoken words and converts them into text.

When speech is recorded using a voice recording device like a microphone, it converts physical sound to electrical energy. Then, using an analog-to-digital converter, this is converted to digital data, which can be fed to a neural network or hidden Markov model to convert them to text.

We are going to train such a neural network here, which, after training, will be able to recognize small speech commands.

Speech recognition is also known as:

Automatic Speech Recognition (ASR)

Computer Speech Recognition

Speech to Text (STT)

The steps involved are:

Import required libraries

Download dataset

Data Exploration and Visualization

Preprocessing

Here is a colab notebook with all of the codes if you want to follow along.

Let us start the implementation.

Download and extract the mini_speech_commands.zip , file containing the smaller Speech Commands datasets.

The dataset's audio clips are stored in eight folders corresponding to each speech command: no , yes , down , go , left , up , right ,and stop.

Dataset: TensorFlow recently released the Speech Commands Datasets. It includes 65,000 one-second long utterances of 30 short words by thousands of different people. We will be working with a smaller version of the Speech Commands dataset called mini speech command datasets.

Download the mini Speech Commands dataset and unzip it.

The dataset's audio clips are stored in eight folders corresponding to each speech command: no yes down go left up right and stop

Now that we have the dataset, let us understand and visualize it.

Data Exploration and Visualization is an approach that helps us understand what's in a dataset and the characteristics of the dataset. Let us visualize the audio signal in the time series domain.

Here is what the audio looks in a waveform.

Untitled (37).png

To listen to the above command up :

Check the list of commands for which we will be training our speech recognition model. These audio clips are stored in eight folders corresponding to each speech command: no , yes , down , go , left , up , right , and stop.

Untitled (38).png

Remove unnecessary files:

Let us plot a bar graph to understand the number of recordings for each of the eight voice commands:

Untitled (39).png

As we can see, we have almost the same number of recordings for each command.

Let us define these preprocessing steps in the code snippet below:

Convert the output labels to integer encoded labels and then to a one-hot vector since it is a multi-classification problem. Then reshape the 2D array to 3D since the input to the conv1d must be a 3D array:

Train test split - train test split is a model validation procedure that allows you to simulate how a model would perform on new/unseen data. We are doing an 80:20 split of data for training and testing.

Create a model and compile it. Now, we define a model:

Define callbacks:

Start the training:

Plot the training loss vs validation loss:

Now, we have a trained model. We need to load it and use it to predict our commands.

Load the model for prediction:

Start predicting:

You can always create your own dataset with creative ways like clap sounds and whistles or your own custom words and train your model to recognize them.

You can check out the demo video for this experiment here.

We just completed a tutorial on building a speech recognition system! Here is a quick recap:

We began by exploring the dataset, giving us a good feel for what is inside. Then, we prepped the data, converting it into a format suitable for training. After a train-test split, we designed a Conv1D neural network for the task.

By following these steps, we have laid the foundation for a speech recognition system. With further tweaks and your own data, you can expand its capabilities. Keep exploring the world of speech recognition!

This article was written by Priyamvada, Software Engineer, for the GeekyAnts blog.

sample rate for speech recognition

Frequently Asked Questions

Background .

Speech Recognition Engines need Acoustic Models trained with speech audio that has the same sampling rate and bits per sample as the speech it will recognize.  The different speech mediums have limitations that affect speech recognition.

Telephony Bandwidth Limitations 

For example, for telephony speech recognition, the limitation is the 64kbps bandwidth of a telephone line.  This only permits a sampling rate of 8kHz and a sampling resolution of 8-bits per sample. Therefore, to perform speech recognition on a telephone line, you need Acoustic Models trained using audio recorded at an 8kHz sampling rate with 8-bits per sample.  VoIP applications usually have the same limitations since they allow interconnection to Public Service Telephone Network (PSTN).

Desktop Sound Card and Processor Limitations 

For desktop Command and Control applications,  your PC's sound card determines your maximum sampling rate and bits per sample, and the power of your CPU determines what kinds of acoustic models your Speech Recognition Engine can process efficiently.

So why record at highest sampling/bits per sample rates?

Speech Recognition Engines work best with Acoustic Models trained with audio recorded at higher sampling rate and bits per sample.  However, since current hardware (CPUs and/or sound cards) is not powerful enough to support Acoustic Models trained at higher sampling rates and bits per sample, and telephony applications have bandwidth limitations (as discussed above), a compromise is required.  VoxForge has decided that the best approach (for now) is to collect speech recorded at the highest sampling rate your audio card support, at 16-bits per sample, and then downsample the audio to sampling rates that can be supported by the speech medium

For example, for Command and Control applications on a desktop PC, you can downsample the 48kHz/16-bit audio to 16kHz/16-bit audio, and create Acoustic Models from this.  This approach permits us to be backward compatible with older Sound Cards that may not support the higher sampling rates/bits per sample, and also permit us to look to the future so that any submitted audio at higher sampling rates/bits per sample will be usable down the road when Sound Cards that support higher sampling rates/bits per sample will become more common, and processing power increases.

For Telephony applications, to create Acoustic Models from audio recorded at a sample rate of 48kHz with 16-bits per sample, you must first downsample the audio to a sample rate of 8kHz/8-bit per sample, and then create an Acoustic Model from this.

Some VoIP PBXs, such as Asterisk, actually represent audio data internally at 8kHz/16-bit sampling rates, even though the codec used might only support 8kHz/8-bit sampling rates.  Therefore VoIP PBX's like Asterisk can use Acoustic Models trained on audio with8kHz/16-bit sampling rates.

  • For Mobile & Edge

Retrain a speech recognition model with TensorFlow Lite Model Maker

In this colab notebook, you'll learn how to use the TensorFlow Lite Model Maker to train a speech recognition model that can classify spoken words or short phrases using one-second sound samples. The Model Maker library uses transfer learning to retrain an existing TensorFlow model with a new dataset, which reduces the amount of sample data and time required for training.

By default, this notebook retrains the model (BrowserFft, from the TFJS Speech Command Recognizer ) using a subset of words from the speech commands dataset (such as "up," "down," "left," and "right"). Then it exports a TFLite model that you can run on a mobile device or embedded system (such as a Raspberry Pi). It also exports the trained model as a TensorFlow SavedModel.

This notebook is also designed to accept a custom dataset of WAV files, uploaded to Colab in a ZIP file. The more samples you have for each class, the better your accuracy will be, but because the transfer learning process uses feature embeddings from the pre-trained model, you can still get a fairly accurate model with only a few dozen samples in each of your classes.

If you want to run the notebook with the default speech dataset, you can run the whole thing now by clicking Runtime > Run all in the Colab toolbar. However, if you want to use your own dataset, then continue down to Prepare the dataset and follow the instructions there.

Import the required packages

You'll need TensorFlow, TFLite Model Maker, and some modules for audio manipulation, playback, and visualizations.

Prepare the dataset

To train with the default speech dataset, just run all the code below as-is.

But if you want to train with your own speech dataset, follow these steps:

  • Be sure each sample in your dataset is in WAV file format, about one second long . Then create a ZIP file with all your WAV files, organized into separate subfolders for each classification. For example, each sample for a speech command "yes" should be in a subfolder named "yes". Even if you have only one class, the samples must be saved in a subdirectory with the class name as the directory name. (This script assumes your dataset is not split into train/validation/test sets and performs that split for you.)
  • Click the Files tab in the left panel and just drag-drop your ZIP file there to upload it.
  • Use the following drop-down option to set use_custom_dataset to True.
  • Then skip to Prepare a custom audio dataset to specify your ZIP filename and dataset directory name.

Toggle code

Generate a background noise dataset

Whether you're using the default speech dataset or a custom dataset, you should have a good set of background noises so your model can distinguish speech from other noises (including silence).

Because the following background samples are provided in WAV files that are a minute long or longer, we need to split them up into smaller one-second samples so we can reserve some for our test dataset. We'll also combine a couple different sample sources to build a comprehensive set of background noises and silence:

Prepare the speech commands dataset

We already downloaded the speech commands dataset, so now we just need to prune the number of classes for our model.

This dataset includes over 30 speech command classifications, and most of them have over 2,000 samples. But because we're using transfer learning, we don't need that many samples. So the following code does a few things:

  • Specify which classifications we want to use, and delete the rest.
  • Keep only 150 samples of each class for training (to prove that transfer learning works well with smaller datasets and simply to reduce the training time).
  • Create a separate directory for a test dataset so we can easily run inference with them later.

Prepare a custom dataset

If you want to train the model with our own speech dataset, you need to upload your samples as WAV files in a ZIP ( as described above ) and modify the following variables to specify your dataset:

After changing the filename and path name above, you're ready to train the model with your custom dataset. In the Colab toolbar, select Runtime > Run all to run the whole notebook.

The following code integrates our new background noise samples into your dataset and then separates a portion of all samples to create a test set.

Play a sample

To be sure the dataset looks correct, let's play at a random sample from the test set:

Define the model

When using Model Maker to retrain any model, you have to start by defining a model spec. The spec defines the base model from which your new model will extract feature embeddings to begin learning new classes. The spec for this speech recognizer is based on the pre-trained BrowserFft model from TFJS .

The model expects input as an audio sample that's 44.1 kHz, and just under a second long: the exact sample length must be 44034 frames.

You don't need to do any resampling with your training dataset. Model Maker takes care of that for you. But when you later run inference, you must be sure that your input matches that expected format.

All you need to do here is instantiate the BrowserFftSpec :

Load your dataset

Now you need to load your dataset according to the model specifications. Model Maker includes the DataLoader API, which will load your dataset from a folder and ensure it's in the expected format for the model spec.

We already reserved some test files by moving them to a separate directory, which makes it easier to run inference with them later. Now we'll create a DataLoader for each split: the training set, the validation set, and the test set.

Load the speech commands dataset

Load a custom dataset, train the model.

Now we'll use the Model Maker create() function to create a model based on our model spec and training dataset, and begin training.

If you're using a custom dataset, you might want to change the batch size as appropriate for the number of samples in your train set.

Review the model performance

Even if the accuracy/loss looks good from the training output above, it's important to also run the model using test data that the model has not seen yet, which is what the evaluate() method does here:

View the confusion matrix

When training a classification model such as this one, it's also useful to inspect the confusion matrix . The confusion matrix gives you detailed visual representation of how well your classifier performs for each classification in your test data.

Export the model

The last step is exporting your model into the TensorFlow Lite format for execution on mobile/embedded devices and into the SavedModel format for execution elsewhere.

When exporting a .tflite file from Model Maker, it includes model metadata that describes various details that can later help during inference. It even includes a copy of the classification labels file, so you don't need to a separate labels.txt file. (In the next section, we show how to use this metadata to run an inference.)

Run inference with TF Lite model

Now your TFLite model can be deployed and run using any of the supported inferencing libraries or with the new TFLite AudioClassifier Task API . The following code shows how you can run inference with the .tflite model in Python.

To observe how well the model performs with real samples, run the following code block over and over. Each time, it will fetch a new test sample and run inference with it, and you can listen to the audio sample below.

Download the TF Lite model

Now you can deploy the TF Lite model to your mobile or embedded device. You don't need to download the labels file because you can instead retrieve the labels from .tflite file metadata, as shown in the previous inferencing example.

Check out our end-to-end example apps that perform inferencing with TFLite audio models on Android and iOS .

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License , and code samples are licensed under the Apache 2.0 License . For details, see the Google Developers Site Policies . Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2023-05-15 UTC.

A reweighting method for speech recognition with imbalanced data of Mandarin and sub-dialects

  • Original Research Paper
  • Published: 26 March 2024

Cite this article

  • Jiaju Wu 1 , 3 ,
  • Zhengchang Wen 1 , 3 ,
  • Haitian Huang 6 ,
  • Hanjing Su 5 ,
  • Fei Liu 1 ,
  • Huan Wang 7 ,
  • Yi Ding 2 &
  • Qingyao Wu   ORCID: orcid.org/0000-0002-8564-7289 1 , 3 , 4  

20 Accesses

Explore all metrics

Automatic speech recognition (ASR) is an important technology in many fields like video-sharing services, online education and live broadcast. Most recent ASR methods are based on deep learning technology. A dataset containing training samples of standard Mandarin and its sub-dialects can be used to train a neural network-based ASR model that can recognize standard Mandarin and its sub-dialects. Usually, due to different costs of collecting different sub-dialects, the number of training samples of standard Mandarin in the dataset is much larger than the number of training samples of sub-dialects, resulting in the recognition performance of the model for standard Mandarin being much higher than that of sub-dialects. In this paper, to enhance the recognition performance for sub-dialects, we propose to reweight the recognition loss for different sub-dialects based on their similarity to standard Mandarin. The proposed reweighting method makes the model pay more attention to sub-dialects with larger loss weights, alleviating the problem of poor recognition performance for sub-dialects. Our model was trained and validated on an open-source dataset named KeSpeech, including standard Mandarin and its eight sub-dialects. Experimental results show that the proposed model is better at recognizing most sub-dialects than the baseline and is about 0.5 lower than the baseline in Character Error Rate.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

sample rate for speech recognition

Similar content being viewed by others

sample rate for speech recognition

Automatic speech recognition: a survey

Mishaim Malik, Muhammad Kamran Malik, … Imran Makhdoom

sample rate for speech recognition

Large-scale Multi-modal Pre-trained Models: A Comprehensive Survey

Xiao Wang, Guangyao Chen, … Wen Gao

sample rate for speech recognition

Speech Emotion Recognition: A Comprehensive Survey

Mohammed Jawad Al-Dujaili & Abbas Ebrahimi-Moghadam

Tang Z, Wang D, Xu Y, Sun J, Lei X, Zhao S, Wen C, Tan X, Xie C, Zhou S, Yan R, Lv C, Han Y, Zou W, Li X (2021) KeSpeech: an open source speech dataset of mandarin and its eight subdialects. https://openreview.net/forum?id=b3Zoeq2sCLq

Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

Article   Google Scholar  

Huang C, Li Y, Loy CC, Tang X (2016) Learning deep representation for imbalanced classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5375–5384

Kang B, Xie S, Rohrbach M, Yan Z, Gordo A, Feng J, Kalantidis Y (2019) Decoupling representation and classifier for long-tailed recognition. arXiv preprint arXiv:1910.09217

Cui Y, Jia M, Lin T-Y, Song Y, Belongie S (2019) Class-balanced loss based on effective number of samples. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9268–9277

Liu Z, Miao Z, Zhan X, Wang J, Gong B, Yu SX (2019) Large-scale long-tailed recognition in an open world. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2537–2546

Rabiner L, Juang B (1986) An introduction to hidden Markov models. IEEE Assp Mag 3(1):4–16

Rabiner LR (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE 77(2):257–286

Abdel-Hamid O, Mohamed A-R, Jiang H, Deng L, Penn G, Yu D (2014) Convolutional neural networks for speech recognition. IEEE/ACM Trans Audio Speech Lang Process 22(10):1533–1545

Han W, Zhang Z, Zhang Y, Yu J, Chiu C-C, Qin J, Gulati A, Pang R, Wu Y (2020) Contextnet: improving convolutional neural networks for automatic speech recognition with global context. arXiv preprint arXiv:2005.03191

Hao Y, Wu J, Huang X, Zhang Z, Liu F, Wu Q (2022) Speaker extraction network with attention mechanism for speech dialogue system. SOCA 16(2):111–119

Miao Y, Gowayyed M, Metze F (2015) EESEN: end-to-end speech recognition using deep RNN models and WFST-based decoding. In: 2015 IEEE workshop on automatic speech recognition and understanding (ASRU). IEEE, pp 167–174

Shewalkar A, Nyavanandi D, Ludwig SA (2019) Performance evaluation of deep neural networks applied to speech recognition: RNN, LSTM and GRU. J Artif Intell Soft Comput Res 9(4):235–245

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, vol 30

Watanabe S, Hori T, Karita S, Hayashi T, Nishitoba J, Unno Y, Soplin NEY, Heymann J, Wiesner M, Chen N et al (2018) Espnet: end-to-end speech processing toolkit. arXiv preprint arXiv:1804.00015

Dong L, Xu S, Xu B (2018) Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 5884–5888

Wang Y, Mohamed A, Le D, Liu C, Xiao A, Mahadeokar J, Huang H, Tjandra A, Zhang X, Zhang F et al (2020) Transformer-based acoustic modeling for hybrid speech recognition. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6874–6878

Chan W, Jaitly N, Le Q, Vinyals, O (2016) Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 4960–4964

Gulati A, Qin J, Chiu C-C, Parmar N, Zhang Y, Yu J, Han W, Wang S, Zhang Z, Wu Y et al (2020) Conformer: convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100

Yao Z, Wu D, Wang X, Zhang B, Yu F, Yang C, Peng Z, Chen X, Xie L, Lei X (2021) Wenet: production oriented streaming and non-streaming end-to-end speech recognition toolkit. In: Proc Interspeech, Brno, Czech Republic. IEEE

Amodei D, Ananthanarayanan S, Anubhai R, Bai J, Battenberg E, Case C, Casper J, Catanzaro B, Cheng Q, Chen G et al (2016) Deep speech 2: end-to-end speech recognition in English and mandarin. In: International conference on machine learning. PMLR, pp 173–182

Hannun A, Lee A, Xu Q, Collobert R (2019) Sequence-to-sequence speech recognition with time-depth separable convolutions. arXiv preprint arXiv:1904.02619

He Y, Sainath TN, Prabhavalkar R, McGraw I, Alvarez R, Zhao D, Rybach D, Kannan A, Wu Y, Pang R et al (2019) Streaming end-to-end speech recognition for mobile devices. In: ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6381–6385

He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284

Liu P, Zheng G (2022) Handling imbalanced data: uncertainty-guided virtual adversarial training with batch nuclear-norm optimization for semi-supervised medical image classification. IEEE J Biomed Health Inform 26(7):2983–2994

Shamsudin H, Yusof UK, Jayalakshmi A, Khalid MNA (2020) Combining oversampling and undersampling techniques for imbalanced classification: a comparative study using credit card fraudulent transaction dataset. In: 2020 IEEE 16th international conference on control & automation (ICCA). IEEE, pp 803–808

Zhao L, Shang Z, Tan J, Zhou M, Zhang M, Gu D, Zhang T, Tang YY (2022) Siamese networks with an online reweighted example for imbalanced data learning. Pattern Recogn 132:108947

Kannan A, Datta A, Sainath TN, Weinstein E, Ramabhadran B, Wu Y, Bapna A, Chen Z, Lee S (2019) Large-scale multilingual speech recognition with a streaming end-to-end model. arXiv preprint arXiv:1909.05330

Soky K, Li S, Mimura M, Chu C, Kawahara T (2021) On the use of speaker information for automatic speech recognition in speaker-imbalanced corpora. In: 2021 Asia-Pacific signal and information processing association annual summit and conference (APSIPA ASC). IEEE, pp 433–437

Winata GI, Wang G, Xiong C, Hoi S (2020) Adapt-and-adjust: overcoming the long-tail problem of multilingual speech recognition. arXiv preprint arXiv:2012.01687

Download references

This work was supported by the National Natural Science Foundation of China (NSFC) 62272172, Guangdong Basic and Applied Basic Research Foundation 2023A1515012920, Basic and Applied Basic Research Project of Guangzhou Basic Research Program with Grant No. 2023A04J1051.

Author information

Authors and affiliations.

School of Software Engineering, South China University of Technology, Guangzhou, China

Jiaju Wu, Zhengchang Wen, Fei Liu & Qingyao Wu

Hunan University of Arts and Science, Changde, China

Key Laboratory of Big Data and Intelligent Robot, Ministry of Education, Guangzhou, China

Jiaju Wu, Zhengchang Wen & Qingyao Wu

Pazhou Lab, Guangzhou, China

Tencent Wechat Department, Shenzhen, China

Shenzhen Zhenhua Microelectronics, Ltd. - ZHM, Shenzhen, China

Haitian Huang

Industrial Technology Research Center, Guangdong Institute of Scientific and Technical Information, Guangzhou, China

You can also search for this author in PubMed   Google Scholar

Contributions

Jiaju Wu and Zhengchang Wen developed the proposed method and drafted the manuscript. Yi Ding supervised the project, contributed to the discussion and analysis, and provided important suggestions for the paper. Haitian Huang, Hanjin Su, Fei Liu, Huan Wang and Qingyao Wu participated in the discussion about the proposed method. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Fei Liu or Yi Ding .

Ethics declarations

Conflict of interest.

The authors declare no potential conflict of interest.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Wu, J., Wen, Z., Huang, H. et al. A reweighting method for speech recognition with imbalanced data of Mandarin and sub-dialects. SOCA (2024). https://doi.org/10.1007/s11761-024-00384-0

Download citation

Received : 15 March 2023

Revised : 12 January 2024

Accepted : 22 January 2024

Published : 26 March 2024

DOI : https://doi.org/10.1007/s11761-024-00384-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Automatic speech recognition
  • Imbalanced data
  • Dialect recognition
  • Find a journal
  • Publish with us
  • Track your research
  • Español – América Latina
  • Português – Brasil
  • Cloud Speech-to-Text
  • Documentation

Package google.cloud.speech.v1

  • Adaptation (interface)
  • Speech (interface)
  • CreateCustomClassRequest (message)
  • CreatePhraseSetRequest (message)
  • CustomClass (message)
  • CustomClass.ClassItem (message)
  • DeleteCustomClassRequest (message)
  • DeletePhraseSetRequest (message)
  • GetCustomClassRequest (message)
  • GetPhraseSetRequest (message)
  • ListCustomClassesRequest (message)
  • ListCustomClassesResponse (message)
  • ListPhraseSetRequest (message)
  • ListPhraseSetResponse (message)
  • LongRunningRecognizeMetadata (message)
  • LongRunningRecognizeRequest (message)
  • LongRunningRecognizeResponse (message)
  • PhraseSet (message)
  • PhraseSet.Phrase (message)
  • RecognitionAudio (message)
  • RecognitionConfig (message)
  • RecognitionConfig.AudioEncoding (enum)
  • RecognitionMetadata (message) (deprecated)
  • RecognitionMetadata.InteractionType (enum)
  • RecognitionMetadata.MicrophoneDistance (enum)
  • RecognitionMetadata.OriginalMediaType (enum)
  • RecognitionMetadata.RecordingDeviceType (enum)
  • RecognizeRequest (message)
  • RecognizeResponse (message)
  • SpeakerDiarizationConfig (message)
  • SpeechAdaptation (message)
  • SpeechAdaptation.ABNFGrammar (message)
  • SpeechAdaptationInfo (message)
  • SpeechContext (message)
  • SpeechRecognitionAlternative (message)
  • SpeechRecognitionResult (message)
  • StreamingRecognitionConfig (message)
  • StreamingRecognitionConfig.VoiceActivityTimeout (message)
  • StreamingRecognitionResult (message)
  • StreamingRecognizeRequest (message)
  • StreamingRecognizeResponse (message)
  • StreamingRecognizeResponse.SpeechEventType (enum)
  • TranscriptOutputConfig (message)
  • UpdateCustomClassRequest (message)
  • UpdatePhraseSetRequest (message)
  • WordInfo (message)

Service that implements Google Cloud Speech Adaptation API.

Service that implements Google Cloud Speech API.

CreateCustomClassRequest

Message sent by the client for the CreateCustomClass method.

CreatePhraseSetRequest

Message sent by the client for the CreatePhraseSet method.

CustomClass

A set of words or phrases that represents a common concept likely to appear in your audio, for example a list of passenger ship names. CustomClass items can be substituted into placeholders that you set in PhraseSet phrases.

An item of the class.

DeleteCustomClassRequest

Message sent by the client for the DeleteCustomClass method.

DeletePhraseSetRequest

Message sent by the client for the DeletePhraseSet method.

GetCustomClassRequest

Message sent by the client for the GetCustomClass method.

GetPhraseSetRequest

Message sent by the client for the GetPhraseSet method.

ListCustomClassesRequest

Message sent by the client for the ListCustomClasses method.

ListCustomClassesResponse

Message returned to the client by the ListCustomClasses method.

ListPhraseSetRequest

Message sent by the client for the ListPhraseSet method.

ListPhraseSetResponse

Message returned to the client by the ListPhraseSet method.

LongRunningRecognizeMetadata

Describes the progress of a long-running LongRunningRecognize call. It is included in the metadata field of the Operation returned by the GetOperation call of the google::longrunning::Operations service.

LongRunningRecognizeRequest

The top-level message sent by the client for the LongRunningRecognize method.

LongRunningRecognizeResponse

The only message returned to the client by the LongRunningRecognize method. It contains the result as zero or more sequential SpeechRecognitionResult messages. It is included in the result.response field of the Operation returned by the GetOperation call of the google::longrunning::Operations service.

Provides "hints" to the speech recognizer to favor specific words and phrases in the results.

A phrases containing words and phrase "hints" so that the speech recognition is more likely to recognize them. This can be used to improve the accuracy for specific words and phrases, for example, if specific commands are typically spoken by the user. This can also be used to add additional words to the vocabulary of the recognizer. See usage limits .

List items can also include pre-built or custom classes containing groups of words that represent common concepts that occur in natural language. For example, rather than providing a phrase hint for every month of the year (e.g. "i was born in january", "i was born in febuary", ...), use the pre-built $MONTH class improves the likelihood of correctly transcribing audio that includes months (e.g. "i was born in $month"). To refer to pre-built classes, use the class' symbol prepended with $ e.g. $MONTH . To refer to custom classes that were defined inline in the request, set the class's custom_class_id to a string unique to all class resources and inline classes. Then use the class' id wrapped in $ {...} e.g. "${my-months}". To refer to custom classes resources, use the class' id wrapped in ${} (e.g. ${my-months} ).

Speech-to-Text supports three locations: global , us (US North America), and eu (Europe). If you are calling the speech.googleapis.com endpoint, use the global location. To specify a region, use a regional endpoint with matching us or eu location value.

RecognitionAudio

Contains audio data in the encoding specified in the RecognitionConfig . Either content or uri must be supplied. Supplying both or neither returns google.rpc.Code.INVALID_ARGUMENT . See content limits .

RecognitionConfig

Provides information to the recognizer that specifies how to process the request.

Set to true to use an enhanced model for speech recognition. If use_enhanced is set to true and the model field is not set, then an appropriate enhanced model is chosen if an enhanced model exists for the audio.

If use_enhanced is true and an enhanced version of the specified model does not exist, then the speech is recognized using the standard version of the specified model.

AudioEncoding

The encoding of the audio data sent in the request.

All encodings support only 1 channel (mono) audio, unless the audio_channel_count and enable_separate_recognition_per_channel fields are set.

For best results, the audio source should be captured and transmitted using a lossless encoding ( FLAC or LINEAR16 ). The accuracy of the speech recognition can be reduced if lossy codecs are used to capture or transmit audio, particularly if background noise is present. Lossy codecs include MULAW , AMR , AMR_WB , OGG_OPUS , SPEEX_WITH_HEADER_BYTE , MP3 , and WEBM_OPUS .

The FLAC and WAV audio file formats include a header that describes the included audio content. You can request recognition for WAV files that contain either LINEAR16 or MULAW encoded audio. If you send FLAC or WAV audio file format in your request, you do not need to specify an AudioEncoding ; the audio encoding format is determined from the file header. If you specify an AudioEncoding when you send send FLAC or WAV audio, the encoding configuration must match the encoding described in the audio header; otherwise the request returns an google.rpc.Code.INVALID_ARGUMENT error code.

RecognitionMetadata

Description of audio data to be recognized.

InteractionType

Use case categories that the audio recognition request can be described by.

MicrophoneDistance

Enumerates the types of capture settings describing an audio file.

OriginalMediaType

The original media the speech was recorded on.

RecordingDeviceType

The type of device the speech was recorded with.

RecognizeRequest

The top-level message sent by the client for the Recognize method.

RecognizeResponse

The only message returned to the client by the Recognize method. It contains the result as zero or more sequential SpeechRecognitionResult messages.

SpeakerDiarizationConfig

Config to enable speaker diarization.

SpeechAdaptation

Speech adaptation configuration.

ABNFGrammar

Speechadaptationinfo.

Information on speech adaptation use in results

SpeechContext

Speechrecognitionalternative.

Alternative hypotheses (a.k.a. n-best list).

SpeechRecognitionResult

A speech recognition result corresponding to a portion of the audio.

StreamingRecognitionConfig

Voiceactivitytimeout.

Events that a timeout can be set on for voice activity.

StreamingRecognitionResult

A streaming speech recognition result corresponding to a portion of the audio that is currently being processed.

StreamingRecognizeRequest

The top-level message sent by the client for the StreamingRecognize method. Multiple StreamingRecognizeRequest messages are sent. The first message must contain a streaming_config message and must not contain audio_content . All subsequent messages must contain audio_content and must not contain a streaming_config message.

StreamingRecognizeResponse

StreamingRecognizeResponse is the only message returned to the client by StreamingRecognize . A series of zero or more StreamingRecognizeResponse messages are streamed back to the client. If there is no recognizable audio, and single_utterance is set to false, then no messages are streamed back to the client.

Here's an example of a series of StreamingRecognizeResponse s that might be returned while processing audio:

results { alternatives { transcript: "tube" } stability: 0.01 }

results { alternatives { transcript: "to be a" } stability: 0.01 }

results { alternatives { transcript: "to be" } stability: 0.9 } results { alternatives { transcript: " or not to be" } stability: 0.01 }

results { alternatives { transcript: "to be or not to be" confidence: 0.92 } alternatives { transcript: "to bee or not to bee" } is_final: true }

results { alternatives { transcript: " that's" } stability: 0.01 }

results { alternatives { transcript: " that is" } stability: 0.9 } results { alternatives { transcript: " the question" } stability: 0.01 }

results { alternatives { transcript: " that is the question" confidence: 0.98 } alternatives { transcript: " that was the question" } is_final: true }

Only two of the above responses #4 and #7 contain final results; they are indicated by is_final: true . Concatenating these together generates the full transcript: "to be or not to be that is the question".

The others contain interim results . #3 and #6 contain two interim results : the first portion has a high stability and is less likely to change; the second portion has a low stability and is very likely to change. A UI designer might choose to show only high stability results .

The specific stability and confidence values shown above are only for illustrative purposes. Actual values may vary.

In each response, only one of these fields will be set: error , speech_event_type , or one or more (repeated) results .

SpeechEventType

Indicates the type of speech event.

TranscriptOutputConfig

Specifies an optional destination for the recognition results.

UpdateCustomClassRequest

Message sent by the client for the UpdateCustomClass method.

UpdatePhraseSetRequest

Message sent by the client for the UpdatePhraseSet method.

Word-specific information for recognized words.

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License , and code samples are licensed under the Apache 2.0 License . For details, see the Google Developers Site Policies . Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2023-01-26 UTC.

More From Forbes

Is openai voice engine adding value or creating more societal risks.

  • Share to Facebook
  • Share to Twitter
  • Share to Linkedin

AI speaks letters, text-to-speech or TTS, text-to-voice, speech synthesis applications, generative ... [+] Artificial Intelligence, futuristic technology in language and communication.

Innovative futuristic technology continues to burst from OpenAI research labs. Voice Engine, just announced, generates natural speech that resembles the original speaker in a fifteen second audio capture. The tool can recreate voices in english, Spanish, French or Chinese.

Although the Voice Engine was in their labs in 2022, OpenAI stated they were being cautious in their release and want to start a dialogue on responsible deployment of synthetic voices.

Voice Engine can help advance a number of use cases. One example is providing reading assistance to non-readers and children through natural sounding voices to generate pre-scripted voice over content automatically. This allows for more content development and more rapid deployment.

A second example is helping patients recover their voices when suffering from a sudden loss of speech or degenerative speech conditions. Brown University has been piloting Voice Engine to help patients with oncologic or neurologic issued for treating speech impairments.

The partners testing Voice Engine have agreed to OpenAI usage policies , which prohibit the impersonation of another individual or organization without consent or legal right.

In addition, OpenAI partners require explicit and informed consent from the original speaker and the company does not allow developers to build ways for individual users to create their own voices. The partners must also also disclose to their audience that the voices they're hearing are AI-generated. Perhaps the most important point is OpenAI is implementing watermarking to trace the origin of any audio generated by Voice Engine, and retain proactive monitoring of how Voice Engine is being used.

Although not officially released Voice Engine has serious risks. Some of the risks most often highlighted are to families and small businesses that are targeted with fraudulent extortion scams. False election and marketing campaigns is a boon to bad actors with access to Voice Engine technology. In addition, creative professionals, such as voice artists, could potentially have their voices used in ways that could jeopardize an artist's reputation and ability to earn an income.

The company also made recommendations to look ahead on safety approaches for voice technologies:

  • phasing out voice based authentication as a security measure for accessing bank accounts and other sensitive information,
  • exploring policies to protect the use of individuals' voices in AI,
  • educating the public in understanding the capabilities and limitations of AI technologies, including the possibility of deceptive AI content, and
  • accelerating the development and adoption of techniques for tracking the origin of audiovisual content, so it's always clear when you're interacting with a real person or with an AI consent or legal right.

OpenAI is wisely proceeding with more caution and safety positioning with Voice Engine and is withholding a formal public release over safety concerns, citing the election year as a factor.

Where is OpenAI heading with Voice Engine?

An obvious answer is direct competition with Amazon’s Alexa, as the company filed a trademark application on March 19, further signalling its market direction. No matter where OpenAI Voice Engine is heading, the reality is that voice cloning is here to stay.

Update on Voice Cloning FCC Legislation

The Federal Communications Commission (FCC) announced in early February, 2024 that calls made with voices generated with the help of Artificial Intelligence (AI) will be considered “artificial” under the Telephone Consumer Protection Act (TCPA).

This announcement makes robocalls that implement voice cloning technology and target consumers illegal.

Cindy Gordon

  • Editorial Standards
  • Reprints & Permissions

IMAGES

  1. The Difference Between Speech and Voice Recognition

    sample rate for speech recognition

  2. What is Sample Rate

    sample rate for speech recognition

  3. Speech Recognition using Wit.ai

    sample rate for speech recognition

  4. Speech Recognition in Python- The Complete Beginner’s Guide

    sample rate for speech recognition

  5. The correct rate of speech recognition.

    sample rate for speech recognition

  6. Speech Recognition: Everything You Need to Know in 2023

    sample rate for speech recognition

VIDEO

  1. Speech Recognition Project

  2. Sample Speech Presentation

  3. What do you rate the speech?🤔

  4. Internal display speech recognition tutorial Windows Vista 

  5. Speech Recognition Using matlab #SecureDoorLock#matlab_projects#pitchimages

  6. Realme 12pro+ can detect heart rate through the fingerprint recognition module #shorts

COMMENTS

  1. Detailed Guide on Sample Rate for ASR! [2023]

    Recommended Sample Rates for Various ASR Use Cases. We are clear till now that choosing the optimal sample rate depends on your use case. Below are some of the common ASR use cases and generally used sample rates for them. Voice Assistants (e.g., Siri, Alexa, Google Assistant): - Optimal Sample Rate: 16 kHz to 48 kHz.

  2. Best practices to provide data to the Speech-to-Text API

    Sampling rate. If possible, set the sampling rate of the audio source to 16000 Hz. Otherwise, set the sample_rate_hertz to match the native sample rate of the audio source (instead of re-sampling). Frame size. Streaming recognition recognizes live audio as it is captured from a microphone or other audio source.

  3. A Complete Guide to Audio Datasets

    The load_dataset function prepares audio samples with the sampling rate that they were published with. This is not always the sampling rate expected by our model. ... Speech recognition, or speech-to-text, is the task of mapping from spoken speech to written text, where both the speech and text are in the same language. ...

  4. What is the best sample rate for Google Speech API? Any Google employee

    16 kHz is just the recommended sample rate to be used for transcribing Speech-to-Text. 1. We recommend a sample rate of at least 16 kHz in the audio files that you use for transcription with Speech-to-Text. Sample rates found in audio files are typically 16 kHz, 32 kHz, 44.1 kHz, and 48 kHz.

  5. A Step-by-Step Guide to Speech Recognition and Audio Signal Processing

    Common sampling frequencies are 8 kHz, 16 kHz, and 44.1 kHz. A 1 Hz sampling rate means one sample per second and therefore high sampling rates mean better signal quality. ... for simpler validation and model training. A sample resolution is always measured in bits per sample. A general Speech Recognition system is designed to perform the tasks ...

  6. Optimize audio files for Speech-to-Text

    We recommend a sample rate of at least 16 kHz in the audio files that you use for transcription with Speech-to-Text. Sample rates found in audio files are typically 16 kHz, 32 kHz, 44.1 kHz, and 48 kHz. Because intelligibility is greatly affected by the frequency range, especially in the higher frequencies, a sample rate of less than 16 kHz ...

  7. The Ultimate Guide To Speech Recognition With Python

    The Effect of Noise on Speech Recognition. Noise is a fact of life. All audio recordings have some degree of noise in them, and un-handled noise can wreck the accuracy of speech recognition apps. To get a feel for how noise can affect speech recognition, download the "jackhammer.wav" file here. As always, make sure you save this to your ...

  8. Automatic Speech Recognition

    Automatic speech recognition (ASR) has grown tremendously in recent years, with deep learning playing a key role. Simply put, ASR is the task of converting spoken language into computer readable text (Fig. 8.1). ... The sample rate is the frequency at which the analog signal is sampled (in Hertz). The number of channels refers to audio capture ...

  9. Speech Recognition with Wav2Vec2

    Speech Recognition with Wav2Vec2¶ Author: Moto Hira. This tutorial shows how to perform speech recognition using using pre-trained models from wav2vec 2.0 . Overview¶ The process of speech recognition looks like the following. Extract the acoustic features from audio waveform. Estimate the class of the acoustic features frame-by-frame

  10. Speech Recognition Analysis

    Similar to image recognition, the most important part of speech recognition is to convert audio files into 2X2 arrays. Sample rate and raw wave of audio files: Sample rate of an audio file represents the number of samples of audio carried per second and is measured in Hz.

  11. DeepSpeech for Dummies

    The original DeepSpeech paper from Baidu popularized the concept of "end-to-end" speech recognition models. "End-to-end" means that the model takes in audio, and directly outputs characters or words. ... is_speech = vad.is_speech(frame.bytes, sample_rate) if not triggered: ring_buffer.append((frame, is_speech)) num_voiced = len([f for f ...

  12. 8.4. Speaker Recognition and Verification

    8.4.1. Introduction to Speaker Recognition #. Speaker recognition is the task of identifying a speaker using their voice. Speaker recognition is classified into two parts: speaker identification and speaker verification. While speaker identification is the process of determining which voice in a group of known voices best matches the speaker ...

  13. Evaluation metrics for ASR

    Join the Hugging Face community. and get access to the augmented documentation experience. Collaborate on models, datasets and Spaces. Faster examples with accelerated inference. Switch between documentation themes. to get started. 500.

  14. Introducing Whisper

    Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. We show that the use of such a large and diverse dataset leads to improved robustness to accents, background noise and technical language. Moreover, it enables transcription in multiple languages ...

  15. Simple audio recognition: Recognizing keywords

    Simple audio recognition: Recognizing keywords. This tutorial demonstrates how to preprocess audio files in the WAV format and build and train a basic automatic speech recognition (ASR) model for recognizing ten different words. You will use a portion of the Speech Commands dataset ( Warden, 2018 ), which contains short (one-second or less ...

  16. Cloud Speech-to-Text API v1 (revision 119)

    Sample rate in Hertz of the audio data sent in all `RecognitionAudio` messages. Valid values are: 8000-48000. 16000 is optimal. For best results, set the sampling rate of the audio source to 16000 Hz. ... Set to true to use an enhanced model for speech recognition. If `use_enhanced` is set to true and the `model` field is not set, then an ...

  17. Speech Command Classification with torchaudio

    We use torchaudio to download and represent the dataset. Here we use SpeechCommands, which is a datasets of 35 commands spoken by different people. The dataset SPEECHCOMMANDS is a torch.utils.data.Dataset version of the dataset. In this dataset, all audio files are about 1 second long (and so about 16000 time frames long).

  18. Pre-trained models for automatic speech recognition

    Whisper is a pre-trained model for speech recognition published in September 2022 by the authors Alec Radford et al. from OpenAI. Unlike its CTC predecessors, which were pre-trained entirely on un-labelled audio data, Whisper is pre-trained on a vast quantity of labelled audio-transcription data, 680,000 hours to be precise.

  19. How to Build a Basic Speech Recognition Network with Tensorflow (Demo

    Table of contents. Introduction. A Basic Understanding of the Techniques Involved. Step 1: Import Necessary Modules and Dependencies. Step 2: Download the Dataset. Step 3: Data Exploration and Visualization. Step 4: Preprocessing. Step 5: Training.

  20. Why record at highest sampling/bits per sample rates?

    Speech Recognition Engines work best with Acoustic Models trained with audio recorded at higher sampling rate and bits per sample. However, since current hardware (CPUs and/or sound cards) is not powerful enough to support Acoustic Models trained at higher sampling rates and bits per sample, and telephony applications have bandwidth limitations ...

  21. Retrain a speech recognition model with TensorFlow Lite Model Maker

    Note: The model we'll be training is optimized for speech recognition with one-second samples. ... Maker perfoms automatic resampling for the training dataset, so there's no need to resample your dataset if it has a sample rate other than 44.1 kHz. But beware that audio samples longer than one second will be split into multiple one-second ...

  22. RecognitionConfig

    Sample rate in Hertz of the audio data sent in all RecognitionAudio messages. Valid values are: 8000-48000. 16000 is optimal. For best results, set the sampling rate of the audio source to 16000 Hz. ... The industry vertical to which this speech recognition request most closely applies. This is most indicative of the topics contained in the audio.

  23. A reweighting method for speech recognition with imbalanced ...

    Automatic speech recognition (ASR) is an important technology in many fields like video-sharing services, online education and live broadcast. Most recent ASR methods are based on deep learning technology. A dataset containing training samples of standard Mandarin and its sub-dialects can be used to train a neural network-based ASR model that can recognize standard Mandarin and its sub ...

  24. Package google.cloud.speech.v1

    Sample rate in Hertz of the audio data sent in all RecognitionAudio messages. Valid values are: 8000-48000. 16000 is optimal. For best results, set the sampling rate of the audio source to 16000 Hz. ... The industry vertical to which this speech recognition request most closely applies. This is most indicative of the topics contained in the audio.

  25. Is OpenAI Voice Engine Adding Value Or Creating More Societal ...

    Voice Engine, just announced, generates natural speech that resembles the original speaker in a fifteen second audio capture. The tool can recreate voices in english, Spanish, French or Chinese.