Python Speech Recognition - a Step-by-Step Guide

Have you used Shazam, the app that identifies music that is playing around you?

If yes, how often have you wondered about the technology that shapes this application?

How about products like Google Home or Amazon Alexa or your digital assistant Siri?

Many modern IoT products use speech recognition. This both adds creative functionality to the product and improves its accessibility features.

Python supports speech recognition and is compatible with many open-source speech recognition packages.

In this tutorial, I will teach you how to write Python speech recognition applications use an existing speech recognition package available on PyPI. We will also build a simple Guess the Word game using Python speech recognition.

Table of Contents

You can skip to a specific section of this Python speech recognition tutorial using the table of contents below:

How does speech recognition work?

Modern speech recognition software works on the Hidden Markov Model (HMM).

According to the Hidden Markov Model, a speech signal that is broken down into fragments that are as small as one-hundredth of a second is a stationary process whose properties do not change with respect to time.

Your computer goes through a series of complex steps during speech recognition as it converts your speech to an on-screen text.

When you speak, you create an analog wave in the form of vibrations. This analog wave is converted into a digital signal that the computer can understand using a converter.

This signal is then divided into segments that are as small as one-hundredth of a second. The small segments are then matched with predefined phonemes.

Phonemes are the smallest element of a language. Linguists believe that there are around 40 phonemes in the English language.

Though this process sounds very simple, the trickiest part here is that each speaker pronounces a word slightly differently. Therefore, the way a phoneme sounds varies from speaker-to-speaker. This difference becomes especially significant across speakers from different geographical locations.

As Python developers, we are lucky to have speech recognition services that can be easily accessed through an API. Said differently, we do not need to build the infrastructure to recognize these phonemes from scratch!

Let's now look at the different Python speech recognition packages available on PyPI.

Available Python Speech Recognition Packages

There are many Python speech recognition packages available today. Here are some of the most popular:

  • apiai
  • assemblyai
  • google-cloud-speech
  • google-speech-engine
  • IBM speech to text
  • Microsoft Bing voice recognition
  • pocketsphinx
  • SpeechRecognition
  • watson-developer-cloud

In this tutorial, we will use the SpeechRecognition package, which is open-source and available on PyPI.

Installing and Using the SpeechRecognition package

In this tutorial, I am assuming that you will be using Python 3.5 or above.

You can install the SpeechRecognition package with pyenv, pipenv, or virtualenv. In this tutorial, we will install the package with pipenv from a terminal.

$ pip install SpeechRecognition

Verify the installation of the speech recognition module using the below command.

import speech_recognition as sr

Note: If you are using a microphone input instead of audio files present in your computer, you'll want to install the PyAudio (0.2.11 +) package as well.

The Recognizer Class

The recognizer class from the speech\_recognition module is used to convert our speech to text format. Based on the API used that the user selects, the Recognizer class has seven methods. The seven methods are described in the following table:

API Recognizer Method
Google Cloud Speech API recognize_google_cloud()
Google Speech API recognize_google()
Houndify API by SoundHound recognize_houndify()
IBM Speech to Text API recognize_ibm()
PocketSphinx API recognize_sphinx()
Microsoft Bing Speech API recognize_bing()
Wit.ai API recognize_wit()

{:.blueTable}

In this tutorial, we will use the Google Speech API. The Google Speech API is shipped in SpeechRecognition with a default API key. All the other APIs will require an API key with a username and a password.

First, create a Recognizer instance.

r = sr.Recognizer()

AudioFile is a class that is part of the speech\_recognition module and is used to recognize speech from an audio file present in your machine.

Create an object of the AudioFile class and pass the path of your audio file to the constructor of the AudioFile class. The following file formats are supported by SpeechRecognition:

  • wav
  • aiff
  • aiff-c
  • flac

Try the following script:

myaudio = sr.AudioFile(D:/Files/my_audio.wav)

In the above script, you'll want to replace D:/Files/my_audio.wav with the location of your audio file.

Now, let's use the recognize_google() method to read our file. This method requires us to use a parameter of the speech_recognition() module, the AudioData object.

The Recognizer class has a record() method that can be used to convert our audio file to an AudioData object. Then, pass the AudioFile object to the record() method as shown below:

with myaudio as source:

 audio = r.record(source)

Check the type of the audio variable. You will notice that the type is speech_recognition.AudioData.

Input:

type(audio)

Output:

Speech_recognition.AudioData

Now, use the recognize google() to invoke the audio object and convert the audio file into text.

r.recognize_google(audio)

Output

the birch canoe slid on the smooth planks glue the sheet to the dark blue background its easy to tell the depth of a well these days a chicken leg is a rare dish rice is often served in round bowls the juice of lemons makes fine punch the box was thrown beside the parked truck the hogs were fed chopped corn and garbage four hours of steady work faced us large size in stockings is hard to sell

Now that you have converted your first audio file into text, let's see how we can take only a portion of the file and convert it into text. To do this, we first need to understand the offset and duration keywords in the record() method.

The duration keyword of the record() method is used to set the time at which the speech conversion should end. That is, if you want to end your conversion after 5 seconds, specify the duration as 5. Let's see how this is done.

with myaudio as source:

 audio = r.record(source, duration=5)

r.recognize_google(audio)

The output will be as follows:

the birch canoe slid on the smooth planks glue the sheet to the dark blue background

It's important to note that inside a with block, the record() method moves ahead in the file stream. That is, if you record twice, say once for five seconds and then again for four seconds, the output you get for the second recording will after the first five seconds.

with myaudio as source:

 audio1 = r.record(source, duration=5)

 audio2 = r.record(source, duration=4)

r.recognize_google(audio1)

r.recognize_google(audio2)

What if we want the audio to start from the fifth second and for a duration of 10 seconds?

This is where the offset attribute of the record() method comes to our aid. Here's how to use the offset attribute to skip the first four seconds of the file and then print the text for the next 5 seconds.

with myaudio as source:

 audio = r.record(source, offset=4, duration=5)

r.recognize_google(audio)

The output is as follows:

the dark blue background its easy to tell the depth of a well

To get the exact phrase from the audio file that you are looking for, use precise values for both offset and duration attributes.

Removing Noise

The file we used in this example had very little background noise that disrupted our conversion process. However, in reality, you will encounter a lot of background noise in your speech files.

Fortunately, you can use the adjust_for_ambient_noise() method of the Recognizer class to remove any unwanted noise. This method takes the AudioData object as a parameter.

Let's see how this works:

myaudio = sr.AudioFile(D:/Files/my_audio.wav)

with myaudio as source:

    r.adjust_for_ambient_noise()

    audio = r.record(source)

r.recognize_google(audio)

Output

the birch canoe slid on the smooth planks glue the sheet to the dark blue background its easy to tell the depth of a well these days a chicken leg is a rare dish rice is often served in round bowls the juice of lemons makes fine punch the box was thrown beside the parked truck the hogs were fed chopped corn and garbage four hours of steady work faced us large size in stockings is hard to sell

As mentioned above, our file did not have much noise. This means that the output looks very similar to what we got earlier.

Speech Recognition from a Live Microphone Recording

Now that we have seen speech recognition from an audio file, let's see how to perform the same function when the input is provided via a microphone. As mentioned earlier, you will have to install the PyAudio library to use your microphone.

$ pip install PyAudio

After installing the PyAudio library, create an object of the microphone class of the speech_recognition module.

Create another instance of the Recognizer class like we did for the audio file.

import SpeechRecognition as sr

r = sr.Recognizer()

Now, instead of specifying the input from a file, let us use the default microphone of the system. Access the microphone by creating an instance of the Microphone class.

micph = sr.Microphone()

Similar to the record() method, you can use the listen() method of the Recognizer class to capture input from your microphone. The first argument of the listen() method is the audio source. It records input from the microphone until it detects silence.

with micph as source:

 print("You can now speak")

 r.adjust_for_ambient_noise(source)

 audio = r.listen(source)

print("Translating your speech...")

print("You said: " + r.recognize_google(audio))

Execute the script and try speaking into the microphone.

The system is ready to translate your speech if it displays the You can speak now message. The program will begin translation once you stop speaking. If you do not see this message, it means that the system has failed to detect your microphone.

Final Thoughts

Python speech recognition is slowly gaining importance and will soon become an integral part of human computer interaction.

This article discussed speech recognition briefly and discussed the basics of using the Python SpeechRecognition library.

If you enjoyed this article, be sure to join my Developer Monthly newsletter, where I send out the latest news from the world of Python and JavaScript:


Written on June 9th, 2020