Recognizer

class danspeech.Recognizer(model=None, lm=None, with_gpu=False, **kwargs)

Recognizer Class, which represents a collection of speech recognition functionality.

None of the parameters are required, but you need to update the Recognizer with a valid model before being able to perform speech recognition.

param DeepSpeech model

A valid DanSpeech model (danspeech.deepspeech.model.DeepSpeech)

See Pre-trained DanSpeech Models for more information.

This can also be your custom DanSpeech trained model.

param str lm

A path (str) to a valid .klm language model. See Language Models for a list of pretrained available models.

param bool with_gpu

A bool representing whether you want to run the Recognizer with a GPU.

Note: Requires a GPU.

param **kwargs

Additional decoder arguments. See Recognizer.update_decoder() for more information.

Example

recognizer = Recognizer()

recognize(audio_data, show_all=False)

Performs speech recognition with the current initialized DanSpeech model (danspeech.deepspeech.model.DeepSpeech).

Parameters
  • audio_data (array) – Numpy array of audio data. Use audio.load_audio() to load your audio into a valid format.

  • show_all (bool) – Whether to return all beams from beam search, if decoding is performed with a language model.

Returns

The most likely transcription if show_all=false (the default). Otherwise, returns the most likely beams from beam search with a language model.

Return type

str or list[str] if show_all=True.

update_model(model)

Updates the DanSpeech model being used by the Recognizer.

Parameters

model – A valid DanSpeech model (danspeech.deepspeech.model.DeepSpeech). See Pre-trained DanSpeech Models for a list of pretrained available models. This can also be your custom DanSpeech trained model.

update_decoder(lm=None, alpha=None, beta=None, beam_width=None)

Updates the decoder being used by the Recognizer. By default, greedy decoding of the DanSpeech model will be performed.

If lm is None or lm=”greedy”, then the decoding will be performed by greedy decoding, and the alpha, beta and beam width parameters are therefore ignored.

Warning: Language models requires the ctc-decode python package to work.

Parameters
  • lm (str) – A path to a valid .klm language model. See Language Models for a list of pretrained available models.

  • alpha (float) – Alpha parameter of beam search decoding. If None, then the default parameter of alpha=1.3 is used

  • beta (float) – Beta parameter of beam search decoding. If None, then the default parameter of beta=0.2 is used

  • beam_width (int) – Beam width of beam search decoding. If None, then the default parameter of beam_width=64 is used.

enable_streaming()

Adjusts the Recognizer to continuously transcribe a stream of audio input.

Use this before starting a stream.

Example
recognizer.enable_streaming()
disable_streaming()

Adjusts the Recognizer to stop expecting a stream of audio input.

Use this after cancelling a stream.

Example
recognizer.disable_streaming()
streaming(source)

Generator class for a stream audio source a Microphone()

Spawns a background thread and uses the loaded model to continuously transcribe audio input between detected silences from the Microphone() stream.

Warning: Requires that Recognizer.enable_streaming() has been called.

Parameters

source (Microphone) – Source of audio.

Example
generator = recognizer.streaming(source=m)

# Runs for a long time. Insert your own stop condition.
for i in range(100000):
    trans = next(generator)
    print(trans)
enable_real_time_streaming(streaming_model, secondary_model=None, string_parts=True)

Adjusts the Recognizer to continuously transcribe a stream of audio input real-time.

Real-time audio streaming utilize a uni-directional model to transcribe an utterance while being spoken in contrast to Recognizer.streaming(), where the utterance is transcribed after a silence has been detenced.

Use this before starting a (Recognizer.real_time_streaming()) stream.

Parameters
  • streaming_model (DeepSpeech) – The DanSpeech model to perform streaming. This model needs to be uni-directional. This is required for real-time streaming to work. The two available DanSpeech models are pretrained_models.CPUStreamingRNN() and pretrained_models.GPUStreamingRNN() but you may use your own custom streaming model as well.

  • secondary_model (DeepSpeech) – A valid DanSpeech model (danspeech.deepspeech.model.DeepSpeech). The secondary model transcribes the output after a silence is detected. This is useful since the performance of uni-directional models is very poor compared to bi-directional models, which require the full utterance. See Pre-trained DanSpeech Models for more information on available models. This can also be your custom DanSpeech trained model.

  • string_parts (bool) – Boolean indicating whether you want the generator (Recognizer.real_time_streaming()) to yield parts of the string or the whole (currrent) iterating string. Recommended is string_parts=True and then keep track of the transcription yourself.

Example
recognizer.enable_real_time_streaming(streaming_model=CPUStreamingRNN())
disable_real_time_streaming(keep_secondary_model_loaded=False)

Adjusts the Recognizer to stop expecting a stream of audio input.

Use this after cancelling a stream.

Parameters

keep_secondary_model_loaded (bool) – Whether to keep the secondary model in memory or not. Generally, you do not want to keep it in memory, unless you want to perform Recognizer.real_time_streaming() again after disabling the real-time streaming.

Example
recognizer.disable_real_time_streaming()
real_time_streaming(source)

Generator class for a real-time stream audio source a Microphone().

Spawns a background thread and uses the loaded model(s) to continuously transcribe an audio utterance while it is being spoken.

Warning: Requires that Recognizer.enable_real_time_streaming() has been called.

Parameters

source (Microphone) – Source of audio.

Example
generator = r.real_time_streaming(source=m)

iterating_transcript = ""
print("Speak!")
while True:
    is_last, trans = next(generator)

    # If the transcription is empty, it means that the energy level required for data
    # was passed, but nothing was predicted.
    if is_last and trans:
        print("Final: " + trans)
        iterating_transcript = ""
        continue

    if trans:
        iterating_transcript += trans
        print(iterating_transcript)
        continue

The generator yields both a boolean (is_last) to indicate whether it is a full utterance (detected by silences in audio input) and the (current/part) transcription. If the is_last boolean is true, then it is a full utterance determined by a silence.

Warning: This method assumes that you use a model with default spectrogram/audio parameters i.e. 20ms audio for each stft and 50% overlap.

adjust_for_speech(source, duration=4)

Adjusts the energy level threshold required for the audio.Microphone() to detect speech in background.

Warning: You need to talk after calling this method! Else, the energy level will be too low. If talking to adjust energy level is not an option, use Recognizer.adjust_for_ambient_noise() instead.

Only use if the default energy level does not match your use case.

Parameters
  • source (Microphone) – Source of audio.

  • duration (float) – Maximum duration of adjusting the energy threshold

adjust_for_ambient_noise(source, duration=2)

Source: https://github.com/Uberi/speech_recognition/blob/master/speech_recognition/__init__.py Modified for DanSpeech

Adjusts the energy level threshold required for the audio.Microphone() to detect speech in background. It is based on the energy level in the background.

Warning: Do not talk while adjusting energy threshold with this method. This method generally sets the energy level very low. We recommend using Recognizer.adjust_for_speech() instead.

Only use if the default energy level does not match your use case.

Parameters
  • source (Microphone) – Source of audio.

  • duration (float) – Maximum duration of adjusting the energy threshold

update_stream_parameters(energy_threshold=None, pause_threshold=None, phrase_threshold=None, non_speaing_duration=None)

Updates parameters for stream of audio. Only use if the default streaming from your microphone is working poorly.

Parameters
  • energy_threshold (float) – Minimum audio energy required for the stream to start detecting an utterance.

  • pause_threshold (float) – Seconds of non-speaking audio before a phrase is considered complete.

  • phrase_threshold (float) – Minimum seconds of speaking audio before we consider the speaking audio a phrase.

  • non_speaing_duration (float) – Seconds of non-speaking audio to keep on both sides of the recording.