Recognizer¶

class danspeech.Recognizer(model=None, lm=None, with_gpu=False, **kwargs)¶

Recognizer Class, which represents a collection of speech recognition functionality.

None of the parameters are required, but you need to update the Recognizer with a valid model before being able to perform speech recognition.

param DeepSpeech model

A valid DanSpeech model (danspeech.deepspeech.model.DeepSpeech)

See Pre-trained DanSpeech Models for more information.

This can also be your custom DanSpeech trained model.

param str lm

A path (str) to a valid .klm language model. See Language Models for a list of pretrained available models.

param bool with_gpu

A bool representing whether you want to run the Recognizer with a GPU.

Note: Requires a GPU.

param **kwargs

Additional decoder arguments. See Recognizer.update_decoder() for more information.

Example

recognizer = Recognizer()

recognize(audio_data, show_all=False)¶

Performs speech recognition with the current initialized DanSpeech model (danspeech.deepspeech.model.DeepSpeech).

Parameters

audio_data (array) – Numpy array of audio data. Use audio.load_audio() to load your audio into a valid format.
show_all (bool) – Whether to return all beams from beam search, if decoding is performed with a language model.

Returns

The most likely transcription if show_all=false (the default). Otherwise, returns the most likely beams from beam search with a language model.

Return type

str or list[str] if show_all=True.

update_model(model)¶

Updates the DanSpeech model being used by the Recognizer.

Parameters: model – A valid DanSpeech model (danspeech.deepspeech.model.DeepSpeech). See Pre-trained DanSpeech Models for a list of pretrained available models. This can also be your custom DanSpeech trained model.

update_decoder(lm=None, alpha=None, beta=None, beam_width=None)¶

Updates the decoder being used by the Recognizer. By default, greedy decoding of the DanSpeech model will be performed.

If lm is None or lm=”greedy”, then the decoding will be performed by greedy decoding, and the alpha, beta and beam width parameters are therefore ignored.

Warning: Language models requires the ctc-decode python package to work.

Parameters

lm (str) – A path to a valid .klm language model. See Language Models for a list of pretrained available models.
alpha (float) – Alpha parameter of beam search decoding. If None, then the default parameter of alpha=1.3 is used
beta (float) – Beta parameter of beam search decoding. If None, then the default parameter of beta=0.2 is used
beam_width (int) – Beam width of beam search decoding. If None, then the default parameter of beam_width=64 is used.

enable_streaming()¶

Adjusts the Recognizer to continuously transcribe a stream of audio input.

Use this before starting a stream.

Example

recognizer.enable_streaming()

disable_streaming()¶

Adjusts the Recognizer to stop expecting a stream of audio input.

Use this after cancelling a stream.

Example

recognizer.disable_streaming()

streaming(source)¶

Generator class for a stream audio source a Microphone()

Spawns a background thread and uses the loaded model to continuously transcribe audio input between detected silences from the Microphone() stream.

Warning: Requires that Recognizer.enable_streaming() has been called.

Parameters

source (Microphone) – Source of audio.

Example

generator = recognizer.streaming(source=m)

# Runs for a long time. Insert your own stop condition.
for i in range(100000):
    trans = next(generator)
    print(trans)

enable_real_time_streaming(streaming_model, secondary_model=None, string_parts=True)¶

Adjusts the Recognizer to continuously transcribe a stream of audio input real-time.

Real-time audio streaming utilize a uni-directional model to transcribe an utterance while being spoken in contrast to Recognizer.streaming(), where the utterance is transcribed after a silence has been detenced.

Use this before starting a (Recognizer.real_time_streaming()) stream.

Parameters

streaming_model (DeepSpeech) – The DanSpeech model to perform streaming. This model needs to be uni-directional. This is required for real-time streaming to work. The two available DanSpeech models are pretrained_models.CPUStreamingRNN() and pretrained_models.GPUStreamingRNN() but you may use your own custom streaming model as well.
secondary_model (DeepSpeech) – A valid DanSpeech model (danspeech.deepspeech.model.DeepSpeech). The secondary model transcribes the output after a silence is detected. This is useful since the performance of uni-directional models is very poor compared to bi-directional models, which require the full utterance. See Pre-trained DanSpeech Models for more information on available models. This can also be your custom DanSpeech trained model.
string_parts (bool) – Boolean indicating whether you want the generator (Recognizer.real_time_streaming()) to yield parts of the string or the whole (currrent) iterating string. Recommended is string_parts=True and then keep track of the transcription yourself.

Example

recognizer.enable_real_time_streaming(streaming_model=CPUStreamingRNN())

disable_real_time_streaming(keep_secondary_model_loaded=False)¶

Adjusts the Recognizer to stop expecting a stream of audio input.

Use this after cancelling a stream.

Parameters

keep_secondary_model_loaded (bool) – Whether to keep the secondary model in memory or not. Generally, you do not want to keep it in memory, unless you want to perform Recognizer.real_time_streaming() again after disabling the real-time streaming.

Example

recognizer.disable_real_time_streaming()

real_time_streaming(source)¶

Generator class for a real-time stream audio source a Microphone().

Spawns a background thread and uses the loaded model(s) to continuously transcribe an audio utterance while it is being spoken.

Warning: Requires that Recognizer.enable_real_time_streaming() has been called.

Parameters

source (Microphone) – Source of audio.

Example

generator = r.real_time_streaming(source=m)

iterating_transcript = ""
print("Speak!")
while True:
    is_last, trans = next(generator)

    # If the transcription is empty, it means that the energy level required for data
    # was passed, but nothing was predicted.
    if is_last and trans:
        print("Final: " + trans)
        iterating_transcript = ""
        continue

    if trans:
        iterating_transcript += trans
        print(iterating_transcript)
        continue

The generator yields both a boolean (is_last) to indicate whether it is a full utterance (detected by silences in audio input) and the (current/part) transcription. If the is_last boolean is true, then it is a full utterance determined by a silence.

Warning: This method assumes that you use a model with default spectrogram/audio parameters i.e. 20ms audio for each stft and 50% overlap.

adjust_for_speech(source, duration=4)¶

Adjusts the energy level threshold required for the audio.Microphone() to detect speech in background.

Warning: You need to talk after calling this method! Else, the energy level will be too low. If talking to adjust energy level is not an option, use Recognizer.adjust_for_ambient_noise() instead.

Only use if the default energy level does not match your use case.

Parameters

source (Microphone) – Source of audio.
duration (float) – Maximum duration of adjusting the energy threshold

adjust_for_ambient_noise(source, duration=2)¶

Source: https://github.com/Uberi/speech_recognition/blob/master/speech_recognition/__init__.py Modified for DanSpeech

Adjusts the energy level threshold required for the audio.Microphone() to detect speech in background. It is based on the energy level in the background.

Warning: Do not talk while adjusting energy threshold with this method. This method generally sets the energy level very low. We recommend using Recognizer.adjust_for_speech() instead.

Only use if the default energy level does not match your use case.

Parameters

source (Microphone) – Source of audio.
duration (float) – Maximum duration of adjusting the energy threshold

update_stream_parameters(energy_threshold=None, pause_threshold=None, phrase_threshold=None, non_speaing_duration=None)¶

Updates parameters for stream of audio. Only use if the default streaming from your microphone is working poorly.

Parameters

energy_threshold (float) – Minimum audio energy required for the stream to start detecting an utterance.
pause_threshold (float) – Seconds of non-speaking audio before a phrase is considered complete.
phrase_threshold (float) – Minimum seconds of speaking audio before we consider the speaking audio a phrase.
non_speaing_duration (float) – Seconds of non-speaking audio to keep on both sides of the recording.