Recognizer¶
-
class
danspeech.
Recognizer
(model=None, lm=None, with_gpu=False, **kwargs)¶ Recognizer Class, which represents a collection of speech recognition functionality.
None of the parameters are required, but you need to update the Recognizer with a valid model before being able to perform speech recognition.
- param DeepSpeech model
A valid DanSpeech model (
danspeech.deepspeech.model.DeepSpeech
)See Pre-trained DanSpeech Models for more information.
This can also be your custom DanSpeech trained model.
- param str lm
A path (
str
) to a valid .klm language model. See Language Models for a list of pretrained available models.- param bool with_gpu
A
bool
representing whether you want to run theRecognizer
with a GPU.Note: Requires a GPU.
- param **kwargs
Additional decoder arguments. See
Recognizer.update_decoder()
for more information.- Example
recognizer = Recognizer()
-
recognize
(audio_data, show_all=False)¶ Performs speech recognition with the current initialized DanSpeech model (
danspeech.deepspeech.model.DeepSpeech
).- Parameters
audio_data (array) –
Numpy array
of audio data. Useaudio.load_audio()
to load your audio into a valid format.show_all (bool) – Whether to return all beams from beam search, if decoding is performed with a language model.
- Returns
The most likely transcription if
show_all=false
(the default). Otherwise, returns the most likely beams from beam search with a language model.- Return type
str or list[str] if
show_all=True
.
-
update_model
(model)¶ Updates the DanSpeech model being used by the Recognizer.
- Parameters
model – A valid DanSpeech model (
danspeech.deepspeech.model.DeepSpeech
). See Pre-trained DanSpeech Models for a list of pretrained available models. This can also be your custom DanSpeech trained model.
-
update_decoder
(lm=None, alpha=None, beta=None, beam_width=None)¶ Updates the decoder being used by the Recognizer. By default, greedy decoding of the DanSpeech model will be performed.
If lm is None or lm=”greedy”, then the decoding will be performed by greedy decoding, and the alpha, beta and beam width parameters are therefore ignored.
Warning: Language models requires the ctc-decode python package to work.
- Parameters
lm (str) – A path to a valid .klm language model. See Language Models for a list of pretrained available models.
alpha (float) – Alpha parameter of beam search decoding. If None, then the default parameter of
alpha=1.3
is usedbeta (float) – Beta parameter of beam search decoding. If None, then the default parameter of
beta=0.2
is usedbeam_width (int) – Beam width of beam search decoding. If None, then the default parameter of
beam_width=64
is used.
-
enable_streaming
()¶ Adjusts the Recognizer to continuously transcribe a stream of audio input.
Use this before starting a stream.
- Example
recognizer.enable_streaming()
-
disable_streaming
()¶ Adjusts the Recognizer to stop expecting a stream of audio input.
Use this after cancelling a stream.
- Example
recognizer.disable_streaming()
-
streaming
(source)¶ Generator class for a stream audio source a
Microphone()
Spawns a background thread and uses the loaded model to continuously transcribe audio input between detected silences from the
Microphone()
stream.Warning: Requires that
Recognizer.enable_streaming()
has been called.- Parameters
source (Microphone) – Source of audio.
- Example
generator = recognizer.streaming(source=m) # Runs for a long time. Insert your own stop condition. for i in range(100000): trans = next(generator) print(trans)
-
enable_real_time_streaming
(streaming_model, secondary_model=None, string_parts=True)¶ Adjusts the Recognizer to continuously transcribe a stream of audio input real-time.
Real-time audio streaming utilize a uni-directional model to transcribe an utterance while being spoken in contrast to
Recognizer.streaming()
, where the utterance is transcribed after a silence has been detenced.Use this before starting a (
Recognizer.real_time_streaming()
) stream.- Parameters
streaming_model (DeepSpeech) – The DanSpeech model to perform streaming. This model needs to be uni-directional. This is required for real-time streaming to work. The two available DanSpeech models are
pretrained_models.CPUStreamingRNN()
andpretrained_models.GPUStreamingRNN()
but you may use your own custom streaming model as well.secondary_model (DeepSpeech) – A valid DanSpeech model (
danspeech.deepspeech.model.DeepSpeech
). The secondary model transcribes the output after a silence is detected. This is useful since the performance of uni-directional models is very poor compared to bi-directional models, which require the full utterance. See Pre-trained DanSpeech Models for more information on available models. This can also be your custom DanSpeech trained model.string_parts (bool) – Boolean indicating whether you want the generator (
Recognizer.real_time_streaming()
) to yield parts of the string or the whole (currrent) iterating string. Recommended isstring_parts=True
and then keep track of the transcription yourself.
- Example
recognizer.enable_real_time_streaming(streaming_model=CPUStreamingRNN())
-
disable_real_time_streaming
(keep_secondary_model_loaded=False)¶ Adjusts the Recognizer to stop expecting a stream of audio input.
Use this after cancelling a stream.
- Parameters
keep_secondary_model_loaded (bool) – Whether to keep the secondary model in memory or not. Generally, you do not want to keep it in memory, unless you want to perform
Recognizer.real_time_streaming()
again after disabling the real-time streaming.- Example
recognizer.disable_real_time_streaming()
-
real_time_streaming
(source)¶ Generator class for a real-time stream audio source a
Microphone()
.Spawns a background thread and uses the loaded model(s) to continuously transcribe an audio utterance while it is being spoken.
Warning: Requires that
Recognizer.enable_real_time_streaming()
has been called.- Parameters
source (Microphone) – Source of audio.
- Example
generator = r.real_time_streaming(source=m) iterating_transcript = "" print("Speak!") while True: is_last, trans = next(generator) # If the transcription is empty, it means that the energy level required for data # was passed, but nothing was predicted. if is_last and trans: print("Final: " + trans) iterating_transcript = "" continue if trans: iterating_transcript += trans print(iterating_transcript) continue
The generator yields both a boolean (is_last) to indicate whether it is a full utterance (detected by silences in audio input) and the (current/part) transcription. If the is_last boolean is true, then it is a full utterance determined by a silence.
Warning: This method assumes that you use a model with default spectrogram/audio parameters i.e. 20ms audio for each stft and 50% overlap.
-
adjust_for_speech
(source, duration=4)¶ Adjusts the energy level threshold required for the
audio.Microphone()
to detect speech in background.Warning: You need to talk after calling this method! Else, the energy level will be too low. If talking to adjust energy level is not an option, use
Recognizer.adjust_for_ambient_noise()
instead.Only use if the default energy level does not match your use case.
- Parameters
source (Microphone) – Source of audio.
duration (float) – Maximum duration of adjusting the energy threshold
-
adjust_for_ambient_noise
(source, duration=2)¶ Source: https://github.com/Uberi/speech_recognition/blob/master/speech_recognition/__init__.py Modified for DanSpeech
Adjusts the energy level threshold required for the
audio.Microphone()
to detect speech in background. It is based on the energy level in the background.Warning: Do not talk while adjusting energy threshold with this method. This method generally sets the energy level very low. We recommend using
Recognizer.adjust_for_speech()
instead.Only use if the default energy level does not match your use case.
- Parameters
source (Microphone) – Source of audio.
duration (float) – Maximum duration of adjusting the energy threshold
-
update_stream_parameters
(energy_threshold=None, pause_threshold=None, phrase_threshold=None, non_speaing_duration=None)¶ Updates parameters for stream of audio. Only use if the default streaming from your microphone is working poorly.
- Parameters
energy_threshold (float) – Minimum audio energy required for the stream to start detecting an utterance.
pause_threshold (float) – Seconds of non-speaking audio before a phrase is considered complete.
phrase_threshold (float) – Minimum seconds of speaking audio before we consider the speaking audio a phrase.
non_speaing_duration (float) – Seconds of non-speaking audio to keep on both sides of the recording.