@info_app_icon_id@ CC0-1.0 @info_license_spdx@ @info_author@ @info_name@ Notes with offline Speech to Text, Text to Speech and Machine Translation

Speech Note let you take, read and translate notes in multiple languages. It uses Speech to Text, Text to Speech and Machine Translation to do so. Text and voice processing take place entirely offline, locally on your computer, without using a network connection. Your privacy is always respected. No data is sent to the Internet.

#6b94ad #0f5d8a AudioVideo speech offline voice stt asr speech-to-text speech-recognition tts text-to-speech speech-synthesis machine-translation translator subtitles deepspeech vosk whisper rhvoice piper espeak mbrola coqui bergamot Speech Note main window https://gitlab.com/mkiol/dsnote/-/raw/5d8a190c182886f34a1a729bdbf09fd02acaf4e9/desktop/screenshots/speechnote-screenshot-light-notepad.png Speech Note main window (dark theme) https://gitlab.com/mkiol/dsnote/-/raw/5d8a190c182886f34a1a729bdbf09fd02acaf4e9/desktop/screenshots/speechnote-screenshot-dark-notepad.png Translator mode https://gitlab.com/mkiol/dsnote/-/raw/5d8a190c182886f34a1a729bdbf09fd02acaf4e9/desktop/screenshots/speechnote-screenshot-light-translator.png Translator mode (dark theme) https://gitlab.com/mkiol/dsnote/-/raw/5d8a190c182886f34a1a729bdbf09fd02acaf4e9/desktop/screenshots/speechnote-screenshot-dark-translator.png https://github.com/mkiol/dsnote https://github.com/mkiol/dsnote/issues https://app.transifex.com/mkiol/dsnote https://github.com/mkiol/dsnote#how-to-support dsnote_AT_mkiol.net @info_app_icon_id@.desktop audio/aac audio/mpeg audio/x-mp3 audio/ogg audio/vorbis audio/x-vorbis audio/opus audio/x-speex audio/speex audio/wav audio/x-wav audio/flac audio/x-flac audio/mp4 audio/x-matroska video/mpeg video/mpeg video/mp4 video/ogg video/x-matroska video/webm text/plain application/x-subrip

User Interface:

  • Speech Note has been translated into Norwegian language.
  • Grouped models. Models that provide multiple sub-models (for example, TTS models that provide different voices) are shown in groups. This makes it easier to find models in the model browser.

Speech to Text:

  • The name of the all Whisper models has been changed to 'WhisperCpp' to better reflect the engine behind them. Whisper is currently supported by the 'WhisperCpp' and 'FasterWhisper' engines. Both engines are optimized to achieve the best performance.
  • Automatic language detection in STT. To automatically detect the language during STT, select one of the models that is in the 'Auto detected' category in the language list.
  • Separate settings for engines. The configuration of each engine has been separated in the settings. You can separately set the parameters for 'WhisperCpp' and 'FasterWhisper'. The new configuration parameters that have been added to the settings are: 'Number of simultaneous threads', 'Beam search width', 'Audio context size', 'Use Flash Attention'.
  • Quicker decoding with 'WhisperCpp'. Optimization for short sentences has been added to 'WhisperCpp' engine. With it, the speed of STT has doubled!
  • Support for OpenVINO hardware acceleration in 'WhisperCpp. With OpenVINO decoding on CPU is much quicker. If you are not using GPU acceleration, it is recommended to enable OpenVINO in 'WhisperCpp' engine settings. Currently, OpenVINO is enabled only for CPU acceleration.
  • Option for inserting processing statistics. New settings option allows inserting processing related information to the text after decoding, such as processing time and audio length. This can be useful for comparing the performance of different models, engines and their parameters.

Text to Speech:

  • Control tags for advance TTS processing. Control tags allow you to dynamically change the speed of synthesized text or add silence between sentences. To use control tags, insert '{speed: 0.5}' or '{silence: 1s}' into the text. For convenience, you can also insert predefined control tags using text context menu 'Insert control tag'.
  • Welsh language. New language is enabled with 'Piper' voice.
  • New 'Piper' voices for Spanish, Italian and English
  • New 'RHVoice' voice for Slovak

Translator:

  • Improved Translator UI. The 'Translate', 'Switch languages' and 'Add' buttons have been placed between text areas which is more convenient.
  • New models: English to Lithuanian, Croatian to English, Latvian to English, Danish to English
  • Updated models: Lithuanian to English, Slovenian to English

Flatpak:

  • New library: OpenVINO version 2024.1.0.15008
  • whisper.cpp update to version 1.6.2
  • CTranslate2 update to version 4.3.1

User Interface:

  • Import subtitles embedded into video file. If your video file contains one or many subtitle streams, you can import the selected subtitles into notebook.
  • Support for more subtiles formats. You can import and export subtitles in SRT, WebVTT and ASS formats.
  • Unified file importing and exporting. Text, subtitles, audio and video files can be imported or exported using unified menu bar option.
  • Settings option to enable/disable remembering the last note. If the option is disabled, the last note will not be available after restarting the app.
  • Settings option for default action when importing note from a file.
  • Enhanced text editor font settings. You can set the font family, style and size of the font used in the text editor.
  • Text to Text repair options. With these options you can directly fix diacritical marks and punctuation in the text.
  • Text context menu with additional options: 'Read selection' and 'Translate selection'
  • New text appending style: 'After empty line'
  • System tray menu for changing active STT/TTS model
  • User friendly names for audio input devices
  • Simplified model filtering. It is now less flexible, but much easier to understand and use.
  • Speech Note has been translated into Ukrainian and Russian languages.
  • Fix: Cancellation was blocking the user interface.

Speech to Text:

  • Updated 'Distil' model for English: 'Distil Large-v3'. New model is enabled for WhisperCpp and FasterWhisper engines.
  • New Fine-Tuned WhisperCpp and FasterWhisper models for Slovenian and Polish.
  • Fix: Punctuation model could not be downloaded.

Text to Speech:

  • WhisperSpeech engine that generates voice with exceptional naturalness. The new engine comes with models for English and Polish languages. All models support voice cloning.
  • New voice cloning model for Vietnamese: 'viXTTS'
  • New Piper voices for English, Persian, Slovenian, Turkish, French and Spanish
  • New RHVoice voice for Czech
  • Settings option to enable/disable speech synchronization with subtitle timestamps. This may be useful for creating voice overs.
  • Mixing speech with audio from an existing file. When exporting to a file, you can overlay speech with audio from an existing media file.
  • Context menu option to read from cursor position or read only selected text
  • Speech audio is always normalized after TTS processing.
  • Fix: Mimic3 models could not be downloaded.

Translator:

  • New models: Greek to English, Maltese to English, Slovenian to English, Turkish to English, English to Catalan
  • Updated models: Czech and Lithuanian
  • Handy buttons to quickly add translated text to the note, replace it and switch languages.
  • Context menu option to translate from cursor position or translate only selected text

Accessibility:

  • New 'Actions' for STT/TTS models switching: 'switch-to-next-stt-model', 'switch-to-prev-stt-model', 'switch-to-next-tts-model', 'switch-to-prev-tts-model', 'set-stt-model', 'set-tts-model'
  • New global keyboard shortcuts for STT/TTS models switching (X11 only): 'Switch to next STT model', 'Switch to prev STT model', 'Switch to next TTS model', 'Switch to prev TTS model'
  • Toggle option for keyboard shortcuts (X11 only). When 'Toggle behavior' is enabled, Start listening/reading shortcuts will also stop listening/reading if they are triggered while listening/reading is active.
  • Fix: Accented characters (e.g.: ã, ê) were not transferred correctly to the active window.

Flatpak:

  • Flatpak runtime update to version 5.15-23.08
  • PyTorch update to version 2.2.1
  • CTranslate2 update to version 4.2.1
  • Faster-Whisper update to version 1.0.2
  • New library: WhisperSpeech

Flatpak:

  • Modular Flatpak package. The application package is divided into a base package 'Speech Note' (net.mkiol.SpeechNote) and two optional add-ons: 'Speech Note AMD' (net.mkiol.SpeechNote.amd) and 'Speech Note NVIDIA' (net.mkiol.SpeechNote.nvidia). Add-ons packages provide a set of libraries for GPU acceleration with AMD and NVIDIA graphics cards. New "modular" approach makes the base Flatpak package much smaller.
  • NVIDIA CUDA runtime update to version 12.2
  • AMD ROCm runtime update to version 5.6
  • PyTorch update to version 2.1.1

User Interface:

  • Improvements to the model browser. You can check various model properties such as size, license, and the URLs from which the model is downloaded.
  • Model filtering options. Models can be searched by various features such as: Processing speed, Quality, Additional capabilities.
  • Setting option to minimize to the system tray
  • Setting option to enable/disable including of recognized or read text in desktop notifications

Speech to Text:

  • Marathi language. New language is enabled with WhisperCpp and FasterWhisper models.
  • New version of FasterWhisper Large model: 'FasterWhisper Large-v3'
  • New 'Distil' FasterWhisper models for English. 'Distil' models are potentially faster than regular models.
  • WhisperCpp and FasterWhisper enabled for Chinese-Cantonese language
  • Support for Speex audio codec in 'Transcribe a file'
  • Translate to English option for WhisperCpp and FasterWhisper models
  • More effective GPU acceleration for WhisperCpp models with AMD graphics cards
  • Subtitles generation
  • Support for multiple audio streams in a video file

Text to Speech:

  • Marathi language. New language is enabled with Coqui MMS model.
  • Voice cloning with Coqui XTTS and YourTTS models. Coqui XTTS models are enabled for: Arabic, Brazilian Portuguese, Chinese, Czech, Dutch, English, French, German, Hungarian, Italian, Japanese, Korean, Polish, Russian, Spanish and Turkish. Coqui YourTTS model is enabled for: English, French and Brazilian Portuguese.
  • Voice samples creator. A reference voice sample is used for voice cloning. You can create voice sample with a microphone or from audio or video file. The sample creator is available on main toolbar only if the selected TTS model supports voice cloning.
  • New voices for Serbian and Uzbek languages (RHVoice models)
  • GPU acceleration for Coqui models with AMD graphics cards
  • Speech synchronized with subtitle timestamps (e.g. useful for voice overs)

Translator:

  • New model: Lithuanian to English
  • Option to force text cleaning before translation. If the input text is incorrectly formatted, this option may improve the translation quality.
  • Text formatting support. The translation will preserve the formatting from the input text. Supported formats are: HTML, Markdown and SRT Subtitles.
  • Translation progress indicator

Other:

  • Setting option to override GPU version (AMD graphics cards)
  • Setting option to limit number of simultaneous CPU threads
  • Setting option to set Python libraries directory (PYTHONPATH). This option may be useful if you use 'venv' module to manage Python libraries.

Accessibility:

  • Global keyboard shortcuts. Shortcuts allow you to to start or stop listening and reading using keyboard. Keyboard shortcuts function even when the application is not active (e.g. minimized or in the background)
  • Support for 'Actions'. This feature allows external application to invoke certain operation when Speech Note is running. An action can be triggered via DBus call or with command-line option.

User Interface:

  • Desktop notifications. By default, when Speech Note is in the background, desktop notifications are shown to indicate starting or ending of listening and reading.
  • Opening files with Drag and Drop gesture.
  • Speech speed control option has been moved to the main application window.
  • Fix: Application did not use native widgets on some platforms.

Translator:

  • New model: English to Hungarian

Speech to Text:

  • New languages: Afrikaans, Gujarati, Hausa, Telugu, Tswana, Javanese, Hebrew
  • New engine: 'Faster Whisper'. It provides slightly better performance comparing to the existing engine for Whisper models, especially on bigger models like Medium or Large.
  • New engine: 'April-asr'. It supports intermediate results and punctuation (only English). New engine comes with models for the following languages: English, French, Polish.
  • Inserting text to any active window. Using global keyboard shotcut or 'action' you can directly insert the decoded text into any window which is currently in focus.
  • Copy text to the clipboard. Using global keyboard shotcut or 'action' the decoded text can be copied to the clipboard instead of being inserted into the current note.
  • Stop listening button. Unlike Cancel, with this button you can stop listening but the already recorded voice will be decoded into text.
  • Support for Opus audio codec in 'Transcribe a file'
  • More effective GPU acceleration for Whisper models. Average decoding time has been shortened by 3 times.
  • New Whisper models for English: 'Distil-Whisper Medium' and 'Distil-Whisper Large-v2'. Both Distil models are faster than currently enabled 'Whisper Medium' and 'Whisper Large'.
  • New version of Whisper Large model: 'Whisper Large-v3'
  • Fix: CUDA acceleration for Whisper models did not work on NVIDIA graphic cards with Maxwell architecture.

Text to Speech:

  • New languages: Afrikaans, Gujarati, Hausa, Telugu, Tswana, Javanese, Hebrew
  • New engine: 'Mimic 3' with voices for the following languages: Afrikaans, Bengali, German, Greek, English, Spanish, Persian, Finnish, French, Gujarati, Hausa, Hungarian, Italian, Javanese, Nepali, Dutch, Polish, Russian, Swedish, Telugu, Tswana, Ukrainian, Yoruba.
  • Reading text from the clipboard. Using global keyboard shotcut or 'action' you can directly read text that is in the clipboard.
  • New Piper voices for the following languages: Arabic, English, Hungarian, Polish, Czech, German, Ukrainian, Vietnamese, Serbian, French, Spanish, Nepali.
  • More steps in the 'Speech speed' option. You can set speed from 0.1 to 2.0 values.
  • Diacritical marks restoration before speech synthesis for Arabic and Hebrew.
  • Support for GPU acceleration for 'Coqui' models.
  • Fix: Coqui Chinese MMS Hakka and Min Nan voices were broken.
  • Fix: Exporting to audio file was not possible when text was very long.

Other:

  • Setting option to disable support for certain graphic cards
  • Setting option 'Clear cache on close'
  • Cache compression. Temporary audio files are stored in Opus format instead of raw audio. This significantly reduces the required disk space.
  • Detecting the availability of the optional features. In the settings, you can check what optional features are available.

Speech to Text:

  • Improved AMD GPU acceleration support for Whisper models. Whisper GPU accelerator for AMD cards uses OpenCL interface. OpenCL implementation shipped in Flatpak runtime 'Clover' does not support new AMD cards. To overcome this problem, Speech Note package provides another implementation of OpenCL 'ROCm-OpenCL' which supports new AMD hardware.

Translator:

  • New models: Hungarian to English, Finnish to English

Speech to Text:

  • Support for video files transcription. With 'Transcribe a file' menu option you can convert audio file or audio from video file to text. Following video formats are supported: MP4, MKV, Ogg.
  • Option 'Audio source' in 'Settings' to select preferred audio source. New option let you choose microphone (or other audio source) which is used in Speech to Text.
  • Whisper engine update. Library behind Whisper engine (whisper.cpp) has been updated resulting in an increase in performance. Processing time on CPU has been cut in half on average.
  • Improved Nvidia GPU acceleration support for Whisper models. Following Whisper accelerators are currently enabled: OpenCL (for most Nvidia cards, few AMD cards and Intel GPUs), CUDA (for most Nvidia cards). GPU hardware acceleration might not work well on your system, therefore is not enabled by default. Use the option in 'Settings' to turn it on. Disable, if you observe any problems when using Speech to Text with Whisper models.

Text to Speech:

  • Save audio in compressed formats (MP3 or Ogg Vorbis). You can also save metadata tags to the audio file, such as track number, title, artist or album.
  • Pause option. Note reading can be paused and resumed.
  • New models from Massively Multilingual Speech (MMS) project: Hungarian, Catalan, German, Spanish, Romanian, Russian and Swedish. If you would like any other MMS model to be included, please let us know.
  • Update of RHVoice voice for Uzbek.
  • Fix: Many Coqui models couldn't read the numbers or the reading wasn't correct.

User Interface:

  • Menu options: 'Open a text file' and 'Save to a text file'
  • Command line option to open files. If you want to associate text, audio or video files with Speech Note, now it is possible. Your system may detect this new capability and show Speech Note under 'Open With' menu in the file manager. Please note that Flatpak app only has permission to access files in the following folders: Desktop, Documents, Downloads, Music and Videos.
  • Improved UI colors when app is running under GNOME dark theme.
  • Advanced settings option 'Graphical style'. This option let you select any Qt interface style installed in your system. Changing the style might make app look better under GNOME.

Speech to Text:

  • Support for GPU acceleration for Whisper models. If a suitable GPU device is found in the system, it will be used to accelerate processing. This significantly reduces the time of decoding (usually 2 times or more). GPU hardware acceleration is not enabled by default. Use the option in 'Settings' to turn it on. Disable, if you observe any problems when using Speech to Text with Whisper models.
  • Fix: Whisper model wasn't able to decode short speech sentences.

Text to Speech:

  • Option 'Speech speed' in 'Settings' to make synthesized speech slower or faster.
  • New models from Massively Multilingual Speech (MMS) project. MMS project released models for 1100 languages, but only the following have been enabled: Albanian, Amharic, Arabic, Basque, Bengali, Bulgarian, Chinese, Greek, Hindi, Icelandic, Indonesian, Kazakh, Korean, Latin, Latvian, Malay, Mongolian, Polish, Portuguese, Swahili, Tagalog, Tatar, Thai, Turkish, Uzbek, Vietnamese and Yoruba. If you would like any other MMS model to be included, please let us know.
  • New Coqui voices for: Japanese, Turkish and Spanish.
  • New Piper voices for: Czech, German, Hungarian, Portuguese, Slovak and English.
  • Update of RHVoice voices for Slovak and Czech.
  • Fix: Splitting text into sentences was incorrect for: Georgian, Japanese, Bengali, Nepali and Hindi.

User Interface:

  • Option to change font size in text editor

Translator:

  • Support for offline translations between following languages: Catalan, Bulgarian, Czech, Danish English, Spanish, German, Estonian, French, Italian, Polish, Portuguese, Norwegian, Iranian, Dutch, Russian, Ukrainian, Icelandic.
  • Translator uses models that were created as part of Bergamot project.
  • To switch between Notepad and Translator modes, use the toggle buttons in the upper right corner.

User Interface:

  • User interface has been redesign. It is more handy and better supports portrait view for mobile.
  • Settings option to force specific interface style has been added. It is useful to overcome UI glitches when app is running under GNOME desktop environment.
  • Application has been translated to new languages: Dutch and Italian.

Text to Speech:

  • All existing Piper models have been updated.
  • New Piper voices for: English, Swedish, Turkish, Polish, German, Spanish, Finnish, French, Ukrainian, Russian, Swahili, Serbian, Romanian, Luxembourgish and Georgian.
  • New RHVoice model for Slovak language

Text to Speech:

  • New Coqui voice for English: Jenny

Speech to Text:

  • Quicker decoding when using DeepSpeech/Coqui models (especially on ARM CPU)

User Interface:

  • Option to show recent changes in the app (About -> Changes)
  • French translation update (Many thanks to L'Africain)

Text to Speech:

  • New Piper model for Chinese
  • New RHVoice model for Uzbek
  • Updated RHVoice models for Ukrainian
  • Piper and RHVoice engines updated to most recent versions

Speech to Text:

  • Whisper 'Large' models enabled for all languages
  • Whisper supported on older CPUs (i.e. without AVX/AVX2 extensions)
  • Whisper engine update (20% performance improvement, 50% less memory)

Text to Speech:

  • New Piper models for: Icelandic, Swedish and Russian

Speech to Text:

  • Whisper fine-tuned models for: Czech, Slovak, Slovenian, Romanian, Russian, Hungarian and Polish
  • Standard Whisper models enabled also for: Amharic, Arabic, Bengali, Danish, Estonian, Basque, Persian, Hindi, Croatian, Hungarian, Icelandic, Georgian, Kazakh, Korean, Lithuanian, Latvian, Mongolian, Maltese, Nepali, Romanian, Slovak, Slovenian, Albanian, Swahili, Tagalog, Tatar, Uzbek and Yoruba
  • Fix: Whisper STT didn't work for Chinese language
  • Text to Speech: New DeepSpeech model for Latvain
  • Minor UI improvements
  • Fix: Crash during Text to Speech on some systems
  • Fix: Downloaded models disappeared after app restart when Downloads directory was not set in the system