Voice Intelligence Layer
The Voice Intelligence Layer is a solution that enhances the naturalness of speech
The Voice Intelligence Layer is AudioStackโs solution that enhances the naturalness of speech. It aims to solve common pronunciation and formatting woes that apply to most users when using a synthetic Voice: quality of speech, pronunciation and contextual recognition.
Voice Intelligence has two main features: Normalizer and Lexi
This feature consists of 2 main services to enhance content creation with the most natural and human-like results: Normalizer and Lexi (word dictionaries).
ย Normalizer
The goal of the Normalizer is that any TTS output to be pronounced consistently across multiple providers, wants acronyms to be read out letter by letter and times as well as dates and numbers to be interpreted correctly (e.g. 1970 -> nineteen seventy, not one thousand nine hundred seventy). Our service applies over 50 linguistic and NLP rules.
Normalizer is German only at the moment
Due to its specific challenges, we focused on ๐ฉ๐ช normalization first.
One of the functionalities of the normalizer is to convert Roman numerals into their verbalized forms.
The following example shows the normalization of a Roman numeral used as a cardinal number:
import audiostack
import os
## Add your API Key below
audiostack.api_key = "APIKEY"
text = "Hast du Rocky IV gesehen?"
script = audiostack.Content.Script.create(scriptText=text)
tts = audiostack.Speech.TTS.create(scriptItem=script, voice="lena", voiceIntelligence= True)
print(tts.data['sections'][0]['preview'])
This code returns the script as it should be pronounced: Hast du Rocky vier gesehen?
Normalizer exists for all German voices and is activated using a simple flag in the Speech section of your code:
voiceIntelligence= True
When Roman numerals are used to identify queens, kings or popes, they are converted into verbalized ordinal numbers:
import audiostack
import os
## Add your API Key below
audiostack.api_key = "APIKEY"
text = "Nach dem Tod Elisabeths II. steht nun die Krรถnung Charles III an."
script = audiostack.Content.Script.create(scriptText=text)
tts = audiostack.Speech.TTS.create(scriptItem=script, voice="lena", voiceIntelligence= True)
print(tts.data['sections'][0]['preview'])
Here, the output would be:
Nach dem Tod Elisabeths der zweiten steht nun die Krรถnung Charles des dritten an.
Here is an overview of all supported features of Normalizer:
๐ Dates
๐ Disambiguation years/quantifiers
โ๏ธ Telephone numbers
๐ข Ordinal numbers as adverbs
๐ข Ordinal numbers as adjectives
๐ข Decimals
๐ฃ Other Symbols
๐ค Initialisms
๐ค Acronyms
๐ค Invariable abbreviations
๐ค Variable abbreviations
๐ Times
โ Fractions
โพ๏ธ Measures, units
๐ฐ Currencies
๐ธ๏ธ URLs
๐ Period disambiguation
Lexi (Word Dictionaries)
Lexi, our second layer, can be also thought of as a TTS spell checker for simplicity. Often when working with TTS, the models can fail to accurately pronounce specific words, for example, brands, names and locations. To fix this, we have introduced our lexi flag, which works in a similar way to SSML. This can replace words with either plain text or IPA phonemes. You can selectively replace words based on language, language dialect, provider or exact voice.
For example, adding <!peadar> instead of Peadar to your script will cause the model to produce an alternative pronunciation of this name. This is particularly useful in cases where words can have multiple pronunciations, for example, the cities โreadingโ and โniceโ. In this instance placing <!reading> and <!nice> will ensure that these are pronounced correctly, given the script: " The city of <!nice> is a really nice place in the south of France."
If this solution does not work for you, you can instead make use of our custom (self-serve) lexi feature (shown below). This can be used to either correct single words, or expand acronyms. For example, you can replace all occurrences of the word Aflorithmic with โaf low rhythmicโ or occurrences of the word โBMWโ with โBayerische Motoren Werke''. Replacement words can be supplied as plain text or an IPA phonemicization.
Lexi dictionaries are restricted to organizations in order to avoid spillover of incorrect words or user-specific pronunciations. Just like Normalizer, it's a simple feature flag in the Speech section of your code:
voiceIntelligence=True
Dictionaries within Lexi
A dictionary contains a list of words. A single word contains one or more inputs (normally just the word), and one or more replacements. In the vast majority of cases, a single input and single output are contained within a word entry.
There are in total 8 dictionary types, split into two-word types.
universal words = words that have only one pronunciation. i.e. 'because'
homographs = words that have more than one pronunciation, i.e. 'nice' (verb) and 'Nice' (location)
Within these two categories, we maintain the following structure.
For Universal Words
a customer-specific dictionary containing global words (language-agnostic)
a customer-specific dictionary containing language-specific words (i.e. ๐ฉ๐ช, ๐ฌ๐ง)
an AFLR-specific dictionary containing global words (language-agnostic)
an AFLR-specific dictionary containing language-specific words (i.e. ๐ฉ๐ช, ๐ฌ๐ง)
For Homograph Words
a customer-specific dictionary containing homographic words (language-agnostic)
a customer-specific dictionary containing language-specific homographic words (i.e. ๐ฉ๐ช, ๐ฌ๐ง)
an AFLR-specific dictionary containing global homographic words (language-agnostic)
an AFLR-specific dictionary containing language-specific homographic words (i.e. ๐ฉ๐ช, ๐ฌ๐ง)
A customer only has one global language dictionary, but can have multiple specific language dictionaries, i.e. one for ๐ฌ๐ง and one for ๐ฉ๐ช'.
Here is a code example:
# Creates a default entry
entry_default = {
"word" : "sam",
"lang" : "en",
"replacement" : "dr sam",
}
# creates a default entry with IPA
# content type can be either 'basic' (default) or 'ipa'
entry_with_ipa = {
"word" : "sam",
"lang" : "en",
"replacement" : "sรฆm",
"contentType" : "ipa"
}
# creates a special edge case, i.e. if the user uses voice 'sara' apply this replacement instead
entry_with_specialisation = {
"word" : "sam",
"lang" : "en",
"replacement" : "sรฆรฆm",
"contentType" : "ipa",
"specialization", "en-us",
"provider" : "azure"
}
}
Front-End Demo Example:
This is what a frontend using Lexi built with AudioStack infrastructure can look like. Given that AudioStack is designed around simple API calls, you can design any frontend around it or integrate it into existing systems.
Please Note
The Voice Intelligence Layer is also responsible for handling SSML tags. It is therefore automatically set to
True
if<as: ...>
tags are present in the script.
Updated over 1 year ago