Voice Intelligence Layer

The Voice Intelligence Layer is a solution that enhances the naturalness of speech

The Voice Intelligence Layer is AudioStackโ€™s solution that enhances the naturalness of speech. It aims to solve common pronunciation and formatting woes that apply to most users when using a synthetic Voice: quality of speech, pronunciation and contextual recognition.

๐Ÿ“˜

Voice Intelligence has two main features: Normalizer and Lexi

This feature consists of 2 main services to enhance content creation with the most natural and human-like results: Normalizer and Lexi (word dictionaries).

ย Normalizer

The goal of the Normalizer is that any TTS output to be pronounced consistently across multiple providers, wants acronyms to be read out letter by letter and times as well as dates and numbers to be interpreted correctly (e.g. 1970 -> nineteen seventy, not one thousand nine hundred seventy). Our service applies over 50 linguistic and NLP rules.

๐Ÿšง

Normalizer is German only at the moment

Due to its specific challenges, we focused on ๐Ÿ‡ฉ๐Ÿ‡ช normalization first.

One of the functionalities of the normalizer is to convert Roman numerals into their verbalized forms.

The following example shows the normalization of a Roman numeral used as a cardinal number:

import audiostack
import os

## Add your API Key below
audiostack.api_key = "APIKEY"

text = "Hast du Rocky IV gesehen?"

script = audiostack.Content.Script.create(scriptText=text)
tts = audiostack.Speech.TTS.create(scriptItem=script, voice="lena", voiceIntelligence= True)
print(tts.data['sections'][0]['preview'])

This code returns the script as it should be pronounced: Hast du Rocky vier gesehen?

๐Ÿ‘

Normalizer exists for all German voices and is activated using a simple flag in the Speech section of your code:

voiceIntelligence= True

When Roman numerals are used to identify queens, kings or popes, they are converted into verbalized ordinal numbers:

import audiostack
import os

## Add your API Key below
audiostack.api_key = "APIKEY"
text = "Nach dem Tod Elisabeths II. steht nun die Krรถnung Charles III an."   

script = audiostack.Content.Script.create(scriptText=text)
tts = audiostack.Speech.TTS.create(scriptItem=script, voice="lena", voiceIntelligence= True)
print(tts.data['sections'][0]['preview'])

Here, the output would be:
Nach dem Tod Elisabeths der zweiten steht nun die Krรถnung Charles des dritten an.

Here is an overview of all supported features of Normalizer:

๐Ÿ“… Dates

๐Ÿ“… Disambiguation years/quantifiers

โ˜Ž๏ธ Telephone numbers

๐Ÿ”ข Ordinal numbers as adverbs

๐Ÿ”ข Ordinal numbers as adjectives

๐Ÿ”ข Decimals

๐Ÿ”ฃ Other Symbols

๐Ÿ”ค Initialisms

๐Ÿ”ค Acronyms

๐Ÿ”ค Invariable abbreviations

๐Ÿ”ค Variable abbreviations

๐Ÿ• Times

โž— Fractions

โ™พ๏ธ Measures, units

๐Ÿ’ฐ Currencies

๐Ÿ•ธ๏ธ URLs

๐Ÿ†Ž Period disambiguation

Lexi (Word Dictionaries)

Lexi, our second layer, can be also thought of as a TTS spell checker for simplicity. Often when working with TTS, the models can fail to accurately pronounce specific words, for example, brands, names and locations. To fix this, we have introduced our lexi flag, which works in a similar way to SSML. This can replace words with either plain text or IPA phonemes. You can selectively replace words based on language, language dialect, provider or exact voice.

For example, adding <!peadar> instead of Peadar to your script will cause the model to produce an alternative pronunciation of this name. This is particularly useful in cases where words can have multiple pronunciations, for example, the cities โ€˜readingโ€™ and โ€˜niceโ€™. In this instance placing <!reading> and <!nice> will ensure that these are pronounced correctly, given the script: " The city of <!nice> is a really nice place in the south of France."

If this solution does not work for you, you can instead make use of our custom (self-serve) lexi feature (shown below). This can be used to either correct single words, or expand acronyms. For example, you can replace all occurrences of the word Aflorithmic with โ€œaf low rhythmicโ€ or occurrences of the word โ€˜BMWโ€™ with โ€œBayerische Motoren Werke''. Replacement words can be supplied as plain text or an IPA phonemicization.

๐Ÿ‘

Lexi dictionaries are restricted to organizations in order to avoid spillover of incorrect words or user-specific pronunciations. Just like Normalizer, it's a simple feature flag in the Speech section of your code:

voiceIntelligence=True

Dictionaries within Lexi

A dictionary contains a list of words. A single word contains one or more inputs (normally just the word), and one or more replacements. In the vast majority of cases, a single input and single output are contained within a word entry.

There are in total 8 dictionary types, split into two-word types.

universal words = words that have only one pronunciation. i.e. 'because'
homographs = words that have more than one pronunciation, i.e. 'nice' (verb) and 'Nice' (location)

Within these two categories, we maintain the following structure.

For Universal Words

a customer-specific dictionary containing global words (language-agnostic)
a customer-specific dictionary containing language-specific words (i.e. ๐Ÿ‡ฉ๐Ÿ‡ช, ๐Ÿ‡ฌ๐Ÿ‡ง)
an AFLR-specific dictionary containing global words (language-agnostic)
an AFLR-specific dictionary containing language-specific words (i.e. ๐Ÿ‡ฉ๐Ÿ‡ช, ๐Ÿ‡ฌ๐Ÿ‡ง)

For Homograph Words

a customer-specific dictionary containing homographic words (language-agnostic)
a customer-specific dictionary containing language-specific homographic words (i.e. ๐Ÿ‡ฉ๐Ÿ‡ช, ๐Ÿ‡ฌ๐Ÿ‡ง)
an AFLR-specific dictionary containing global homographic words (language-agnostic)
an AFLR-specific dictionary containing language-specific homographic words (i.e. ๐Ÿ‡ฉ๐Ÿ‡ช, ๐Ÿ‡ฌ๐Ÿ‡ง)

A customer only has one global language dictionary, but can have multiple specific language dictionaries, i.e. one for ๐Ÿ‡ฌ๐Ÿ‡ง and one for ๐Ÿ‡ฉ๐Ÿ‡ช'.

Here is a code example:

# Creates a default entry
entry_default = {
    "word" : "sam",
    "lang" : "en",
    "replacement" : "dr sam",
}

# creates a default entry with IPA
# content type can be either 'basic' (default) or 'ipa'
entry_with_ipa = {
    "word" : "sam",
    "lang" : "en",
    "replacement" : "sรฆm",
    "contentType" : "ipa"
}

# creates a special edge case, i.e. if the user uses voice 'sara' apply this replacement instead
entry_with_specialisation = {
    "word" : "sam",
    "lang" : "en",
    "replacement" : "sรฆรฆm",
    "contentType" : "ipa",
    "specialization", "en-us",
    "provider" : "azure"
  }
}

Front-End Demo Example:

This is what a frontend using Lexi built with AudioStack infrastructure can look like. Given that AudioStack is designed around simple API calls, you can design any frontend around it or integrate it into existing systems.

๐Ÿšง

Please Note

The Voice Intelligence Layer is also responsible for handling SSML tags. It is therefore automatically set to True if <as: ...> tags are present in the script.