Voice Cloner

Find out more about how to clone voices using AudioStack

Audiostack's Voice Cloner is an AI-powered solution that enables users to record custom scripts that facilitate creating a personalised synthetic voice model that exhibits remarkable semblance, tone, and prosody in 4 different languages (English ๐Ÿ‡ฌ๐Ÿ‡ง๐Ÿ‡บ๐Ÿ‡ธ, German ๐Ÿ‡ฉ๐Ÿ‡ช, Italian ๐Ÿ‡ฎ๐Ÿ‡น and Spanish ๐Ÿ‡ช๐Ÿ‡ธ)and multiple accents. With our user-friendly platform, you can effortlessly create voice clones in multiple languages, adapt speech to your specific requirements, and choose from various protocols, including Speed Cloning for rapid results, Standard Cloning for higher quality, and Premium Cloning for the utmost precision. Our API ensures seamless integration into your applications and projects, making it ideal for a wide range of applications, from voiceovers and virtual assistants to accessibility solutions and more. Once you've cloned your voice with the AudioStack Voice Cloner, you'll be able to use your voice to generate speech with the API, our SDKs or using workflows such as Sonic Sell.


To get access to the voice cloner app, contact us at [email protected]

If you already have an account, you can log in here.

Types of Voice Cloning Available

Speed Cloning

Speed cloning is our most efficient and rapid cloning protocol. It allows users to generate a voice clone in just 30 minutes. Speed cloning is suitable for quick, low-complexity projects where a basic voice clone is needed urgently.

Standard Cloning

Standard cloning provides a more detailed and refined voice clone compared to the speed cloning protocol. Users can expect a higher level of voice accuracy and naturalness. The standard cloning process takes approximately 2 hours to complete.

Premium Cloning

Our premium cloning protocol offers the highest level of customisation and quality. To utilise the premium cloning service, users are required to contact our team directly. Premium cloning allows for in-depth customisation and fine-tuning of the voice clone, making it ideal for professional applications such as voiceovers, virtual assistants, and more. If you need this service, please contact us at [email protected].

Type of CloningDescriptionSentences requiredRecording time requiredLanguages available
Speed CloningThis protocol is designed to create a reliable clone of your voice, with only 30 minutes of recording.10030 - 40 minsEN
Standard CloningThis protocol is designed to create a state-of-the art, high quality voice that can be used to create content in any length and purpose.1000-20002 - 4 hoursEN, DE, IT, ES (more coming soon)
Premium CloningThis protocol creates industryโ€™s best voices with premium quality and the highest naturalness score.>20002 - 20hAny

How to Use the Voice Cloning Service

Using our Voice Cloner is a fairly straightforward process. The simplest option is for users to access the service through AudioStack's Voice Cloning web app, where they can choose their preferred protocol (Speed, Standard, or Premium) and the target language for the voice clone. We also offer targeted verticals that the user can select if they clone their voice for a specified industry:

Default: a multi-use script that covers all topics and is suitable for any voice cloning needs. It includes all the phonemes required to create a balanced synthetic model

Advertisement and Sales: a script that is targeted mostly to users that require to clone a voice that will be used for the ad and sales industry.

Newscast : features a script that focuses more on broadcasting services, such as radio, television etc.

Conversational: this script includes customised sentences that can be used bu conversational models, such as IVR, virtual assistants, chatbots and more.

The above options are available both in English and German. All the scripts offer dynamic sentences so as the user can tailor the pronunciation to their business, title and more. If you need to customize a script in a different language or industry/ vertical, please get in touch.


Did you know?

If you want to integrate the voice cloner within your own user interface to offer this functionality to end users or people within your organisation, you can integrate directly with the API.

:microphone: Best Practices for cloning your voice :microphone:

In order to ensure quality and make sure your time is being used correctly, we recommend the following for your recording environment and equipment. Input means output in voice cloning-the clearer and closer to the desired speaking style your recordings are, the more lifelike your synthetic voice will turn out.

:white-check-mark:Understanding the Nuances:

Before delving into the intricacies of voice cloning, it's crucial to comprehend the underlying principles. Voice cloning is exceptionally accurate in replicating the samples it's trained on, capturing both the nuances and imperfections of the provided audio. This includes background noise, room reverb, or any other unintended sounds present in the training samples.

:white-check-mark:Maintaining Clarity:

A fundamental aspect to ensure optimal results is maintaining clarity in the training data. The AI thrives when presented with a single, consistent speaking voice throughout the recordings. The introduction of multiple speakers or excessive noise can lead to confusion, hindering the AI's ability to discern the intended voice accurately.

:white-check-mark:Crafting the Desired Style:

The speaking style embedded in the training samples directly influences the output. Whether you aim to replicate your voice for an audiobook or any other purpose, align the training data with the desired style. Consistency in style within the uploaded samples is paramount for achieving the desired results.Think about what your voice will be used for, and make sure you consistently use that tone of voice, speaking style and energy.

:microphone:Best Practices for Recording Your Voice: :microphone:

  1. Professional Recording Equipment:
    We highly recommend that you do not use any type of lavalier, boom, laptop, desktop or mobile phone microphones.What we do recommend is anything in the range of Yeti Blue USB Microphone, Shure SM57 (w/ Focusrite 2i2 Interface), Rode NT-USB Microphone. For best results, we should ask the users to use a MacBook and Google Chrome.

  2. Microphone Distance:
    Position yourself at the recommended distance, approximately two fists away from the microphone, adjusting based on the recording type.

  3. Pop-Filter Usage:
    Minimize plosives during recording by using a pop-filter if available, or try to soften your delivery.

  4. Noise-Free Recording:
    Ensure a clean audio input by eliminating interference such as background music or noise.

  5. Room Acoustics:
    Record in an acoustically-treated room to reduce echoes and background noises. If this is not possible or you have a home setup, choose the quietest space in your house that does not have echoes. Turn off all air conditioning, fans and make sure there are no other people in the room while recording.

  6. Recording quality:
    Recording quality 44.1 kHz 16-bit or better. 48 kHz 24-bit is common and desirable

  7. Sufficient Audio Length:
    Provide at least 30 minutes of high-quality audio adhering to the guidelines for optimal results. Do not cut the Protocol required number of utterances short.

  8. Speaking Style:
    The quality of your voice model heavily depends on the quality of the recorded voice that is used for training. The volume, speaking rate, pitch, and expressive mannerisms must all remain consistent throughout the recording process. If you choose to do the recording over multiple sessions, itโ€™s important that they sound like they were done on the same day in the same room. To avoid inconsistency we recommend doing it in one session, or as few as possible.
    Make sure you have access to some drinking water. It will help keep your voice nice and clear.
    You should not add distinct pauses between words except if there are punctuation marks.
    Think about what your voice will be used for, and make sure you consistently use that tone of voice, speaking style and energy.

  9. Script Expectations:
    The script consists of a series of sentences that are separated by full stops, and displayed one at a time. Each script has a specific length which allows us to cover all the phonemes that are present in the respective language and contains a combination of general sentences, question sentences, exclamation sentences, long sentences, and short sentences. There will be some acronyms such as F B I, NASA etc. Please read these as they appear in the sentence. For example, if there is a space or full stop between the letters, then you will need to read it letter by letter such as F B I, and if there arenโ€™t any spaces, then you can read the whole word itself, such as NASA.
    The spoken words must match the text exactly. Every word of the script should be pronounced as it is written. Sounds should not be omitted or slurred together, as is common in casual speech, unless they have been written that way in the script. Especially things like โ€œI amโ€, should not be spoken as โ€œIโ€™mโ€ , or โ€œyou willโ€ as โ€œyouโ€™llโ€ and so on.

In conclusion, the success of voice cloning depends on the meticulous application of these best practices. For those eager to embark on the journey of creating lifelike voice clones, adherence to these guidelines ensures predictability and an authentic outcome. For further assistance and inquiries, do not hesitate to reach out to [email protected].

Embark on the exciting journey of voice cloning with confidence, armed with the knowledge to create stunningly authentic reproductions of your voice!