How to Generate a Subtitle (.SRT) File for YouTube

Create subtitles when auto-subtitle tools aren't pronouncing words accurately enough

Once you've decided on the content you're going to use for your video, you can generate a timed transcript file using AudioStack, ready to upload to YouTube.

Dependencies

πŸ“˜

Follow our installation guide for developers if you haven't generated any audio using the API before!

You'll need to install Pandas to get started.

pip install pandas

Code Example

In this example, we generate a speech file for a video voiceover. We then format the timestamps and export it as a .txt file and a .srt file suitable for YouTube upload as subtitles.

import audiostack import os import pandas as pd audiostack.api_key = "APIKEY" scriptText = """ <as:section name="main" soundsegment="main"> Parrots are highly intelligent and social birds known for their vibrant plumage and remarkable ability to mimic sounds, including human speech. These colorful avian companions are found in tropical regions around the world and are known for their playful and affectionate nature, often forming strong bonds with their human caregivers. Parrots and generative audio both exhibit remarkable abilities for mimicry and creativity, with parrots mimicking sounds and voices, and generative audio systems producing compelling sound through imitation and adaptation. At AudioStack, we're big fans of parrots and take inspiration from them in everything from our voice cloning capabilities to our branding. Find out more at www.audiostack.ai </as:section>""" print("Generating your script...") script = audiostack.Content.Script.create(scriptText=scriptText, scriptName="test") print("Synthesizing speech...") tts = audiostack.Speech.TTS.create(scriptItem=script, voice="cosmo", speed=1) speechId = tts.speechId print("Applying auto mixing and mastering") mix = audiostack.Production.Mix.create(speechItem=tts, exportSettings={"ttsTrack" : True}, masteringPreset="balanced") print("Annotating with time stamps...") tts = audiostack.Speech.TTS.annotate(speechId=speechId) print("Preparing for download...") encoder = audiostack.Delivery.Encoder.encode_mix( productionItem=mix, preset="custom", sampleRate=44100, bitDepth=16, public=False, format="wav", channels=2, loudnessPreset="podcast" ) encoder.download(fileName="parrots_voiceover.wav") print(encoder) print("Formatting your timestamp data as a table...") data_list = None for key, value in tts['data'].items(): if 'annotations_timestamps' in value: data_list = value['annotations_timestamps'] break df = pd.DataFrame(data_list) df.rename(columns={'Offset': 'Timestamp'}, inplace=True) df['Timestamp'] = df['Timestamp'] / 10000000 df['Duration'] = df['Duration'] / 10000000 # Print the formatted table print(df) print("Generating a transcript...") # Function to format the transcript for a given row def format_transcript(row): start_time = row['Timestamp'].iloc[0] end_time = row['Timestamp'].iloc[-1] + row['Duration'].iloc[-1] words = ' '.join(row['Word']) return f"{start_time:.2f}-{end_time:.2f}\n\"{words}\"" # Group the data by continuous Timestamps and apply the format_transcript function formatted_transcript = df.groupby((df['Timestamp'] != df['Timestamp'].shift(1) - df['Duration'].shift(1)).cumsum()).apply(format_transcript) # Print the formatted transcript print("Your Timestamped Transcript:") # Initialize variables segments = [] current_segment = pd.DataFrame(columns=['Word', 'Timestamp', 'Duration', 'Confidence']) # Iterate through the data and create segments with at least 5 words for _, row in df.iterrows(): current_segment = pd.concat([current_segment, row.to_frame().T], ignore_index=True) words_in_segment = ' '.join(current_segment['Word']).split() if len(words_in_segment) >= 5: # This means that there are 5 words between each timestamp marker segments.append(current_segment) current_segment = pd.DataFrame(columns=['Word', 'Timestamp', 'Duration', 'Confidence']) # If there are remaining words in the current segment, add it to the segments if not current_segment.empty: segments.append(current_segment) # Delete any existing content in the transcript file f = open("transcript.txt", "w") f.close() # Format and print the segments for segment in segments: formatted_segment = format_transcript(segment) print(formatted_segment) f = open("transcript.txt", "a") print(f"{formatted_segment}", file=f) f.close() # Helper function to format time in SRT format (HH:MM:SS,sss) def format_time(milliseconds): seconds, milliseconds = divmod(milliseconds, 1000) minutes, seconds = divmod(seconds, 60) hours, minutes = divmod(minutes, 60) return f"{hours:02d}:{minutes:02d}:{seconds:02d},{milliseconds:03d}" # Helper function to format a segment to SRT text def segment_to_text(segment): words = ' '.join(segment['Word']) return f"{words}" # Define the start time for the SRT start_time = 0 # You can set the start time as needed # Initialize a counter for SRT subtitle entries srt_counter = 1 # Delete any existing content in the transcript file with open("transcript.srt", "w"): pass # Open the .srt file for writing with open("transcript.srt", "a") as srt_file: # Iterate through the segments for segment in segments: # Calculate end time for the current segment end_time = start_time + int(segment['Duration'].sum() * 1000) # Format SRT entry and write to the file srt_entry = f"{srt_counter}\n{format_time(start_time)} --> {format_time(end_time)}\n{segment_to_text(segment)}\n" srt_file.write(srt_entry) srt_file.write('\n') # Increment the counter and update the start time for the next segment srt_counter += 1 start_time = end_time

πŸ‘

You should now have perfect subtitles, with no misinterpretations of your script!


What’s Next

Want to understand more about what each step is doing in the code example? Check out the detailed, step-by-step tutorial:

Did this page help you?