How to Generate a Subtitle (.SRT) File for YouTube

Create subtitles when auto-subtitle tools aren't pronouncing words accurately enough

Once you've decided on the content you're going to use for your video, you can generate a timed transcript file using AudioStack, ready to upload to YouTube.

Dependencies

πŸ“˜

Follow our installation guide for developers if you haven't generated any audio using the API before!

You'll need to install Pandas to get started.

pip install pandas

Code Example

In this example, we generate a speech file for a video voiceover. We then format the timestamps and export it as a .txt file and a .srt file suitable for YouTube upload as subtitles.

import audiostack
import os
import pandas as pd

audiostack.api_key = "APIKEY"

scriptText = """
    <as:section name="main" soundsegment="main"> 
    Parrots are highly intelligent and social birds known for their vibrant plumage and remarkable ability to mimic sounds, including human speech. These colorful avian companions are found in tropical regions around the world and are known for their playful and affectionate nature, often forming strong bonds with their human caregivers.  Parrots and generative audio both exhibit remarkable abilities for mimicry and creativity, with parrots mimicking sounds and voices, and generative audio systems producing compelling sound through imitation and adaptation. At AudioStack, we're big fans of parrots and take inspiration from them in everything from our voice cloning capabilities to our branding. Find out more at www.audiostack.ai
    </as:section>"""

print("Generating your script...")
script = audiostack.Content.Script.create(scriptText=scriptText, scriptName="test")

print("Synthesizing speech...")
tts = audiostack.Speech.TTS.create(scriptItem=script, voice="cosmo", speed=1)
speechId = tts.speechId

print("Applying auto mixing and mastering")
mix = audiostack.Production.Mix.create(speechItem=tts, exportSettings={"ttsTrack" : True}, masteringPreset="balanced")

print("Annotating with time stamps...")
tts = audiostack.Speech.TTS.annotate(speechId=speechId)

print("Preparing for download...")
encoder = audiostack.Delivery.Encoder.encode_mix(
    productionItem=mix,
    preset="custom",
    sampleRate=44100,
    bitDepth=16,
    public=False,
    format="wav",
    channels=2,
    loudnessPreset="podcast"
)

encoder.download(fileName="parrots_voiceover.wav")
print(encoder)

print("Formatting your timestamp data as a table...")
data_list = None
for key, value in tts['data'].items():
    if 'annotations_timestamps' in value:
        data_list = value['annotations_timestamps']
        break

df = pd.DataFrame(data_list)

df.rename(columns={'Offset': 'Timestamp'}, inplace=True)
df['Timestamp'] = df['Timestamp'] / 10000000
df['Duration'] = df['Duration'] / 10000000

# Print the formatted table
print(df)

print("Generating a transcript...")
# Function to format the transcript for a given row
def format_transcript(row):
    start_time = row['Timestamp'].iloc[0]
    end_time = row['Timestamp'].iloc[-1] + row['Duration'].iloc[-1]
    words = ' '.join(row['Word'])
    return f"{start_time:.2f}-{end_time:.2f}\n\"{words}\""

# Group the data by continuous Timestamps and apply the format_transcript function
formatted_transcript = df.groupby((df['Timestamp'] != df['Timestamp'].shift(1) - df['Duration'].shift(1)).cumsum()).apply(format_transcript)

# Print the formatted transcript
print("Your Timestamped Transcript:")


# Initialize variables
segments = []
current_segment = pd.DataFrame(columns=['Word', 'Timestamp', 'Duration', 'Confidence'])

# Iterate through the data and create segments with at least 5 words
for _, row in df.iterrows():
    current_segment = pd.concat([current_segment, row.to_frame().T], ignore_index=True)
    words_in_segment = ' '.join(current_segment['Word']).split()
    if len(words_in_segment) >= 5: # This means that there are 5 words between each timestamp marker
        segments.append(current_segment)
        current_segment = pd.DataFrame(columns=['Word', 'Timestamp', 'Duration', 'Confidence'])

# If there are remaining words in the current segment, add it to the segments
if not current_segment.empty:
    segments.append(current_segment)
# Delete any existing content in the transcript file 
f = open("transcript.txt", "w")
f.close()
# Format and print the segments
for segment in segments:
    formatted_segment = format_transcript(segment)
    print(formatted_segment)
    f = open("transcript.txt", "a")
    print(f"{formatted_segment}", file=f)
    f.close()


# Helper function to format time in SRT format (HH:MM:SS,sss)
def format_time(milliseconds):
    seconds, milliseconds = divmod(milliseconds, 1000)
    minutes, seconds = divmod(seconds, 60)
    hours, minutes = divmod(minutes, 60)
    return f"{hours:02d}:{minutes:02d}:{seconds:02d},{milliseconds:03d}"

# Helper function to format a segment to SRT text
def segment_to_text(segment):
    words = ' '.join(segment['Word'])
    return f"{words}"

# Define the start time for the SRT
start_time = 0  # You can set the start time as needed

# Initialize a counter for SRT subtitle entries
srt_counter = 1

# Delete any existing content in the transcript file
with open("transcript.srt", "w"):
    pass

# Open the .srt file for writing
with open("transcript.srt", "a") as srt_file:
    # Iterate through the segments
    for segment in segments:
        # Calculate end time for the current segment
        end_time = start_time + int(segment['Duration'].sum() * 1000)

        # Format SRT entry and write to the file
        srt_entry = f"{srt_counter}\n{format_time(start_time)} --> {format_time(end_time)}\n{segment_to_text(segment)}\n"
        srt_file.write(srt_entry)

        srt_file.write('\n')

        # Increment the counter and update the start time for the next segment
        srt_counter += 1
        start_time = end_time

πŸ‘

You should now have perfect subtitles, with no misinterpretations of your script!


What’s Next

Want to understand more about what each step is doing in the code example? Check out the detailed, step-by-step tutorial: