Writing a news page audio summariser

Introduction

We are going to use Beautiful Soup ,ChatGPT and AudioStack to build an audio news page summariser.

This consists of three parts

:one: Writing a web scraper

:two:Writing a news page summariser

:three: Generating the audio file itself from the web page

The webscraper.py file runs first and asks for the URL of a webpage to be entered. Using the beautiful soup module from the Beautiful Soup libary, the program gets all of the text between the body tags of the websites html file. Validation is then carried out using a for loop to remove any unusable special characters from the code ( such as "Β©" and "&"). The content between the body tags is then imported into the chatgpt.py file. The content is then inserted into a prewritten chatGPT prompt and the response is saved. The chatGPT response is then sent to the main.py file where it is saved as a script. The script is then compiled into a TTS audio file which is saved as Summary.mp3 .

File list:

main.py
chatgpt.py
webscraper.py
README.txt
requirements.txt

The audio file will be saved as Summary.mp3 after the code has been run.

Writing the webscraper

We're going to write a simple web scraper. This doesn't handle a lot of edge cases but it should work.

We've added some simple invalid characters detection

🚧

This is a prototype

This web scraper is not necessarily production ready and you'd need to do a lot of work to make this handle complex webpages however it should work for a simple HTML page.

import requests
from bs4 import BeautifulSoup

url = str(input("Enter the website URL: "))
# target url

# making requests instance
reqs = requests.get(url)

# using the BeautifulSoup module
soup = BeautifulSoup(reqs.text, 'html.parser')

# stores the text between the body tags of the website in the variable newtext
for body in soup.find_all('body'):
	newtext = body.get_text()

#removes any invalid characters from the text
content = newtext
originalchr = [ chr(34), chr(39), chr(194), "Β©", chr(92), "&"]
newchr = [ "", "", "", "", "", "and"]

#replaces the invalid characters with the valid alternative in the newchr array
for i in range(0,6):
  content = content.replace( originalchr[i], newchr[i])




#for title in soup.find_all('title'):
	#title = title.get_text()
#print("The title is",title)

Make sure to save this as webscraper.py file in your folder.

Creating the chat gpt prompt

In chatgpt.py create the following

import os
from openai import OpenAI
from webscraper import content
#Imorts the webpage text from the webscraper.py file
client = OpenAI()

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))


def generate_prompt(content):
  return """You are an AI Article Summeriser, 
The summaries you provide will be synthesized using Text to speech 

Here is an Article:{}

please give me an accurate summary of this article, including all important information, but keep it below 340 characters.""".format(
    content.capitalize())
#defines the general prompt given to chat gpt
#The websites text will be entered between {}


completion = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": "You are a helpful AI articule summariser"},
    {"role": "user", "content": generate_prompt(content)}
  ]
)

print(completion)
#import ipdb; ipdb.set_trace()
final = completion.choices[0].message.content
#Stores the text portion of the response from chatGPT in the variable final

originalchr = [ chr(34), chr(39), chr(194), "Β©", chr(92), "&"]
newchr = [ "", "", "", "", "", "and"]

for i in range(0,6):
  final = final.replace( originalchr[i], newchr[i])
print(final)

#Final is later imported into the main.py file




Let's generate the main file

Our final step in our development is to generate the main.py file. This will link everything together.

import audiostack
import os
from chatgpt import final
#imports the summary script from the chatgpt.py file

#Code below is edited from the audiostack API quickstart guide
"""
Hello! Welcome to the audiostack python SDK. 
"""

#gets the api key from secrets
audiostack.api_key = os.environ['APIKEY']


#adds the tts script
scriptText = final

script = audiostack.Content.Script.create(
  scriptText=scriptText,
  projectName="testingthings"
)

tts = audiostack.Speech.TTS.create(
  scriptItem=script,
  voice="bronson",
  speed= 1.0)
#Sets the voice for the TTS

mix = audiostack.Production.Mix.create(
  speechItem=tts,
  soundTemplate="time_horizon_20",
  masteringPreset="balanced"
)
print(mix)

encoder = audiostack.Delivery.Encoder.encode_mix(productionItem=mix, preset="mp3")
encoder.download(fileName="Summary")