Python : Chat GPT, voice ChatBot

In my last article, I detailed the process of constructing a private Chatbot website powered by ChatGPT 3.5 API to duplicate the user interface of ChatGPT, together with Streamlit- a proficient Python library for developing web applications without requiring you any HTML expertise. If the topic attracts you, the article is available here. Nonetheless, it’s not necessary to preview it in order to comprehend the toolset and technical framework of this guide since I will ensure comprehensive guidance from scratch.

Today, the focus is on creating a private voice Chatbot web application using ChatGPT API. The aim is to explore and discover further potential use cases and business opportunities for AI. I will guide you through the development process step-by-step to ensure that you understand and can replicate your own.

Why need

In addition to the necessity of creating your own Chatbot application I mentioned in the last article, there are more reasons why you should make a voice driven Chatbot.

Not everyone welcomes a service based on typing, imagine children who are still mastering their writing skills or seniors who can’t see words properly on the screen. The AI Chatbot based on voice is the one to resolve that issue, just like how it helps my little one who asks his voice Chatbot to read him a bedtime story.
Given the existing assistant options available such as Apple’s Siri and Amazon’s Alexa, incorporating voice interaction in GPT models can open up a wider range of possibilities. The advantage of ChatGPT API’s superior ability to create coherent and contextually relevant responses, combined with the ideas of voice-based smart home connectivity may provide a plethora of business opportunities. The voice assistant we created in this article will be the entry.

Enough for theory, let’s start.

0. Block diagram

In this application, we have three key modules by the sequence of processing:

Speech-to-Text by Bokeh & Web Speech API
Chat Completion by OpenAI GPT-3.5 API
Text-to-Speech by gTTS

And the Web framework is built by Streamlit.

If you already know how to use the OpenAI APIs under GPT 3.5 model and how to design a web application by Streamlit, you are recommended to skip chapters 1 & 2 to save your reading time.

1. OpenAI GPT APIs

Acquire your API Key

If you’ve already got an OpenAI API Key, stick with it instead of creating a new one. But if you’re an OpenAI newcomer, sign up for a new account and find the page below in your account menu:

After generating the API key, remember that it will only be displayed once, so make sure to copy it somewhere secure for future use.

Usage of ChatCompletion API

At the moment GPT-4.0 was just released, and the API of that model had not been fully released yet, so I am going to introduce the development still with GPT 3.5 model which is powerful enough to complete our AI voice Chatbot demonstration.

Now let’s have a look at the simplest demo from OpenAI to understand the basic definition of ChatCompletion API (or called gpt-3.5 API or ChatGPT API):

Install the package:

!pip install openai

If you have developed some legacy GPT models previously from OpenAI, you may have to upgrade your package through pip:

!pip install --upgrade openai

Create and send the prompt:

import openai
complete = openai.ChatCompletion.create(
  model="gpt-3.5-turbo",
  messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who won the world series in 2020?"},
        {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
        {"role": "user", "content": "Where was it played?"}
    ]
)

Receive the text response:

message=complete.choices[0].message.content

Because the GPT 3.5 API are chat-based text completion API, so make sure your message body of the ChatCompletion request contains the conversation history as the context you would like the model to refer to respond to your current request more contextually relevant.

In order to achieve this function, the list object of the message body should be organized with the below sequence:

The system message is defined to set the behavior of the chatbot by adding an instruction in the content at the top of the message list. As mentioned in the introduction, currently this power has not been fully released in gpt-3.5-turbo-0301 yet.
The user message represents an input or inquiry from the user, while the assistant message refers to the corresponding response from the GPT-3.5 API. Such pairs of conversations give the reference to the model about the context.
The last user message refers to the prompt requested at the current moment.

2. Web development

As same as the last project we introduced, we are going to keep using the powerful Streamlit library to build the web application.

Streamlit is an open-source framework that enables data scientists and developers to quickly build and share interactive web applications for machine learning and data science projects. It also offers a bunch of widgets that only required one line of python code to create like st.table(…). If you do not quite skill in Web development and are not willing to build a large commercial application like me, Streamlit is always one of your best choices as it almost requires zero expertise in HTML stuff.

Let’s see a quick example of building a Streamlit Web application:

Install the package:

!pip install streamlit

Create a Python file “demo.py”:

import streamlit as st

st.write("""
# My First App
Hello *world!*
""")

Run on your local machine or remote server:

!python -m streamlit run demo.py

After this output is printed, you can visit your web through the address and port listed:

You can now view your Streamlit app in your browser.

  Network URL: http://xxx.xxx.xxx.xxx:8501
  External URL: http://xxx.xxx.xxx.xxx:8501

The usage of all the widgets Streamlit provides could be found in its docs page: https://docs.streamlit.io/library/api-reference

3. Speech-to-text implementation

One of the key features of this AI voice Chatbot is its capability to recognize user speech and generate proper text that our ChatCompletion API can use as input.

The high-quality speech recognition provided by OpenAI’s Whisper API is an excellent option, but it comes at a cost. Alternatively, the free Web Speech API from Javascript offers reliable multi-language support with impressive performance. While developing a Python project may seem incompatible with customized Javascript, fear not! I will introduce a simple technique for invoking Javascript code within a Python program in the next section.

Anyway, let’s see how to develop a speech-to-text demo with Web Speech API very quickly. You can find its documentation here.

The implementation of speech recognition can be easily done as below.

var recognition = new webkitSpeechRecognition();
recognition.continuous = false;
recognition.interimResults = true;
recognition.lang = 'en';

recognition.start();

After initialing the recognition object by the method webkitSpeechRecognition(), some useful attributes need to be defined. The continuous attribute indicates whether you want the SpeechRecognition function to keep working after one single pattern processing of speech input has been successfully finalized. I set it false as I want the voice Chatbot can generate each answer based on user speech input at a stable pace.

The interimResults attribute set to true will generate some interim results during user speech so that users can see dynamic messages output from their voice input.

The lang attribute will set the language of the recognition for the request. Please be noted that if it remains unset in the code, the default language will be from the HTML document root element and associated hierarchy so the users with different language settings in their system may have a different experience.

The recognition object has multiple events, from which we use .onresult callback to handle text generation results from both interim results and final results.

recognition.onresult = function (e) {
    var value, value2 = "";
    for (var i = e.resultIndex; i < e.results.length; ++i) {
        if (e.results[i].isFinal) {
            value += e.results[i][0].transcript;
            rand = Math.random();
            
        } else {
            value2 += e.results[i][0].transcript;
        }
    }
}

4. Bokeh

From the definition of the user interface, we want to design a button to start voice recognition that we’ve already implemented with Javascript in the last section.

Streamlit library cannot support customized JS code so we introduce Bokeh. The Bokeh library is another powerful tool for data visualization in Python. One of the best parts that can support our demo is embedding custom Javascript code, which means we can run our voice recognition script under Bokeh’s button widget.

To achieve that, we should install the Bokeh package:

!pip install bokeh

Import the button and CustomJS:

from bokeh.models.widgets import Button
from bokeh.models import CustomJS

Create the button widget:

spk_button = Button(label='SPEAK', button_type='success')

Define the button click event:

spk_button.js_on_event("button_click", CustomJS(code="""
    ...js code...
"""))

The .js_on_event() method is defined to register the events of spk_button. In this case, we register the “button_click” event which will trigger the execution of JS code block …js code… embedded by CustomJS() method after the users’ click.

Streamlit_bokeh_event

After the implementation of speak button and its callback method, the next step is to connect the Bokeh event output (the recognized text) to other function blocks in order to dispatch the prompt text to ChatGPT API.

Fortunately, there is an open-source project called “Streamlit Bokeh Events” which was designed for this purpose that provides bi-directional communication with Bokeh widgets. You shall find its GitHub page here.

The usage of this library is very simple. Install the package first:

!pip install streamlit-bokeh-events

Create the result object by the streamlit_bokeh_events method.

result = streamlit_bokeh_events(
    bokeh_plot = spk_button,
    events="GET_TEXT,GET_ONREC,GET_INTRM",
    key="listen",
    refresh_on_update=False,
    override_height=75,
    debounce_time=0)

Use bokeh_plot attribute to register the spk_button we created in the last section. Use events attribute to beacon several customized HTML document events

GET_TEXT to receive final recognition text
GET_INTRM to receive interim recognition text
GET_ONREC to receive speech processing stage

We can use the JS function document.dispatchEvent(new CustomEvent(…)) to generate an event, for example of the GET_TEXT and GET_INTRM events:

spk_button.js_on_event("button_click", CustomJS(code="""
    var recognition = new webkitSpeechRecognition();
    recognition.continuous = false;
    recognition.interimResults = true;
    recognition.lang = 'en';
    
    var value, value2 = "";
    for (var i = e.resultIndex; i < e.results.length; ++i) {
        if (e.results[i].isFinal) {
            value += e.results[i][0].transcript;
            rand = Math.random();
            
        } else {
            value2 += e.results[i][0].transcript;
        }
    }
    document.dispatchEvent(new CustomEvent("GET_TEXT", {detail: {t:value, s:rand}}));
    document.dispatchEvent(new CustomEvent("GET_INTRM", {detail: value2}));

    recognition.start();
    }
"""))

And, check the result.get() method for the event GET_INTRM handling, for example:

tr = st.empty()
if result:
    if "GET_INTRM" in result:
        if result.get("GET_INTRM") != '':
            tr.text_area("**Your input**", result.get("GET_INTRM"))

The two code snippets indicate that when the user’s speech is ongoing, any interim recognition text will be shown on Streamlit text_area widget:

5. Text-to-speech implementation

After the prompt request is completed and the response generated by GPT-3.5 model through ChatGPT API, we display the response text directly on the web page by Streamlit st.write() method.

However, we need to convert the text to voice speech so that our bi-directional feature of AI voice Chatbot will be fully completed.

There is a popular Python library called “gTTS” able to complete the job perfectly. It supports multiple formats of voice data output including mp3 or stdout after interfacing wth Google Translate’s text-to-speech API. You could find its GitHub page here.

Only a few lines of code can finish the convert. Install the package first:

!pip install gTTS

In this demo, we don’t want to save the voice data into a file so we can call BytesIO() to temporarily store the voice data:

sound = BytesIO()
tts = gTTS(output, lang='en', tld='com')
tts.write_to_fp(sound)

The output is the text string to be converted, and you could choose a different language by lang from different google domains by tld according to your user preference. For example, you could set tld='co.uk' to generate a British English accent.

Then, create a decent audio player by Streamlit widget:

st.audio(sound)

Entire voice Chatbot

To consolidate all the modules mentioned above, we should complete the full features:

Completed interaction with ChatCompletion APIs with appended history conversation defined in user and assistant message blocks. Use Streamlit’s st.session_state to store the running variables.
Completed event generation in SPEAK button’s CustomJS considering multiple events like .onspeechstart(), .onsoundend(), and .onerror() along with the recognition process.
Completed event handling for events “GET_TEXT,GET_ONREC,GET_INTRM” to display proper information on the web interface and manage the text display and assembly while the user’s speech.
All necessary Streamit widgets

Please find the entire demo code for your reference:

import streamlit as st
from bokeh.models.widgets import Button
from bokeh.models import CustomJS

from streamlit_bokeh_events import streamlit_bokeh_events

from gtts import gTTS
from io import BytesIO
import openai

openai.api_key = '{Your API Key}'

if 'prompts' not in st.session_state:
    st.session_state['prompts'] = [{"role": "system", "content": "You are a helpful assistant. Answer as concisely as possible with a little humor expression."}]

def generate_response(prompt):

    st.session_state['prompts'].append({"role": "user", "content":prompt})
    completion=openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages = st.session_state['prompts']
    )
    
    message=completion.choices[0].message.content
    return message

sound = BytesIO()

placeholder = st.container()

placeholder.title("Yeyu's Voice ChatBot")
stt_button = Button(label='SPEAK', button_type='success', margin = (5, 5, 5, 5), width=200)


stt_button.js_on_event("button_click", CustomJS(code="""
    var value = "";
    var rand = 0;
    var recognition = new webkitSpeechRecognition();
    recognition.continuous = false;
    recognition.interimResults = true;
    recognition.lang = 'en';

    document.dispatchEvent(new CustomEvent("GET_ONREC", {detail: 'start'}));
    
    recognition.onspeechstart = function () {
        document.dispatchEvent(new CustomEvent("GET_ONREC", {detail: 'running'}));
    }
    recognition.onsoundend = function () {
        document.dispatchEvent(new CustomEvent("GET_ONREC", {detail: 'stop'}));
    }
    recognition.onresult = function (e) {
        var value2 = "";
        for (var i = e.resultIndex; i < e.results.length; ++i) {
            if (e.results[i].isFinal) {
                value += e.results[i][0].transcript;
                rand = Math.random();
                
            } else {
                value2 += e.results[i][0].transcript;
            }
        }
        document.dispatchEvent(new CustomEvent("GET_TEXT", {detail: {t:value, s:rand}}));
        document.dispatchEvent(new CustomEvent("GET_INTRM", {detail: value2}));

    }
    recognition.onerror = function(e) {
        document.dispatchEvent(new CustomEvent("GET_ONREC", {detail: 'stop'}));
    }
    recognition.start();
    """))

result = streamlit_bokeh_events(
    bokeh_plot = stt_button,
    events="GET_TEXT,GET_ONREC,GET_INTRM",
    key="listen",
    refresh_on_update=False,
    override_height=75,
    debounce_time=0)

tr = st.empty()

if 'input' not in st.session_state:
    st.session_state['input'] = dict(text='', session=0)

tr.text_area("**Your input**", value=st.session_state['input']['text'])

if result:
    if "GET_TEXT" in result:
        if result.get("GET_TEXT")["t"] != '' and result.get("GET_TEXT")["s"] != st.session_state['input']['session'] :
            st.session_state['input']['text'] = result.get("GET_TEXT")["t"]
            tr.text_area("**Your input**", value=st.session_state['input']['text'])
            st.session_state['input']['session'] = result.get("GET_TEXT")["s"]

    if "GET_INTRM" in result:
        if result.get("GET_INTRM") != '':
            tr.text_area("**Your input**", value=st.session_state['input']['text']+' '+result.get("GET_INTRM"))

    if "GET_ONREC" in result:
        if result.get("GET_ONREC") == 'start':
            placeholder.image("recon.gif")
            st.session_state['input']['text'] = ''
        elif result.get("GET_ONREC") == 'running':
            placeholder.image("recon.gif")
        elif result.get("GET_ONREC") == 'stop':
            placeholder.image("recon.jpg")
            if st.session_state['input']['text'] != '':
                input = st.session_state['input']['text']
                output = generate_response(input)
                st.write("**ChatBot:**")
                st.write(output)
                st.session_state['input']['text'] = ''

                tts = gTTS(output, lang='en', tld='com')
                tts.write_to_fp(sound)
                st.audio(sound)

                st.session_state['prompts'].append({"role": "user", "content":input})
                st.session_state['prompts'].append({"role": "assistant", "content":output})

After typing:

!python -m streamlit run demo_voice.py

You will finally see a simple but smart voice Chatbot on your web browser. (Don’t forget to allow the webpage to access your mic and speaker when the request popup)

That’s it.

Hope you can find something useful in this article and thank you for reading!