Build a Full-Stack Voice AI Agent with React, Node.js & OpenAI TTS

🤔 What Are We Building?

A full-stack Voice AI Agent with:

🎤 Voice Input — Captured via the Browser's Web Speech API

🧠 AI Brain — GPT-4o-mini processes your spoken query

🔊 Voice Output — OpenAI's TTS-1 model speaks the answer back

💅 Beautiful UI — Animated dark-mode orb with 4 visual states

🏗️ Project Structure


Voice-AI-Agent/
├── client/                  # React + Vite + TailwindCSS
│   └── src/
│       ├── components/
│       │   └── ChatInterface.jsx   # Main UI
│       ├── hooks/
│       │   ├── useVoiceInput.js    # Web Speech API hook
│       │   └── useAudioPlayer.js   # Audio playback hook
│       └── App.jsx
│
└── server/                  # Node.js + Express
    ├── server.js            # API endpoint
    └── .env                 # OPENAI_API_KEY

⚙️ Backend: Node.js + Express + OpenAI

The backend has a single POST /api/voice endpoint that does two things:

Sends the user's text to GPT-4o-mini for an intelligent reply

Converts that reply to speech using OpenAI TTS-1

javascript

// server/server.js
const express = require('express');
const cors = require('cors');
const dotenv = require('dotenv');
const OpenAI = require('openai');

dotenv.config();

const app = express();
const port = process.env.PORT || 5000;

app.use(cors());
app.use(express.json());

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

app.post('/api/voice', async (req, res) => {
    try {
        const { text } = req.body;
        if (!text) return res.status(400).json({ error: 'Text is required' });

        // Step 1: Get AI response from GPT-4o-mini
        const completion = await openai.chat.completions.create({
            model: "gpt-4o-mini",
            messages: [
                { role: "system", content: "You are a helpful voice assistant. Keep responses concise and conversational." },
                { role: "user", content: text }
            ],
        });

        const responseText = completion.choices[0].message.content;

        // Step 2: Convert response to speech (TTS-1)
        const mp3 = await openai.audio.speech.create({
            model: "tts-1",
            voice: "alloy",
            input: responseText,
        });

        const buffer = Buffer.from(await mp3.arrayBuffer());
        const audioBase64 = buffer.toString('base64');

        res.json({
            role: 'assistant',
            content: responseText,
            audio: `data:audio/mpeg;base64,${audioBase64}`
        });

    } catch (error) {
        console.error('Error:', error);
        res.status(500).json({ error: 'An error occurred.' });
    }
});

app.listen(port, () => console.log(`Server running on port ${port}`));

Key Points:

The audio is returned as a Base64-encoded MP3 data URI — no file storage needed!

We use the "alloy" voice, but OpenAI offers shimmer, echo, nova, onyx, and fable too.

🎤 Frontend Hook: useVoiceInput

This custom hook wraps the browser's SpeechRecognition API:

javascript

// hooks/useVoiceInput.js
import { useState, useCallback, useRef } from 'react';

const useVoiceInput = () => {
  const [isListening, setIsListening] = useState(false);
  const [transcript, setTranscript] = useState('');
  const recognitionRef = useRef(null);

  const startListening = useCallback(() => {
    const SpeechRecognition = window.SpeechRecognition || window.webkitSpeechRecognition;
    if (!SpeechRecognition) return alert('Speech Recognition not supported');

    const recognition = new SpeechRecognition();
    recognition.lang = 'en-US';
    recognition.interimResults = true;

    recognition.onresult = (event) => {
      const current = Array.from(event.results)
        .map(r => r[0].transcript).join('');
      setTranscript(current);
    };

    recognition.onend = () => setIsListening(false);

    recognitionRef.current = recognition;
    recognition.start();
    setIsListening(true);
  }, []);

  const stopListening = useCallback(() => {
    recognitionRef.current?.stop();
  }, []);

  const resetTranscript = useCallback(() => setTranscript(''), []);

  return { isListening, transcript, startListening, stopListening, resetTranscript };
};

export default useVoiceInput;

🔊 Frontend Hook: useAudioPlayer

Plays the Base64 audio URL returned from the backend:

javascript

// hooks/useAudioPlayer.js
import { useState, useRef, useCallback } from 'react';

const useAudioPlayer = () => {
  const [isPlaying, setIsPlaying] = useState(false);
  const audioRef = useRef(null);

  const playAudio = useCallback((audioSrc) => {
    if (audioRef.current) audioRef.current.pause();
    const audio = new Audio(audioSrc);
    audioRef.current = audio;
    setIsPlaying(true);
    audio.play();
    audio.onended = () => setIsPlaying(false);
  }, []);

  return { isPlaying, playAudio };
};

export default useAudioPlayer;

🚀 Setup in 3 Steps

1. Clone the Repo

bash

git clone https://github.com/YOUR_USERNAME/voice-ai-agent.git
cd voice-ai-agent

2. Start the Backend

bash

cd server
npm install
# Create .env file:
# PORT=5000
# OPENAI_API_KEY=sk-...
npm start

3. Start the Frontend

bash

cd client
npm install
npm run dev

Visit http://localhost:5173, tap the orb, and speak! 🎙️

💡 Ideas to Extend This Project

🗣️ Add conversation history so the AI remembers context

🌍 Add language switching for multilingual support

🎭 Let users choose different voices (Shimmer, Nova, Echo...)

⌨️ Add text fallback input for environments without a mic

🌐 Deploy backend to Railway or Render, frontend to Vercel

🔑 Key Takeaways

Web Speech API is built into Chrome/Edge — no external library needed for STT

GPT-4o-mini gives fast, cost-effective AI responses

TTS-1 produces natural-sounding speech and returns audio as Base64 — easy to stream

React's custom hooks keep voice and audio logic cleanly separated

State-driven UI (4 states) is the secret to making the app feel alive

📂 GitHub Repo

🔗 Voice AI Agent on GitHub

Drop a ⭐ if you find it useful!

This is Day 13 of my #30DaysOfAI challenge. Follow along for a new AI project every day!