Skip to main content
NavTalk Real-time Digital Human API enables you to build interactive conversational experiences with digital avatars that respond in real-time through WebSocket connections. With sub-500ms latency and seamless interruption handling, create natural, human-like interactions that feel truly conversational.

Core Features

Ultra-Low Latency

Sub-500ms end-to-end response times for real-time conversations

Real-time Rendering

Frame-accurate lip sync and emotion-driven facial expressions

Natural Conversations

Human-like dialogue with emotional intelligence and empathetic responses

Multilingual Support

Over 50 languages with 95%+ recognition accuracy, seamless language switching

Context Management

Maintain conversation history and context across sessions

Preset Characters

Ready-to-use digital avatars for various use cases and industries

Custom Characters

Create and deploy your own custom digital character avatars

Knowledge Base Integration

Connect enterprise or personal knowledge bases for expert-level, context-aware responses

Function Calling

Integrate external APIs and execute custom functions during conversations

Unified WebSocket Architecture

NavTalk uses a single unified WebSocket connection that handles all real-time communication, including:
  • Real-time API communication - Audio input streaming and text/audio responses
  • WebRTC signaling - Video stream setup and ICE candidate exchange
  • Session management - Configuration and conversation history
This unified approach provides several key advantages:
Only one WebSocket connection to manage, reducing complexity and potential connection issues. No need to coordinate multiple connections or handle separate WebRTC signaling channels.
Reduced connection overhead and improved reliability with a single persistent connection. All communication flows through one optimized channel.
All events and messages flow through a single connection, making it easier to monitor, log, and debug issues in your application.
WebRTC signaling is automatically synchronized with the audio stream, eliminating timing issues and ensuring smooth audio-video synchronization.

How It Works

The Real-time Digital Human API uses a direct audio-to-audio processing pipeline that eliminates traditional text conversion steps (STT and TTS), delivering unprecedented speed and natural conversation flow.
Your application captures user audio input and sends audio streams through WebSocket connections. This layer handles real-time audio streaming and ensures continuous bidirectional communication for seamless dialogue.
GPT-realtime processes audio signals directly without text conversion steps. By eliminating Speech-to-Text (STT) and Text-to-Speech (TTS) transformations, the system achieves sub-500ms latency and enables natural interruption handling.
The processed audio response is generated in real-time and synchronized with video rendering. This layer delivers high-quality audio output with preserved fidelity, maintaining natural voice tone and emotional nuances.
Frame-accurate lip sync and emotion-driven facial expressions are rendered in real-time, creating a lifelike visual presence synchronized with the audio output.

Try Our Demo

We provide simple, single-page demos in multiple languages and platforms that you can clone and run with one click. To get started:
  1. Register for an account and obtain your API key from the dashboard
  2. Clone the Samples repository: git clone https://github.com/navtalk/Samples.git
  3. Configure your API key in the demo files
  4. Run the demo — each demo is a single-page application that works immediately
The Samples repository includes ready-to-run examples for Web, Python, JavaScript, and other platforms to help you get started quickly.