Real-Time Voice API

Technical Breakthrough

Bridging Legacy and Modern Systems

Our new optimizer module is designed to bridge the gap between conventional speech recognition and LLM-powered processing. By leveraging advanced techniques, our system delivers near real-time performance with robust support for call interruptions and dynamic function call management. The core objective is to ensure that even during high-demand scenarios, the system maintains responsiveness and accuracy.

Key Technical Enhancements

Acoustic Voice Activity Detector

Recognizing the complexity of real-world interactions, our module supports call interruptions and effectively manages waiting states. This ensures that ongoing conversations remain fluid and adaptive, even when unexpected pauses occur.

Word Chunking Detector

Recognize the streaming chunks from a large language model, Our module supports detecting the completion of chunks by comparing to natural sentences. This balances between speed and completeness of output.

Cost Efficiency through Optimization

By rethinking the conventional architecture and streamlining the integration process, our optimizer reduces operational costs by approximately 4 times compared to traditional solutions. This cost efficiency does not come at the expense of performance, as our solution is fully scalable and enterprise-ready.

The above diagram will provide a visual breakdown of our system architecture, illustrating how the optimizer module interfaces between traditional speech recognition and LLM functionalities to deliver real-time API performance.

Implementation in Practice

A Modular Approach

The design of the optimizer is based on modular principles, enabling seamless integration into existing enterprise systems. This approach not only enhances adaptability but also simplifies the upgrade path as new technologies and enhancements become available.

Successful Deployment with Help Desk Use Case

Our recent collaboration with the Accentix Team has validated the technical robustness of our solution. The deployment has showcased remarkable improvements in both processing speed and cost efficiency, reinforcing our confidence in the optimizer’s potential to transform real-time voice interactions at scale.

Benchmark Our Approachvs OpenAI Realtime API

We conducted a benchmark using GPT-4o as a judge to evaluate the performance of Amity Real-Time Voice API against OpenAI's 4o-mini Realtime API. The evaluation focused on two key factors: (1) accuracy and relevance of responses compared to the given questions and expected answers, and (2) average response time to assess efficiency. To ensure a fair comparison, both models were tested under the same configuration and prompt, allowing for an objective measurement of accuracy, relevance, and response time under identical conditions.

Accuracy and Relevance Comparison (Higher is better)

The results are based on a head-to-head comparison between the given answers from each modeland the expected answers across all test cases. The results are:
‍
• Amity Real-Time Voice API: 73.33%
• OpenAI: 26.67%

This means that Amity Real-Time Voice API's responses were more accurate and relevant in 73.33% of the test cases, while OpenAI's responses were more accurate and relevant in 26.67% of the test cases, indicating that Amity Real-Time Voice API's answers were more closely aligned with the expected responses, demonstrating a better grasp of the context and providing more precise information compared to OpenAI.

Accuracy Comparison

Duration Benchmark: Response Time Analysis (Lower is better)

We also analyzed the average response times for both models, defined as the time from when the user finishes speaking to the first token being returned by the AI model. The results are:
‍
• Amity Real-Time Voice API Average Duration: 2.84 sec
• OpenAI Realtime API (4o-mini) Average Duration: 1.36 sec

This indicates that OpenAI responds to the user at least 1 second faster than Amity Real-Time Voice API.

Duration Comparison

Conclusion

The benchmark demonstrates that while Amity Real-Time Voice API has a longer response time compared to OpenAI's Real-Time API, it consistently provides more accurate and contextually relevant responses. On the other hand, OpenAI's faster response speed appears to come at the cost of accuracy and relevance, as it tends to answer fully in one go without matching the expected context as effectively.

Future Work & Enhancements

Our latest optimizer module not only delivers a real-time voice API at 4× lower cost but also provides smoother, more natural conversations by reducing latency. Looking ahead, we’re focused on further enhancing function calling support so that every voice command directly triggers smart, automated actions.

What's Next?

Semantic Voice Activity Detector

In real world Scenario, To detect when the user has finished speaking is very important, for example, if user audio that trails off with an "hhhmmm..." or “ummmmm…” it should make a noise to the acoustic voice activity detector model to inaccurately detect the voice, we’ve planned to train and integrate the small language model to detect in semantic level on next version.

Dynamic Function Calls

Voice requests like “Order a medium latte from Starbucks” will automatically invoke pre-defined functions (e.g., an orderCoffee API) to process orders instantly.

Smooth Transaction Flow

With improved error handling and multi-step function support, complex tasks (such as booking flights or scheduling meetings) will be handled seamlessly—all while keeping the conversation fluid and responsive.

By combining reduced latency with advanced function calling, we’re paving the way for intelligent, voice-driven enterprise applications that not only cut costs but also deliver a smooth, engaging user experience.