In today's rapidly evolving digital communication landscape, contact centers are crucial in shaping customer experiences. A particularly labor-intensive task in these settings is the manual analysis of call conversations to determine outcomes and assess customer emotions. Traditionally, human agents have needed to meticulously listen to entire conversations—a process not only time-consuming but also prone to human error and subjective interpretation. As businesses strive for efficiency, it is increasingly clear that a systematic, reliable method is essential to streamline these tasks, motivating research into automated AI solutions.
To tackle the challenge of reducing human effort in determining chat outcomes, our research introduces an innovative AI-driven solution to automate the emotion analysis process during call conversations. By breaking down voice files into segments based on voice activity, our system performs emotion analysis on each segment, offering a more detailed insight into the call. This method not only facilitates a comprehensive understanding of customer emotions throughout the conversation but also aims to achieve higher accuracy in outcome determination by focusing on smaller, more manageable data segments. This segmentation allows for parallel processing, thereby enhancing the efficiency and effectiveness of the emotion analysis.
(The architecture of Amity Voice Emotion Analysis)
Our research included several experimental rounds to refine and test the proposed solution. Initially, a dataset of 14 voice conversations with 30 labeled emotions was used to evaluate a conventional acoustic emotion detection model, achieving a modest accuracy of 32% with the popular model "speechbrain/emotion-recognition-wav2vec2-IEMOCAP."
In the first major improvement, we introduced a more sophisticated system architecture that processes acoustic and textual emotion data independently, also effectively separating customer emotions into five distinct categories: Confusion, Anger, Happy, Sadness, and Excited. This categorization provided a clearer and more actionable insight into customer sentiments, helping refine the system further. Significantly boosting accuracy to 85% with our Mix-Emo-Detector-EN model.
However, upon expanding the evaluation to 200 data points, the accuracy dropped to 64%, prompting an investigation into features analysis. It revealed that using only emotion data was insufficient for class discrimination.
(The distribution of data across emotions based on only textual x acoustic emotion data)
Investigations revealed issues related to feature representation, leading us to seek more effective pre-trained models for better acoustic and textual data integration ( ref: https://emo-box.github.io/leaderboard1.html ). The combination of acoustic and textual embeddings delivered results well-suited to our use case.
( The distribution of data across emotions based on textual x acoustic embedding )
Further validation involved training a simple feed-forward model on both private and public datasets across major emotions to ensure generalization, yielding the following results:
Best WA: 71.51% | Best UA: 70.08% | Best WF1: 72.12%
The evaluation results demonstrated significant improvements in feature engineering, positioning our approach at rank #2 on the Emo-Box leaderboard with a model size 70% smaller than the leading competitor.
(Emo-Box Leaderboard, Last updated: 2025-02-26)
The integration of our AI-driven solution for outcome determination in voice processes offers numerous benefits. Firstly, it drastically reduces the amount of manual effort required from human operators, allowing them to focus on more strategic tasks rather than routine monitoring. This not only enhances resource allocation but also opens opportunities for increased organizational efficiency and productivity. Moreover, the improved accuracy in emotion detection leads to consistent and reliable outcome determinations, boosting customer satisfaction through more responsive and empathetic customer service.
In conclusion, the successful development and testing of our Amity Voice Emotion Analysis solution represent a significant advancement in automating outcome determination in call center operations. By reducing the reliance on human analysts through the deployment of advanced AI models, businesses can achieve faster, more accurate, and consistent analysis of customer sentiment. This research holds the potential to transform the way organizations interact with their customers, leading to enhanced customer experience and operational efficiency, ultimately positioning businesses to better meet the challenges of today's dynamic and competitive marketplace.