When we experiment with Large Language Models (LLMs) in environments like Jupyter Notebooks or Google Colab, it’s easy to believe that if it works there, it should work the same in a production setting. However, the reality is a bit more complex. The non-deterministic nature of LLMs, one of their biggest hurdles, means that while a solution might work perfectly once, ensuring consistent performance over time—like a diesel engine running smoothly for years—is a whole different ball game.
This is particularly true for applications where accuracy and reliability are critical, and outputs must be in the expected range more than 99% of the time. Achieving this level of consistency requires additional effort, including more checkpoints and workflows, to ensure even a simple use case is effectively implemented in a production environment.
In this blog post, I’ll guide you through what worked for us and our learning in implementing a customer support agent using LLMs like GPT4 or Claude. This agent will be capable of handling various levels of support tickets, such as those for IT, HR, or procurement helpdesks. We’ll also explore how this model excels in categorizing tickets based on their criticality and complexity, offering a glimpse into the potential of LLMs in practical, real-world applications.
Streamlining LLMs for Effective Ticket Handling
In our journey to harness the power of Large Language Models (LLMs) for customer support, we start not with a complex model like GPT4, but with a simpler approach: a neural network-based classifier. This might seem a bit old-fashioned, especially with the rapid advancements in LLMs, but it’s a crucial step to ensure more deterministic outcomes.
Initially, we use this classifier to discern the intent of support tickets, categorizing them into different levels like L1, L2, and so on. Once categorized, the real work begins with integrating LLMs to generate responses. Our initial experiments with a non-finetuned, broad-based LLM like GPT4 paired with a basic reference table powered by a simple RAG (Retrieval Augmented Generation) showed little success. The relevance of the response was not as per expectations and the hallucination meter was quite high. The accuracy was <40% when tested on a sample set of 50 queries. To enhance this, we improved the RAG function, expanding the dataset it could reference. However, the accuracy was still not optimal and it was still <60%.
The game-changer came when we started finetuning the LLMs with historical ticket data (prompt-completion datasets). This fine-tuning significantly improved accuracy, to the tune of 70%+, but we encountered a new challenge: L2 responses were sometimes suggested for L1 tickets, and vice versa. To address this, we created specialized instances of fine-tuned LLMs for each ticket level, supplemented with rich, level-specific RAG content. This strategy dramatically boosted the accuracy of responses, often surpassing 90%.
The vector search, powered by RAG, enriched the responses further. This allowed us to pull in more detailed instructions and information, ensuring that the final response was accurate and comprehensive. Through this multi-layered approach, we’re able to leverage the strengths of LLMs while overcoming their non-deterministic nature, ensuring that our customer support responses are both precise and helpful.
RLHF and RLAIF in improving the output accuracy
The culmination of our journey with non-deterministic Large Language Models (LLMs) in customer support leads to a versatile system where responses generated by the AI can be utilized by either human agents or further processed in AI-driven workflows. This flexibility is crucial, as it allows for manual intervention or automated handling, depending on the scenario. The key to seamless integration lies in the subtle formatting of the output, a task easily managed at the prompt level.
However, what truly sets this system apart is the dynamic feedback mechanism. Users can interact with the responses, offering a thumbs up or down rating. We also explored presenting users with multiple response options, enabling them to choose the most effective one. This continuous stream of user feedback is then funneled back into the system, enhancing the training dataset through Reinforced Learning Human Feedback (RLHF). This iterative process resulted in ongoing improvement of the fine-tuned LLMs.
In parallel, we introduced Reinforced Learning AI Feedback (RLAIF), a novel concept that further refines our approach. This feedback comes in two forms: one is through error codes generated by AI agents when a response fails in downstream applications. The second, more proactive method involves running the AI-generated responses through predefined test cases, allowing automatic feedback generation. This feedback is then utilized to further refine the dataset and improve the RLAIF model.
After numerous experiments and iterations, we continued to do potential enhancements through error correction and feedback-loop systems over the next six months. Our journey illustrates how a customer support agent, powered by non-deterministic LLMs, can be both accurate and useful.
For those interested in exploring this technology, we provide a list of powerful yet simple to integrate LLM SDKs.
Docs – https://docs.lyzr.ai/homepage
The Lyzr.ai SDKs used for the customer support agent were,
- ChatBot SDK
- Finetuning SDK (in private preview)
- COTFS Prompt SDK
- SearchBot SDK (for retrieval)
- RLHF SDK (in private preview)
- RLAIF SDK (in private preview)
- Classifier SDK (in private preview)
We encourage you to build a customer support agent for your usecase and share your experiences. For enterprises, we offer feature-rich, more powerful, and versatile LLM SDKs designed for enterprise-grade deployments. The Enterprise SDKs of Lyzr come with the Lyzr Enterprise Hub, a comprehensive monitoring tool for your AI applications, including Generative AI application performance, LLM requests, LLM spending, queries, and logs. The added security layer of secret key activation for these SDKs ensures enhanced control on your enterprise workloads.
Read more about Lyzr – https://www.lyzr.ai/
Book Demo – https://www.lyzr.ai/#BookDemo
Book A Demo: Click Here
Join our Slack: Click Here
Link to our GitHub: Click Here
- How Chatbots work in the era of Large Language Models like GPT4?
- Designing a Customer Support Agent with Non-Deterministic LLMs