Winner of the Accenture Gen AI challenge.🏆

Mixture of Expert Agents – Lyzr AI

Mixture of Expert Agents – Lyzr AI

Table of Contents

Build your 1st AI agent today!

Mixture of Expert Agents for Multi-Persona Chatbots

Abstract

Generative AI powered chatbots have become the most widely adopted application of Large Language Models (LLM) by organizations across various industries. These organization-specific chatbots leverage a technology known as RAG – Retrieval Augmented Generation, enabling them to respond more dynamically to user inputs.

Despite the advanced capabilities offered by RAG, it has been observed that most chatbots are still designed with a general-purpose focus. This means they may not be able to effectively address the increasingly complex and specialized needs of the customer.

Recognizing this gap in the capabilities of existing AI chatbots, the research team at Lyzr has proposed an innovative new chatbot architecture, aptly named the ‘Mixture of Expert Agents.’ This unique architecture allows a single chatbot to embody multiple personas, with each persona acting as an expert in a distinct field. The Mixture of Experts (MoE) architecture in machine learning is significant because it leverages specialized neural network models, termed ‘experts,’ to improve model capacity and efficiency.

The concept behind this is simple yet powerful: by having multiple expert agents within a single chatbot, it can handle a wide range of customer queries across different topics. Each agent specializes in a certain area, allowing the chatbot to provide tailored responses and solutions to a variety of customer needs.

In essence, the ‘Mixture of Expert Agents’ architecture empowers chatbots to deliver a more comprehensive and personalized customer service experience, capable of addressing complex queries with expert precision.

The Problem

Chatbots are the easiest entry point for adopting Generative AI in an organization. However, general-purpose chatbots with RAG often fail to provide accurate responses to all customer queries. Let’s examine an example.

Chatbot PersonaBase PromptCustomer QueryChatbot ResponseInsurance Claims AssistantYou are an expert claims assistant designed to support in claims filing process.   Answer claims related queries using the claims processing details provided.   Answer policy related queries using the policy related details provided.‘How to file a claim?’Correct Response “1. Obtain all original hospitalization documents such as discharge summary, diagnostic and laboratory reports, prescriptions, payment receipts, etc. Download the health insurance claim form from the website and fill in all the required details. Ensure the hospital attests all the documents with seal and signature. Submit all the hospitalization-related documents along with the health insurance claim form.”

If a customer decides to ask a different question, the chatbot may not have the correct context. Even if it does, it might struggle to provide the most relevant top-k results to answer the query. You might wonder why a customer would ask a question that seems out of context. Well, who are we to control customer behavior. Customers tend to pose queries that they believe the chatbot can answer.

Chatbot PersonaBase PromptCustomer QueryChatbot ResponseInsurance Claims AssistantYou are an expert claims assistant designed to support in claims filing process.   Answer claims related queries using the claims processing details provided.   Answer policy related queries using the policy related details provided.‘How do I switch policy’Wrong Response “To switch policy, you will have to obtain all original hospitalization documents such as discharge summary, diagnostic and laboratory reports, prescriptions, payment receipts, etc.”

When a context switch occurs, the instructions in the prompt may not extract the correct data due to inaccurate retrieval from the customer data. While this behavior may not pose much of a problem in a prototype, it is unreliable in a production scenario where accuracy is paramount.

So, how is a multi-agent chatbot architecture different from a single-agent chatbot? By engaging both the experts and the gate networks in a training model, multi-agent systems can better handle variations in customer queries, improving stability and computational efficiency.

Single Agent Chatbot Architecture
  • Customer data is passed to the chatbot through RAG
  • Customer query is interpreted by the retrieval engine of the chatbot
  • The top-k results are fetched, reranked and augmented by LLM to generate response
  • Single-agent chatbots often face limitations in handling complex queries due to the constraints in model parameters.
  • Agent memory holds the chat memory and passes it to LLMs during every call
  • Features like RLHF and RLAIF can enhance the chatbot performance over time
Multi Agent Chatbot Architecture
  • The customer query is interpreted by a query resolver
  • The resolver holds the metadata of the individual expert agents
  • The resolver analyzes the input query using a gating network to quickly decide on which expert agent to call
  • The right expert agent is invoked and it answers the query accurately
  • Each agent has its own memory and the entire architecture holds a universal memory

How does the prompts look in a multi-agent chatbot architecture?

Expert AgentPersona in Base PromptClaims Filing AgentYou are an expert claims filing agent designed to help customers with filing claims successfully.Claims Status Check AgentYou are an expert claims status check agent designed to help customers check the status of their claims.Policy Information AgentYou are an expert policy information agent designed to help customers with policy related information including policy comparison, feature explanation.

Query Router and Load Balancing

The major roadblock we encountered was with the query router algorithm. While we introduced the ‘decorator’ way of calling the agent, the router cannot follow a simple ‘round robin’ model of calling agents. After many experiments, we figured out that the best way to call the agents is to follow one of the two methods.

  1. Manual agent calling
  2. LLM Router

An important aspect of the query router is load balancing, which ensures that all agents handle a roughly equal share of queries. This prevents inefficiencies that arise when a few agents are overloaded while others are underutilized.

In Manual agent calling router model, the agents hold the agent selection logic. Each agent will have rules on when to select the other agent. This model helps in more deterministic behavior of the agent calling and hence could be used for a workflow that requires more precision.

In the LLM router model, the router holds the metadata of agents (in Lyzr’s case, it’s the decorators) and will decide which agent to call dynamically. LLMs decide the agent to be called upon based on the customer’s query and the available metadata.

Introducing ‘hold and pass’ model for agent interactions and training process

Another significant challenge was ensuring consistency in the agent conversations when a user is actively engaging with the meta agent. Here, the term ‘meta agent’ refers to the entire multi-agent chatbot architecture. To ensure consistent interactions and smooth transfer of control between agents, we developed the ‘hold and pass’ model, similar to the exchange of the ball in basketball.

The ‘hold and pass’ model significantly improves the training process by allowing each agent to specialize in specific tasks or data subsets, leading to better performance and stability.

Without ‘hold and pass’Without this, the query resolver will have to go through each agent to determine which agent should take up the incoming query. And with the agent’s own memory and the universal memory of the meta-agent, LLM will generate a response to the user query.With ‘hold and pass’With the ‘hold and pass’ architecture, the agent that handled the last known user query will continue to hold the chat until it decides to pass the ball back to the query resolver. When will it decide to pass the control to query resolver? When the agent realizes that the incoming query is better answered by another expert agent within the meta-agent group.

Agent Memory, Sparse Models & Universal Memory

The mixture of expert agents also invokes another interesting concept in how the meta-agent handles agent memory. With multiple agents, you now get the liberty of providing memory to each individual agents so that they have their own mini-memory. This multi-agent architecture enhances model capacity by allowing the system to understand and express more complex patterns. And the combination of all these mini-memories along with the logs of non-agent components (like resolver, router) becomes the universal memory of the meta-agent.

What’s your Reaction?
+1
1
+1
0
+1
0
+1
0
+1
0
+1
0
+1
0
Book A Demo: Click Here
Join our Slack: Click Here
Link to our GitHub: Click Here
Share this:
Enjoyed the blog? Share it—your good deed for the day!
You might also like

What Are AI Agents: A Comprehensive Guide

The exploding power of AI agents for enterprises

What is Agentic AI?

Need a demo?
Speak to the founding team.
Launch prototypes in minutes. Go production in hours.
No more chains. No more building blocks.