Mixture of Expert Agents – Lyzr AI

State of AI Agents 2025 report is out now!

Table of Contents

Mixture of Expert Agents for Multi-Persona Chatbots

Abstract

Generative AI powered chatbots have become the most widely adopted application of Large Language Models (LLM) by organizations across various industries. These organization-specific chatbots leverage a technology known as RAG – Retrieval Augmented Generation, enabling them to respond more dynamically to user inputs.

Despite the advanced capabilities offered by RAG, it has been observed that most chatbots are still designed with a general-purpose focus. This means they may not be able to effectively address the increasingly complex and specialized needs of the customer.

Recognizing this gap in the capabilities of existing AI chatbots, the research team at Lyzr has proposed an innovative new chatbot architecture, aptly named the ‘Mixture of Expert Agents.’ This unique architecture allows a single chatbot to embody multiple personas, with each persona acting as an expert in a distinct field. The Mixture of Experts (MoE) architecture in machine learning is significant because it leverages specialized neural network models, termed ‘experts,’ to improve model capacity and efficiency.

The concept behind this is simple yet powerful: by having multiple expert agents within a single chatbot, it can handle a wide range of customer queries across different topics. Each agent specializes in a certain area, allowing the chatbot to provide tailored responses and solutions to a variety of customer needs.

In essence, the ‘Mixture of Expert Agents’ architecture empowers chatbots to deliver a more comprehensive and personalized customer service experience, capable of addressing complex queries with expert precision.

The Problem

Chatbots are the easiest entry point for adopting Generative AI in an organization. However, general-purpose chatbots with RAG often fail to provide accurate responses to all customer queries. Let’s examine an example.

Chatbot PersonaBase PromptCustomer QueryChatbot ResponseInsurance Claims AssistantYou are an expert claims assistant designed to support in claims filing process. Answer claims related queries using the claims processing details provided. Answer policy related queries using the policy related details provided.‘How to file a claim?’Correct Response “1. Obtain all original hospitalization documents such as discharge summary, diagnostic and laboratory reports, prescriptions, payment receipts, etc. Download the health insurance claim form from the website and fill in all the required details. Ensure the hospital attests all the documents with seal and signature. Submit all the hospitalization-related documents along with the health insurance claim form.”

If a customer decides to ask a different question, the chatbot may not have the correct context. Even if it does, it might struggle to provide the most relevant top-k results to answer the query. You might wonder why a customer would ask a question that seems out of context. Well, who are we to control customer behavior. Customers tend to pose queries that they believe the chatbot can answer.

Chatbot PersonaBase PromptCustomer QueryChatbot ResponseInsurance Claims AssistantYou are an expert claims assistant designed to support in claims filing process. Answer claims related queries using the claims processing details provided. Answer policy related queries using the policy related details provided.‘How do I switch policy’Wrong Response “To switch policy, you will have to obtain all original hospitalization documents such as discharge summary, diagnostic and laboratory reports, prescriptions, payment receipts, etc.”

When a context switch occurs, the instructions in the prompt may not extract the correct data due to inaccurate retrieval from the customer data. While this behavior may not pose much of a problem in a prototype, it is unreliable in a production scenario where accuracy is paramount.

So, how is a multi-agent chatbot architecture different from a single-agent chatbot? By engaging both the experts and the gate networks in a training model, multi-agent systems can better handle variations in customer queries, improving stability and computational efficiency.

SIngle-Agent Chatbot Architecture

Customer data is passed to the chatbot through RAG
Customer query is interpreted by the retrieval engine of the chatbot
The top-k results are fetched, reranked and augmented by LLM to generate response
Single-agent chatbots often face limitations in handling complex queries due to the constraints in model parameters.
Agent memory holds the chat memory and passes it to LLMs during every call
Features like RLHF and RLAIF can enhance the chatbot performance over time

Multi-Agent Chatbot MoE Architecture

The customer query is interpreted by a query resolver
The resolver holds the metadata of the individual expert agents
The resolver analyzes the input query using a gating network to quickly decide on which expert agent to call
The right expert agent is invoked and it answers the query accurately
Each agent has its own memory and the entire architecture holds a universal memory

How does the prompts look in a multi-agent chatbot architecture?

Expert AgentPersona in Base PromptClaims Filing AgentYou are an expert claims filing agent designed to help customers with filing claims successfully.Claims Status Check AgentYou are an expert claims status check agent designed to help customers check the status of their claims.Policy Information AgentYou are an expert policy information agent designed to help customers with policy related information including policy comparison, feature explanation.

Query Router and Load Balancing

The major roadblock we encountered was with the query router algorithm. While we introduced the ‘decorator’ way of calling the agent, the router cannot follow a simple ‘round robin’ model of calling agents. After many experiments, we figured out that the best way to call the agents is to follow one of the two methods.

Manual agent calling
LLM Router

An important aspect of the query router is load balancing, which ensures that all agents handle a roughly equal share of queries. This prevents inefficiencies that arise when a few agents are overloaded while others are underutilized.

In Manual agent calling router model, the agents hold the agent selection logic. Each agent will have rules on when to select the other agent. This model helps in more deterministic behavior of the agent calling and hence could be used for a workflow that requires more precision.

In the LLM router model, the router holds the metadata of agents (in Lyzr’s case, it’s the decorators) and will decide which agent to call dynamically. LLMs decide the agent to be called upon based on the customer’s query and the available metadata.

Introducing ‘hold and pass’ model for agent interactions and training process

Another significant challenge was ensuring consistency in the agent conversations when a user is actively engaging with the meta agent. Here, the term ‘meta agent’ refers to the entire multi-agent chatbot architecture. To ensure consistent interactions and smooth transfer of control between agents, we developed the ‘hold and pass’ model, similar to the exchange of the ball in basketball.

The ‘hold and pass’ model significantly improves the training process by allowing each agent to specialize in specific tasks or data subsets, leading to better performance and stability.

Without ‘hold and pass’Without this, the query resolver will have to go through each agent to determine which agent should take up the incoming query. And with the agent’s own memory and the universal memory of the meta-agent, LLM will generate a response to the user query.With ‘hold and pass’With the ‘hold and pass’ architecture, the agent that handled the last known user query will continue to hold the chat until it decides to pass the ball back to the query resolver. When will it decide to pass the control to query resolver? When the agent realizes that the incoming query is better answered by another expert agent within the meta-agent group.

Agent Memory, Sparse Models & Universal Memory

The mixture of expert agents also invokes another interesting concept in how the meta-agent handles agent memory. With multiple agents, you now get the liberty of providing memory to each individual agents so that they have their own mini-memory. This multi-agent architecture enhances model capacity by allowing the system to understand and express more complex patterns. And the combination of all these mini-memories along with the logs of non-agent components (like resolver, router) becomes the universal memory of the meta-agent.

What’s your Reaction?

Post Views: 1,220

Book A Demo: Click Here
Join our Slack: Click Here
Link to our GitHub: Click Here

Banking

Insurance

Sales

HR

Marketing

Customer Service

Mixture of Expert Agents – Lyzr AI

Table of Contents

State of AI Agents 2025 report is out now!

Mixture of Expert Agents for Multi-Persona Chatbots

Abstract

The Problem

SIngle-Agent Chatbot Architecture

Multi-Agent Chatbot MoE Architecture

Query Router and Load Balancing

Introducing ‘hold and pass’ model for agent interactions and training process

Agent Memory, Sparse Models & Universal Memory

Enjoyed the blog? Share it—your good deed for the day!

Launch prototypes in minutes. Go production in hours.
No more chains. No more building blocks.

Join 13,376+ subscribers

Agents

Fundamentals

Playbooks

Banking

Insurance

Sales

HR

Marketing

Customer Service

Mixture of Expert Agents – Lyzr AI

Table of Contents

State of AI Agents 2025 report is out now!

Mixture of Expert Agents for Multi-Persona Chatbots

Abstract

The Problem

SIngle-Agent Chatbot Architecture

Multi-Agent Chatbot MoE Architecture

Query Router and Load Balancing

Introducing ‘hold and pass’ model for agent interactions and training process

Agent Memory, Sparse Models & Universal Memory

Enjoyed the blog? Share it—your good deed for the day!

Launch prototypes in minutes. Go production in hours. No more chains. No more building blocks.

Join 13,376+ subscribers

Agents

Fundamentals

Playbooks

Launch prototypes in minutes. Go production in hours.
No more chains. No more building blocks.