How Chatbots Work
Three Generations of Chatbots
Rule-Based Chatbots
The simplest chatbots follow pre-defined conversation scripts. They match user input against patterns or keywords and respond with pre-written messages. If the user types anything containing "hours," the bot responds with business hours. If the input contains "return" and "policy," the bot provides the return policy text. These systems are essentially interactive FAQ pages with a conversational interface. They excel at narrow, well-defined tasks: answering predictable questions, collecting structured information (name, order number, issue type), and routing users to the right department.
Rule-based bots are built using decision trees or flow diagrams. Each node in the tree represents a bot message or question, and edges represent possible user responses. Platforms like Dialogflow, Amazon Lex, and Microsoft Bot Framework provide visual tools for designing these conversation flows. The advantage is complete control: the bot never says anything unexpected, never hallucinates, and every response is reviewed by a human before deployment. The disadvantage is rigidity: users who deviate from the expected conversation flow get stuck, the system cannot handle paraphrases or unusual phrasing, and maintaining large decision trees as products and policies change becomes expensive.
Retrieval-Based Chatbots
Retrieval-based chatbots select the best response from a pre-built database of possible responses rather than generating new text. Given the user's message and conversation history, the system ranks all candidate responses and returns the highest-scoring one. The candidate set can be as small as 100 responses for a focused customer service bot or as large as millions of response examples mined from human conversations.
The ranking mechanism has evolved from keyword matching to semantic matching. Modern retrieval chatbots encode the user's message and each candidate response into dense vectors using transformer models, then select the candidate whose vector is most similar to the user's message vector. This semantic matching handles paraphrases gracefully: "I want to cancel my subscription" and "how do I stop my membership" both match the same cancellation response even though they share few words. Retrieval-based systems avoid hallucination because every response was written by a human, but they cannot combine information from multiple responses or adapt their language to specific user contexts.
Generative Chatbots
Generative chatbots produce novel responses using language models. The system takes the conversation history as input and generates a response token by token. This approach can handle any topic, adapt to any conversational style, and produce responses tailored to the specific context of the conversation. The dramatic improvement in large language models since 2022 has made generative chatbots the dominant paradigm for new deployments. ChatGPT, Claude, Gemini, and similar systems are all generative chatbots built on transformer language models with billions of parameters.
The generative approach introduces risks that rule-based and retrieval-based systems do not have. The model can hallucinate facts, generate inappropriate content, reveal information from its training data, or produce responses that contradict its instructions. Managing these risks requires alignment techniques that train the model to follow instructions, refuse harmful requests, acknowledge uncertainty, and stay within the boundaries set by the system deployer.
The Architecture of a Modern AI Chatbot
The Base Language Model
The foundation is a large language model pre-trained on vast quantities of text. GPT-4, Claude, Gemini, and LLaMA are all transformer models trained to predict the next token in a sequence. This pre-training gives the model broad knowledge of language, facts, reasoning patterns, and conversational conventions. The pre-trained model is capable of generating text but is not yet aligned with human expectations for a helpful assistant: it might continue user prompts as if they were documents rather than conversations, produce toxic content reflecting patterns in its training data, or refuse to engage with legitimate questions because similar questions appeared in problematic contexts.
Supervised Fine-Tuning (SFT)
The first alignment step is supervised fine-tuning on high-quality conversation examples. Human annotators write example conversations demonstrating how the chatbot should behave: helpful, accurate, polite, appropriately cautious, and following instructions. The model is trained on these examples using standard language modeling loss (predict the next token). This step teaches the model the format and style of chatbot responses, transitioning it from a general text predictor to a conversational assistant. Typical SFT datasets contain 10,000 to 100,000 conversation examples covering diverse topics, question types, and edge cases.
Reinforcement Learning from Human Feedback (RLHF)
RLHF further refines the model's behavior based on human preferences. The process works in three stages. First, the SFT model generates multiple responses to each prompt. Second, human raters rank these responses from best to worst based on helpfulness, accuracy, safety, and following instructions. Third, a reward model is trained on these rankings, learning to predict which responses humans prefer. Finally, the language model is further trained using reinforcement learning (typically Proximal Policy Optimization, or PPO) to maximize the reward model's score while staying close to the SFT model's behavior.
RLHF produces models that are substantially more helpful and less harmful than SFT alone. The reward model captures subtle preferences that are difficult to encode in training data: being concise when a question has a simple answer but thorough when it requires detail, admitting uncertainty rather than guessing, and providing balanced perspectives on controversial topics. Constitutional AI (CAI), developed by Anthropic, extends this approach by having the model critique and revise its own responses according to a set of principles, reducing reliance on human raters for certain safety-related behaviors.
System Prompts and Conversation Management
The system prompt is a set of instructions prepended to every conversation that defines the chatbot's persona, capabilities, limitations, and behavioral guidelines. A customer service chatbot's system prompt might specify: "You are a helpful customer service agent for Acme Corp. Answer questions about our products and policies. If asked about competitor products, politely decline. If the customer is angry, acknowledge their frustration before addressing their issue. Never share internal pricing formulas." The system prompt shapes every response without being visible to the user.
Conversation management tracks the state of the dialogue across multiple turns. In the simplest implementation, the entire conversation history is included in the model's context window for every new response. This allows the model to reference earlier messages and maintain coherence across the conversation. For long conversations that exceed the context window, summarization techniques compress earlier turns into a shorter summary that preserves key facts and commitments. More sophisticated systems maintain structured state (customer name, order number, issue category, resolution status) alongside the conversation history.
Tool Use and External Integration
Modern chatbots extend beyond pure conversation by calling external tools and APIs. When a user asks "What is the weather in Tokyo?", the chatbot does not rely on its training data (which has a cutoff date), but instead calls a weather API, retrieves the current conditions, and incorporates the result into its response. Tool use transforms chatbots from knowledgeable conversationalists into functional agents that can take actions: looking up order status, scheduling appointments, searching databases, executing code, and processing transactions.
The technical mechanism for tool use involves the model generating a structured function call rather than natural language text. When the model determines that a tool is needed, it outputs a JSON object specifying the tool name and parameters. The chatbot platform executes the function, returns the result to the model, and the model generates a natural language response incorporating the result. This creates a loop: the model reasons about what information or action is needed, calls the appropriate tool, receives the result, and communicates the outcome to the user. Models are specifically trained to generate well-formatted function calls and to use tool results appropriately in their responses.
Retrieval-augmented generation (RAG) is the most common form of tool use, where the chatbot searches a knowledge base before responding. A company's support chatbot might search the help center documentation for relevant articles, include those articles in its context, and generate a response grounded in the retrieved information. This ensures the chatbot provides accurate, up-to-date information specific to the company's products and policies rather than relying on general knowledge from pre-training.
Measuring Chatbot Quality
Chatbot evaluation is multidimensional. Task completion rate measures how often the chatbot successfully resolves the user's request without escalation to a human agent. For customer service bots, this ranges from 30% for complex service scenarios to 80% for simple informational queries. Response accuracy measures whether the chatbot's statements are factually correct, critical for applications where wrong information has real consequences. User satisfaction, typically measured through post-conversation surveys, captures the overall experience including response quality, speed, and perceived empathy.
Conversation-level metrics include average conversation length (shorter is often better for task-oriented bots), escalation rate (how often the bot transfers to a human), and retention rate (how often users return to the bot for future questions). Turn-level metrics include response relevance (does the response address the user's actual question), coherence (does the response make sense given the conversation history), and latency (how quickly the response appears). Automated evaluation using LLM judges (where a separate language model rates the quality of chatbot responses) is increasingly used for continuous monitoring, though human evaluation remains the gold standard for periodic quality assessments.
Modern chatbots combine large language models (for generating contextually appropriate responses), alignment training (for safety and helpfulness), and tool integration (for accessing external information and taking actions) into systems that conduct natural, multi-turn conversations across virtually any topic.