AI agent platform Social Graph

CCAI (Contact Center AI)

--> to the BOTwiki - The Chatbot Wiki

Contact Center AI, or CCAI for short, refers to the use of artificial intelligence to automate and support customer service processes across voice, chat, and email channels. At its core is an AI agentthat understands and classifies incoming inquiries and either resolves them independently or transfers them to human agents in a structured manner.

 

Components of Contact Center AI

A comprehensive CCAI solution consists of several interconnected functional components. At the forefront is the virtual agent, which conducts incoming voice and text conversations. Behind the scenes, modules provide real-time support to employees, while analytical functions organize conversation data and make it available for continuous optimization.

  • Virtual Agent: Automating customer inquiries via phone and chat using a voicebot or chatbot.
  • Agent Assist: Provides support to service representatives during calls or chats by offering context-relevant documents and suggested responses in real time.
  • Knowledge integration: Integration of internal knowledge sources such as FAQs, Confluence, or product databases via a knowledge AI layer.
  • Routing logic: Data-driven forwarding to the appropriate department when a request cannot be answered automatically.

 

Typical Architecture and Integration

Most projects already have a telephone system, a CRM, a ticketing system, and an existing contact center system in place. A modern CCAI architecture integrates with these systems via open interfaces rather than replacing them. The AI agent handles the conversation logic, while the existing systems continue to provide call routing, queues, and agent workstations.

At the data level, call data, transcripts, and intent matches are fed into a central layer where models can be retrained. For companies in the DACH region, issues such as data residency, data processing on behalf of clients, and data deletion policies are of central importance, as the service involves the processing of personal data and voice recordings.

 

Implications for Voice and Chat

Contact Center AI has the greatest impact in the voice channel. Traditional IVR trees are rigid, rule-based, and often frustrating for callers. AI-native voice with multi-agent orchestration changes this approach. A Phonebot understands the request in natural language, passes structured data to follow-up agents, and guides users through complex processes such as scheduling appointments, checking statuses, or making simple contract changes.

In the chat and email channel, the same platform handles inquiries via web chat, messaging apps, and service inboxes. 

 

CCAI in multi-agent setups

In complex service situations, a single AI agent is rarely sufficient. Multi-agent orchestration means that specialized agents work together. A triage agent identifies the issue, a specialized agent handles authentication, and a subject-matter expert resolves the actual task in the order system or CRM. This creates a division-of-labor architecture that smoothly manages individual conversations while treating the handoff to human employees as an equally valid path.

 

Frequently Asked Questions (FAQ)

Contact Center AI is an umbrella term for AI capabilities that automate or support customer service processes across voice, chat, and email channels. At its core is an AI agent that understands inquiries and either resolves them on its own or provides agents with relevant information. The solution complements existing contact center systems rather than completely replacing them.

Traditional IVR systems use fixed menus and DTMF inputs via the telephone keypad. Contact Center AI, on the other hand, understands natural language, recognizes inquiries in context, and can handle multiple steps in a single conversation. This reduces the average handling time, and callers are less likely to get stuck in endless loops.

A modern CCAI platform connects to the phone system, CRM, ticketing system, and knowledge base via open interfaces. This preserves existing investments and adds a conversational AI layer.

CCAI is particularly relevant for companies with high call volumes, recurring standard inquiries, and multiple parallel service channels. Typical industries include insurance, banking, utilities, retail, and healthcare. In these sectors, automated voice and chat channels, combined with Agent Assist, deliver measurable improvements in reachability and first-contact resolution.



--> Back to BOTwiki - The Chatbot Wiki


AI agent platform Social Graph

confidence score

--> to the BOTwiki - The Chatbot Wiki

The confidence score is a numerical metric used in AI-based agents, chatbots, and voicebots to indicate the degree of certainty with which a system has correctly understood an input, assigned it to an intent, or how confident it is in a response it has generated. It forms the probabilistic basis for the system’s decisions, and the higher the value, the more reliable the model’s classification and the lower the risk of an incorrect response or hallucination.

 

How the Confidence Score is Calculated

The confidence score is derived from the interaction of several components of the AI model. Essentially, the underlying language model or classifier calculates a probability distribution for each possible interpretation of the user input. In intent classification, for example, the model compares the incoming utterance with trained patterns and assigns a probability value to each potential intent. The highest value in this distribution is typically reported as the Confidence Score, often normalized to a scale of 0 to 1 or 0 to 100 percent.

In modern transformer-based models, several factors come into play: the semantic similarity of the input to training examples, the clarity of the wording, the conversational context from previous turns, and, in the case of voicebots, the acoustic recognition quality from automatic speech recognition (ASR). Ambiguous, colloquial, or very short inputs typically result in lower scores because the model oscillates between multiple interpretations. Well-structured, clear formulations, on the other hand, lead to high, concentrated probability masses on a single intent and thus to a high confidence score.

 

Threshold values and fallback behavior

To make the Confidence Score usable in practice, developers and conversational designers define thresholds. If the score exceeds the upper threshold, the system provides a response. If it falls below the lower threshold, a fallback mechanism is triggered. 

Typical fallback strategies include asking specific comprehension questions, offering choices to narrow down the user’s intent, or escalating the matter to a human agent. In critical use cases such as medical information systems or financial advisory bots, thresholds are deliberately set high to minimize errors. In less high-risk scenarios, a lower threshold can increase the automation rate without significantly compromising user satisfaction.

Calibrating these thresholds is an iterative process based on the evaluation of real-world interaction data and the analysis of misclassifications. A threshold set too high leads to frequent, unnecessary follow-up questions and frustrates users; a threshold set too low increases the rate of incorrect answers and undermines trust in the system.

 

Implications for Voice and Chat

In the context of chatbots, the confidence score primarily influences the control of dialogue flows and the selection of response modules. Since text inputs are generally more precise and better structured than spoken language, confidence scores in chat often fall within higher ranges. Nevertheless, typos, abbreviations, code-switching between languages, or very short inputs such as single keywords pose a challenge and can significantly lower the score.

In the voicebot domain, the confidence score plays an even more central role, as two error-prone stages are linked in sequence: first, speech recognition, which converts spoken words into text, and then the NLU (Natural Language Understanding) model, which interprets the text. Both stages provide their own confidence scores, which are often combined. Background noise, dialects, speech rate, and telephone bandwidth degrade ASR quality and lower the overall confidence score.

Voicebots must therefore implement particularly robust fallback strategies, as a failure or an incorrect response is perceived as significantly more disruptive in a spoken dialogue than in a chat. Overall, the confidence score is a key tool in both channels for managing the balance between the level of automation and the quality of interaction, and for continuously improving the user experience.

 

Frequently Asked Questions (FAQ)

The Confidence Score is a probability value between 0 and 1 that indicates how confident an NLU model is that it has correctly understood an input, assigned it to an intent, or how confident it is in a response it has found. It determines whether an AI agent responds directly, asks a follow-up question, or branches to a fallback path.

The model calculates a probability value for every possible interpretation of the input. Modern methods use vector spaces in which meaning and context are represented. The highest of these values is output as a confidence score, normalized on a scale from 0 to 1.

The system falls back on a contingency plan: it asks a follow-up question, offers options, or escalates the conversation to a human agent.

In practice, values between 0.7 and 0.85 have proven effective—the more risk-sensitive the application, the higher the threshold should be set. Regular calibration based on real-world



--> Back to BOTwiki - The Chatbot Wiki


AI agent platform Social Graph

Phonebots / Voicebots

--> to the BOTwiki - The Chatbot Wiki

Phonebots are AI-based voice solutions that automatically answer, understand, and—in many cases—completely handle phone calls. They represent the AI-native evolution of traditional IVR hotlines (“Press 1 for…”), with the key difference being that callers no longer have to select rigid menu options but can instead speak freely. A modern phonebot understands the request, classifies it, authenticates the caller if necessary, and handles the process end-to-end, including backend integration with CRM, ERP, payment, or industry-specific systems.

For many companies, voice is the most important channel for customer interaction. High call volumes, overloaded hotlines, staff shortages, and frustration with hold times are a daily reality in customer service. Phonebots address precisely this pain point—not as a replacement for human agents, but as a scalable first point of contact that automates simple and moderately complex tasks and smoothly hands off complex issues to humans. 

 

Body vs. Brain: Why Traditional Voice Solutions Fail

Some telephony platforms are heavily reliant on the line, that is, on SIP, PSTN, and call center telephony—but they use AI merely as an add-on to legacy IVR trees, and as a result, they fail when faced with ambiguity, context shifts, and natural language. Despite the “AI voicebot,” callers end up on hold anyway because the system escalates at the first sign of unclear phrasing.

On the other hand, there are simple single-prompt tools and wrappers that can respond in natural language but consistently fail in real-world business processes involving authentication, database access, and multi-step workflows, resulting in hallucinations, tool-calling errors, and context contamination. AI-native phonebots need both: solid telephony integration and an intelligent, process-stable architecture.

 

What Matters When Implementing Phonebots

Three key success factors are common to nearly all Phonebot projects. The focus on use cases is crucial. Instead of automating “the entire hotline,” the first step is to identify the truly frequent, clearly definable processes—that is, the typical top 3 or top 5 issues per industry. The backend integration must be seamless. A Phonebot that doesn’t integrate with CRM, ERP, or industry-specific systems remains a FAQ bot with a phone number. And the voice experience must be seamless. Voice, tempo, pause fillers, escalation logic, and warm transfer to a human agent all go hand in hand.

 

Phonebots by BOTfriends

BOTfriends X picks up right where traditional voice solutions leave off: AI-native voice with multi-agent orchestration. The result: Callers speak freely, the agent understands their request, authenticates them, accesses backend systems, and completes the process end-to-end. No waiting on hold and no rigid menu structure.

The platform offers full telephony integration via SIP and PSTN, more than 500 voices in over 100 languages, and ElevenLabs Voices for a natural-sounding conversation experience. Hallucinations and tool-calling errors are structurally prevented through hybrid intelligence: natural language (LLM) is combined with deterministic rule logic, ensuring that even backend writing processes remain brand-safe and factually accurate.

 

Frequently Asked Questions (FAQ)

In practice, the terms are often used interchangeably. “Phonebot” places greater emphasis on the telephone channel (traditional phone numbers, hotlines, PSTN/SIP), while “Voicebot” is a broader term that can also include in-app voice or web-based voice. At BOTfriends, both terms refer to an AI-native voice solution with a multi-agent architecture.

Common use cases include damage reports, meter readings, shipment tracking with authentication, hotline triage, scheduling appointments, status inquiries, and simple contract or order processes. More complex or sensitive issues can be forwarded to human agents with all relevant context.

A wide selection of voices in many languages, including high-quality neural voices, is available. Tone, pause fillers, and tempo are configured in collaboration with the customer to ensure consistency with the brand’s tone of voice. Our phonebots can be tested live in our Demo Hub.

Time-to-value depends on the use case. In clearly defined scenarios, initial production setups can be achieved in just a few weeks, including backend integration, test loops, and the hypercare phase. “Live in 5 minutes” is just marketing speak and not a realistic claim for truly efficient enterprise voice projects.



--> Back to BOTwiki - The Chatbot Wiki


AI agent platform Social Graph

Embeddings

--> to the BOTwiki - The Chatbot Wiki

Embeddings are numerical representations of text, images, or other data in a high-dimensional vector space. They translate meaning into numbers. Content with similar meanings is located close together in the vector space, regardless of the specific wording. Embeddings thus make possible what traditional keyword matching cannot achieve: semantic search, in which “reading the electricity meter” and “submitting the meter reading” are recognized as related.

In modern AI agents , embeddings form the basis of semantic language processing in applications such as initial intent recognition, Retrieval Augmented Generation (RAG) in knowledge bases, and many other functions. 

 

How embeddings work technically

An embedding model—usually a specially trained neural network—takes an input text (e.g., “How do I report water damage?”) and converts it into a vector that typically has several hundred or thousand dimensions. Similar content generates similar vectors. Using distance metrics such as cosine similarity, the texts most relevant to the query can be efficiently identified from a large volume of content, such as a knowledge base or a product catalog.

In the RAG setup, the semantically relevant information is first retrieved from the knowledge base and provided to the LLM as context. Instead of letting the model “guess,” it responds based on verified sources. This is one of the most effective ways to reduce hallucinations and a key component that BOTfriends uses to ensure factual accuracy.

 

Best Practices for Using Embeddings

In enterprise projects, you can primarily influence the quality of the knowledge stored in the bot, as the chunking quality of knowledge base entries plays a key role in determining the quality of search results. Chunks that are too small lose context, while those that are too large dilute semantic accuracy. BOTfriends uses various mechanisms to optimize the chunks as much as possible and ensure that the most relevant information is always provided.  They make the difference between an agent that “just answers” and one that retrieves the right information from the right source, even when dealing with extensive, multilingual knowledge bases.

 

Frequently Asked Questions (FAQ)

Yes. Depending on the content, embeddings may contain personal information or make it traceable. That is why proper data management, EU-based hosting, and a clear authorization and deletion policy are mandatory. BOTfriends addresses these requirements by default during setup and operation.



--> Back to BOTwiki - The Chatbot Wiki


AI agent platform Social Graph

AI Instructions

--> to the BOTwiki - The Chatbot Wiki

AI instructions are the core guidelines that tell an AI agent —such as a chatbot or voicebot —how to behave in order to perform a task. They define how the agent should proceed, which steps it must follow, what it needs to pay attention to, and how it should handle special cases. AI instructions thus serve as the agent’s manual, acting as the interface between brand strategy and model behavior.

 

Creating Effective AI Instructions

A good instruction works like a good prompt and therefore stands or falls on its structure and formatting. Using the Markdown formatting language helps the LLM follow the instruction reliably. You can find more information on this in our Prompting Guide for Agentic AI.

The individual steps or decision-making processes should be described in a clearly structured manner and, where appropriate, illustrated with examples.

Then there is the use of tools: When are which tools called up, in what order, and with which required parameters?

Finally, escalation and fallback rules are needed: What happens if the agent is unsure, does not understand a request, or exceeds a security-critical threshold? These building blocks ensure that the agent does not improvise, but instead operates according to a clear, auditable logic.

 

AI Instructions in Multi-Agent Architecture

Many problems associated with traditional AI agents, such as hallucinations, rule violations, or tool-calling errors arise because a single system prompt attempts to cover all behaviors, behavioral rules, and tasks simultaneously. BOTfriends solves this by setting different configurations in different places and clearly distributing tasks through multi-agent orchestration. Each agent (Triage, Authentication, Processes, Knowledge) has its own focused AI instructions, tailored to its specific area of responsibility. 

This approach is not only more stable but also easier to maintain. Changes to the triage agent’s workflow do not necessarily affect the authentication logic in the authentication agent. Updates to compliance requirements can be applied directly to the knowledge agent. This ensures that voice and chat setups remain easy to maintain, even months later and across multiple releases.

 

Best Practices for AI Instructions in Practice

In production environments, three best practices have proven effective. AI instructions must be concrete rather than abstract. Instead of “Be friendly,” try “Start responses by acknowledging the user’s request, followed by the solution step, and then a follow-up question.” They should provide examples—short positive examples of ideal responses and, if necessary, a negative example for clarification. And they must be tested regularly, as updates to the LLM system can affect how the instructions are executed. AI instructions belong in a test suite with real-world use-case dialogs, evaluated automatically, with clear KPIs.

 

Frequently Asked Questions (FAQ)

AI instructions specify at the task level exactly what an agent is supposed to do, while prompt engineering is a technique for making the prompt as clear and structured as possible.

As short as possible, as long as necessary. Long, monolithic instructions often lead to poorer results because the model overlooks important details. In multi-agent setups, concise instructions for each agent are usually more successful than a single, lengthy prompt.

AI instructions are a living asset, not a one-time setup. Changes are planned, tested, and rolled out smoothly without requiring operational teams to rely on external consultants every time.



--> Back to BOTwiki - The Chatbot Wiki


AI agent platform Social Graph

AI Knowledge Base

--> to the BOTwiki - The Chatbot Wiki

An AI knowledge base is the structured repository of information from which an AI agent draws its responses. Unlike the training data of a Large Language Model (LLM), the knowledge base is company-specific, up-to-date, and versionable. It contains product manuals, websites, FAQs, process descriptions, pricing plans, terms and conditions, service guides, and everything the agent needs to know reliably and accurately when interacting with customers.

The knowledge base thus serves as the counterpart to "creative model intuition." While the LLM contributes language understanding and response generation, the knowledge base ensures factual accuracy. In combination with RAG (Retrieval Augmented Generation) , this creates a system that responds naturally while remaining brand-safe and compliant.

 

Building an AI Knowledge Base

A knowledge base that can be used effectively isn’t created by simply dumping all available documents into a vector database. Three steps are standard in BOTfriends projects.

Once the team has decided which documents, wikis, CMS content, FAQs, and backend data are reliable and necessary for the bot, all knowledge sources are uploaded to the knowledge base.

The platform breaks down the uploaded content into semantically meaningful units (so-called text chunks). The chunks are transferred to a vector space via embeddings so that they can be found later.

Tip: The better the content is structured and formatted (e.g., using Markdown), the more accurate the bot's information will be and the higher the quality of its responses.

If you do your research thoroughly, choose your sources carefully, and keep them up to date, you’ll lay the groundwork for consistent answer quality. At BOTfriends, we’re happy to help you with this process.

 

Knowledge Base and Multi-Agent Orchestration

In single-prompt architectures, the entire knowledge base—or an overly large portion of it—is often included in every prompt. This leads to context contamination, higher costs, and poorer response quality. BOTfriends, on the other hand, works with dedicated AI agents within a multi-agent orchestration framework. They have access only to the parts of the knowledge base that they need for their specific tasks. 

 

Knowledge Base and RAG

The technical mechanism that connects the knowledge base and the AI model is called Retrieval-Augmented Generation—RAG for short. Instead of having the language model generate a response based on static knowledge, the knowledge base is first searched for every user query. The text chunks that are most semantically relevant are identified and provided to the model as context—only then does it generate a response.

An additional fact check compares the generated response with the user's query once more before it is displayed. 

RAG thus provides the foundation that enables a bot to deliver accurate, source-based answers rather than making things up or repeating outdated information.

 

Frequently Asked Questions (FAQ)

Ideally, on an ongoing basis. When it comes to pricing plans, terms and conditions, or product data, “once a quarter” is rarely enough. BOTfriends X supports automated sync workflows from CMS, DAM systems, and backend data sources, ensuring that updates are automatically reflected in the knowledge base without any manual effort.

By having the AI agent use only the verified sources contained therein to generate responses. A fact-checking layer further ensures that, in cases of uncertainty, the model communicates transparently rather than speculating.

Yes. In BOTfriends projects, multiple knowledge bases are created in parallel to establish clear thematic boundaries. Using routing logic in the multi-agent orchestration, each agent accesses the knowledge base that is appropriate for it.



--> Back to BOTwiki - The Chatbot Wiki


AI agent platform Social Graph

AI KPIs

--> to the BOTwiki - The Chatbot Wiki

AI KPIs (Key Performance Indicators) are the metrics companies use to objectively evaluate the success of AI agents, voicebots, and chat solutions. Strong AI KPIs combine technical quality, business results, and customer experience. Weak AI KPIs measure activity rather than impact—such as the “total number of bot responses”—and thus obscure whether the system is actually delivering business value.

In enterprise settings, AI KPIs are not just reporting metrics but management tools. They show where voice or chat agents can reliably handle tasks automatically, where human intervention is needed, and where use cases still need to be optimized. Those who implement AI without KPIs are essentially managing based on gut instinct—a costly approach—and only realize too late that the system isn’t delivering what’s needed operationally and financially.

 

An Overview of the Most Important AI KPIs

In enterprise projects, these KPI categories have proven to be essential:

  • The automation rate indicates the percentage of processes that are handled by an AI agent resolves end-to-end without human intervention.
  • The resolution rate measures the percentage of issues that are actually resolved, as opposed to the simple response rate.
  • The containment rate describes the percentage of interactions that are completed within the bot channel without being transferred to other channels.
  • Customer Satisfaction (CSAT) and NPS complement this perspective with results-oriented quality metrics.

These are supplemented by operational KPIs such as Average Handling Time (AHT), Cost per Contact, Hand-Off Quality (i.e., how smoothly transfers to human agents are handled), and latency, which is particularly critical in voice interactions. To ensure brand safety, any reputable set of KPIs should also include the hallucination rate, insult rate, and compliance-related incident rates.

 

Which KPIs are actually meaningful for voice and chat agents

At voicebots , the automation rate per use case often provides the most accurate picture. What matters is not the number of calls themselves, but the percentage of them that are successfully completed without human assistance, including the correct backend action. Equally important is handover quality—that is, how reliably complex or escalated cases are transferred to human agents with full context.

In the chat section, resolution rate, containment rate, and self-service rate are the key metrics. 

 

Frequently Asked Questions (FAQ)

In most cases, these metrics include the automation rate per use case, CSAT or NPS in bot interactions, and the quality of handoffs during escalations. These three metrics indicate whether the bot is truly automating interactions, whether customers are satisfied, and whether the handoffs to human agents are working smoothly.

Not much. It shows activity, not results. A system can generate many responses without actually resolving the original issue. Resolution rate and containment rate are much more meaningful metrics in this context.

Essentially, yes, but not in terms of priority. Voice is more sensitive to latency and audio quality, while chat is more sensitive to length and navigation. Containment rate and self-service rate play a greater role in chat, while average handling time and audio quality dominate in voice.



--> Back to BOTwiki - The Chatbot Wiki


AI agent platform Social Graph

Rich Media Elements

--> to the BOTwiki - The Chatbot Wiki

Rich media elements are interactive content components used in chat- and messenger-based AI agentsthat go beyond simple text responses. These include images, videos, buttons, quick replies, carousels, cards, and lists. They help convey complex information in an understandable way, speed up decision-making processes, and create a more professional user experience. Unlike text-only messages, rich media elements significantly reduce the amount of typing and reading required by the user. 

 

Common rich media elements and when they are appropriate

Buttons and quick replies are suitable for clear-cut questions with a manageable number of options, such as “Report a claim,” “Track a shipment,” or “Book an appointment.” Carousels are ideal for product recommendations, contract options, or case studies where the user wants to compare several equally valid alternatives. Images, videos, and PDFs often explain complex topics more quickly than text, such as step-by-step self-help instructions or a visualization of shipment status. Cards and lists organize answers with multiple data points, such as available appointments, locations, or rates.

A well-designed AI agent seamlessly switches between free-form conversation and rich media elements, depending on the context and the capabilities of the channel.

 

Best Practices for Using Rich Media Elements

Three principles have proven effective in practice.

First, a dialog box shouldn’t be overloaded. Too many buttons or carousel cards overwhelm the user and distract from the actual purpose. Two to five clear options are ideal in most cases.

Second, we need to consider free-form input as well. Rich media elements complement, but do not replace, natural language understanding. Customers should always be able to type or speak freely.

Third, brand consistency is essential. Color schemes, visual language, and tone of voice are all part of the tone of voice; rich media elements must not deviate from this.

In practice, rich media is most effective for recurring use cases with clear decision paths, such as shipment tracking, appointment booking, or contract options. They measurably reduce the time to resolution and increase the self-service rate.

 

Frequently Asked Questions (FAQ)

No. Web chat and the app offer the widest variety of interactive elements, while WhatsApp and Facebook Messenger use predefined templates (templates, list messages), and voice and email require a customized layout. BOTfriends X handles this channel adaptation, ensuring that content is managed centrally and delivered in a format tailored to each channel.

Provided they are properly configured, yes. It is particularly important that embedded content, such as videos or tracking, does not send data to third parties without verification. BOTfriends is hosted in the EU, is GDPR- and EU AI Act-compliant, and configures rich media setups accordingly.

In some simple FAQ scenarios. For more complex business processes, such as shipment tracking with authentication, contract changes, or damage reports, rich media elements are demonstrably more effective. They reduce misunderstandings, speed up the dialogue, and increase the conversion rate.



--> Back to BOTwiki - The Chatbot Wiki


AI agent platform Social Graph

Session Initiation Protocol (SIP)

--> to the BOTwiki - The Chatbot Wiki

The Session Initiation Protocol (SIP) is an open standard for managing real-time communication sessions over IP networks, primarily telephone calls. SIP governs how a call is established, put on hold, transferred, and terminated, regardless of whether the endpoints are traditional telephones, softphones, PBX systems, or AI-based voicebots .

SIP is indispensable for AI-native voice agents. It serves as the bridge between the traditional telephony world (PSTN, mobile networks, legacy ISDN) and modern AI logic. Without seamless SIP integration, even the most intelligent AI agent cut off from the channel where the majority of truly valuable customer inquiries take place—namely, the telephone.

 

How SIP works technically

SIP functions as a signaling protocol. It does not manage the audio transport itself, but rather the establishment and termination of sessions. The actual voice stream typically runs over RTP (Real-time Transport Protocol). SIP messages such as INVITE, ACK, BYE, and REGISTER define who is calling whom, whether the call is accepted, and when it ends.

For voicebots, this means: As soon as a caller dials a hotline, the telephony infrastructure establishes a session with the voice agent endpoint via SIP. The agent receives the audio stream and processes it using speech-to-text, LLM, and text-to-speech, and sends the response back. If necessary, the agent can initiate a warm transfer via SIP, i.e., hand the call—including the context—over to a human agent.

 

Body vs. Brain: Why SIP Alone Isn't Enough

Traditional telephony platforms are robust in terms of connectivity—specifically, their SIP and PSTN connections—but rigid in their logic. They treat AI as an add-on to legacy IVR structures (“Press 1 for …”) and consequently struggle with ambiguity, changes in context, and natural language. Despite the “AI voicebot,” callers still end up on hold.

BOTfriends takes a different approach. It’s AI-native voice from the ground up—meaning multi-agent orchestration combined with full-featured telephony integration via SIP and PSTN. The caller speaks freely; a triage agent classifies the request; and a process agent resolves it end-to-end, including authentication, CRM/ERP access, and documentation. SIP remains the reliable “body” component, while the AI architecture serves as the “brain.”

 

Frequently Asked Questions (FAQ)

In most enterprise scenarios, yes. SIP is the de facto standard for modern telephony. Web-only voice applications do not require SIP. However, as soon as traditional phone numbers, hotlines, or PBX integrations come into play, SIP is the natural connectivity standard.

WebRTC is primarily designed for browser-to-browser communication and does not require traditional telephony infrastructure. SIP, on the other hand, is deeply integrated into PSTN, PBX, and mobile networks. In modern setups, the two are often combined, such as web chat using WebRTC and hotline calls via SIP.

Yes. With SIP trunking, existing phone numbers and phone service contracts can be seamlessly continued. The Voice Agent acts as an additional endpoint that handles specific numbers or skill groups without disrupting the customer experience.

SIP supports encryption via TLS and SRTP for audio transmission. BOTfriends uses these mechanisms by default, supplemented by EU-based hosting, role-based permissions, and audit-proof logging. This allows us to effectively serve even sensitive industries such as insurance, healthcare, and energy.



--> Back to BOTwiki - The Chatbot Wiki


AI agent platform Social Graph

Speech-to-Speech

--> to the BOTwiki - The Chatbot Wiki

Speech-to-Speech (S2S) refers to a technology that translates or processes spoken language directly into spoken language without the traditional detour through text. While conventional voice pipelines go through three stages (speech-to-text, then LLM, then text-to-speech), a speech-to-speech model processes audio end-to-end in a single neural network.

This way, even paralinguistic information—such as emotion, tone of voice, laughter, or hesitation—is preserved, details that are typically lost when transcribing speech into text.

 

Where speech-to-speech excels and where it has its limitations

S2S models excel at short, conversational interactions that require a high degree of naturalness, such as small talk, simple inquiries, or topics similar to those covered in FAQs. They currently perform less well in complex, business-critical processes involving multi-step tool calls, authentication, and backend write operations. In these scenarios, single-model architectures quickly fail due to tool-calling errors or a lack of adherence to rules.

 

Frequently Asked Questions (FAQ)

Not in general. Speech-to-speech is superior in terms of latency and naturalness, but currently has weaknesses when it comes to complex tool invocation, adherence to rules, and auditability.

While text-to-speech (TTS) and speech-to-text (STT) simply convert between written and spoken language, speech-to-speech (S2S) directly converts an audio input into a new audio output. In the process, characteristics such as the speaker’s voice, emotions, and intonation can be preserved or translated into another language without necessarily focusing on the intermediate step of visible text.



--> Back to BOTwiki - The Chatbot Wiki