Tools — overview
A tool is a function the AI can call to do something on the visitor’s behalf — scroll a page, click a button, search your content, capture a lead, end the call. Every tool has a name, a JSON schema for its parameters, and a server-side handler.
Spelo’s widget ships with 17 built-in tools organized into 4 groups. The AI picks which one to call based on what the visitor said.
How tool-calling works end-to-end
1. Visitor speaks "Show me the pricing page" │ ▼2. OpenAI Realtime API (over WebRTC, no Spelo middleman) transcribes audio + decides which tool to call │ ▼3. Tool call returns to the browser as a data-channel event: { name: "navigate", arguments: { url: "/pricing" } } │ ▼4. Widget runs the local handler in spelo.js → navigate("/pricing") triggers in-page nav, returns "Navigated to /pricing" │ ▼5. Tool result is sent back to OpenAI model speaks: "Done — you're on the pricing page now."Audio never touches Spelo servers. Tool handlers run in the visitor’s browser (most tools) or on Spelo’s API (only search_knowledge_base, read_section, and submit_lead, which need server credentials).
Two protocol generations: v2 (see/act) and legacy
Spelo’s tool surface evolved in two phases. Both are live and the LLM picks between them per task.
v2 — namespaced see.* / act.* (preferred)
The v2 protocol is snapshot-driven: the AI first asks for a structured grid of every interactive/textual element on the page (see.snapshot), gets back stable ids, then acts on those ids (act.click, act.fill, act.scroll_to).
Why this is better:
- Stable element addressing — no fuzzy text matching, no “I see two ‘Submit’ buttons” ambiguity
- Survives DOM rewrites mid-conversation
- Snapshot tells the AI exactly what’s visible vs. below the fold
- Icon-only buttons (where
click_elementby text fails) become reliably targetable
The flow: see.snapshot → AI picks an id from the result → act.click({ id: "sp-12" }).
Legacy flat tools (still wired as fallback)
The legacy tools (navigate, scroll_to, scroll_by, click_element, fill_field) address elements by visible text or label. They still work, and the LLM uses them when:
- No snapshot has been taken yet for the current page
- The target is unambiguously named (a single “Submit” button on a small form)
- The legacy tool is more efficient (e.g.
scroll_by 50%is just a viewport math hint — no snapshot needed)
You don’t have to configure anything — the AI picks the right path automatically per the instructions in its system prompt.
The wire format
Every tool follows the OpenAI Realtime function-calling schema:
{ "type": "function", "name": "navigate", "description": "Navigate to a different page on the website. Use the href paths from the NAVIGATION section of the page context...", "parameters": { "type": "object", "properties": { "url": { "type": "string", "description": "..." }, "external": { "type": "boolean", "description": "..." } }, "required": ["url"] }}When the AI invokes a tool, the data channel delivers:
{ "type": "response.function_call_arguments.done", "call_id": "call_xyz", "name": "navigate", "arguments": "{ \"url\": \"/pricing\" }"}The widget runs the handler and replies:
{ "type": "conversation.item.create", "item": { "type": "function_call_output", "call_id": "call_xyz", "output": "Navigated to /pricing" }}The model then continues its turn — usually with a short spoken confirmation.
Where tools live in the codebase
| Concern | File |
|---|---|
| LLM-visible tool schemas + browser handlers | packages/spelo-system/src/tools.ts |
| Server-side voice-relay tool handlers | apps/voice-relay/src/tools.ts |
| Server endpoints called by browser handlers | apps/api/src/routes/{search,read-section,leads}.ts |
When customers want to know “what can Spelo do on my site?” — the answer is whatever tools are in voiceTools[] in packages/spelo-system/src/tools.ts at bundle build time.
Tools by transport
Spelo has two voice transports, and the LLM-visible toolset differs slightly between them:
| Transport | When used | search_knowledge_base | search_database | submit_lead |
|---|---|---|---|---|
| openai-direct (browser ↔ OpenAI WebRTC) | Most customer sites, by default | Yes (RAG over crawled content) | Hidden (DFY tier) | Yes |
| voice-relay (browser ↔ our managed voice infrastructure ↔ Gemini/OpenAI) | Sites configured for Gemini, or free-tier overflow | Yes | Yes (full DB adapter) | Yes |
If your site uses connected database adapters and you want the AI to query them directly — see Plans and limits for the DFY tier requirements.
How the AI decides which tool to call
The AI’s choice is driven entirely by the tool descriptions in the schema (the description field shown in the wire-format example above). These descriptions are part of the system prompt. The AI doesn’t know how the tool is implemented — it picks based on what the description says it does.
This is why Spelo’s tool descriptions are precise and behavioural. For example:
scroll_to— Scroll to a specific NAMED section, heading, or element on the page. Use when the user names a destination (“scroll to the pricing section”, “go to the contact form”, “show me the FAQ”). For vague directional moves like “scroll down a little” usescroll_byinstead.
The “use when … but use X for Y” phrasing trains the model to disambiguate.
What customers should and shouldn’t worry about
You don’t need to:
- Define tools yourself (they’re built into the bundle)
- Wire up handlers (the widget does this)
- Pick which tool the AI uses (the AI does this from descriptions)
You can influence tool behaviour via:
- Personality + custom instructions — change tone and add domain-specific guidance
- Restricted topics — hard-block topics the AI shouldn’t answer
- Enabled / disabled pages — keep the orb off
/checkoutetc. - Connect a database — unlock the
search_databasecapability (DFY tier) - Webhooks — be notified when the AI captures a lead
See also
- Site intelligence endpoint — the metadata the widget loads on session start (pages, schema, personality)
- Query endpoint — the REST endpoint behind
search_database(still callable directly from your backend even if the LLM tool is gated) - Conversations API — fetch tool-call transcripts after the fact