Function Calling and Tool Use

How Does an LLM Call a Function?

Language models generate text. That's all they do at a fundamental level: predict the next token given a sequence of previous tokens. But functions require something very different. A function like get_weather(city="London") needs structured input — specific arguments in a specific format — not a stream of natural language. So how do we bridge the gap between a model that produces free-form text and a function that expects precisely structured arguments?

The answer: train the model to output structured JSON (JavaScript Object Notation, a lightweight data format that uses key-value pairs like {"city": "London"} ) that specifies which function to call and with what arguments . The model doesn't execute anything. It just produces a structured call specification — a message that says "I'd like to call this function with these parameters." A separate runtime (your application code, the API server, an orchestration framework) reads that specification, executes the actual function, and feeds the result back to the model so it can continue generating.

This separation is crucial. The model is a text generator with no ability to run code, access databases, or make HTTP requests. The runtime is the execution layer that actually performs those actions. The model's job is to decide when a function call is needed, which function to call, and what arguments to pass. The runtime's job is to execute and return results.

💡 Function calling (also called tool use) is the single most important capability that separates agents from chatbots. A chatbot can only talk. An agent can act — checking the weather, querying a database, sending an email, writing a file — because it can express its intent as a structured function call that the runtime executes on its behalf.

The Function Calling Protocol

How does the model know which functions are available? The API provides a list of tool definitions alongside the conversation. Each definition describes a function using JSON Schema (a standard for describing the structure of JSON data): its name, a natural language description of what it does, and the parameters it accepts with their types and constraints. Here's what a weather tool definition looks like:

{
  "name": "get_weather",
  "description": "Get the current weather for a city",
  "parameters": {
    "type": "object",
    "properties": {
      "city": {
        "type": "string",
        "description": "The city name, e.g. 'London'"
      },
      "unit": {
        "type": "string",
        "enum": ["celsius", "fahrenheit"],
        "description": "Temperature unit"
      }
    },
    "required": ["city"]
  }
}

The model sees these tool definitions as part of its context (they're injected into the prompt, usually in a system message or a dedicated tools section). When the model decides that answering the user's question requires calling a function, it outputs a structured JSON object specifying the function name and arguments:

{
  "name": "get_weather",
  "arguments": {
    "city": "London",
    "unit": "celsius"
  }
}

The runtime intercepts this output, executes the actual get_weather function, obtains the result (e.g., {"temp": 12, "condition": "rain", "humidity": 85} ), and sends it back to the model as a new message in the conversation. The model then uses that result to formulate its final response to the user.

Different providers use slightly different API shapes, but the core idea is identical. OpenAI uses a tools parameter in the Chat Completions API. Anthropic uses a tools parameter in the Messages API. Google uses function_declarations in the Gemini API. In every case, you provide tool definitions, the model outputs a structured call, and your code executes it.

Here's what the full flow looks like as pseudocode:

# 1. Define available tools
tools = [
    {
        "name": "get_weather",
        "description": "Get the current weather for a city",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {"type": "string"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
            },
            "required": ["city"]
        }
    }
]

# 2. Send user message + tools to the model
response = llm.chat(
    messages=[{"role": "user", "content": "What's the weather in London?"}],
    tools=tools
)

# 3. Model responds with a tool call (not a text answer)
# response.tool_calls = [{"name": "get_weather", "arguments": {"city": "London"}}]

# 4. Your code executes the function
result = get_weather(city="London")  # => {"temp": 12, "condition": "rain"}

# 5. Send the result back to the model
final = llm.chat(
    messages=[
        {"role": "user", "content": "What's the weather in London?"},
        {"role": "assistant", "tool_calls": response.tool_calls},
        {"role": "tool", "content": json.dumps(result)}
    ],
    tools=tools
)
# 6. Model now responds with natural language:
# "It's 12°C and rainy in London right now."

And here's a pure-Python simulation that runs through the same flow end-to-end, using a mock model and a mock weather function, so you can see the data at each step:

import json

# ---- Mock weather function (the "tool") ----
def get_weather(city, unit="celsius"):
    """Simulate a weather API."""
    db = {
        "london":  {"temp": 12, "condition": "rain",   "humidity": 85},
        "paris":   {"temp": 18, "condition": "cloudy", "humidity": 60},
        "tokyo":   {"temp": 24, "condition": "sunny",  "humidity": 45},
    }
    data = db.get(city.lower(), {"temp": 0, "condition": "unknown", "humidity": 0})
    if unit == "fahrenheit":
        data = {**data, "temp": round(data["temp"] * 9/5 + 32)}
    return data

# ---- Tool registry (what the model sees) ----
tools = {
    "get_weather": {
        "function": get_weather,
        "schema": {
            "name": "get_weather",
            "description": "Get the current weather for a city",
            "parameters": ["city", "unit"]
        }
    }
}

# ---- Simulate model deciding to call a tool ----
user_query = "What's the weather in London?"

# Step 1: Model "decides" to call get_weather (simulated)
model_tool_call = {"name": "get_weather", "arguments": {"city": "London", "unit": "celsius"}}
print(f"User:       {user_query}")
print(f"Model call: {json.dumps(model_tool_call)}")

# Step 2: Runtime executes the function
func = tools[model_tool_call["name"]]["function"]
result = func(**model_tool_call["arguments"])
print(f"Tool result: {json.dumps(result)}")

# Step 3: Model uses the result to respond (simulated)
model_response = (
    f"It's currently {result['temp']}°C and {result['condition']} "
    f"in London, with {result['humidity']}% humidity."
)
print(f"Model reply: {model_response}")

How Models Learn to Use Tools

A base language model (one that has only been pre-trained on text prediction) doesn't know how to output structured function calls. That capability is added during post-training , specifically through supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) — the same stages that teach the model to follow instructions and be helpful. (For more on how SFT works, see Why Fine-tune? .) The training data includes thousands of examples showing the pattern: user asks a question → model produces a structured tool call → tool returns a result → model incorporates the result into a natural language answer.

But where do those training examples come from? Creating them manually is expensive. (Schick et al., 2023) introduced Toolformer , a method that lets language models teach themselves to use tools. The key insight is elegant: start with a pre-trained model, let it insert candidate tool calls at various positions in its training text, execute those calls, and then filter — keep only the examples where the tool call actually reduced the model's perplexity (i.e., the model's next-word prediction improved with the tool result compared to without it). This creates a high-quality dataset of tool-use examples without human annotation.

For example, given the sentence "The Eiffel Tower is 330 metres tall", Toolformer might insert a calculator call to verify the arithmetic, or a Wikipedia lookup to confirm the fact. If the tool result makes the model more confident in the continuation, the example is kept. If not, it's discarded. The resulting model learns not just how to call tools but when calling them is actually useful — a critical distinction.

Modern function-calling models support two additional capabilities worth knowing about:

Parallel tool calling: when a query requires multiple independent function calls (e.g., "compare the weather in London and Paris"), the model can emit both calls simultaneously rather than waiting for one to complete before issuing the next. This reduces latency by executing the calls concurrently.
Forced tool use: the API can require the model to call a specific tool regardless of whether it "wants" to. This is useful for structured data extraction — if you always need the model to output data through a particular tool schema, forced tool use guarantees it won't skip the call and respond in plain text instead.

💡 Parallel tool calling is a significant latency optimisation. If a user asks "What's the weather in London, Paris, and Tokyo?", a model with parallel calling emits three get_weather calls at once. The runtime executes all three concurrently, and the total wall-clock time is roughly the latency of the slowest single call rather than the sum of all three.

Structured Outputs and JSON Mode

Tool calling depends on the model outputting valid, well-formed JSON. But language models are probabilistic token generators — they don't inherently guarantee syntactic correctness. A missing closing brace, a trailing comma, an unescaped quote character — any of these breaks the JSON parser, and a broken tool call breaks the entire agent loop. If your agent calls three tools in sequence and the second call produces malformed JSON, everything downstream fails.

This is why major API providers have introduced structured output modes (sometimes called JSON mode ) that constrain the model to produce valid JSON matching a given schema. Rather than hoping the model formats things correctly, the system guarantees it.

How does this work under the hood? The technique is called constrained decoding . At each token-generation step, the model produces a probability distribution over its entire vocabulary (typically 50,000–100,000+ tokens). Normally, any token can be sampled. With constrained decoding, the system masks out (sets the probability to zero for) every token that would make the output invalid JSON at that point. If the model just produced {"city": , only tokens that begin a valid JSON value (a quote for a string, a digit for a number, etc.) are allowed. Tokens like } or , that would create a syntax error are blocked.

This means the model's output follows a valid JSON path at every single step. It's not a post-hoc fix ("generate freely, then try to parse") — it's a generative constraint that makes invalid JSON literally impossible to produce. The tradeoff is a small amount of additional computation per token to maintain the constraint state machine, but the reliability gain is enormous.

In practice, you interact with this feature through API parameters. OpenAI provides response_format: { type: "json_schema", json_schema: {...} } in the Chat Completions API, which constrains the model to output JSON matching a specific schema. Anthropic's tool use system automatically constrains the input field of tool calls to match the input_schema you provide in each tool definition, giving you schema-validated outputs without a separate mode.

Here's a simulation showing how constrained decoding limits the model's vocabulary at each generation step:

import json

# Simulate constrained decoding for JSON generation
# At each step, show which token types are ALLOWED vs BLOCKED

schema = {
    "type": "object",
    "properties": {
        "city": {"type": "string"},
        "temp": {"type": "number"}
    },
    "required": ["city", "temp"]
}

# Walk through a constrained decoding trace
steps = [
    {
        "generated_so_far": "",
        "next_token": "{",
        "allowed": ["{ (object start)"],
        "blocked": ["any letter", "[ (array)", "number", "null"]
    },
    {
        "generated_so_far": "{",
        "next_token": '"city"',
        "allowed": ['"city"', '"temp"'],
        "blocked": ["any non-required key", "} (object needs required keys)", "number"]
    },
    {
        "generated_so_far": '{"city"',
        "next_token": ":",
        "allowed": [": (key-value separator)"],
        "blocked": [", (need value first)", "} (need value first)", "any letter"]
    },
    {
        "generated_so_far": '{"city":',
        "next_token": '"London"',
        "allowed": ['"any string" (schema says type=string)'],
        "blocked": ["number", "true/false", "null", "{ (not an object)"]
    },
    {
        "generated_so_far": '{"city":"London"',
        "next_token": ",",
        "allowed": [", (more required keys remain)"],
        "blocked": ["} (temp is required but missing)"]
    },
    {
        "generated_so_far": '{"city":"London",',
        "next_token": '"temp"',
        "allowed": ['"temp" (still required)'],
        "blocked": ['"city" (already present)', "} (temp still missing)"]
    },
    {
        "generated_so_far": '{"city":"London","temp":',
        "next_token": "12",
        "allowed": ["any number (schema says type=number)"],
        "blocked": ['"string"', "true/false", "null"]
    },
    {
        "generated_so_far": '{"city":"London","temp":12',
        "next_token": "}",
        "allowed": ["} (all required keys present)"],
        "blocked": ["any letter", "number (already complete)"]
    },
]

print("Constrained Decoding Trace")
print("=" * 55)
for i, step in enumerate(steps):
    print(f"\nStep {i+1}: next token = {step['next_token']}")
    print(f"  So far: {step['generated_so_far'] + step['next_token']}")
    print(f"  Allowed:  {', '.join(step['allowed'])}")
    print(f"  Blocked:  {', '.join(step['blocked'])}")

final = '{"city":"London","temp":12}'
parsed = json.loads(final)
print(f"\nFinal output: {final}")
print(f"Valid JSON:   {True}")
print(f"Parsed:       {parsed}")

Tool Design: What Makes a Good Tool?

A model's ability to use a tool correctly depends almost entirely on how well that tool is described . Remember: the model has never seen your function's source code. All it knows is the name, description, and parameter schema you provided. The tool description is the documentation — if the description is vague, the model will use the tool vaguely. If it's precise, the model will use it precisely.

What separates a well-designed tool from a poorly designed one? Here are the principles that matter most in practice:

Clear, descriptive names: search_web tells the model exactly what this tool does. fn_042 tells it nothing. The model selects tools partly based on name matching to the user's intent, so a good name is a strong signal.
Rich parameter descriptions: don't just specify types — include what each parameter means, example values, and constraints. " city: The city name, e.g. 'London', 'New York' " is far more useful than " city: string ". The model uses these descriptions to figure out what value to pass.
Atomic actions: one tool should do one thing well. A tool that searches the web and summarises the results combines two distinct operations. If the model only needs to search (not summarise), it has no way to use half of a tool. Break it into search_web and summarise_text — the model can compose them when needed.
Informative return values: return enough context for the model to reason about the result. A weather tool returning {"temp": 12, "condition": "rain", "humidity": 85} is far more useful than one returning just "12" . The model needs context to formulate a helpful response.
Clear error messages: when a tool call fails, return a human-readable error like "City 'Londno' not found. Did you mean 'London'?" rather than a raw stack trace. The model will see this error and can either retry with corrected arguments or explain the issue to the user.

There's one more practical concern that's easy to overlook: tool count . Every tool definition consumes context window tokens. More importantly, models struggle with selection when presented with too many options. Research and practical experience show that performance degrades noticeably with 50+ tools — the model starts picking the wrong tool or hallucinating tool names. Keep the active tool set focused on what's needed for the current task. If you have 200 tools, use a routing layer that selects a relevant subset (say, 5–10) based on the user's query before passing them to the model.

Here's an example contrasting a poor tool definition with a good one:

import json

# ---- BAD tool definition ----
bad_tool = {
    "name": "fn_042",
    "description": "does stuff",
    "parameters": {
        "type": "object",
        "properties": {
            "q": {"type": "string"},
            "n": {"type": "integer"}
        }
    }
}

# ---- GOOD tool definition ----
good_tool = {
    "name": "search_knowledge_base",
    "description": (
        "Search the internal knowledge base for documents matching a query. "
        "Returns the top-n most relevant documents with titles and snippets. "
        "Use this when the user asks about company policies, product specs, "
        "or internal procedures."
    ),
    "parameters": {
        "type": "object",
        "properties": {
            "query": {
                "type": "string",
                "description": "Natural language search query, e.g. 'vacation policy for remote employees'"
            },
            "max_results": {
                "type": "integer",
                "description": "Number of results to return (1-20, default 5)",
                "minimum": 1,
                "maximum": 20
            }
        },
        "required": ["query"]
    }
}

print("BAD tool definition:")
print(json.dumps(bad_tool, indent=2))
print()
print("Problems:")
print("  - Name 'fn_042' gives the model no hint about what it does")
print("  - Description 'does stuff' is useless for tool selection")
print("  - Parameters 'q' and 'n' are cryptic abbreviations")
print("  - No parameter descriptions or constraints")
print("  - No required fields specified")
print()
print("GOOD tool definition:")
print(json.dumps(good_tool, indent=2))
print()
print("Improvements:")
print("  - Name clearly describes the action: search_knowledge_base")
print("  - Description explains what, when, and what it returns")
print("  - Parameters have full names and descriptions with examples")
print("  - Constraints (min/max) prevent invalid arguments")
print("  - Required fields are explicit")

The Tool Use Loop in Practice

Now let's put it all together by walking through a complete tool-use interaction, step by step. Consider a user who asks: "What's the weather in London and should I bring an umbrella?"

The interaction unfolds in four stages:

Stage 1 — User query: the user sends their message. The model receives it alongside the tool definitions.
Stage 2 — Model decides to call a tool: the model determines it needs current weather data to answer the question. It outputs a structured call: get_weather(city="London") .
Stage 3 — Runtime executes: your code calls the real weather API. It returns {"temp": 12, "condition": "rain", "humidity": 85} .
Stage 4 — Model responds: the model reads the tool result, reasons that rain means yes to the umbrella question, and generates: "It's 12°C and rainy in London — definitely bring an umbrella!"

Now consider a more complex case: "Compare the weather in London and Paris." The model needs data from two cities. With parallel tool calling, it can emit both calls simultaneously:

Stage 1: user asks for a comparison.
Stage 2: model emits two tool calls in one response: get_weather(city="London") and get_weather(city="Paris") .
Stage 3: runtime executes both concurrently (since they're independent). Both results return.
Stage 4: model compares the two results and responds: "London is 12°C and rainy while Paris is 18°C and cloudy — Paris is warmer and drier today."

The code below is a complete, runnable simulation of this tool-use loop. It implements a mock model (which selects tools based on simple keyword matching) and a mock weather API, then runs through both the single-call and parallel-call scenarios:

import json

# ---- Mock weather API ----
def get_weather(city, unit="celsius"):
    db = {
        "london": {"temp": 12, "condition": "rain",   "humidity": 85},
        "paris":  {"temp": 18, "condition": "cloudy", "humidity": 60},
        "tokyo":  {"temp": 24, "condition": "sunny",  "humidity": 45},
    }
    data = db.get(city.lower(), {"temp": 0, "condition": "unknown", "humidity": 0})
    if unit == "fahrenheit":
        data = {**data, "temp": round(data["temp"] * 9/5 + 32)}
    data["city"] = city
    return data

# ---- Tool registry ----
TOOLS = {"get_weather": get_weather}

# ---- Mock model: decides which tools to call based on query ----
def mock_model_plan(query, available_tools):
    """Simulate a model deciding which tool calls to make."""
    calls = []
    query_lower = query.lower()
    cities = []
    for city in ["london", "paris", "tokyo", "new york"]:
        if city in query_lower:
            cities.append(city.title())
    if cities and "get_weather" in available_tools:
        for city in cities:
            calls.append({"name": "get_weather", "arguments": {"city": city}})
    return calls

def mock_model_respond(query, tool_results):
    """Simulate a model generating a response from tool results."""
    if len(tool_results) == 1:
        r = tool_results[0]
        umbrella = " Bring an umbrella!" if r["condition"] == "rain" else ""
        return (
            f"It's {r['temp']}C and {r['condition']} in {r['city']}, "
            f"with {r['humidity']}% humidity.{umbrella}"
        )
    else:
        parts = []
        for r in tool_results:
            parts.append(f"{r['city']}: {r['temp']}C, {r['condition']}")
        comparison = " vs ".join(parts)
        warmest = max(tool_results, key=lambda r: r["temp"])
        return f"{comparison}. {warmest['city']} is the warmest."

# ---- The tool-use loop ----
def run_agent(query):
    print(f"User: {query}")
    print("-" * 50)

    # Step 1: Model decides on tool calls
    tool_calls = mock_model_plan(query, TOOLS)

    if not tool_calls:
        print("Model: (no tools needed, respond directly)")
        return

    # Step 2: Show the calls (parallel if multiple)
    parallel = len(tool_calls) > 1
    print(f"Model decides to call {len(tool_calls)} tool(s)" +
          (" in parallel:" if parallel else ":"))
    for call in tool_calls:
        print(f"  -> {call['name']}({call['arguments']})")

    # Step 3: Execute all calls (concurrently in real systems)
    results = []
    for call in tool_calls:
        func = TOOLS[call["name"]]
        result = func(**call["arguments"])
        results.append(result)
        print(f"  <- {call['name']} returned: {json.dumps(result)}")

    # Step 4: Model generates final response
    response = mock_model_respond(query, results)
    print(f"\nModel: {response}")
    print()

# ---- Run both scenarios ----
print("=== Scenario 1: Single tool call ===")
print()
run_agent("What's the weather in London and should I bring an umbrella?")

print("=== Scenario 2: Parallel tool calls ===")
print()
run_agent("Compare the weather in London and Paris")

In production systems, the mock model is replaced by an actual LLM API call, and the mock functions are replaced by real HTTP requests, database queries, or any other side-effecting operation. But the loop structure is exactly the same: query → plan → execute → respond. This is the fundamental pattern behind every tool-using agent, from simple single-tool chatbots to complex multi-step systems that chain dozens of tools together.

Quiz

Test your understanding of function calling and tool use.

When a model 'calls a function', what does it actually produce?

It executes the function directly and returns the result

It outputs structured JSON specifying the function name and arguments, which a separate runtime executes

It generates Python code that is compiled and run

It sends an HTTP request to the function's endpoint

What is the purpose of constrained decoding in structured output mode?

It speeds up token generation by reducing the vocabulary size

It masks out tokens that would create invalid JSON at each generation step, guaranteeing syntactically valid output

It fine-tunes the model on JSON examples during inference

It adds a post-processing step that fixes malformed JSON after generation

Why is having too many tools (50+) problematic for a model?

The API rejects requests with more than 50 tools

Each tool requires a separate fine-tuning run

Tool definitions consume context tokens and the model's tool selection degrades when the set is too large

The model can only store 50 tools in its weights

In the Toolformer approach, how are good tool-calling examples selected for training?

Human annotators manually label each example as good or bad

Examples are kept only if the tool result reduced the model's perplexity (improved next-word prediction)

All tool call examples are kept regardless of quality

A separate reward model scores each example