Lec 16 LLM Tools and MCP

“Brain and Limbs” Analogy

LLM - the brain
- complex reasoning
- understand user intent

How Do Models Call Tools

LLMs don’t execute tools directly
LLMs “learn” to generate tool call syntax
A system orchestration layer intercept and execute the tool
Three primary methods for LLMs to know about tools
- Instruction Tuning (Fine-Tuning)
- System Prompt (e.g. ReAct)
- In-Context Learning (Few-Shot)

Instruction Tuning

An LLM explicitly fine-tuned on a dataset of (Instruction, Tool Call, Tool Output, Final Answer)
- Early works (2023):
  - Toolformer
  - TALM
Fine-tuned to recognized when an instruction requires a tool and to emit the exact syntax
At inference time, no need for extra info to LLM,
Knowledge of tool use baked into model weights
- Pro: Very fast and reliable for known tools.
- Con: Not flexible; cannot easily add new tools without re-tuning.
This is early work, not scalable with new tools

System Prompt

System prompt instructs LLM to call tools

eg. ReAct loop

Reason: LLM thinks step-by-step about what it needs. Tool Call: LLM decides to call a tool and generates the tool call syntax. Tool Execution: Outside the LLM, execute the tool, get result, and feed back to LLM Repeats this loop until it has enough information to answer the user.

Pro: Flexible; new tools can be added just by describing them in the prompt
Cons: Can be less reliable than fine-tuning; agent can get “stuck in a loop”, making errors or inducing many LLM calls.

Not as fast as fine-tuned, less reliable but more flexible

In-Context Learning

Provide a few examples (few-shot prompting)

User:

What's the weather in SF? Model: weather('SF') System: {"temp": 65} Model: The weather in SF is 65. --- (New query) --- User: What about Seoul?

Model: <toolcall>weather(‘Seoul')</toolcall> Pros: Flexible; new tools can be added just by describing them in the prompt Cons: Longer context, can be less reliable than fine-tuning, needs good examples

Common Agent Use Cases

Data Science: Writing and executing Python code to analyze data, plot charts, and check statistical results.
Search & Retrieval: Answering questions with up-to-date information by fetching and summarizing web content.
Productivity : Managing calendars, drafting emails, and summarizing documents.
DevOps: Reading logs, diagnosing errors, suggesting code fixes, or managing cloud infrastructure via CLI tools.
Creative Work: Generating images with tools like DALL-E, or composing music snippets via a MIDI API.

Agent Tooling Challenges

Standardization

The Problem: An “N x M” integration nightmare.
N Models: (GPT-4, Claude 3, Gemini, Llama 3…) all have different finetuning data and preferred tool-call syntax.
M Tools: (Google, Stripe, Slack, JIRA, internal APIs…) all have different authentication, schemas, and endpoints.
The Result: Developers must write custom, brittle “glue code” for every single model-tool combination. This is not scalable.

The Solution - MCP (model context protocol)

Connecting (N) LLMs to (M) external tools/resources used to be a NxM problem
MCP standardizes the LLM-tool communication into a N→1→M process
Build with a client-server model
- MCP client: the agent that needs to call tool/data
- MCP server: a service to expose external tools and data sources

MCP Security Vulnerability

Malicious MCP Server

MCP Tradeoffs

The good:

Unified tool protocol
Language and framework agnostic
Unifies function calling and data Not so good:
Add complexity to programmers
- Typically not really necessary unless a large scale of tools
Security vulnerabilities
Performance overhead
MCP codebase is not well maintained by Claude

Performance Problem

The Problem: Sequential tool calls are slow!

Query: ”What’s the weather in San Diego, and what’s the top news story on arXiv today?”
Sequential Agent:
1. call(weather_api, location='SD') → (waits 1 second)
2. call (arxiv_api, query='top') → (waits 1.5 seconds)
3. LLM synthesizes response → (waits 0.5 seconds)
Total Latency: 3 seconds

The solution: Parallel Tool Call

When tool calls are independent, they can be executed in parallel
LLM’s reasoning step must be sophisticated
Instead of emitting one tool call, it emits a list of calls
The system layer executes all API calls concurrently
- [call(weather_api), call(arxiv_api)] →(waits 1.5 seconds)
- LLM synthesizes response → (waits 0.5 seconds)
Total Latency: 2 seconds

Current Research, LLMCompiler: Plan for Parallel Tools

Deep Dive: Web Search Tool

An AI web search “tool” is not a single API call
It’s a complex workflow or even an agent
Steps:
1. Query Rewrite
  - An LLM rewrites the user query into clearer, more precise queries.
2. Search
  - Launch a search API (e.g., Google, DuckduckGo) with the rewritten query to retrieve a list of URLs
3. Crawling
  - Use a crawler (e.g., Craw4AI) to fetch the raw HTML from the search returned URLs.
  - This is fast but often fails on JavaScript-heavy, client-side rendered (CSR) websites.
4. Browser (as needed)
  - If the raw HTML from crawling is empty or lacks content, can trigger a virtual browser (e.g., Puppeteer, Playwright).
  - This virtual browser can involve clicks (e.g., to bypass captcha) and renders the JavaScript to get dynamic content.
5. Chunking
  - Split each page into smaller, semantic units (e.g., paragraphs or sections).
6. Embedding
  - Convert text chunks into a vector embeddings
  - Can save/cache these embeddings for future use
7. Step 7: Reranking & Dedup
  - Rerank chunks based on the semantic similarity between the original user query vector and each chunk vector
  - During the process, remove identical or near-identical chunks
8. Summarization & Aggregation
  - Feed the top-k, reranked, and deduplicated chunks into an LLM for summarization

Case Study: Deep Research Agent

AI web search is good for getting information; LLMs are good at answering questions with short answers
Need something else for conducting thorough research and generate comprehensive reports ⇒ Deep Research
Common use cases:
- Literature review, survey writing, other academic research
- Financial reports and analysis
- Lead generation, sales, marketing research
- Competitor research • Product comparison
Key capabilities of a deep research agent
- Adaptive Long-Horizon Planning: usually create and adjust a complex plan.
- Multi-Hop Information Retrieval: can follow a series of search across multiple sources.
- Iterative Tool Use: Repeatedly call tools (search, browse, code) to refine knowledge.
- Structured Report Generation: The final output is a structured document. Many different ways to build deep-research agents

Aaron's Digital Garden 🪴

Recent Writing

Computer Arch Crash Course

The Missing Readme - consolidated by new grad

Caching Crash Course

OS Crash Course

Recent Notes

C2 - Data Models and Query Languages

C1-Reliable, Scalable, and Maintainable Applications

Table of Contents