Lec 1 - LLMs - Gen AI

What is an LLM?

language models
- text to text generation
that are large
- Massive number of model parameters (weights)
- Trained on huge amounts of data

How do LLMs learn?

Via a model architecture (giant graph of interconnected nodes)
- transformer, mamba
Model weights “trained” with massive dataset
- internet, books, code

What can LLMs do?

Text Gen
Translation
Summarization
Question Answering
Conversational AI

What can’t LLMs do?

Hallucination: can generate seeming correct but incorrect information
Bias: can reflect biases present in their training data
Real-time info: no access to data present in their training data
Privacy: your personal info can be part of the training data
Security: LLMs can be manipulated to give/not give certain answers

What is Infrastructure (systems)?

the backbone that powers applications
Traditional infra (systems)
- OS, database, networking, hardware, compiler, security, virtualization, DevOps, etc
- for data centers, cloud computing

What is LLM infra?

LLM training
- Pre-training (foundational model training)
- Post-training (fine-tuning, RLHF)
LLM serving
- Single-GPU/CPU LLM inference
- Distributed model serving
LLMOps
- Training data collection, preparation, and synthesize
- Experimental tracking, model registry
- monitoring and logging of LLM serving

What is an AI Agent?

From outside:
- autonomous entity that can operate on its own
From inside:
- program involving model calls and tool calls
- diff from LLM: can perceive env, make decisions, take actions
- diff from AI workflows: autonomous and can morph
The Rise and Potential of Large Language Model Based Agents: A Survey

Categories of AI Agents

Single-agent applications
Multi-agent systems
- LLM agents interacting with each other in a collaborative or competitive manner
Human-Agent cooperation
- LLM agents can interact with humans, providing them with assistance and performing tasks more efficiently and safely

Agent-to-Agent (A2A) Protocol

A protocol enabling standardized communication across agents
- Support agents using different frameworks to communicate
- A client (local) agent can discover agents by fetching “Agent Card” of available remote agents
- And then delegate a task to the chosen remote agent
Supports streaming and asynchronous push notifications for long tasks

What are LLM Agents

Planning

Subgoal and decomposition: the agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks
Reflection and refinement: the agent can do self-criticism and self reflection over past actions, learn from mistakes and refine them for future steps, thereby improving the quality of final results

Task Decomposition: Chain of Thought (CoT)

CoT has become a standard prompting technique for enhancing model performance on complex tasks

The model is instruction to “think step by step”, utilizing more computation time to decompose hard tasks into smaller and simpler steps.

Transforms big tasks into multiple manageable tasks and shows model’s thinking process

Task Decomposition: Tree of Thoughts

Tree of Thoughts extends CoT by exploring multiple reasoning possibilities at each step, creating a tree structure.

Design choices involved:

DFS or BFS search
how each state is evaluated by a classifier (via prompt or majority vote)

Self-Reflection: ReAct

Adding reasoning and acting within the LLM by extending the action space to be a combination of task-specific discrete actions and the language space

Example format of the explicit steps for an LLM to think Example of ReAct differences

Memory

Sensory Memory: learning embedding representation for raw inputs, including text, image or other modalities (Vision encoder/ Speech encoder)
Short-term memory: all the in-context learning is utilizing short-term memory of the model to learn. Short and finite, restricted by window length of transformer (prompt engineering)
Long-term memory: this provides the agent with the capability to retain and recall (infinite) information over extended periods, often by leveraging an external vector store and fast retrieval. External vector store for agent to query (RAGs)

Agent Memory Types

Semantic: facts and general knowledge
Episodic: past requests/actions
Procedural: skills defined in programs or fine-tuned
Short-term: context within a request execution

RAG

Why RAG?

LLM’s knowledge is static at time of training
No proprietary info
Non-attributable: difficult to trace the source of an LLM generation
Hallucination: easier to hallucinate when asked about knowledge outside of training set
LLMs are large and expensive to train and run, put proprietary info out of model Components:
Data source
- preprocessed by chunking and embedding into vector DB
Retrieve
- convert query into embedding
Augment and generate
- combine retrieved chunks with existing context

Tool Use

Agent learns to call external APIs for extra information missing from model weights

eg:
- web search + crawling
- browser
- social media, email hooks
- code + CLI execution

What make LLM agents stand out?

Language Mastery: comprehend and produce language
Decision-making: LLMs are equipped to reason and decide
Flexibility: adaptability ensures they can be molded for diverse applications
Collaborative Interactions: collaborate with humans or other agents → multifaceted interactions Examples
HuggingGPT - LLM + APIs
- LLM mapped to hugging face models as tools
TALM (tool augmented language model)
Toolformer

Aaron's Digital Garden 🪴

Recent Writing

Computer Arch Crash Course

The Missing Readme - consolidated by new grad

Caching Crash Course

OS Crash Course

Recent Notes

Dist OS

HW Disaggregation

Table of Contents