Lec 15 - Reasoning Models

What is LLM reasoning?

Example

Overall design

How “normal” LLMs are created

DeepSeek’s Process

What is different with R1?

They used a different post-training recipe on the base model!

Optimization solution score is fed back into the model for training

DeepSeek’s training is entirely reinforcement-learning-based, with no supervised fine-tuning.
GRPO allows them to:

Scale efficiently across massive reasoning models,
Avoid instability from learning value functions,
And use relative, group-based rewards to encourage better reasoning behaviors. In short:

GRPO = PPO without the value model, replacing learned baselines with group-based comparisons to stabilize and speed up RL for large language models.

Prompt Used for Reasoning RL Training

It learns to think!

The result of this RL process is that it’s starts to spend more time reasoning through out training

Purely RL is not enough (DeepSeek-R1-Zero)

DeepSeek-R1 uses some SFT data to help with readability and language mixing problems in Zero

Demystifying Delays in Reasoning: A Pilot Temporal and Token Analysis of Reasoning Systems

Reasoning and deep research study

Implications

Tool calling, especially web search, accounts for most of the latency for GPT-5 and Open Deep Research
Insignificant for o3
Tool calls can often be parallelized
Higher token counts (e.g., longer reasoning chains) can improve generation quality, but comes with high latency and cost
More on tool and agents next Monday

Aaron's Digital Garden 🪴

Recent Writing

Computer Arch Crash Course

The Missing Readme - consolidated by new grad

Caching Crash Course

OS Crash Course

Recent Notes

C2 - Data Models and Query Languages

C1-Reliable, Scalable, and Maintainable Applications

Table of Contents