Scheduler Activations

Kernel-level threadsUser-level threads
OS is aware of them, manages thempurpose: express concurrency in applications
Purpose: leveraging physical parallelismManaged by user level thread libraries
Example: pthreads (fn: exit, yield, etc)
Misconception: it’s only in kernel space, it can execute in user space.
Note: the kernel just manages them

User vs Kernel Thread

⚠ Switch to EXCALIDRAW VIEW in the MORE OPTIONS menu of this document. ⚠ You can decompress Drawing data with the command palette: ‘Decompress current Excalidraw file’. For more info check in plugin settings under ‘Saving’

Excalidraw Data

Text Elements

User

Kernel

State for a kernel thread

State for a user-level thread

read()

User level library is unaware of this block

Link to original

Threading Models

  • M:N threading - Multiplex user-level threads over kernel threads
    • where n = kernel threads
    • where m = user-level threads
    • M>>N
  • M:M threading - map user-level threads directly to kernel threads
  • M:1 threading - only run on one core, one kernel threads multiplexes user-level threads (not as used as much anymore)
    • no physical parallelism

Tradeoffs between user-level threads

User-level threadsKernel-level threads
lower overheadhigher overhead
Kernel events are hidden from user level
- lots of ways threads block in kernel I/O, page faults
- OS can preempt kernel-level threads
OS awareness
Customizable, specializedgeneral purpose
hardware parallelism
more isolation

Scheduler Activations

Scheduler Activations

⚠ Switch to EXCALIDRAW VIEW in the MORE OPTIONS menu of this document. ⚠ You can decompress Drawing data with the command palette: ‘Decompress current Excalidraw file’. For more info check in plugin settings under ‘Saving’

Excalidraw Data

Text Elements

User

Kernel

SA1

SA2

SA3

C1

C2

C3

Read()

Kernel decides how to allocate cores User decides how to schedule among allocated cores

Creates new SA and call to user thread to schedule on new SA instead

SA4

SA 1 blocked, use S4 instead

SA4

Upcall

U1 is runnable U2 is also runnable

U2

U1

Link to original

  • OS just allocates cores
  • User-level schedules threads across cores
  • This is called 2-level scheduling
  • invariant: # of active SAs = # of cores
  • OS adds a core to an app: create a new sched. activation
  • OS revokes a core:
    • preempts 2 SAs, uses one for an upcall

In Go

  • it tries to detect if one kernel thread is blocked then try to wake up a new kernel level thread
  • not scheduler activations

Summary of Scheduler Activations

  • Pros/Cons of user-level and kernel-level threads
  • Use of upcalls to notify user levels whenever scheduling decisions has been made

Decade of Wasted Cores

  • Linus Torvalds, what do you know?
    • his temper
    • his response to commits
    • ego
    • Git, Linux
    • advocate for monolithic kernels
    • advocate for open source software

Scheduler Goals

  • Fair allocation of compute - policy
  • Prioritize important tasks - policy
  • Work conserving - no core is idle if there are runnable threads
  • Performance - cache locality
  • efficient - power usage, locality
  • scaling
  • general purpose

Single-Core Scheduling

single core scheduling

⚠ Switch to EXCALIDRAW VIEW in the MORE OPTIONS menu of this document. ⚠ You can decompress Drawing data with the command palette: ‘Decompress current Excalidraw file’. For more info check in plugin settings under ‘Saving’

Excalidraw Data

Text Elements

Thread

FIFO order

runqueue

CFS: Completely Fair Scheduler

Link to original

CFS: Completely Fair Scheduler

  • Preemptive - timeslice (e.g. a few ms)
  • weighted - vruntime
  • Implemented using a red-black tree

Multi-Core Scheduling

  • one runqueue per core
  • Multi-core scheduling

    ⚠ Switch to EXCALIDRAW VIEW in the MORE OPTIONS menu of this document. ⚠ You can decompress Drawing data with the command palette: ‘Decompress current Excalidraw file’. For more info check in plugin settings under ‘Saving’

    Excalidraw Data

    Text Elements

    Core

    runqueue

    Link to original
  • Why multiple runqueues / core?
    • Avoids contention for 1 queue
      • So runqueue isn’t the bottleneck

Challenges

  • load imbalance
    • expensive to remedy
  • complex cache hierarchies
  • hyperthreads (SMT)

    threads on core

    ⚠ Switch to EXCALIDRAW VIEW in the MORE OPTIONS menu of this document. ⚠ You can decompress Drawing data with the command palette: ‘Decompress current Excalidraw file’. For more info check in plugin settings under ‘Saving’

    Excalidraw Data

    Text Elements

    L3 Cache / LLC

    Socket, NUMA node

    L1 cache

    Hyperthreads(SMT)

    Physical Core

    Other sockets

    Other Sockets

    L2 cache

    Link to original

Load Balancing in Linux

  • periodic load balancing
  • “emergency” load balancing - idle core, try to steal
  • on creation/wakeup
    • try to place on an idle core

Metrics

  • In Linux:
    • number of threads
    • sum of weights (priorities)
    • load - combine weights and average CPU utilization
    • groups

Load-Balancing Algorithm

  • Idea: if we didn’t care about cache locality, we would want to perfectly balance cores, but we do care!
    • hierarchical approach
      • 1 core per group balances load
      • prevent need to moving across sockets
      • load bal cores

        ⚠ Switch to EXCALIDRAW VIEW in the MORE OPTIONS menu of this document. ⚠ You can decompress Drawing data with the command palette: ‘Decompress current Excalidraw file’. For more info check in plugin settings under ‘Saving’

        Excalidraw Data

        Text Elements

        Socket

        Cores

        HyperThreads

        bottom up

        Link to original

Group Imbalance Bug

  • 8 NUMA nodes, 8 cores each
  • 64 compilation threads load 1
  • 2 R threads load 64 - due to high average
  • R threads would run on each socket with load of 64, rest of 64 threads would be balanced on rest of NUMA loads. Other 7 cores was just idle
    • FIX: rather than average, use minimum instead

Overload on Wakeup Bug

  • thread wakeups
  • threads woken on busy local core rather than another idle core
  • fix: favor idle cores for wakeups

Summary of Wasted Cores

  • Design of scalable CPU scheduler with per-core runqueues
  • Approaches to balancing load
  • CPU scheduling is complex