Lecture 14 - Scalability

RCU

RCU Goals

concurrent reads, even during updates
low overhead - memory, execution time
deterministic completion time

Example: linked list

RCU linked list
⚠ Switch to EXCALIDRAW VIEW in the MORE OPTIONS menu of this document. ⚠ You can decompress Drawing data with the command palette: ‘Decompress current Excalidraw file’. For more info check in plugin settings under ‘Saving’

Excalidraw Data

Text Elements

Head

“out”

“dog”

“Fish”

x
Link to original

Approaches for mutual exclusion

Approach #1: spin lock

all threads acquire and release

Approach #2: read write lock

Multiple readers OR 1 writer, can’t have both at the same time
Benefit:
- multiple reads in parallel
Drawbacks:
- cannot read while write is happening
- could cause starvation, depends on implementation
  - like readers starve writer
- execution time overhead
  - ```
  struct rwlock{int n;}
      n = 0 -> unlocked
      n = -1 -> 1 writer
      n > 0 - > locked by n readers
  
  read_lock - increment n
  write_lock - decrement n
```
- To just read data, still lots of writes to n
- space overhead

Thought experiment: readers skip the lock

no writer: ok
yes writer: not ok, could see inconsistent state

Read-Copy-Update (RCU)

rules, patterns, mechanisms

RCU idea #1: writers make a copy

RCU idea 1 writers make a copy
⚠ Switch to EXCALIDRAW VIEW in the MORE OPTIONS menu of this document. ⚠ You can decompress Drawing data with the command palette: ‘Decompress current Excalidraw file’. For more info check in plugin settings under ‘Saving’

Excalidraw Data

Text Elements

head

x

Enew

E1

E2

E3
Link to original

To update E2:

acquire lock
Enew = alloc()
Enew -> next = E2 -> next
strcpy(Enew, "new string")
E1 -> next = Enew
release lock 1-5 readers see old version, 6 - onwards readers see new version

Readers cannot see partially complete updates

problem:

Compilers and CPUs can reorder instructions

RCU idea #2: memory barriers to enforce ordering

To update E2:

acquire lock
Enew = alloc()
Enew -> next = E2 -> next
strcpy(Enew, "new string") ------------ Memory barrier ----------------
E1 -> next = Enew
release lock

problem:

use-after-free
- readers might still have a pointer to memory that we’re trying to free

RCU idea #3: grace period

writers - wait until all CPUs have context switched
readers - can’t hold RCU pointers through context switches

Linux’s RCU API

readers:

rcu_read_lock - disable/enable timer interrupts
rcu_read_unlock - disable/enable timer interrupts per core operation, no need to synchronize with other cores
rcu_dereference(pointer) - includes a memory barrier writers:
syncrhonize_rcu - wait for all CPUs to context switch
call_rcu(callback, arg) - async version
rcu_assign_pointer(pointer_addr, pointer) - updates pointer, includes memory barrier

RCU vs Read - write locks

readers:

RCU imposes almost no overhead
reads can happen during writes writer:
writes can take longer with RCU

Drawbacks of RCU

only helpful if reads >> writes
relies on context switches
complex
different model of consistency
- readers can observe stale data
only works for data structures with single committing write

Summary of RCU

multicore CPUs - need scalable way to do synchronization
Solution is RCU - readers don’t have to wait and can run in parallel with writers
used widely in Linux

Linux Scalability

Corey - new OS from scratch
This paper - scalability of Linux
Scalable commutativity rule - principles

Scalability bottlenecks

App
Kernel
Hardware

Approach:

set of applications
find/fix bottlenecks
Iterate until OS is not bottleneck

Cache Coherence

linux scalablity cache coherence
⚠ Switch to EXCALIDRAW VIEW in the MORE OPTIONS menu of this document. ⚠ You can decompress Drawing data with the command palette: ‘Decompress current Excalidraw file’. For more info check in plugin settings under ‘Saving’

Excalidraw Data

Text Elements

Core with local caches

Interconnect

Write

Read

Requires cache coherence protocol
Link to original

Needs cache coherence protocol → keeps caches consistent

Scalability Bottlenecks

writing to shared variables/memory
lock on shared data
competition for caches, memory bandwidth
not enough concurrency

Techniques for Improving Scalability

sloppy counters - partition counter across cores
Sloppy Counter
⚠ Switch to EXCALIDRAW VIEW in the MORE OPTIONS menu of this document. ⚠ You can decompress Drawing data with the command palette: ‘Decompress current Excalidraw file’. For more info check in plugin settings under ‘Saving’

Excalidraw Data

Text Elements

global: 0

C1

C2

C3

local:

0

0

0

1

2

1
Link to original
- how do we know the true value of counter?
  - true value of counter = value of global - $\sum p er core co u n t ers$
avoid lock contention
- fine-grained locks
avoid data contention
- per-core data structures
wait-free sync
- e.g., RCU
avoid false sharing
- cache lines, 64B
- can have a cache line where one thread is accessing one part and another thread is accessing another part.
  - sharing cache line even though underlying variables aren’t shared between threads
- fix: move shared variables to separate cache lines

Summary

general techniques for scalability
- can apply to applications
Linux is actually pretty scalable

Recent Writing

Recent Notes

Table of Contents

Lecture 14 - Scalability

RCU

RCU Goals

Example: linked list

RCU linked list

Excalidraw Data

Text Elements

Approach #1: spin lock

Approach #2: read write lock

Thought experiment: readers skip the lock

Read-Copy-Update (RCU)

RCU idea #1: writers make a copy

RCU idea 1 writers make a copy

Excalidraw Data

Text Elements

RCU idea #2: memory barriers to enforce ordering

RCU idea #3: grace period

Linux’s RCU API

RCU vs Read - write locks

Drawbacks of RCU

Summary of RCU

Linux Scalability

Scalability bottlenecks

Approach:

Cache Coherence

linux scalablity cache coherence

Excalidraw Data

Text Elements

Scalability Bottlenecks

Techniques for Improving Scalability

Sloppy Counter

Excalidraw Data

Text Elements

Summary

Recent Writing

Recent Notes

Graph View

Table of Contents