• interfaces - exposed vs transparent
  • Centralization vs distribution
  • implementation

Dist OS

⚠ Switch to EXCALIDRAW VIEW in the MORE OPTIONS menu of this document. ⚠ You can decompress Drawing data with the command palette: ‘Decompress current Excalidraw file’. For more info check in plugin settings under ‘Saving’

Excalidraw Data

Text Elements

Local Network

Diskless workstations

File servers

Link to original

Sprite

Good for motivating system design and research problems. Eg one metric is scaling but another isn’t

  • larger memories - increasing DRAM, more storage capacity
  • rise of networking - interconnected machines
    • workstations connected by a network
    • sharing
      • finding files(harder compared to a time-shared machine back in the day)
      • idle machines, how to utilize
    • management /admin
    • transparency - users should not notice that they are running on a distributed platform
  • Multiprocessors
    • parallelism and sharing
    • OS multiprocessors
    • they were thinking of how to better support parallelism within the operating system itself

Features of Sprite

  • Distributed file system
    • utilize file caching
      • can do due to larger memories
  • RPCs
  • Process migration:
    • idle machines utilization
  • Single global namespace
    • used for finding files
  • Subsystem + Monitors (RCU)
    • OS multiprocessor
  • Shared Fork
    • parallelism and sharing

Caching

Caches file data on both client and server

Sprite OS Caching

⚠ Switch to EXCALIDRAW VIEW in the MORE OPTIONS menu of this document. ⚠ You can decompress Drawing data with the command palette: ‘Decompress current Excalidraw file’. For more info check in plugin settings under ‘Saving’

Excalidraw Data

Text Elements

Client A

Client B

File server

file blocks (4KB)

Inconsistent

Link to original

  • concurrent sharing
    • disable caching
  • sequential sharing
    • version numbers - change file block, increment version number
  • Sys design - keep common case fast and don’t care as much about non-common cases
  • backing store - regular files
    • everything is represent as a file, helps with caching because everything is same type

Physical Memory Allocation

  • file buffer cache vs virtual memory
  • Sprite storage

    ⚠ Switch to EXCALIDRAW VIEW in the MORE OPTIONS menu of this document. ⚠ You can decompress Drawing data with the command palette: ‘Decompress current Excalidraw file’. For more info check in plugin settings under ‘Saving’

    Excalidraw Data

    Text Elements

    file buffer cache

    virtual memory

    dynamic split - based on page age

    Link to original
  • Adjust based on requirements

Process Migration

  • Allow processes to migrate across workstations
  • Goals: transparency - don’t want it visible to the users
  • pmake - parallel make tool
  • Freeze process, then transfer over
  • Kernel calls:
    • Different kernels have different states different results
    • How can we do this transparently?
      • Forwarded back to home machine
    • Today - use VM instead because it encapsulates kernel state

Shared Fork

“Processes in shared address space” - threads Unix: share code Sprite: shared_fork share both code and data (heap)

  • Why not share stack?
    • we want threads to have separate states of execution or diverging states of execution and that requires a different stack per thread
  • Today
    • pthread_create share code and data

Sprite Summary

  • file caching protocol
  • When thinking about research/system design:
    • Consider technology trends in system design

LegoOS

Monolithic Servers - CPU + Memory + Storage

Monolithic Server

⚠ Switch to EXCALIDRAW VIEW in the MORE OPTIONS menu of this document. ⚠ You can decompress Drawing data with the command palette: ‘Decompress current Excalidraw file’. For more info check in plugin settings under ‘Saving’

Excalidraw Data

Text Elements

CPU

CPU

CPU

DRAM

DRAM

Memory Bus

PCIe Bus

NIC

GPU

Monolithic Server - “all in one box”

Link to original
Properties:

  • Cache-coherent interconnects: One CPU write to DRAM is visible to all CPUs
  • High bandwidth low latency within same server
  • Network card to connect to data center network

Hardware Disaggregation

Separate all the components inside monolithic server and connect it to data center network individual instead.

HW Disaggregation

⚠ Switch to EXCALIDRAW VIEW in the MORE OPTIONS menu of this document. ⚠ You can decompress Drawing data with the command palette: ‘Decompress current Excalidraw file’. For more info check in plugin settings under ‘Saving’

Excalidraw Data

Text Elements

DRAM

CPU

CPU

CPU

DRAM

Data Center Network

Link to original

  • organize HW into blades

Benefits

  • resource elasticity
    • Scale up more memory or CPU resources individually/independently
  • specialized hardware
    • Easier to adopt new specialty hardware
  • more fine-grained failure domains
    • improve reliability of system overall
  • improve resource utilization
    • users might over estimate resources, more flexibility
    • bin-packing problem - many different jobs and scheduler has to find a place to schedule job
      • how to pack jobs into server as efficiently as possible…
      • result is stranded resources

Performance is a non-goal, due to increase in overhead.

  • Benefits outweighs overhead
  • If it’s close enough to Linux monolithic server

Challenges

  • interconnect:
    • 500 Gbps, 50ns latency
  • data center network: 40 Gbps, 6 us latnecy
  • today: 100+ GBps, few us latency

LegoOS

Need to implement a significant part of LinuxOS

Split Kernel

  • pComp, mComp, sComp
  • monitor - manager for component

    LegoOS Split Kernel

    ⚠ Switch to EXCALIDRAW VIEW in the MORE OPTIONS menu of this document. ⚠ You can decompress Drawing data with the command palette: ‘Decompress current Excalidraw file’. For more info check in plugin settings under ‘Saving’

    Excalidraw Data

    Text Elements

    pComp

    mComp

    CPUs

    Memory

    NIC

    NIC

    NIC

    Storage (disks)

    Data Center Network

    mem

    ExCache 4Gb

    • needed some mem in pComp for caching to have reasonable performance

    In mComp:

    • Page Tables
    • TLBs
    • file buffer cache (key idea: takes up a lot of memory so in mComp) virtual physical

    In pComp:

    • virtual addrs
    Link to original

2-level memory management

  • vRegions
  • vma trees - fine grained memory allocation within vRegions
    • just within memory component

Implementation

  • On commodity hardware, enable and disable different pieces of HW

Evaluation

  • small working sets - perf. a little worse within 2x on standard server
  • large working sets - better perf.
    • accessing memory over the network is faster than locally

Adoption

  • disaggregated GPUs
  • disaggregated storage - blob storage
  • disaggregated memory - still active area of research
  • machines are larger, makes bin-packing problem easier

LegoOS Summary

  • Splitkernel approach disaggregated OS
  • Exploring this extreme point in design space
    • learning a lot about limits of research and how to backoff and create something valuable