Datacenters

  • Applications must scale
  • Network: 100-400 Gbit/s packet can arrive every ~120ns
  • Storage: high-speed flash, NVM
  • Must minimize overheads
    • system call above 120ns, if we’re trying to trap for every packet, it’s too expensive
      • Kernel is too expensive kernel bypass

        Low latency networking

        ⚠ Switch to EXCALIDRAW VIEW in the MORE OPTIONS menu of this document. ⚠ You can decompress Drawing data with the command palette: ‘Decompress current Excalidraw file’. For more info check in plugin settings under ‘Saving’

        Excalidraw Data

        Text Elements

        App

        App

        User

        Kernel

        OS

        NIC

        send()

        recv()

        Monolithic OS with POSIX API

        App

        Library

        User

        Kernel

        OS

        NIC

        DRAM

        Kernel Bypass (LibOS)

        recv()

        dataplane vs. control plane

        Sending and receiving packets (Optimize this part!!!)

        Management

        Link to original

Challenge: isolation between apps

If we kernel bypass, how can we still provide isolation between apps In modern NIC hardware:

  • virtual queues
    • Application can only interact with this one queue, to isolate network traffic, up to NIC to dispatch appropriate traffic to queue
  • IOMMU (input-output memory management unit)
    • NIC can access specific region shared in memory but not anywhere else
    • Memory isolation
    • allows NICs to use virtual addresses

IX

4 Challenges

  • protection
    • isolate between apps
    • isolation for network stack
  • microsecond tail latency
    • large fanout
      • once request end up in datacenter, mass amount of servers are required to respond to request
    • tail latency (99%) - how long for 99% packet to complete
  • high packet rates
    • often very small packets
  • resource efficiency
    • adjust amount of resources used by each application
    • previous kernel bypass machines that allocated all resources to one application IX goal is to determine how to address all four challenges at the same time

IX overview

The previous libOS approach doesn’t provide protection required.

IX uses hardware support for virtualization for protection

Run to completion with adaptive batching

  • Run to completion - process a packet from start to finish
    • avoid intermediate layers of queuing
    • don’t have to worry it’s going to get swapped out or dropped at intermediate queue
  • Adaptive batching
    • system is not waiting around for more to arrive
    • If there’s a high rate, system batches them for better efficiency
    • batch size varies dynamically

IX network flow

⚠ Switch to EXCALIDRAW VIEW in the MORE OPTIONS menu of this document. ⚠ You can decompress Drawing data with the command palette: ‘Decompress current Excalidraw file’. For more info check in plugin settings under ‘Saving’

Excalidraw Data

Text Elements

App / libix

dataplane

NIC Queue

TCP/IP

Event Conditions

APP

libIX

Batched syscalls

TCP/IP timers

NIC Queue

Link to original

Synchronization - free processing

  • partition resources across cores
  • separate per-core NIC queues
    • RSS (Receive-Side Scaling) - distributes packets across cores
      • Takes hash of packet metadata (src, dest) to assign to cores based on connection

Zero Copy

  • POSIX - involves copies
  • Zero copy - NIC writes packets to app-accessible memory

Evaluation

  • Looked primarily at throughput and latency
  • Compared to Linux, IX beats them at both low loads and high loads
    • Avoids interrupt overheads
    • more efficient overall (zero copy and adaptive batching)

Summary of IX

  • Protected dataplane using virtualization
  • run-to-completion, batching, RSS, zero copy

Snap

Goals

  • easier deployment by moving network stack out of kernel
    • 1 to 2 months to deploy new kernel
      • deployment was slow
  • easier development at user level
  • more optimization
    • move network stack to user level

      Snap

      ⚠ Switch to EXCALIDRAW VIEW in the MORE OPTIONS menu of this document. ⚠ You can decompress Drawing data with the command palette: ‘Decompress current Excalidraw file’. For more info check in plugin settings under ‘Saving’

      Excalidraw Data

      Text Elements

      User

      Kernel

      Network module

      OS

      Microkernel

      APP

      APP

      Queues

      NIC

      Link to original
  • Note: network module and NIC not linked together in contrast to IX
LibOS (Exokernel)Snap (Microkernel)
Pro: tailored network stack for specific use casePro: centralized management, control of scheduling
Pro: better latency, application and network model not separatedPro: ease of deployment, modular
Pro: better isolation, not sharing network moduleCon: potentially higher overhead
Pro: decouples scheduling of apps from network stack

Threading - Kernel-level threads

  • Don’t have to modify application that much to use snap, no “new” user level threading

IPC - communication between apps and Snap

  • run apps and snap on different cores so there isn’t a lot of interrupts involved
    • no need to context switch between one core

Separation of control and dataplane

CPU Scheduling

data plane are split into engines

snap-dedicating cores

⚠ Switch to EXCALIDRAW VIEW in the MORE OPTIONS menu of this document. ⚠ You can decompress Drawing data with the command palette: ‘Decompress current Excalidraw file’. For more info check in plugin settings under ‘Saving’

Excalidraw Data

Text Elements

Apps

N

Dedicating cores

Spreading Cores

Apps

Compacting cores

Apps

Link to original

  • Best efficiency?
    • compacting
  • Best tail latency?
    • spreading latency, get efficient time to run
    • dedicating cores

Transparent Upgrades

  • Migrate engines one at a time
  • Steps
    • Start up new snap
    • For each engine:
      • write old snap state to memory (brownout period)
      • new snap reads in state (brownout period)
      • stop execution on old snap (blackout period)
      • transfer any modified state (blackout period)
      • restart engine on new snap (blackout period ~250 ms)
    • Kill old snap

Snap summary

  • microkernel-based design - easier to upgrade and deploy
  • scheduling modes for engines
  • used widely within google