The Linux Block Layer
Built for Fast Storage
Light up your cloud!
Sagi Grimberg KernelTLV
27/6/18
1
First off, Happy 1’st birthday Roni!
2
Who am I?
• Co-founder and Principal Architect @ Lightbits Labs
• LightBits Labs is a stealth-mode startup pushing the software and hardware technology
boundaries in cloud-scale storage.
• We are looking for excellent people who enjoy a challenge for a variety of positions, including
both software and hardware. More information on our website at
http://www.lightbitslabs.com/#join or talk with me after the talk.
• Active contributor to Linux I/O and RDMA stack
• I am the maintainer of the iSCSI over RDMA (iSER) drivers
• I co-maintain the Linux NVMe subsystem
• Used to work for Mellanox on Storage, Networking, RDMA and
pretty much everything in between…
3
Where were we 10 years ago
• Only rotating storage devices exist.
• Devices were limited to hundreds of IOPs
• Devices access latency was in the milliseconds ballpark
• The Linux block layer was sufficient to handle these devices
• High performance applications found clever ways to avoid storage
access as much as possible
4
What happened? (hint: HW)
• Flash SSDs started appearing in the DataCenter
• IOPs went from Hundreds to Hundreds of thousands to Millions
• Latency went from Milliseconds to Microseconds
• Fast Interfaces evolved: PCIe (NVMe)
• Processors core count increased a lot!
• And NUMA...
5
I/O Stack
6
What are the issues?
• Existing I/O stack had a lot of data sharing
• Between different applications (running on different cores)
• Between submission and completion
• Locking for synchronization
• Zero NUMA awareness
• All stack heuristics and optimizations centered around slow
storage
• The result is very bad (even negative) scaling, spending lots of CPU
cycles and much much higher latencies.
7
I/O Stack - Little deeper
8
I/O Stack - Little deeper
9
Hmmm...
- Request are serialized
- Placed for staging
- Retrieved by the drivers
⇒ Lots of shared state!
I/O Stack - Performance
10
I/O Stack - Performance
11
Workaround: Bypass the the request layer
12
Problems with bypass:
● Give up flow control
● Give up error handling
● Give up statistics
● Give up tagging and indexing
● Give up I/O deadlines
● Give up I/O scheduling
● Crazy code duplication -
mistakes are copied because
people get stuff wrong...
Most importantly, this is not the
Linux design approach!
Enter Block Multiqueue
• The old stack does not consist of “one serialization point”
• The stack needed a complete re-write from ground up
• What do we do:
• Go look at the networking stack which solved the exact same issue 10+
years ago.
• But build from scratch for storage devices
13
Block Multiqueue - Goals
• Linear Scaling with CPU cores
• Split shared state between applications and
submission/completion
• Careful locality awareness: Cachelines, NUMA
• Pre-allocate resources as much as possible
• Provide full helper functionality - ease of implementation
• Support all existing HW
• Become THE queueing mode, not a “3’rd one”
14
Block Multiqueue - Architecture
15
Block Multiqueue - Features
• Efficient tagging
• Locality of submissions and completions
• Extremely aware to minimize cache pollutions
• Smart error handling - minimum intrusion to the hot path
• Smart cpu <-> queue mappings
• Clean API
• Easy conversion (usually just cleanup old cruft)
16
Block Multiqueue - I/O Flow
17
Block Multiqueue - Completions
18
● Applications are usually “cpu-sticky”
● If I/O completion comes on the
“correct” cpu, complete it
● Else, IPI to the “correct” cpu
Block Multiqueue - Tagging
19
• Almost every modern HW supports queueing
• Tags are used to identify individual I/Os in the presence of
out-of-order completions
• Tags are limited by capabilities of the HW, driver needs to flow
control
Block Multiqueue - Tagging
20
• PerCPU Cacheline aware scalable bitmaps
• Efficient at near-exhaustion
• Rolling wake-ups
• Maps 1x1 with HW usage - no driver specific tagging
Block Multiqueue - Pre-allocations
21
• Eliminate hot path allocations
• Allocate all the requests memory at initialization time
• Tag and request allocations are combined (no two step allocation)
• No driver per-request allocation
• Driver context and SG lists are placed in “slack space” behind the request
Block Multiqueue - Performance
22
Test-Case:
- null_blk driver
- fio
- 4K sync random read
- Dual socket system
Block Multiqueue - perf profiling
23
• Locking time is drastically reduced
• FIO reports much less “system time”
• Average and tail latencies are much lower and consistent
Next on the Agenda: SCSI, NVMe and friends
• NVMe started as a bypass driver - converted to blk-mq
• mtip32xx (Micron)
• virtio_blk, xen
• rbd (ceph)
• loop
• more...
• SCSI midlayer was a bigger project..
24
SCSI multiqueue
• Needed the concept of “shared tag sets”
• Tags are now a property of the HBA and not the storage device
• Needed a chunking of scatter-gather lists
• SCSI HBAs support huge sg lists, two much to allocate up front
• Needed “Head of queue” insertion
• For SCSI complex error handling
• Removed the “big scsi host_busy lock”
• reduced the huge contention on the scsi target “busy” atomic
• Needed Partial completion support
• Needed BIDI support (yukk..)
• Hardened the stack a lot with lots of user bug reports.
25
Block multiqueue - MSI(X) based queue mapping
26
● Motivation: Eliminate the IPI case
● Expose MSI(X) vector affinity
mappings to the block layer
● Map the HW context mappings via
the underlying device IRQ mappings
● Offer MSI(X) allocation and correct
affinity spreading via the PCI
subsystem
● Take advantage in pci based drivers
(nvme, rdma, fc, hpsa, etc..)
But wait, what about I/O schedulers?
• What we didn’t mention was that block multiqueue lacked a
proper I/O scheduler for approximately 3 years!
• A fundamental part of the I/O stack functionality is scheduling
• To optimize I/O sequentiality - Elevator algorithm
• Prevent write vs. read starvation (i.e. deadline scheduler)
• Fairness enforcement (i.e. CFQ)
• One can argue that I/O scheduling was designed for rotating media
• Optimized for reducing actuator seek time
NOT NECESSARILY TRUE - Flash can benefit scheduling!
27
Start from ground up: WriteBack Throttling
• Linux since the dawn of times sucked at buffered I/O
• Writes are naturally buffered and committed to disk in the
background
• Needs to have little or no impact on foreground activity
• What was needed:
• Plumb I/O stats for submitted reads and writes
• Track average latency in window granularity and what is currently enqueued
• Scale queue depth accordingly
• Prefer reads over non-directIO writes
28
WriteBack Throttling - Performance
29
Before... After...
Now we are ready for I/O schedules - MQ-Deadline
• Added I/O interception of requests for building schedulers on top
• First MQ conversion was for deadline scheduler
• Pretty easy and straightforward
• Just delay writes FIFO until deadline hits
• Reads FIFO are pass-through
• All percpu context - tradeoff?
• Remember: I/O scheduler can hurt synthetic workloads, but impact on
real life workloads.
30
Next: Kyber I/O Scheduler
• Targeted for fast multi-queue devices
• Lightweight
• Prefers reads over writes
• All I/Os are split into two queues (reads and writes)
• Reads are typically preferred
• Writes are throttled but not to a point of starvation
• The key is to keep submission queues short to guarantee latency
targets
• Kyber tracks I/O latency stats and adjust queue size accordingly
• Aware of flash background operations.
31
Next: BFQ I/O Scheduler
• Budget fair queueing scheduler
• A lot heavier
• Maintain Per-Process I/O budget
• Maintain bunch of Per-Process heuristics
• Yields the “best” I/O to queue at any given time
• A better fit for slower storage, especially rotating media and cheap &
deep SSDs.
32
But wait #2: What about Ultra-low latency devices
• New media is emerging with Ultra low latency (1-2 us)
• 3D-Xpoint
• Z-NAND
• Even with block MQ, the Linux I/O stack still has issues providing these
latencies
• It starts with IRQ (interrupt handling)
• If I/O is so fast, we might want to poll for completion and avoid paying the
cost of MSI(X) interrupt
33
Interrupt based I/O completion model
34
Polling based I/O completion model
35
IRQ vs. Polling
36
• Polling can remove the extra context switch from the completion
handling
So we should support polling!
37
• Add selective polling syscall interface:
• Use preadv2/pwritev2 with flag IOCB_HIGHPRI
• Saves roughly 25% of added latency
But what about CPU% - can we go hybrid?
38
• Yes!
• We have all the statistics framework in place, let’s use it for hybrid polling!
• Wake up poller after ½ of the mean latency.
Hybrid polling - Performance
39
Hybrid polling - Adjust to I/O size
40
• Block layer sees I/Os of different sizes.
• Some are 4k, some are 256K and some or 1-2MB
• We need to consider that when tracking stats for Polling considerations
• Simple solution: Bucketize stats...
• 0-4k
• 4-16k
• 16k-64k
• >64k
• Now Hybrid polling has good QoS!
To Conclude
41
• Lots of interesting stuff happening in Linux
• Linux belongs to everyone, Get involved!
• We always welcome patches and bug reports :)
42
LIGHT UP YOUR CLOUD!

The Linux Block Layer - Built for Fast Storage

  • 1.
    The Linux BlockLayer Built for Fast Storage Light up your cloud! Sagi Grimberg KernelTLV 27/6/18 1
  • 2.
    First off, Happy1’st birthday Roni! 2
  • 3.
    Who am I? •Co-founder and Principal Architect @ Lightbits Labs • LightBits Labs is a stealth-mode startup pushing the software and hardware technology boundaries in cloud-scale storage. • We are looking for excellent people who enjoy a challenge for a variety of positions, including both software and hardware. More information on our website at http://www.lightbitslabs.com/#join or talk with me after the talk. • Active contributor to Linux I/O and RDMA stack • I am the maintainer of the iSCSI over RDMA (iSER) drivers • I co-maintain the Linux NVMe subsystem • Used to work for Mellanox on Storage, Networking, RDMA and pretty much everything in between… 3
  • 4.
    Where were we10 years ago • Only rotating storage devices exist. • Devices were limited to hundreds of IOPs • Devices access latency was in the milliseconds ballpark • The Linux block layer was sufficient to handle these devices • High performance applications found clever ways to avoid storage access as much as possible 4
  • 5.
    What happened? (hint:HW) • Flash SSDs started appearing in the DataCenter • IOPs went from Hundreds to Hundreds of thousands to Millions • Latency went from Milliseconds to Microseconds • Fast Interfaces evolved: PCIe (NVMe) • Processors core count increased a lot! • And NUMA... 5
  • 6.
  • 7.
    What are theissues? • Existing I/O stack had a lot of data sharing • Between different applications (running on different cores) • Between submission and completion • Locking for synchronization • Zero NUMA awareness • All stack heuristics and optimizations centered around slow storage • The result is very bad (even negative) scaling, spending lots of CPU cycles and much much higher latencies. 7
  • 8.
    I/O Stack -Little deeper 8
  • 9.
    I/O Stack -Little deeper 9 Hmmm... - Request are serialized - Placed for staging - Retrieved by the drivers ⇒ Lots of shared state!
  • 10.
    I/O Stack -Performance 10
  • 11.
    I/O Stack -Performance 11
  • 12.
    Workaround: Bypass thethe request layer 12 Problems with bypass: ● Give up flow control ● Give up error handling ● Give up statistics ● Give up tagging and indexing ● Give up I/O deadlines ● Give up I/O scheduling ● Crazy code duplication - mistakes are copied because people get stuff wrong... Most importantly, this is not the Linux design approach!
  • 13.
    Enter Block Multiqueue •The old stack does not consist of “one serialization point” • The stack needed a complete re-write from ground up • What do we do: • Go look at the networking stack which solved the exact same issue 10+ years ago. • But build from scratch for storage devices 13
  • 14.
    Block Multiqueue -Goals • Linear Scaling with CPU cores • Split shared state between applications and submission/completion • Careful locality awareness: Cachelines, NUMA • Pre-allocate resources as much as possible • Provide full helper functionality - ease of implementation • Support all existing HW • Become THE queueing mode, not a “3’rd one” 14
  • 15.
    Block Multiqueue -Architecture 15
  • 16.
    Block Multiqueue -Features • Efficient tagging • Locality of submissions and completions • Extremely aware to minimize cache pollutions • Smart error handling - minimum intrusion to the hot path • Smart cpu <-> queue mappings • Clean API • Easy conversion (usually just cleanup old cruft) 16
  • 17.
    Block Multiqueue -I/O Flow 17
  • 18.
    Block Multiqueue -Completions 18 ● Applications are usually “cpu-sticky” ● If I/O completion comes on the “correct” cpu, complete it ● Else, IPI to the “correct” cpu
  • 19.
    Block Multiqueue -Tagging 19 • Almost every modern HW supports queueing • Tags are used to identify individual I/Os in the presence of out-of-order completions • Tags are limited by capabilities of the HW, driver needs to flow control
  • 20.
    Block Multiqueue -Tagging 20 • PerCPU Cacheline aware scalable bitmaps • Efficient at near-exhaustion • Rolling wake-ups • Maps 1x1 with HW usage - no driver specific tagging
  • 21.
    Block Multiqueue -Pre-allocations 21 • Eliminate hot path allocations • Allocate all the requests memory at initialization time • Tag and request allocations are combined (no two step allocation) • No driver per-request allocation • Driver context and SG lists are placed in “slack space” behind the request
  • 22.
    Block Multiqueue -Performance 22 Test-Case: - null_blk driver - fio - 4K sync random read - Dual socket system
  • 23.
    Block Multiqueue -perf profiling 23 • Locking time is drastically reduced • FIO reports much less “system time” • Average and tail latencies are much lower and consistent
  • 24.
    Next on theAgenda: SCSI, NVMe and friends • NVMe started as a bypass driver - converted to blk-mq • mtip32xx (Micron) • virtio_blk, xen • rbd (ceph) • loop • more... • SCSI midlayer was a bigger project.. 24
  • 25.
    SCSI multiqueue • Neededthe concept of “shared tag sets” • Tags are now a property of the HBA and not the storage device • Needed a chunking of scatter-gather lists • SCSI HBAs support huge sg lists, two much to allocate up front • Needed “Head of queue” insertion • For SCSI complex error handling • Removed the “big scsi host_busy lock” • reduced the huge contention on the scsi target “busy” atomic • Needed Partial completion support • Needed BIDI support (yukk..) • Hardened the stack a lot with lots of user bug reports. 25
  • 26.
    Block multiqueue -MSI(X) based queue mapping 26 ● Motivation: Eliminate the IPI case ● Expose MSI(X) vector affinity mappings to the block layer ● Map the HW context mappings via the underlying device IRQ mappings ● Offer MSI(X) allocation and correct affinity spreading via the PCI subsystem ● Take advantage in pci based drivers (nvme, rdma, fc, hpsa, etc..)
  • 27.
    But wait, whatabout I/O schedulers? • What we didn’t mention was that block multiqueue lacked a proper I/O scheduler for approximately 3 years! • A fundamental part of the I/O stack functionality is scheduling • To optimize I/O sequentiality - Elevator algorithm • Prevent write vs. read starvation (i.e. deadline scheduler) • Fairness enforcement (i.e. CFQ) • One can argue that I/O scheduling was designed for rotating media • Optimized for reducing actuator seek time NOT NECESSARILY TRUE - Flash can benefit scheduling! 27
  • 28.
    Start from groundup: WriteBack Throttling • Linux since the dawn of times sucked at buffered I/O • Writes are naturally buffered and committed to disk in the background • Needs to have little or no impact on foreground activity • What was needed: • Plumb I/O stats for submitted reads and writes • Track average latency in window granularity and what is currently enqueued • Scale queue depth accordingly • Prefer reads over non-directIO writes 28
  • 29.
    WriteBack Throttling -Performance 29 Before... After...
  • 30.
    Now we areready for I/O schedules - MQ-Deadline • Added I/O interception of requests for building schedulers on top • First MQ conversion was for deadline scheduler • Pretty easy and straightforward • Just delay writes FIFO until deadline hits • Reads FIFO are pass-through • All percpu context - tradeoff? • Remember: I/O scheduler can hurt synthetic workloads, but impact on real life workloads. 30
  • 31.
    Next: Kyber I/OScheduler • Targeted for fast multi-queue devices • Lightweight • Prefers reads over writes • All I/Os are split into two queues (reads and writes) • Reads are typically preferred • Writes are throttled but not to a point of starvation • The key is to keep submission queues short to guarantee latency targets • Kyber tracks I/O latency stats and adjust queue size accordingly • Aware of flash background operations. 31
  • 32.
    Next: BFQ I/OScheduler • Budget fair queueing scheduler • A lot heavier • Maintain Per-Process I/O budget • Maintain bunch of Per-Process heuristics • Yields the “best” I/O to queue at any given time • A better fit for slower storage, especially rotating media and cheap & deep SSDs. 32
  • 33.
    But wait #2:What about Ultra-low latency devices • New media is emerging with Ultra low latency (1-2 us) • 3D-Xpoint • Z-NAND • Even with block MQ, the Linux I/O stack still has issues providing these latencies • It starts with IRQ (interrupt handling) • If I/O is so fast, we might want to poll for completion and avoid paying the cost of MSI(X) interrupt 33
  • 34.
    Interrupt based I/Ocompletion model 34
  • 35.
    Polling based I/Ocompletion model 35
  • 36.
    IRQ vs. Polling 36 •Polling can remove the extra context switch from the completion handling
  • 37.
    So we shouldsupport polling! 37 • Add selective polling syscall interface: • Use preadv2/pwritev2 with flag IOCB_HIGHPRI • Saves roughly 25% of added latency
  • 38.
    But what aboutCPU% - can we go hybrid? 38 • Yes! • We have all the statistics framework in place, let’s use it for hybrid polling! • Wake up poller after ½ of the mean latency.
  • 39.
    Hybrid polling -Performance 39
  • 40.
    Hybrid polling -Adjust to I/O size 40 • Block layer sees I/Os of different sizes. • Some are 4k, some are 256K and some or 1-2MB • We need to consider that when tracking stats for Polling considerations • Simple solution: Bucketize stats... • 0-4k • 4-16k • 16k-64k • >64k • Now Hybrid polling has good QoS!
  • 41.
    To Conclude 41 • Lotsof interesting stuff happening in Linux • Linux belongs to everyone, Get involved! • We always welcome patches and bug reports :)
  • 42.