The zswap compressed swap cache
Please consider subscribing to LWNSwapping is one of the biggest threats to performance. The latency gap between RAM and swap, even on a fast SSD, can be four orders of magnitude. The throughput gap is two orders of magnitude. In addition to the speed gap, storage on which a swap area resides is becoming more shared and virtualized, which can cause additional I/O latency and nondeterministic workload performance. The zswap subsystem exists to mitigate these undesirable effects of swapping through a reduction in I/O activity.Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visit this page to join up and keep LWN on the net.
Zswap is a lightweight, write-behind compressed cache for swap pages. It takes pages that are in the process of being swapped out and attempts to compress them into a dynamically allocated RAM-based memory pool. If this process is successful, the writeback to the swap device is deferred and, in many cases, avoided completely. This results in a significant I/O reduction and performance gains for systems that are swapping.
Zswap basics
Zswap intercepts pages in the middle of swap writeback and caches them using the frontswap API. Frontswap has been in the kernel since v3.5 and has been covered by LWN before. It allows a backend driver, like zswap, to intercept both swap page writeback and the page faults for swapped out pages. Zswap also makes use of the "zsmalloc" allocator (discussed below) for compressed page storage.
Zswap seeks to be as simple as possible in its structure and operation. There are two primary data structures. The first is the zswap_entry structure, which contains information about a single compressed page stored in zswap:
struct zswap_entry { struct rb_node rbnode; int refcount; pgoff_t offset; unsigned long handle; /* zsmalloc allocation */ unsigned int length; /* ... */ };
The second is the zswap_tree structure which contains a red-black tree of zswap entries indexed by the offset value:
struct zswap_tree { struct rb_root rbroot; struct list_head lru; spinlock_t lock; struct zs_pool *pool; };
At the highest level, there is an array of zswap_tree structures indexed by the swap device number.
There is a single lock per zswap_tree to protect the tree structure during lookups and modifications. The higher-level swap code provides certain protections that simplify the zswap implementation by not having to design for concurrent store, load, and invalidate operations on the same swap entry. While this single-lock design might seem like a likely source for contention, actual execution demonstrates that the swap path is largely bottlenecked by other locks at higher levels, such as the anon_vma mutex or swap_lock. In comparison, the zswap_tree lock is very lightly contended. Writeback support, covered in the next section, also led to this single-lock design.
For page compression, zswap uses compressor modules provided by the kernel's cryptographic API. This allows users to select the compressor dynamically at boot time, and gives easy access to hardware compression accelerators or any other future compression engines.
A zswap store operation occurs when a page is selected for swapping by the reclaim system and frontswap intercepts the page in swap_writepage(). The operation begins by compressing the page into a per-CPU temporary buffer. Compressing into the temporary buffer is required because the compressed size, and thus the size of the permanent allocation needed to hold it, isn't known until the compression is actually done. Once the compressed size is known, an object is allocated and the temporary buffer is copied into the object. Lastly, a zswap_entry structure is allocated, populated, and inserted into the tree for that swap device.
If the store fails for any reason, most likely because of an object allocation failure, zswap returns an error which is propagated up through frontswap into swap_writepage(). The page is then swapped out to the swap device as usual.
A load operation occurs when a program page faults on a page table entry (PTE) that contains a swap entry and is intercepted by frontswap in swap_readpage(). The swap entry contains the device and offset information needed to look up the zswap entry in the appropriate tree. Once the entry is located, the data is decompressed directly into the page allocated by the page fault code. The entry is not removed from the tree during a load; it remains up-to-date until the entry is invalidated.
An invalidate operation occurs when the reference count for a particular swap offset becomes zero in swap_entry_free(). In this case, the zswap entry is removed from the appropriate tree, and the entry and the zsmalloc allocation that it references are freed.
To be preemption-friendly, interrupts are never disabled. Preemption is only disabled during compression while accessing the per-cpu temporary buffer page, and during decompression while accessing a mapped zsmalloc allocation.
Zswap writeback
To operate optimally as a cache, zswap should hold the most recently used pages. With frontswap, there is, unfortunately, a real potential for an inverse least recently used (LRU) condition in which the cache fills with older pages, and newer pages are forced out to the slower swap device. To address this, zswap is designed with "resumed" writeback in mind.
As background, the process for swapping pages follows these steps:
- First, an anonymous memory page is selected for swapping and a slot is
allocated in the swap device.
- Next, the page is unmapped from all processes using that page. The
PTEs referencing that page are filled with the swap entry that consists of
the swap type and offset where the page can be found.
- Lastly, the page is scheduled for writeback to the swap device.
When frontswap_store() in swap_writepage() is successful, the writeback step is not performed. However, the slot in the swap device has been allocated and is still reserved for the page even though the page only resides in the frontswap backend. Resumed writeback in zswap forces pages out of the compressed cache into their previously reserved swap slots in the swap device. Currently, the policy is basic and forces pages out from the cache in two cases: (1) when the cache has reached its maximum size according to the max_pool_percent sysfs tunable or, (2) when zswap is unable to allocate new space for the compressed pool.
During resumed writeback, zswap decompresses the page, adds it back to the swap cache, and schedules writeback into the swap slot that was previously reserved. By splitting swap_writepage() into two functions after frontswap_store() is called, zswap can resume writeback from the point where the initial writeback terminated in frontswap. The new function is called __swap_writepage().
Freeing zswap entries becomes more complex with writeback. Without writeback, pages would only be freed during invalidate operations (zswap_frontswap_invalidate page()). With writeback, pages can also be freed in zswap_writeback_pages(). These invalidate and writeback functions can run concurrently for the same zswap entry. To guarantee that entries are not freed while being accessed by another thread, a reference count field (called refcount) is used the zswap_entry structure.
Zsmalloc rationale
One really can't talk about zswap without mentioning zsmalloc, the allocator it uses for compressed page storage, which currently resides in the Linux Staging tree.
Zsmalloc is a slab-based allocator used by zswap; it provides more reliable allocation of large objects in a memory constrained environment than does the kernel slab allocator. Zsmalloc has already been discussed on LWN, so this section will focus more on the need for zsmalloc in the presence of the kernel slab allocator.
The objects that zswap stores are compressed pages. The default compressor is lzo1x-1, which is known for speed, but not so much for high compression. As a result, zswap objects can frequently be large relative to typical slab objects (>1/8th PAGE_SIZE). This is a problem for the kernel slab allocator under memory pressure.
The kernel slab allocator requires high-order page allocations to back slabs for large objects. For example, on a system with a 4K page size, the kmalloc-512 cache has slabs that are backed by two contiguous pages. kmalloc-2048 requires eight contiguous pages per slab. These high-order page allocations are very likely to fail when the system is under memory pressure.
Zsmalloc addresses this problem by allowing the pages backing a slab (or “size class” in zsmalloc terms) to be both non-contiguous and variable in number. They are variable in number because zsmalloc allows a slab to be composed of less than the target number of backing pages. A set of non-contiguous pages backing a slab are stitched together using fields of struct page to create a “zspage”. This allows zsmalloc to service large object allocations, up to PAGE_SIZE, without requiring high-order page allocations.
Additionally, the kernel slab allocator does not allow objects that are less than a page in size to span a page boundary. This means that if an object is PAGE_SIZE/2 + 1 bytes in size, it effectively uses an entire page, resulting in ~50% waste. Hence there are no kmalloc() cache sizes between PAGE_SIZE/2 and PAGE_SIZE. Zswap frequently needs allocations in this range, however. Using the kernel slab allocator causes the memory savings achieved through compression to be lost in fragmentation.
In order to satisfy these larger allocations while not wasting an entire page, zsmalloc allows objects to span page boundaries at the cost of having to map the allocations before accessing them. This mapping is needed because the object might be contained in two non-contiguous pages. For example, in a zsmalloc size class for objects that are 2/3 of PAGE_SIZE, three objects could be stored in a zspage with two non-contiguous backing pages with no waste. The object stored in the second of the three object positions in the zspage would be split between two different pages.
Zsmalloc is a good fit for zswap. Zswap was evaluated using the kernel slab allocator and these issues did have a significant impact on the frontswap_store() success rate. This was due to kmalloc() allocation failures and a need to reject pages that compressed to sizes greater than PAGE_SIZE/2.
Performance
In order to produce a performance comparison, kernel builds were conducted with an increasing number of threads per run in a constant and constrained amount of memory. The results indicate a runtime reduction of 53% and an I/O reduction of 76% with zswap compared to normal swapping. The testing system was configured with:
- Gentoo running v3.7-rc7
- Quad-core i5-2500 @ 3.3GHz
- 512MB DDR3 1600MHz (limited with mem=512m on boot)
- Filesystem and swap on 80GB HDD (about 58MB/s with hdparm -t)
The table below summarizes the test runs.
Baseline zswap Change N pswpin pswpout majflt I/O sum pswpin pswpout majflt I/O sum %I/O MB 8 1 335 291 627 0 0 249 249 -60% 1 12 3688 14315 5290 23293 123 860 5954 6937 -70% 64 16 12711 46179 16803 75693 2936 7390 46092 56418 -25% 75 20 42178 133781 49898 225857 9460 28382 92951 130793 -42% 371 24 96079 357280 105242 558601 7719 18484 109309 135512 -76% 1653
The 'N' column indicates the maximum number of concurrent threads for the kernel build (make -jN) for each run. The next four columns are the statistics for the baseline run without zswap, followed by the same for the zswap run. The I/O sum column for each run is a sum of pswpin (pages swapped in), pswpout (pages swapped out), and majflt (major page faults). The difference between the baseline and zswap runs is shown both in relative terms, as a percentage of I/O reduction, and in absolute terms, as a reduction of X megabytes of I/O related to swapping activity.
A compressed swap cache reduces the efficiency of the page reclaim process. For any store operation, the cache may allocate some pages to store the compressed page. This results in an reduction of overall page reclaim efficiency. This reduction in efficiency results in additional shrinking pressure on the page cache causing an increase in major page faults where pages must be re-read from disk. In order to have a complete picture of the I/O impact, the major page faults must be considered in the sum of I/O.
The next table shows the total runtime of the kernel builds:
Runtime (in seconds) N base zswap %change 8 107 107 0% 12 128 110 -14% 16 191 179 -6% 20 371 240 -35% 24 570 267 -53%
The runtime impact of swap activity is decreased when comparing runs with the same number of threads. The rate of degradation is reduced for increasingly constrained runs when comparing baseline and zswap.
The measurements of average CPU utilization during the builds are:
%CPU utilization (out of 400% on 4 cpus) N base zswap %change 8 317 319 1% 12 267 311 16% 16 179 191 7% 20 94 143 52% 24 60 128 113%
The CPU utilization table shows that with zswap, the kernel build is able to make more productive use of the CPUs, as is expected from the runtime results.
Additional performance testing was performed using SPECjbb. Metrics regarding the performance improvements and I/O reductions that can be achieved using zswap on both x86 and Power7+ (with and without hardware compression acceleration), can be found on this page.
Conclusion
Zswap is a compressed swap cache, able to evict pages from the compressed
cache, on an LRU basis, to the backing swap device when the compressed pool
reaches it size limit or the pool is unable to obtain additional pages from
the buddy allocator. Its approach trades CPU cycles for reduced swap I/O.
This trade-off can result in a significant performance improvement as reads
to and writes from to the compressed cache are almost always faster that reading
from a swap device which incurs the latency of an asynchronous block I/O
read.
Index entries for this article | |
---|---|
Kernel | Memory management/Swapping |
Kernel | Transcendent memory |
Kernel | zswap |
GuestArticles | Jennings, Seth |
Posted Feb 14, 2013 9:12 UTC (Thu)
by iq-0 (subscriber, #36655)
[Link] (1 responses)
Posted Feb 15, 2013 15:52 UTC (Fri)
by sjennings (guest, #74813)
[Link]
Posted Feb 15, 2013 10:58 UTC (Fri)
by geuder (guest, #62854)
[Link] (2 responses)
Posted Feb 15, 2013 16:58 UTC (Fri)
by sjennings (guest, #74813)
[Link]
So same problem, different solution. zswap focuses solely on compressed swap caching which removes the need for an abstraction layer, resulting in a smaller code base.
Both can be built in the same kernel, but you would only enable one at a time. Both zswap and zcache are enabled through a boot parameters for now.
Neither is in mainline yet and discussions are (or will be) ongoing on LKML about what to do going forward. There has been a LSF/MM topic suggested for this very discussion (http://marc.info/?l=linux-mm&m=135923138220901)
Posted Feb 18, 2013 17:21 UTC (Mon)
by djm1021 (guest, #31130)
[Link]
Posted Feb 16, 2013 7:50 UTC (Sat)
by CChittleborough (subscriber, #60775)
[Link] (1 responses)
It might be worth adding one of these word-orientated, recency-based compressors to the kernel and trying it with zswap. (One problem: Wilson's compressor design assumes a 32-bit machine.)
Posted Oct 28, 2013 6:56 UTC (Mon)
by phil (guest, #91719)
[Link]
They assume that most stuff is more or less aligned on AT LEAST 32 bit boundaries, or can be passably approximated that way.
So for example, they'll notice if a 32 bit word is all zeroes, or if it's zeroes in the upper 22 bits; many integers are.
For a 64 bit int that's zeroes in all but the low 10 bits, what will happen is that the first 32 bits will be encoded as a big zero, in two bits, and second 32 bits will be encoded as an "almost zero," in 2 + 10 = 12 bits, so you get 64 bits crammed into 2 + 12 = 14 bits, for a compression ratio greater than 4.
Similar things happen with 64-bit patterns that aren't mostly zeroes, but are similar to other 64-bit patterns nearby in memory, as happens with many integers and most pointers. Often integers are numerically similar to other integers nearby (e.g., in arrays that represent discretized smooth functions) and pointers are similar to other pointers nearby (e.g., other pointers into the same few kilobytes, megabytes or gigabytes holding related elements of some data structure).
When you have one of those similar-to-something-nearby matches, a 32 bit word gets compressed to 2 tag bits plus 4 selector bits saying which recently-seen pattern it's similar to, plus 10 lower bits it that differ somewhere. For a 32-bit partial match, that gives you a compression ratio of 2. (2 + 4 + 10 = 16 / 32)
But if you're actually looking at 64 bit pointers or ints, you're likely to do much better in the case of partial all-but-low-bits matches like that---the upper 4 bytes will compress to 6 bits, and the lower 4 bytes will compress to 16, for a compression ratio of 64 / (6 + 16 = 20) or greater than 3.
The short version is that if you guess 32-bit alignment, you get pretty good compression very, very fast on 32-bit machines, and even better compression "for free" on 64-bit machines---and a shade faster still---, because 64 bit ints and pointers typically have even less information in their upper bytes than 32-bit ints and pointers.
Posted Feb 17, 2013 5:00 UTC (Sun)
by giraffedata (guest, #1954)
[Link]
Actually, to operate optimally, it should hold the soonest reused pages. Looking for the most recently used pages is just one cheap way to approximate that. As we know from decades of research, there's usually a high correlation between the two properties, but there are notable cases where they are anti-correlated.
Posted Aug 21, 2013 12:34 UTC (Wed)
by falstaff (guest, #92469)
[Link]
Posted Sep 6, 2013 16:57 UTC (Fri)
by monnier (guest, #92728)
[Link]
Regarding the measurements: doing the measurement on machine with a crippled memory but a fast CPU seems unfair. A machine with 512MB of RAM is unlikely to have such a fast CPU, and a slower CPU makes compression less attractive since compression then takes more time. IOW measurements on a real 512MB machine (with something like a 1GHz Pentium, or an Allwinner A10) would seem more realistic.
Posted Nov 26, 2015 17:36 UTC (Thu)
by sbasu (guest, #105518)
[Link]
Posted May 17, 2023 12:03 UTC (Wed)
by xjtudso (guest, #152064)
[Link]
return ret;
The zswap compressed swap cache
The zswap compressed swap cache
The zswap compressed swap cache
The zswap compressed swap cache
The zswap compressed swap cache
Avi Miller gave a talk on it at linux.conf.au last month and the recording of that talk can be found here: http://mirror.linux.org.au/linux.conf.au/2013/ogv/Transce...
Some academics tried this technique back in 1999 using three compression methods, including one designed for memory contents rather than file contents. See "The Case for Compressed Caching in Virtual Memory Systems" by Paul R. Wilson, Scott F. Kaplan, Yannis Smaragdakis, Proc. 1999 Usenix Annual Tech. Conf. (PDF, CiteSeerX), especially section 2.
Specialised compression method from 1999 paper
Specialised compression method from 1999 paper (WKdm et al.)
The zswap compressed swap cache
To operate optimally as a cache, zswap should hold the most recently used pages.
swap required?
Swapping out compressed data?
Memory allocation in zswap
zsawp front load freeentry
In my understanding, if zpool entry is not freed after a load, zpool will be used up quickly. As I checked in zswap code:
freeentry:
spin_lock(&tree->lock);
/*
this is a local refcnt put, zswap entry refcnt will be 1, after put. And it will not be freed here.
*/
zswap_entry_put(tree, entry);
spin_unlock(&tree->lock);
}
So should zswap entry be freed here with an extra global entry_put just like zswap write back?