The zswap compressed swap cache

Please consider subscribing to LWN
See More
Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visit this page to join up and keep LWN on the net.

February 12, 2013

This article was contributed by Seth Jennings

Swapping is one of the biggest threats to performance. The latency gap between RAM and swap, even on a fast SSD, can be four orders of magnitude. The throughput gap is two orders of magnitude. In addition to the speed gap, storage on which a swap area resides is becoming more shared and virtualized, which can cause additional I/O latency and nondeterministic workload performance. The zswap subsystem exists to mitigate these undesirable effects of swapping through a reduction in I/O activity.

Zswap is a lightweight, write-behind compressed cache for swap pages. It takes pages that are in the process of being swapped out and attempts to compress them into a dynamically allocated RAM-based memory pool. If this process is successful, the writeback to the swap device is deferred and, in many cases, avoided completely. This results in a significant I/O reduction and performance gains for systems that are swapping.

Zswap basics

Zswap intercepts pages in the middle of swap writeback and caches them using the frontswap API. Frontswap has been in the kernel since v3.5 and has been covered by LWN before. It allows a backend driver, like zswap, to intercept both swap page writeback and the page faults for swapped out pages. Zswap also makes use of the "zsmalloc" allocator (discussed below) for compressed page storage.

Zswap seeks to be as simple as possible in its structure and operation. There are two primary data structures. The first is the zswap_entry structure, which contains information about a single compressed page stored in zswap:

    struct zswap_entry {
	struct rb_node rbnode;
	int refcount;
	pgoff_t offset;
	unsigned long handle; /* zsmalloc allocation */
	unsigned int length;
	/* ... */
    };

The second is the zswap_tree structure which contains a red-black tree of zswap entries indexed by the offset value:

    struct zswap_tree {
	struct rb_root rbroot;
	struct list_head lru;
	spinlock_t lock;
	struct zs_pool *pool;
    };

At the highest level, there is an array of zswap_tree structures indexed by the swap device number.

There is a single lock per zswap_tree to protect the tree structure during lookups and modifications. The higher-level swap code provides certain protections that simplify the zswap implementation by not having to design for concurrent store, load, and invalidate operations on the same swap entry. While this single-lock design might seem like a likely source for contention, actual execution demonstrates that the swap path is largely bottlenecked by other locks at higher levels, such as the anon_vma mutex or swap_lock. In comparison, the zswap_tree lock is very lightly contended. Writeback support, covered in the next section, also led to this single-lock design.

For page compression, zswap uses compressor modules provided by the kernel's cryptographic API. This allows users to select the compressor dynamically at boot time, and gives easy access to hardware compression accelerators or any other future compression engines.

A zswap store operation occurs when a page is selected for swapping by the reclaim system and frontswap intercepts the page in swap_writepage(). The operation begins by compressing the page into a per-CPU temporary buffer. Compressing into the temporary buffer is required because the compressed size, and thus the size of the permanent allocation needed to hold it, isn't known until the compression is actually done. Once the compressed size is known, an object is allocated and the temporary buffer is copied into the object. Lastly, a zswap_entry structure is allocated, populated, and inserted into the tree for that swap device.

If the store fails for any reason, most likely because of an object allocation failure, zswap returns an error which is propagated up through frontswap into swap_writepage(). The page is then swapped out to the swap device as usual.

A load operation occurs when a program page faults on a page table entry (PTE) that contains a swap entry and is intercepted by frontswap in swap_readpage(). The swap entry contains the device and offset information needed to look up the zswap entry in the appropriate tree. Once the entry is located, the data is decompressed directly into the page allocated by the page fault code. The entry is not removed from the tree during a load; it remains up-to-date until the entry is invalidated.

An invalidate operation occurs when the reference count for a particular swap offset becomes zero in swap_entry_free(). In this case, the zswap entry is removed from the appropriate tree, and the entry and the zsmalloc allocation that it references are freed.

To be preemption-friendly, interrupts are never disabled. Preemption is only disabled during compression while accessing the per-cpu temporary buffer page, and during decompression while accessing a mapped zsmalloc allocation.

Zswap writeback

To operate optimally as a cache, zswap should hold the most recently used pages. With frontswap, there is, unfortunately, a real potential for an inverse least recently used (LRU) condition in which the cache fills with older pages, and newer pages are forced out to the slower swap device. To address this, zswap is designed with "resumed" writeback in mind.

As background, the process for swapping pages follows these steps:

First, an anonymous memory page is selected for swapping and a slot is allocated in the swap device.
Next, the page is unmapped from all processes using that page. The PTEs referencing that page are filled with the swap entry that consists of the swap type and offset where the page can be found.
Lastly, the page is scheduled for writeback to the swap device.

When frontswap_store() in swap_writepage() is successful, the writeback step is not performed. However, the slot in the swap device has been allocated and is still reserved for the page even though the page only resides in the frontswap backend. Resumed writeback in zswap forces pages out of the compressed cache into their previously reserved swap slots in the swap device. Currently, the policy is basic and forces pages out from the cache in two cases: (1) when the cache has reached its maximum size according to the max_pool_percent sysfs tunable or, (2) when zswap is unable to allocate new space for the compressed pool.

During resumed writeback, zswap decompresses the page, adds it back to the swap cache, and schedules writeback into the swap slot that was previously reserved. By splitting swap_writepage() into two functions after frontswap_store() is called, zswap can resume writeback from the point where the initial writeback terminated in frontswap. The new function is called __swap_writepage().

Freeing zswap entries becomes more complex with writeback. Without writeback, pages would only be freed during invalidate operations (zswap_frontswap_invalidate page()). With writeback, pages can also be freed in zswap_writeback_pages(). These invalidate and writeback functions can run concurrently for the same zswap entry. To guarantee that entries are not freed while being accessed by another thread, a reference count field (called refcount) is used the zswap_entry structure.

Zsmalloc rationale

One really can't talk about zswap without mentioning zsmalloc, the allocator it uses for compressed page storage, which currently resides in the Linux Staging tree.

Zsmalloc is a slab-based allocator used by zswap; it provides more reliable allocation of large objects in a memory constrained environment than does the kernel slab allocator. Zsmalloc has already been discussed on LWN, so this section will focus more on the need for zsmalloc in the presence of the kernel slab allocator.

The objects that zswap stores are compressed pages. The default compressor is lzo1x-1, which is known for speed, but not so much for high compression. As a result, zswap objects can frequently be large relative to typical slab objects (>1/8th PAGE_SIZE). This is a problem for the kernel slab allocator under memory pressure.

The kernel slab allocator requires high-order page allocations to back slabs for large objects. For example, on a system with a 4K page size, the kmalloc-512 cache has slabs that are backed by two contiguous pages. kmalloc-2048 requires eight contiguous pages per slab. These high-order page allocations are very likely to fail when the system is under memory pressure.

Zsmalloc addresses this problem by allowing the pages backing a slab (or “size class” in zsmalloc terms) to be both non-contiguous and variable in number. They are variable in number because zsmalloc allows a slab to be composed of less than the target number of backing pages. A set of non-contiguous pages backing a slab are stitched together using fields of struct page to create a “zspage”. This allows zsmalloc to service large object allocations, up to PAGE_SIZE, without requiring high-order page allocations.

Additionally, the kernel slab allocator does not allow objects that are less than a page in size to span a page boundary. This means that if an object is PAGE_SIZE/2 + 1 bytes in size, it effectively uses an entire page, resulting in ~50% waste. Hence there are no kmalloc() cache sizes between PAGE_SIZE/2 and PAGE_SIZE. Zswap frequently needs allocations in this range, however. Using the kernel slab allocator causes the memory savings achieved through compression to be lost in fragmentation.

In order to satisfy these larger allocations while not wasting an entire page, zsmalloc allows objects to span page boundaries at the cost of having to map the allocations before accessing them. This mapping is needed because the object might be contained in two non-contiguous pages. For example, in a zsmalloc size class for objects that are 2/3 of PAGE_SIZE, three objects could be stored in a zspage with two non-contiguous backing pages with no waste. The object stored in the second of the three object positions in the zspage would be split between two different pages.

Zsmalloc is a good fit for zswap. Zswap was evaluated using the kernel slab allocator and these issues did have a significant impact on the frontswap_store() success rate. This was due to kmalloc() allocation failures and a need to reject pages that compressed to sizes greater than PAGE_SIZE/2.

Performance

In order to produce a performance comparison, kernel builds were conducted with an increasing number of threads per run in a constant and constrained amount of memory. The results indicate a runtime reduction of 53% and an I/O reduction of 76% with zswap compared to normal swapping. The testing system was configured with:

Gentoo running v3.7-rc7
Quad-core i5-2500 @ 3.3GHz
512MB DDR3 1600MHz (limited with mem=512m on boot)
Filesystem and swap on 80GB HDD (about 58MB/s with hdparm -t)

The table below summarizes the test runs.

Baseline zswap Change

N pswpin pswpout majflt I/O sum pswpin pswpout majflt I/O sum %I/O MB

8 1 335 291 627 0 0 249 249 -60% 1

12 3688 14315 5290 23293 123 860 5954 6937 -70% 64

16 12711 46179 16803 75693 2936 7390 46092 56418 -25% 75

20 42178 133781 49898 225857 9460 28382 92951 130793 -42% 371

24 96079 357280 105242 558601 7719 18484 109309 135512 -76% 1653

	Baseline	zswap	Change
N	pswpin	pswpout	majflt	I/O sum	pswpin	pswpout	majflt	I/O sum	%I/O	MB
8	1	335	291	627	0	0	249	249	-60%	1
12	3688	14315	5290	23293	123	860	5954	6937	-70%	64
16	12711	46179	16803	75693	2936	7390	46092	56418	-25%	75
20	42178	133781	49898	225857	9460	28382	92951	130793	-42%	371
24	96079	357280	105242	558601	7719	18484	109309	135512	-76%	1653

The 'N' column indicates the maximum number of concurrent threads for the kernel build (make -jN) for each run. The next four columns are the statistics for the baseline run without zswap, followed by the same for the zswap run. The I/O sum column for each run is a sum of pswpin (pages swapped in), pswpout (pages swapped out), and majflt (major page faults). The difference between the baseline and zswap runs is shown both in relative terms, as a percentage of I/O reduction, and in absolute terms, as a reduction of X megabytes of I/O related to swapping activity.

A compressed swap cache reduces the efficiency of the page reclaim process. For any store operation, the cache may allocate some pages to store the compressed page. This results in an reduction of overall page reclaim efficiency. This reduction in efficiency results in additional shrinking pressure on the page cache causing an increase in major page faults where pages must be re-read from disk. In order to have a complete picture of the I/O impact, the major page faults must be considered in the sum of I/O.

The next table shows the total runtime of the kernel builds:

Runtime (in seconds)

N base zswap %change

8 107 107 0%

12 128 110 -14%

16 191 179 -6%

20 371 240 -35%

24 570 267 -53%

Runtime (in seconds)
N	base	zswap	%change
8	107	107	0%
12	128	110	-14%
16	191	179	-6%
20	371	240	-35%
24	570	267	-53%

The runtime impact of swap activity is decreased when comparing runs with the same number of threads. The rate of degradation is reduced for increasingly constrained runs when comparing baseline and zswap.

The measurements of average CPU utilization during the builds are:

%CPU utilization (out of 400% on 4 cpus)

N base zswap %change

8 317 319 1%

12 267 311 16%

16 179 191 7%

20 94 143 52%

24 60 128 113%

%CPU utilization (out of 400% on 4 cpus)
N	base	zswap	%change
8	317	319	1%
12	267	311	16%
16	179	191	7%
20	94	143	52%
24	60	128	113%

The CPU utilization table shows that with zswap, the kernel build is able to make more productive use of the CPUs, as is expected from the runtime results.

Additional performance testing was performed using SPECjbb. Metrics regarding the performance improvements and I/O reductions that can be achieved using zswap on both x86 and Power7+ (with and without hardware compression acceleration), can be found on this page.

Conclusion

Zswap is a compressed swap cache, able to evict pages from the compressed cache, on an LRU basis, to the backing swap device when the compressed pool reaches it size limit or the pool is unable to obtain additional pages from the buddy allocator. Its approach trades CPU cycles for reduced swap I/O. This trade-off can result in a significant performance improvement as reads to and writes from to the compressed cache are almost always faster that reading from a swap device which incurs the latency of an asynchronous block I/O read.

Index entries for this article
Kernel	Memory management/Swapping
Kernel	Transcendent memory
Kernel	zswap
GuestArticles	Jennings, Seth

The zswap compressed swap cache

Posted Feb 14, 2013 9:12 UTC (Thu) by iq-0 (subscriber, #36655) [Link] (1 responses)

Does anybody know if this plays nicely with unbalenced NUMA systems, like trying to use the memory of another numa node when the current node is under pressure?

The zswap compressed swap cache

Posted Feb 15, 2013 15:52 UTC (Fri) by sjennings (guest, #74813) [Link]

zswap is not NUMA aware for the moment. The current behavior would depend on the default behavior of the buddy allocator when getting pages for zswap's compressed pool.

The zswap compressed swap cache

Posted Feb 15, 2013 10:58 UTC (Fri) by geuder (guest, #62854) [Link] (2 responses)

So how does zswap relate to zcache? (https://lwn.net/Articles/537760/ , http://lwn.net/Articles/397574/) Different problem or different solution? Mutually exclusive or could be combined inside the same kernel? Found also http://lwn.net/Articles/528817/, but still a bit confused...

The zswap compressed swap cache

Posted Feb 15, 2013 16:58 UTC (Fri) by sjennings (guest, #74813) [Link]

Both zcache and zswap do compressed swap page caching. However, zcache uses the tmem API internally that creates an abstraction layer for doing a number of other things like compressed file page caching via cleancache and, more recently, remote RAM (RAMster).

So same problem, different solution. zswap focuses solely on compressed swap caching which removes the need for an abstraction layer, resulting in a smaller code base.

Both can be built in the same kernel, but you would only enable one at a time. Both zswap and zcache are enabled through a boot parameters for now.

Neither is in mainline yet and discussions are (or will be) ongoing on LKML about what to do going forward. There has been a LSF/MM topic suggested for this very discussion (http://marc.info/?l=linux-mm&m=135923138220901)

The zswap compressed swap cache

Posted Feb 18, 2013 17:21 UTC (Mon) by djm1021 (guest, #31130) [Link]

As Seth responded, zcache is part of a larger portfolio of inter-related code and objectives, called Transcendent Memory (or "tmem"), supporting different kinds of "input" data (frontends) and different "data stores" (backends). Tmem and its friends are described here: http://lwn.net/Articles/454795/
Avi Miller gave a talk on it at linux.conf.au last month and the recording of that talk can be found here: http://mirror.linux.org.au/linux.conf.au/2013/ogv/Transce...

Specialised compression method from 1999 paper

Posted Feb 16, 2013 7:50 UTC (Sat) by CChittleborough (subscriber, #60775) [Link] (1 responses)

Some academics tried this technique back in 1999 using three compression methods, including one designed for memory contents rather than file contents. See "The Case for Compressed Caching in Virtual Memory Systems" by Paul R. Wilson, Scott F. Kaplan, Yannis Smaragdakis, Proc. 1999 Usenix Annual Tech. Conf. (PDF, CiteSeerX), especially section 2.

It might be worth adding one of these word-orientated, recency-based compressors to the kernel and trying it with zswap. (One problem: Wilson's compressor design assumes a 32-bit machine.)

Specialised compression method from 1999 paper (WKdm et al.)

Posted Oct 28, 2013 6:56 UTC (Mon) by phil (guest, #91719) [Link]

Actually, Wilson's algorithms don't assume 32 bits, and work even better on 64-bit machines. That's one reason they're used in the new version of OSX ("Mavericks") now.

They assume that most stuff is more or less aligned on AT LEAST 32 bit boundaries, or can be passably approximated that way.

So for example, they'll notice if a 32 bit word is all zeroes, or if it's zeroes in the upper 22 bits; many integers are.

For a 64 bit int that's zeroes in all but the low 10 bits, what will happen is that the first 32 bits will be encoded as a big zero, in two bits, and second 32 bits will be encoded as an "almost zero," in 2 + 10 = 12 bits, so you get 64 bits crammed into 2 + 12 = 14 bits, for a compression ratio greater than 4.

Similar things happen with 64-bit patterns that aren't mostly zeroes, but are similar to other 64-bit patterns nearby in memory, as happens with many integers and most pointers. Often integers are numerically similar to other integers nearby (e.g., in arrays that represent discretized smooth functions) and pointers are similar to other pointers nearby (e.g., other pointers into the same few kilobytes, megabytes or gigabytes holding related elements of some data structure).

When you have one of those similar-to-something-nearby matches, a 32 bit word gets compressed to 2 tag bits plus 4 selector bits saying which recently-seen pattern it's similar to, plus 10 lower bits it that differ somewhere. For a 32-bit partial match, that gives you a compression ratio of 2. (2 + 4 + 10 = 16 / 32)

But if you're actually looking at 64 bit pointers or ints, you're likely to do much better in the case of partial all-but-low-bits matches like that---the upper 4 bytes will compress to 6 bits, and the lower 4 bytes will compress to 16, for a compression ratio of 64 / (6 + 16 = 20) or greater than 3.

The short version is that if you guess 32-bit alignment, you get pretty good compression very, very fast on 32-bit machines, and even better compression "for free" on 64-bit machines---and a shade faster still---, because 64 bit ints and pointers typically have even less information in their upper bytes than 32-bit ints and pointers.

The zswap compressed swap cache

Posted Feb 17, 2013 5:00 UTC (Sun) by giraffedata (guest, #1954) [Link]

To operate optimally as a cache, zswap should hold the most recently used pages.

Actually, to operate optimally, it should hold the soonest reused pages. Looking for the most recently used pages is just one cheap way to approximate that. As we know from decades of research, there's usually a high correlation between the two properties, but there are notable cases where they are anti-correlated.

swap required?

Posted Aug 21, 2013 12:34 UTC (Wed) by falstaff (guest, #92469) [Link]

Is a real swap space required to make use of the zswap cache? Since it intercepts the swap_writepage call, it looks to me like a requirement. However, I'm completely unaware which code paths are called with/without swap...

Swapping out compressed data?

Posted Sep 6, 2013 16:57 UTC (Fri) by monnier (guest, #92728) [Link]

Your description gives me the impression that when the data needs to be swapped to the actual storage, it is done uncompressed (more precisely it's uncompressed and then written to backing store). That seems obviously suboptimal. Is it indeed the case, or can the compressed data be swapped out compressed?

Regarding the measurements: doing the measurement on machine with a crippled memory but a fast CPU seems unfair. A machine with 512MB of RAM is unlikely to have such a fast CPU, and a slower CPU makes compression less attractive since compression then takes more time. IOW measurements on a real 512MB machine (with something like a 1GHz Pentium, or an Allwinner A10) would seem more realistic.

Memory allocation in zswap

Posted Nov 26, 2015 17:36 UTC (Thu) by sbasu (guest, #105518) [Link]

I have a question regarding allocation of pool in zswap. It seems that zswap allocates memory for caching on-the-fly whenever it needs to store new compressed pages. But swapping will happen and there will be a page_io request to zswap only when physical memory is full. At that time if zswap is trying to allocate from physical memory (which is already full) will it not get a -ENOMEM error always? Also will it not create more pressure on physical memory that is already trying to swap out pages?

zsawp front load freeentry

Posted May 17, 2023 12:03 UTC (Wed) by xjtudso (guest, #152064) [Link]

“The entry is not removed from the tree during a load; it remains up-to-date until the entry is invalidated.”
In my understanding, if zpool entry is not freed after a load, zpool will be used up quickly. As I checked in zswap code:
freeentry:
spin_lock(&tree->lock);
/*
this is a local refcnt put, zswap entry refcnt will be 1, after put. And it will not be freed here.
*/
zswap_entry_put(tree, entry);
spin_unlock(&tree->lock);

return ret;
}
So should zswap entry be freed here with an extra global entry_put just like zswap write back?