Add hotkeys detection #14680

minchopaskal · 2026-01-09T09:21:11Z

Description

Introducing a new method for identifying hotkeys inside a redis server during a tracking time period.

Hotkeys in this context are defined by two metrics:

Percentage of time spend by cpu on the key from the total time during the tracking period
Percentage of network bytes (input+output) used for the key from the total network bytes used by redis during the tracking period

Usage

Although the API is subject to change the general idea is for the user to initiate a hotkeys tracking process which should run for some time. The keys' metrics are recorded inside a probabilistic structure and after that the user is able to fetch the top K of them.

Current API

HOTKEYS START	[COUNT k] 
			    [DURATION duration]
			    [SAMPLE ratio]
			    [SLOTS count slot…]

HOTKEYS STOP
HOTKEYS GET 	[MINCPU mincpu] 
			    [MINNET minnet]

HOTKEYS GET

127.0.0.1:6379> hotkeys get
1) "collection-start-time
2) <start-time-unix-timestamp-in-seconds>
3) "collection-duration"
4) <duration-in-seconds>
5) "by-cpu-time"
6) 1) key-1_1
   2) <%-from-total-cpu-time>
   ...
   19) key-10_1
   20) <%-from-total-cpu-time>
7) 1) "by-net-bytes"
8) 1) key-1_2
   2) <%-from-total-net-bytes>
   ...
   19) key-10_2
   	20) <%-from-total-net-bytes>

Other ideas

Instead of introducing a new command we can instead use new configurations similar to this PR

Implementation

Independent of API, implementation is based on a probabilistic structure - Cuckoo Heavy Keeper structure with added min-heap to keep track of top K hotkey's names. CHK is an loosely based on HeavyKeeper which is used in RedisBloom's TopK but has higher throughput.

Random fixed probability sampling via the HOTKEYS start sample <ratio> param. Each key is sampled with probability 1/ratio.

Performance implications

With low enough sample rate (controlled by HOTKEYS start sample <ratio>) there is negligible performance hit. Tracking every key though can incur up to 15% hit in the worst case after running the tests in this bench.

Add new TopK structure based on CuckooHeavyKeeper algorithm Add new HOTKEYS (START|STOP|GET) command

sggeorgiev · 2026-01-09T15:02:08Z

src/server.c

+        setDeferredArrayLen(c, replylen, cpu_toshow);
+
+        addReplyBulkCString(c, "by-net-bytes");
+        chkHeapBucket *net = chkTopKList(server.hotkeys.net);


Is it possible chkTopKList to return NULL here?

sggeorgiev · 2026-01-09T15:02:21Z

src/server.c

+
+        addReplyBulkCString(c, "by-cpu-time");
+        /* Sorted list of top K elements in the chkTopK structure */
+        chkHeapBucket *cpu = chkTopKList(server.hotkeys.cpu);


Is it possible chkTopKList to return NULL here?

sggeorgiev · 2026-01-09T15:07:13Z

src/chk.c

+            return NULL;
+        }
+
+        memset(topk->tables[i], 0, sizeof(chkBucket) * numbuckets);


Can you check ztrycalloc implementation. I think it initialize the array with 0.

sggeorgiev · 2026-01-09T15:08:50Z

src/chk.c

+        zfree(topk);
+        return NULL;
+    }
+    memset(topk->heap, 0, sizeof(chkHeapBucket) * k);


Do we need this if ztrycalloc initialize it with 0.

sggeorgiev · 2026-01-09T15:46:25Z

src/server.c

+    if (!server.hotkeys.cpu) return;
+
+    server.hotkeys.net = chkTopKCreate(count * 10, nearestNextPowerOf2((unsigned)count * 100), 1.08);
+    if (!server.hotkeys.net) return;


If we fail here. Do we need to destroy server.hotkeys.cpu.

sggeorgiev · 2026-01-09T15:49:26Z

src/server.c

+     * enough) again for better accuracy. Note the CHK implementation uses a
+     * power of 2 numbuckets for better cache locality. */
+    server.hotkeys.cpu = chkTopKCreate(count * 10, nearestNextPowerOf2((unsigned)count * 100), 1.08);
+    if (!server.hotkeys.cpu) return;


What happens with the slots if we fail here.

sggeorgiev · 2026-01-09T15:58:04Z

src/chk.c

+        return NULL;
+
+    /* Heap uses different hash than the cuckoo tables */
+    uint64_t fp = MurmurHash64A(item, itemlen, HEAP_SEED);


What is the purpose of the HEAP_SEED? Can we keep in fpAndIdx the hash calculated in generateItemFpAndIdxs and use it here?

Different seed values can be used when you need independent (non-correlated) hash functions. The specific value doesn't matter.
Anyway, as above, I suggest replacing with XXH64, which should be faster.

LiorKogan · 2026-01-09T18:52:56Z

src/server.c

+     * sample_ratio was 1 we add all keys.
+     * Note, for multi-key commands we use the same probability for all the keys. */
+    if (server.hotkeys.sample_ratio > 1 &&
+        (double)rand() / RAND_MAX >= 1. / server.hotkeys.sample_ratio)


Probably faster:

(double)rand() * server.hotkeys.sample_ratio >= RAND_MAX

Agree. This is a micro-optimization but worthwhile for a hot path.

LiorKogan · 2026-01-09T19:16:28Z

src/chk.c

+        double prob = (new_count - LOBBY_PROMOTION_THRESHOLD) /
+                      (double)(min - LOBBY_PROMOTION_THRESHOLD);
+
+        if ((rand() / (double)RAND_MAX) >= prob) return -1;


Replacing division with multiplication should be faster.
(+2 places below)

LiorKogan · 2026-01-09T19:18:42Z

src/chk.h

+
+#include "stdint.h"
+
+#define CHK_LUT_SIZE 256


Maybe add documentation for these constants?

LiorKogan · 2026-01-09T19:19:54Z

src/commands/hotkeys-get.json

+                        "collection-start-time": {
+                            "type": "integer"
+                        },
+                        "collection-duration": {


Add units in the description

LiorKogan · 2026-01-09T19:22:04Z

src/chk.c

+              "Lobby promotion threshold should be less then the LUT size to "
+              "ensure constant operations during decayCounter!");
+
+#define MAX_KICKS 16


Maybe add documentation for these constants?

LiorKogan · 2026-01-09T19:25:47Z

src/chk.c

+}
+
+fpAndIdx generateItemFpAndIdxs(chkTopK *topk, char *item, int itemlen) {
+    uint64_t hash = MurmurHash64A(item, itemlen, 0);


Consider using XXH64 instead (should be much faster)

LiorKogan · 2026-01-09T19:51:59Z

src/commands.def

+
+/* HOTKEYS command table */
+struct COMMAND_STRUCT HOTKEYS_Subcommands[] = {
+{MAKE_CMD("get","Returns two lists of top K hotkeys - one by cpu time and one by network bytes.","O(K) where K is the number of hotkeys to return.","8.0.0",CMD_DOC_NONE,NULL,NULL,"server",COMMAND_GROUP_SERVER,HOTKEYS_GET_History,0,HOTKEYS_GET_Tips,0,hotkeysCommand,-2,CMD_ADMIN|CMD_NOSCRIPT,0,HOTKEYS_GET_Keyspecs,0,NULL,2),.args=HOTKEYS_GET_Args},


LiorKogan · 2026-01-09T19:54:25Z

src/commands.def

+
+/* HOTKEYS command table */
+struct COMMAND_STRUCT HOTKEYS_Subcommands[] = {
+{MAKE_CMD("get","Returns two lists of top K hotkeys - one by cpu time and one by network bytes.","O(K) where K is the number of hotkeys to return.","8.0.0",CMD_DOC_NONE,NULL,NULL,"server",COMMAND_GROUP_SERVER,HOTKEYS_GET_History,0,HOTKEYS_GET_Tips,0,hotkeysCommand,-2,CMD_ADMIN|CMD_NOSCRIPT,0,HOTKEYS_GET_Keyspecs,0,NULL,2),.args=HOTKEYS_GET_Args},


Maybe "where K is the number of hotkeys returned" instead of "where K is the number of hotkeys to return"

(we may return a smaller number)

LiorKogan · 2026-01-09T20:01:20Z

src/chk.c

+    return res;
+}
+
+double getDecayProb(chkTopK *topk, counter_t cnt) {


Maybe mark the following 3 functions as static inline?

shahsb

Scalability in Clusters: How does this behave in Redis Cluster? Slot filtering helps, but aggregating across nodes might need future work.

Comparison to Alternatives: How does CHK compare to other sketches like Count-Min or HyperLogLog in this context? A brief rationale would strengthen the case.

shahsb · 2026-01-10T08:50:40Z

src/murmurhash.h

+/* Our hash function is MurmurHash2, 64 bit version.
+ * It was modified for Redis in order to provide the same result in
+ * big and little endian archs (endian neutral). */
+uint64_t MurmurHash64A(const void * key, size_t len, unsigned int seed);


Replace MurmurHash64A with XXH64 as suggested -- it's faster and already in Redis' toolkit.

shahsb · 2026-01-10T09:00:33Z

src/server.c

+     * sample_ratio was 1 we add all keys.
+     * Note, for multi-key commands we use the same probability for all the keys. */
+    if (server.hotkeys.sample_ratio > 1 &&
+        (double)rand() / RAND_MAX >= 1. / server.hotkeys.sample_ratio)


Agree. This is a micro-optimization but worthwhile for a hot path.

shahsb · 2026-01-10T09:04:04Z

src/commands.def

+{MAKE_ARG("minnet",ARG_TYPE_INTEGER,-1,"MINNET",NULL,NULL,CMD_ARG_OPTIONAL,0,NULL)},
+};
+
+/********** HOTKEYS START ********************/


What happens if HOTKEYS START is called while tracking is already active? Should it error or reset?

shahsb · 2026-01-10T09:08:13Z

tests/unit/hotkeys.tcl

+        # Check that all returned keys (based on cpu time) are from our hot_keys list
+        set num_returned_cpu [llength $returned_cpu_keys]
+        assert_lessthan_equal $num_returned_cpu 10
+        assert_morethan $num_returned_cpu 0


Test with very low/high sample ratios -- e.g., ratio=1 (track everything) vs. ratio=1000 (sparse sampling).

shahsb · 2026-01-10T09:08:44Z

tests/unit/hotkeys.tcl

+        assert_lessthan_equal $num_returned_cpu 10
+        assert_morethan $num_returned_cpu 0
+
+        foreach key $returned_cpu_keys {


We could add more unit tests for edge cases like zero-duration tracking, invalid params, or concurrent commands.

shahsb · 2026-01-10T09:09:05Z

tests/unit/hotkeys.tcl

+        # Same as cpu-time but for net-bytes
+        set num_returned_net [llength $returned_net_keys]
+        assert_lessthan_equal $num_returned_net 10
+        assert_morethan $num_returned_net 0


Integration tests with real workloads (e.g., simulating hot keys via benchmarks) would validate the accuracy of CHK.

tezc · 2026-01-10T12:51:25Z

src/chk.c

+    for (int32_t i = topk->k - 1; i >= 0; --i)
+        if (fp == (runner + i)->fp && itemlen == (runner + i)->itemlen &&
+            memcmp((runner + i)->item, item, itemlen) == 0) {
+            return runner + i;
+        }
+    return NULL;


Suggested change

for (int32_t i = topk->k - 1; i >= 0; --i)

if (fp == (runner + i)->fp && itemlen == (runner + i)->itemlen &&

memcmp((runner + i)->item, item, itemlen) == 0) {

return runner + i;

}

return NULL;

for (int32_t i = topk->k - 1; i >= 0; --i) {

chkHeapBucket *bucket = runner + i;

if (bucket->fp == fp &&

bucket->itemlen == itemlen &&

memcmp(bucket->item, item, itemlen) == 0)

{

return bucket;

}

}

return NULL;

not sure about variable name but maybe we can make this part little bit easier on eyes.

tezc · 2026-01-10T12:52:26Z

src/chk.c

+    }
+
+    chkHeapBucket top = {0};
+    memcpy(&top, &array[start], sizeof(chkHeapBucket));


Suggested change

memcpy(&top, &array[start], sizeof(chkHeapBucket));

top = array[start];

maybe you can just assign instead of mempcy calls in this function?

tezc · 2026-01-10T12:57:23Z

src/chk.c

+/* chkTopK operations */
+
+chkTopK *chkTopKCreate(int k, int numbuckets, double decay) {
+    /* Number of buckets need to be a power of 2 for better performance - we


do we need some input validation for this function? e.g. k >= 0
I didn't review rest of the PR yet, maybe it is done externally, then it might be okay (and if we have enough tests for it)

tezc · 2026-01-10T13:02:07Z

src/chk.c

+        return NULL;
+    }
+
+    chkTopK *topk = ztrymalloc(sizeof(chkTopK));


I think we generally use trymalloc when we are making a huge allocation (potentially several GBs) and we don't want to abort the process if it fails. For small allocations, we generally user malloc() family and never expect them to fail (zmalloc() will abort on OOM). Are we expecting a huge memory allocation for this structure? Otherwise, maybe we should not bother with calling trymalloc functions.

tezc · 2026-01-10T13:08:20Z

src/commands/hotkeys-get.json

+        "summary": "Returns two lists of top K hotkeys - one by cpu time and one by network bytes.",
+        "complexity": "O(K) where K is the number of hotkeys to return.",
+        "group": "server",
+        "since": "8.0.0",


Suggested change

"since": "8.0.0",

"since": "8.6.0",

similar issue in other commands

tezc · 2026-01-10T13:12:23Z

src/chk.c

+
+        topk->heap[0].count = entry->count;
+        topk->heap[0].fp = fp;
+        topk->heap[0].item = chkStrndup(item, itemlen);


are you storing a byte array or a null terminated string? I think you never reach this as a string? In any case, I feel like a separate chkStrndup() function might be unnecessary as it is used once. If you think otherwise, maybe chkStrndup() should be in zmalloc as zstrndup().

tezc · 2026-01-10T13:18:47Z

src/chk.h

+
+/* Update weighted item. If another one was expelled from the topK list -
+ * return its name. Caller is responsible for releasing it */
+char *chkTopKUpdate(chkTopK *topk, char *item, int itemlen, counter_t weight);


I think I commented above, I feel like we are mixing byte array and null terminated strings. This function accepts new item as a byte array but returns previous item as a null terminated string. Maybe it is better not to mix these two. Maybe function should accept int *outlen to return item len.

tezc · 2026-01-10T13:23:07Z

src/server.c

+                dict *seen_slots = dictCreate(&hashDictType);
+                for (int i = 0; i < slots_count; i++) {
+                    long slot_val;
+                    if (getRangeLongFromObjectOrReply(c, c->argv[j+2+i], 0,


you can use getSlotOrReply() here

tezc · 2026-01-10T13:37:06Z

@minchopaskal I had a quick pass over the PR and left a few minor comments. I assume the API is not final yet so we don't have more details about the commands in the top comment. e.g. what is SLOTS or SAMPLE. I assume we also don't have more tests because of this.

Regarding cuckoo impl, maybe it is a good idea to add some comments over each function (only to important ones) to describe what it is doing. So, people who have no idea about the algorithm will be a chance to understand what is going on.

Add hotkeys detection

c8f7b20

Add new TopK structure based on CuckooHeavyKeeper algorithm Add new HOTKEYS (START|STOP|GET) command

minchopaskal requested review from sggeorgiev, skaslev and sundb January 9, 2026 09:23

sggeorgiev reviewed Jan 9, 2026

View reviewed changes

LiorKogan reviewed Jan 9, 2026

View reviewed changes

src/chk.h

#include "stdint.h"

#define CHK_LUT_SIZE 256

Copy link

Member

LiorKogan Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add documentation for these constants?

LiorKogan reviewed Jan 9, 2026

View reviewed changes

src/chk.c

return res;

}

double getDecayProb(chkTopK *topk, counter_t cnt) {

Copy link

Member

LiorKogan Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe mark the following 3 functions as static inline?

shahsb suggested changes Jan 10, 2026

View reviewed changes

tezc reviewed Jan 10, 2026

View reviewed changes

	memcpy(&top, &array[start], sizeof(chkHeapBucket));
	top = array[start];

Add hotkeys detection #14680

Are you sure you want to change the base?

Add hotkeys detection #14680

Conversation

minchopaskal commented Jan 9, 2026

Description

Usage

Current API

HOTKEYS GET

Other ideas

Implementation

Performance implications

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LiorKogan Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shahsb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LiorKogan Jan 9, 2026 •

edited

Loading