How I Start the Journey

A while ago, I was surfing GitHub to find some issues that I can help. While glancing the issues in Valkey, I spotted an issue with a title “[NEW] Accelerate CRC16 with SIMD” 1. As I have made a contribution to sse2neon on CRC32 and with the fame of Valkey (yes, I always like fame so that I can use them to earn money!), I decided to resolve this issue.

Kind Communities of Valkey

Thanks to the kind comminities and maintainers, I was assigned to solve this issue. Furthermore, to let maintainers track the progress, I made the following tasks for possible solutions:

  1. half-byte lookup table
  2. Barrett reduction
  3. multi-byte lookup table

OK, What Does Valkey CRC16 Hash Used for?

Valkey utilize CRC16 (formally CRC16/CCITT 2) to hash the key into slots (with modulo 16384), which is enabled in clustering mode 3.

What’s more, provided by the maintainers, the length of key is at most 20 Bytes, and the maintainers are pretty concerned the usage of L1 cache (i.e., avoid evicting the data in L1 cache) 4

Get My Hands Dirty

Let’s look the original implementation (with some modifications):

/* We are going to optimize this function. */
static inline uint16_t crc16_base(uint16_t crc, uint8_t tmp) {
    crc = (crc << 8) ^ crc16tab[((crc >> 8) ^ (uint8_t)buf[counter]) & 0x00FF];
    return crc;
}

uint16_t crc16(const char *buf, int len) {
    int counter;
    uint16_t crc = 0;
    for (counter = 0; counter < len; counter++) {
        uint8_t tmp = (uint8_t)buf[counter];
        crc = crc16_base(crc, tmp);
    }
    return crc;
}

To prove any of my solutions is better, we need to benchmark. There are two benchmark scenarios: single cluster and multi-cluster (3 primary plus 3 secondary nodes).

I list my benchmark environment as follows:

  • CPU: Intel(R) Core(TM) i5-3230M CPU @ 2.60GHz
  • RAM: 16 GB
  • OS: Linux 6.8.0-101-generic

benchmark procedures

single cluster 5

# in the root directory of Valkey

$ cd src
$ rm dump.rdb nodes.conf # clean up any old data files (if they exist)
$ ./valkey-server --cluster-enabled yes --save '' &
(...)
2622423:M 07 Oct 2025 17:48:58.872 * Server initialized
2622423:M 07 Oct 2025 17:48:58.873 * Ready to accept connections tcp

$ ./valkey-cli ./valkey-cli cluster addslotsrange 0 16383
OK
$ ./valkey-cli cluster info | head -3
cluster_state:ok
cluster_slots_assigned:16384
cluster_slots_ok:16384

multi-cluster

# in the root directory of Valkey

$ ./utils/create-cluster/create-cluster start
$ ./utils/create-cluster/create-cluster create
$ ./src/valkey-server --cluster -p <port of one of primary nodes>
$ ./src/valkey-benchmark --cluster -p <port of one of primary nodes> --threads 3 -P 10 -n 10000000 -r 1000000 -t get -q
$ ./utils/create-cluster/create-cluster stop
$ ./utils/create-cluster/create-cluster clean

benchmark metrics

  1. RPS (Requests Per Second). Higher is better.
  2. Latency (measured in seconds). Lower is better.

half-byte lookup table

View the PR in https://github.com/valkey-io/valkey/pull/2300.

Explained in my CRC32-C contribution on sse2neon, this method reduces the size of lookup table, and it maybe benefitical as it reduces the stress of L1 cache:

const uint16_t crc16_tbl[16] = {
    0x0000,
    0x1021,
    0x2042,
    0x3063,
    0x4084,
    0x50a5,
    0x60c6,
    0x70e7,
    0x8108,
    0x9129,
    0xa14a,
    0xb16b,
    0xc18c,
    0xd1ad,
    0xe1ce,
    0xf1ef,
};

static inline uint16_t crc16_base(uint16_t crc, uint8_t v) {
    crc ^= v << 8;
    crc = (crc << 4) ^ crc16_tbl[crc >> (16 - 4)];
    crc = (crc << 4) ^ crc16_tbl[crc >> (16 - 4)];
    return crc;
}

Nevertheless, after benchmarking, I got a much inferior results (about 10% impact) on RPS and latency because it needs an extra step to calculate the whole CRC message of an input byte.

Barrett reduction

View the PR here: https://github.com/valkey-io/valkey/pull/2691

Mentioned in my sse2neon contribution, this technique required no lookup table, and the carry-less multiplication needs a moderate cycle count (e.g., Ivy Bridge, my test environment, requireds 8 cycle).

What’s more, the Valkey community are excited about this because this utilize SIMD instruction, which maybe unlock the performance further.

So let’s look my implementation:

static inline uint16_t crc16_base(uint16_t crc, uint8_t v) {
	crc ^= v << 8;

    uint64_t p = 0x11021;
    uint64_t mu = 0x111303471a041; // 2^64 / p

    /* Set 'p' and 'mu' into the same xmm register to save register usage. */
    __m128i mul = _mm_set_epi64x(p, mu);

    /* Shift left in purpose so that the intermediate result 
     * after multiply with 'mu' will appear in the upper half of xmm register.
     */
    __m128i orig = _mm_set_epi64x(0x0, (uint64_t)(crc) << (8));
    __m128i tmp = orig;

    /* Multiply with 'mu'.
     * Recall that the intermediate result of CRC appears in the upper
     * half of xmm register.
     */
    tmp = _mm_clmulepi64_si128(
            tmp,
            mul, 0x00);

    /* Multiply with 'p'. */
    tmp = _mm_clmulepi64_si128(tmp, 
            mul, 0x11);

    /* Subtract with 'orig'. */
    tmp = _mm_xor_si128(tmp, orig);

    /* The result CRC is in bit [0..15]. */
    uint16_t ret = (uint16_t)(_mm_extract_epi16(tmp, 0x0));

    return ret;
}

However, I still get inferior benchmark results. The reason is that this technique usually acts well when input message length is at least 128 Bytes, whereas the input message length in Valkey is at most 20 Bytes. As a result, not benefitical but worse than original method.

multi-byte lookup table

View the implementation here: https://github.com/valkey-io/valkey/pull/2790

Adopted from 6, processing more bytes in the same time gives boost on performance. But, it required more memory space. As a consequence, I need to benchmark different size of LUT to check whether it triggers L1 cache eviction or not.

After benchmarking, it shows no vital improvement on the LUT which can consume more than 2-byte input at the same time. So I just put the 2-byte LUT implementation:

static const uint16_t crc16tab[]= {
    /* zeroth byte LUT */
    0x0000,0x1021,0x2042,0x3063,0x4084,0x50a5,0x60c6,0x70e7,
    0x8108,0x9129,0xa14a,0xb16b,0xc18c,0xd1ad,0xe1ce,0xf1ef,
    0x1231,0x0210,0x3273,0x2252,0x52b5,0x4294,0x72f7,0x62d6,
    0x9339,0x8318,0xb37b,0xa35a,0xd3bd,0xc39c,0xf3ff,0xe3de,
    0x2462,0x3443,0x0420,0x1401,0x64e6,0x74c7,0x44a4,0x5485,
    0xa56a,0xb54b,0x8528,0x9509,0xe5ee,0xf5cf,0xc5ac,0xd58d,
    0x3653,0x2672,0x1611,0x0630,0x76d7,0x66f6,0x5695,0x46b4,
    0xb75b,0xa77a,0x9719,0x8738,0xf7df,0xe7fe,0xd79d,0xc7bc,
    0x48c4,0x58e5,0x6886,0x78a7,0x0840,0x1861,0x2802,0x3823,
    0xc9cc,0xd9ed,0xe98e,0xf9af,0x8948,0x9969,0xa90a,0xb92b,
    0x5af5,0x4ad4,0x7ab7,0x6a96,0x1a71,0x0a50,0x3a33,0x2a12,
    0xdbfd,0xcbdc,0xfbbf,0xeb9e,0x9b79,0x8b58,0xbb3b,0xab1a,
    0x6ca6,0x7c87,0x4ce4,0x5cc5,0x2c22,0x3c03,0x0c60,0x1c41,
    0xedae,0xfd8f,0xcdec,0xddcd,0xad2a,0xbd0b,0x8d68,0x9d49,
    0x7e97,0x6eb6,0x5ed5,0x4ef4,0x3e13,0x2e32,0x1e51,0x0e70,
    0xff9f,0xefbe,0xdfdd,0xcffc,0xbf1b,0xaf3a,0x9f59,0x8f78,
    0x9188,0x81a9,0xb1ca,0xa1eb,0xd10c,0xc12d,0xf14e,0xe16f,
    0x1080,0x00a1,0x30c2,0x20e3,0x5004,0x4025,0x7046,0x6067,
    0x83b9,0x9398,0xa3fb,0xb3da,0xc33d,0xd31c,0xe37f,0xf35e,
    0x02b1,0x1290,0x22f3,0x32d2,0x4235,0x5214,0x6277,0x7256,
    0xb5ea,0xa5cb,0x95a8,0x8589,0xf56e,0xe54f,0xd52c,0xc50d,
    0x34e2,0x24c3,0x14a0,0x0481,0x7466,0x6447,0x5424,0x4405,
    0xa7db,0xb7fa,0x8799,0x97b8,0xe75f,0xf77e,0xc71d,0xd73c,
    0x26d3,0x36f2,0x0691,0x16b0,0x6657,0x7676,0x4615,0x5634,
    0xd94c,0xc96d,0xf90e,0xe92f,0x99c8,0x89e9,0xb98a,0xa9ab,
    0x5844,0x4865,0x7806,0x6827,0x18c0,0x08e1,0x3882,0x28a3,
    0xcb7d,0xdb5c,0xeb3f,0xfb1e,0x8bf9,0x9bd8,0xabbb,0xbb9a,
    0x4a75,0x5a54,0x6a37,0x7a16,0x0af1,0x1ad0,0x2ab3,0x3a92,
    0xfd2e,0xed0f,0xdd6c,0xcd4d,0xbdaa,0xad8b,0x9de8,0x8dc9,
    0x7c26,0x6c07,0x5c64,0x4c45,0x3ca2,0x2c83,0x1ce0,0x0cc1,
    0xef1f,0xff3e,0xcf5d,0xdf7c,0xaf9b,0xbfba,0x8fd9,0x9ff8,
    0x6e17,0x7e36,0x4e55,0x5e74,0x2e93,0x3eb2,0x0ed1,0x1ef0,

    /* first byte LUT */
    0x0000,0x3331,0x6662,0x5553,0xccc4,0xfff5,0xaaa6,0x9997,
    0x89a9,0xba98,0xefcb,0xdcfa,0x456d,0x765c,0x230f,0x103e,
    0x0373,0x3042,0x6511,0x5620,0xcfb7,0xfc86,0xa9d5,0x9ae4,
    0x8ada,0xb9eb,0xecb8,0xdf89,0x461e,0x752f,0x207c,0x134d,
    0x06e6,0x35d7,0x6084,0x53b5,0xca22,0xf913,0xac40,0x9f71,
    0x8f4f,0xbc7e,0xe92d,0xda1c,0x438b,0x70ba,0x25e9,0x16d8,
    0x0595,0x36a4,0x63f7,0x50c6,0xc951,0xfa60,0xaf33,0x9c02,
    0x8c3c,0xbf0d,0xea5e,0xd96f,0x40f8,0x73c9,0x269a,0x15ab,
    0x0dcc,0x3efd,0x6bae,0x589f,0xc108,0xf239,0xa76a,0x945b,
    0x8465,0xb754,0xe207,0xd136,0x48a1,0x7b90,0x2ec3,0x1df2,
    0x0ebf,0x3d8e,0x68dd,0x5bec,0xc27b,0xf14a,0xa419,0x9728,
    0x8716,0xb427,0xe174,0xd245,0x4bd2,0x78e3,0x2db0,0x1e81,
    0x0b2a,0x381b,0x6d48,0x5e79,0xc7ee,0xf4df,0xa18c,0x92bd,
    0x8283,0xb1b2,0xe4e1,0xd7d0,0x4e47,0x7d76,0x2825,0x1b14,
    0x0859,0x3b68,0x6e3b,0x5d0a,0xc49d,0xf7ac,0xa2ff,0x91ce,
    0x81f0,0xb2c1,0xe792,0xd4a3,0x4d34,0x7e05,0x2b56,0x1867,
    0x1b98,0x28a9,0x7dfa,0x4ecb,0xd75c,0xe46d,0xb13e,0x820f,
    0x9231,0xa100,0xf453,0xc762,0x5ef5,0x6dc4,0x3897,0x0ba6,
    0x18eb,0x2bda,0x7e89,0x4db8,0xd42f,0xe71e,0xb24d,0x817c,
    0x9142,0xa273,0xf720,0xc411,0x5d86,0x6eb7,0x3be4,0x08d5,
    0x1d7e,0x2e4f,0x7b1c,0x482d,0xd1ba,0xe28b,0xb7d8,0x84e9,
    0x94d7,0xa7e6,0xf2b5,0xc184,0x5813,0x6b22,0x3e71,0x0d40,
    0x1e0d,0x2d3c,0x786f,0x4b5e,0xd2c9,0xe1f8,0xb4ab,0x879a,
    0x97a4,0xa495,0xf1c6,0xc2f7,0x5b60,0x6851,0x3d02,0x0e33,
    0x1654,0x2565,0x7036,0x4307,0xda90,0xe9a1,0xbcf2,0x8fc3,
    0x9ffd,0xaccc,0xf99f,0xcaae,0x5339,0x6008,0x355b,0x066a,
    0x1527,0x2616,0x7345,0x4074,0xd9e3,0xead2,0xbf81,0x8cb0,
    0x9c8e,0xafbf,0xfaec,0xc9dd,0x504a,0x637b,0x3628,0x0519,
    0x10b2,0x2383,0x76d0,0x45e1,0xdc76,0xef47,0xba14,0x8925,
    0x991b,0xaa2a,0xff79,0xcc48,0x55df,0x66ee,0x33bd,0x008c,
    0x13c1,0x20f0,0x75a3,0x4692,0xdf05,0xec34,0xb967,0x8a56,
    0x9a68,0xa959,0xfc0a,0xcf3b,0x56ac,0x659d,0x30ce,0x03ff,
};

/* crc16 main function */
uint16_t crc16(const char *buf, int len) {
    int counter = 0;
    uint16_t crc = 0;

    for(; counter + 1 < len; counter += 2) {
        /* explicitly get two bytes */
        uint16_t a = (uint8_t)buf[counter];
        uint16_t b = (uint8_t)buf[counter + 1];
        uint16_t tmp = ((a << 8) | b);

        crc ^= tmp;
        crc = crc16tab[1 * 256 + (uint8_t)(crc >> 8)] ^ crc16tab[0 * 256 + (uint8_t)(crc >> 0)];
    }

    /* deal with leftover */
    for(; counter < len; counter++)
        crc = (crc << 8) ^ crc16tab[((crc >> 8) ^ (uint8_t)buf[counter]) & 0x00FF];
    return crc;
}

And let’s watch the benchmark result:

single cluster

RPS
Base This PR Performance gain
362040.906666667 351379.29 -2.9448652%
latency
Base This PR Performance gain
1.161666667 1.124333333 3.2137734%

multi-cluster

RPS
Base This PR Performance gain
479107.686666667 484595.01 1.1453215%
latency
Base This PR Performance gain
0.737666667 0.719 2.5305016%

Though it seems pretty well (recall this PR will later be applied worldwide), I still got rejected from merging.

Multi-byte Lookup Table Seems Good, But Why It Doesn’t Get Accepted?

By the report from reviewers 7, the multi-byte lookup table solution does not yield well on their testing environment, or, almost has the same performance compared to original one.

For not coming up any ideas, I decided to end this optimization journey.

What I Get When the Journey Ends

  • Tuning performance just like quality assurance: you implement a set of methods to prove the original method acts inferior than your implementations. But sometimes, it turns in that the original method is the best.
  • Ask loud. By my prior experience in big tech and huge scale of open source projects, you usually have no any ideas how to run and test the project. Therefore, always asking (with polite) whenever you encounter any troubles on testing the project. Not only the community helps you, but also you help the community to recall their memories and to make the project better (e.g., the Valkey maintainers then add a clustering mode benchmark pipeline 8).
  • Design a test scenario. At first, I use perf for profiling 9, and claim my solution operates well on the cache-miss metric. However, the reviewers suggest design a test scenario to reflect the actual user cases, and I agree to this opinion because the actual user cases is the only way to aid users.
  • Make myself more familiar with profiling tools (especially FlameGraph). Each profiling tool has its own usage, but it always helps you to deliver a truth for the performance of your performance. Moreover, these tools assist you to find any possiblities for optimization.

Special Thanks

I would like to make a great gratitude to Viktor Söderqvist 10 who not only gives me a lot of practical advice, but also makes a great mentoring of Valkey codebase, esepcially about the test scripts. Moreover, Viktor teached me how to benchmark on the real scene and the usage of FlameGraph.

Without the help of Viktor, I would not have gone so far, not to mentioned how Valkey clustering mode.

References

  1. https://github.com/valkey-io/valkey/issues/2031 

  2. https://srecord.sourceforge.net/crc16-ccitt.html 

  3. https://valkey.io/topics/cluster-tutorial/ 

  4. https://github.com/valkey-io/valkey/pull/2691#issuecomment-3402032831 

  5. https://github.com/valkey-io/valkey/pull/2691#issuecomment-3377506364 

  6. https://github.com/komrad36/CRC#option-7-2-byte-tabular 

  7. https://github.com/valkey-io/valkey/pull/2790#issuecomment-4012948361 

  8. https://github.com/valkey-io/valkey/pull/3338 

  9. https://github.com/valkey-io/valkey/pull/2300#issuecomment-3093207349 

  10. https://github.com/zuiderkwast