Skip to content

goodform/valkey-datasketches

Repository files navigation

Valkey DataSketches

valkey-datasketches is a Valkey module that brings the Apache DataSketches family of probabilistic data structures to Valkey.

It is designed for workloads where exact answers are expensive, but fast approximate answers are good enough: distinct counting, set analytics, quantiles, heavy hitters, point-frequency estimation, and stream sampling.

Most sketch implementations are backed by the official apache/datasketches-cpp library (v5.2.0). Reservoir sampling remains the in-tree implementation.

Why use it

  • Estimate distinct users, devices, or sessions without storing every ID.
  • Run set operations such as union, intersection, and difference on approximate sets.
  • Query p50 / p95 / p99 latencies from large streams.
  • Track heavy hitters and approximate per-item frequencies.
  • Keep uniform or weighted samples of streaming data.
  • Serialize sketches with GETRAW / SETRAW for persistence, replication, and transport.

Supported sketch families

All commands are namespaced under SKETCH.*.

Family Command prefix Typical use Status Backend
CPC SKETCH.CPC.* Compact distinct counting Supported apache/datasketches-cpp
HLL SKETCH.HLL.* Distinct counting with tunable precision Supported apache/datasketches-cpp
Theta SKETCH.THETA.* Set operations on approximate distinct sets Supported apache/datasketches-cpp
Tuple SKETCH.TUPLE.* Distinct keys with associated double summaries Supported apache/datasketches-cpp
KLL SKETCH.KLL.* General-purpose quantiles Supported apache/datasketches-cpp
REQ SKETCH.REQ.* Relative-error quantiles Supported apache/datasketches-cpp
t-digest SKETCH.TDIGEST.* Tail-friendly quantiles Supported apache/datasketches-cpp
ItemSketch SKETCH.ITEMSKETCH.* Frequent items for binary-safe string keys Supported apache/datasketches-cpp
LongSketch SKETCH.LONGSKETCH.* Frequent items for int64 keys Supported apache/datasketches-cpp
Count-Min Sketch SKETCH.CMS.* Approximate point-frequency estimation Supported apache/datasketches-cpp
Reservoir sampling SKETCH.RESERVOIR.* Uniform sampling from a stream Supported in-tree implementation
VarOpt sampling SKETCH.VAROPT.* Weighted / variance-optimal sampling Supported apache/datasketches-cpp

The module-level SKETCH.INFO command currently reports these algorithms:

cpc, theta, kll, hll, tdigest, tuple, reservoir, req, varopt, longsketch, itemsketch, countmin

Build

The project uses CMake + Ninja and targets C++11.

Build the module

./build.sh

Build and run unit tests

./build.sh --unit

Build and run integration tests

SERVER_VERSION=8.1 ./build.sh --integration

Reuse an existing Valkey source tree

If you already have a Valkey checkout locally, set VALKEY_SOURCE_DIR to avoid downloading it again during configure:

VALKEY_SOURCE_DIR=/path/to/valkey ./build.sh

Show build options

./build.sh --help

Load the module

On Linux:

valkey-server --loadmodule ./build/src/libdatasketches.so

On macOS the output artifact is ./build/src/libdatasketches.dylib.

You can also load it at runtime:

MODULE LOAD /absolute/path/to/libdatasketches.so

Quick start

After loading the module, verify it is available:

127.0.0.1:6379> SKETCH.INFO
1) "module_name"
2) "datasketches"
3) "version"
4) (integer) ...
5) "algorithms"
6) 1) "cpc"
   2) "theta"
   ...

Distinct counting with HLL

127.0.0.1:6379> SKETCH.HLL.ADD visits user:1 user:2 user:3 user:2
(integer) 1
127.0.0.1:6379> SKETCH.HLL.EST visits
"3.00"

Set analytics with Theta

127.0.0.1:6379> SKETCH.THETA.ADD set:a id:1 id:2 id:3
(integer) 1
127.0.0.1:6379> SKETCH.THETA.ADD set:b id:3 id:4 id:5
(integer) 1
127.0.0.1:6379> SKETCH.THETA.INTER overlap 2 set:a set:b
"OK"
127.0.0.1:6379> SKETCH.THETA.EST overlap
"1.00"

Quantiles with KLL

127.0.0.1:6379> SKETCH.KLL.ADD latency 10 20 30 40 50
(integer) 1
127.0.0.1:6379> SKETCH.KLL.QUANTILE latency 0.5 0.9
1) "30"
2) "50"

SKETCH.REQ.* and SKETCH.TDIGEST.* expose similar workflows for quantiles.

Tuple sketch for keyed summaries

127.0.0.1:6379> SKETCH.TUPLE.ADD campaign:a user:1 1.5
(integer) 1
127.0.0.1:6379> SKETCH.TUPLE.ADD campaign:a user:2 2.0
(integer) 1
127.0.0.1:6379> SKETCH.TUPLE.ADD campaign:a user:1 3.0
(integer) 1
127.0.0.1:6379> SKETCH.TUPLE.EST campaign:a
"2"
127.0.0.1:6379> SKETCH.TUPLE.ESTSUM campaign:a
"6.5"

Frequent items with ItemSketch

127.0.0.1:6379> SKETCH.ITEMSKETCH.ADD hotkeys hot hot hot warm warm cold
(integer) 1
127.0.0.1:6379> SKETCH.ITEMSKETCH.TOPK hotkeys 2
1) "hot"
2) (integer) 3
3) (integer) 3
4) (integer) 3
5) "warm"
6) (integer) 2
7) (integer) 2
8) (integer) 2

Point-frequency estimation with Count-Min Sketch

127.0.0.1:6379> SKETCH.CMS.ADD freqs apple apple banana
(integer) 3
127.0.0.1:6379> SKETCH.CMS.EST freqs apple banana missing
1) (integer) 2
2) (integer) 1
3) (integer) 0

Uniform and weighted sampling

Reservoir sampling keeps a uniform sample:

127.0.0.1:6379> SKETCH.RESERVOIR.CREATE sample:uniform 3
"OK"
127.0.0.1:6379> SKETCH.RESERVOIR.ADD sample:uniform a b c d e
(integer) 1
127.0.0.1:6379> SKETCH.RESERVOIR.SAMPLE sample:uniform 10
1) ...
2) ...
3) ...

VarOpt sampling keeps a weighted sample:

127.0.0.1:6379> SKETCH.VAROPT.CREATE sample:weighted 2
"OK"
127.0.0.1:6379> SKETCH.VAROPT.ADD sample:weighted item:1 1.0
(integer) 1
127.0.0.1:6379> SKETCH.VAROPT.ADD sample:weighted item:2 2.0
(integer) 1
127.0.0.1:6379> SKETCH.VAROPT.ADD sample:weighted item:3 3.0
(integer) 1
127.0.0.1:6379> SKETCH.VAROPT.SAMPLE sample:weighted 10
1) ...
2) ...

Common command patterns

Most sketch families follow the same lifecycle:

  1. CREATE a sketch explicitly, or let ADD create it lazily.
  2. ADD items or values.
  3. Query the sketch with commands such as EST, QUANTILE, RANK, FREQ, TOPK, or SAMPLE.
  4. Merge or combine sketches where supported.
  5. Export/import binary snapshots with GETRAW and SETRAW.
  6. Inspect metadata with INFO.

Examples:

  • Distinct counters: CREATE, ADD, EST, MERGE, GETRAW, SETRAW, INFO
  • Quantile sketches: CREATE, ADD, QUANTILE, RANK, MERGE, GETRAW, SETRAW, INFO
  • Sampling sketches: CREATE, ADD, SAMPLE, MERGE (VarOpt), GETRAW, SETRAW, INFO

Accuracy notes

These data structures are approximate by design.

  • Distinct counters (CPC, HLL, Theta) return estimates that converge as more data arrives.
  • Quantile sketches (KLL, REQ, TDigest) trade exact ordering for bounded memory.
  • Frequency sketches (ItemSketch, LongSketch, CMS) are intended for heavy hitters or approximate counts, not exact transactional accounting.
  • Sampling sketches (Reservoir, VarOpt) preserve representative subsets, not the full stream.

If you need exact answers, these sketches are usually the wrong tool. If you need small memory, mergeability, and fast online queries, they are often a good fit.

Testing

Unit tests

./build.sh --unit

Integration tests

SERVER_VERSION=8.1 ./build.sh --integration

Integration tests use pytest and the valkey-test-framework under tst/integration/.

Project layout

src/
├── module.cc            # Valkey module entry point
├── commands/            # Command metadata JSON files
├── modules/             # Valkey command bindings and module data types
└── sketches/            # Sketch adapters / algorithm implementations

The split keeps Valkey-facing command code separate from sketch implementations, which makes it easier to evolve command surfaces and algorithm backends independently.

About

a Valkey module that brings the Apache DataSketches family of probabilistic data structures to Valkey.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors