valkey-datasketches is a Valkey module that brings the Apache DataSketches family of probabilistic data structures to Valkey.
It is designed for workloads where exact answers are expensive, but fast approximate answers are good enough: distinct counting, set analytics, quantiles, heavy hitters, point-frequency estimation, and stream sampling.
Most sketch implementations are backed by the official apache/datasketches-cpp library (v5.2.0). Reservoir sampling remains the in-tree implementation.
- Estimate distinct users, devices, or sessions without storing every ID.
- Run set operations such as union, intersection, and difference on approximate sets.
- Query p50 / p95 / p99 latencies from large streams.
- Track heavy hitters and approximate per-item frequencies.
- Keep uniform or weighted samples of streaming data.
- Serialize sketches with
GETRAW/SETRAWfor persistence, replication, and transport.
All commands are namespaced under SKETCH.*.
| Family | Command prefix | Typical use | Status | Backend |
|---|---|---|---|---|
| CPC | SKETCH.CPC.* |
Compact distinct counting | Supported | apache/datasketches-cpp |
| HLL | SKETCH.HLL.* |
Distinct counting with tunable precision | Supported | apache/datasketches-cpp |
| Theta | SKETCH.THETA.* |
Set operations on approximate distinct sets | Supported | apache/datasketches-cpp |
| Tuple | SKETCH.TUPLE.* |
Distinct keys with associated double summaries | Supported | apache/datasketches-cpp |
| KLL | SKETCH.KLL.* |
General-purpose quantiles | Supported | apache/datasketches-cpp |
| REQ | SKETCH.REQ.* |
Relative-error quantiles | Supported | apache/datasketches-cpp |
| t-digest | SKETCH.TDIGEST.* |
Tail-friendly quantiles | Supported | apache/datasketches-cpp |
| ItemSketch | SKETCH.ITEMSKETCH.* |
Frequent items for binary-safe string keys | Supported | apache/datasketches-cpp |
| LongSketch | SKETCH.LONGSKETCH.* |
Frequent items for int64 keys |
Supported | apache/datasketches-cpp |
| Count-Min Sketch | SKETCH.CMS.* |
Approximate point-frequency estimation | Supported | apache/datasketches-cpp |
| Reservoir sampling | SKETCH.RESERVOIR.* |
Uniform sampling from a stream | Supported | in-tree implementation |
| VarOpt sampling | SKETCH.VAROPT.* |
Weighted / variance-optimal sampling | Supported | apache/datasketches-cpp |
The module-level SKETCH.INFO command currently reports these algorithms:
cpc, theta, kll, hll, tdigest, tuple, reservoir, req, varopt, longsketch, itemsketch, countmin
The project uses CMake + Ninja and targets C++11.
./build.sh./build.sh --unitSERVER_VERSION=8.1 ./build.sh --integrationIf you already have a Valkey checkout locally, set VALKEY_SOURCE_DIR to avoid downloading it again during configure:
VALKEY_SOURCE_DIR=/path/to/valkey ./build.sh./build.sh --helpOn Linux:
valkey-server --loadmodule ./build/src/libdatasketches.soOn macOS the output artifact is ./build/src/libdatasketches.dylib.
You can also load it at runtime:
MODULE LOAD /absolute/path/to/libdatasketches.so
After loading the module, verify it is available:
127.0.0.1:6379> SKETCH.INFO
1) "module_name"
2) "datasketches"
3) "version"
4) (integer) ...
5) "algorithms"
6) 1) "cpc"
2) "theta"
...
127.0.0.1:6379> SKETCH.HLL.ADD visits user:1 user:2 user:3 user:2
(integer) 1
127.0.0.1:6379> SKETCH.HLL.EST visits
"3.00"
127.0.0.1:6379> SKETCH.THETA.ADD set:a id:1 id:2 id:3
(integer) 1
127.0.0.1:6379> SKETCH.THETA.ADD set:b id:3 id:4 id:5
(integer) 1
127.0.0.1:6379> SKETCH.THETA.INTER overlap 2 set:a set:b
"OK"
127.0.0.1:6379> SKETCH.THETA.EST overlap
"1.00"
127.0.0.1:6379> SKETCH.KLL.ADD latency 10 20 30 40 50
(integer) 1
127.0.0.1:6379> SKETCH.KLL.QUANTILE latency 0.5 0.9
1) "30"
2) "50"
SKETCH.REQ.* and SKETCH.TDIGEST.* expose similar workflows for quantiles.
127.0.0.1:6379> SKETCH.TUPLE.ADD campaign:a user:1 1.5
(integer) 1
127.0.0.1:6379> SKETCH.TUPLE.ADD campaign:a user:2 2.0
(integer) 1
127.0.0.1:6379> SKETCH.TUPLE.ADD campaign:a user:1 3.0
(integer) 1
127.0.0.1:6379> SKETCH.TUPLE.EST campaign:a
"2"
127.0.0.1:6379> SKETCH.TUPLE.ESTSUM campaign:a
"6.5"
127.0.0.1:6379> SKETCH.ITEMSKETCH.ADD hotkeys hot hot hot warm warm cold
(integer) 1
127.0.0.1:6379> SKETCH.ITEMSKETCH.TOPK hotkeys 2
1) "hot"
2) (integer) 3
3) (integer) 3
4) (integer) 3
5) "warm"
6) (integer) 2
7) (integer) 2
8) (integer) 2
127.0.0.1:6379> SKETCH.CMS.ADD freqs apple apple banana
(integer) 3
127.0.0.1:6379> SKETCH.CMS.EST freqs apple banana missing
1) (integer) 2
2) (integer) 1
3) (integer) 0
Reservoir sampling keeps a uniform sample:
127.0.0.1:6379> SKETCH.RESERVOIR.CREATE sample:uniform 3
"OK"
127.0.0.1:6379> SKETCH.RESERVOIR.ADD sample:uniform a b c d e
(integer) 1
127.0.0.1:6379> SKETCH.RESERVOIR.SAMPLE sample:uniform 10
1) ...
2) ...
3) ...
VarOpt sampling keeps a weighted sample:
127.0.0.1:6379> SKETCH.VAROPT.CREATE sample:weighted 2
"OK"
127.0.0.1:6379> SKETCH.VAROPT.ADD sample:weighted item:1 1.0
(integer) 1
127.0.0.1:6379> SKETCH.VAROPT.ADD sample:weighted item:2 2.0
(integer) 1
127.0.0.1:6379> SKETCH.VAROPT.ADD sample:weighted item:3 3.0
(integer) 1
127.0.0.1:6379> SKETCH.VAROPT.SAMPLE sample:weighted 10
1) ...
2) ...
Most sketch families follow the same lifecycle:
CREATEa sketch explicitly, or letADDcreate it lazily.ADDitems or values.- Query the sketch with commands such as
EST,QUANTILE,RANK,FREQ,TOPK, orSAMPLE. - Merge or combine sketches where supported.
- Export/import binary snapshots with
GETRAWandSETRAW. - Inspect metadata with
INFO.
Examples:
- Distinct counters:
CREATE,ADD,EST,MERGE,GETRAW,SETRAW,INFO - Quantile sketches:
CREATE,ADD,QUANTILE,RANK,MERGE,GETRAW,SETRAW,INFO - Sampling sketches:
CREATE,ADD,SAMPLE,MERGE(VarOpt),GETRAW,SETRAW,INFO
These data structures are approximate by design.
- Distinct counters (
CPC,HLL,Theta) return estimates that converge as more data arrives. - Quantile sketches (
KLL,REQ,TDigest) trade exact ordering for bounded memory. - Frequency sketches (
ItemSketch,LongSketch,CMS) are intended for heavy hitters or approximate counts, not exact transactional accounting. - Sampling sketches (
Reservoir,VarOpt) preserve representative subsets, not the full stream.
If you need exact answers, these sketches are usually the wrong tool. If you need small memory, mergeability, and fast online queries, they are often a good fit.
./build.sh --unitSERVER_VERSION=8.1 ./build.sh --integrationIntegration tests use pytest and the valkey-test-framework under tst/integration/.
src/
├── module.cc # Valkey module entry point
├── commands/ # Command metadata JSON files
├── modules/ # Valkey command bindings and module data types
└── sketches/ # Sketch adapters / algorithm implementations
The split keeps Valkey-facing command code separate from sketch implementations, which makes it easier to evolve command surfaces and algorithm backends independently.