Valkey DataSketches

valkey-datasketches is a Valkey module that brings the Apache DataSketches family of probabilistic data structures to Valkey.

It is designed for workloads where exact answers are expensive, but fast approximate answers are good enough: distinct counting, set analytics, quantiles, heavy hitters, point-frequency estimation, and stream sampling.

Most sketch implementations are backed by the official apache/datasketches-cpp library (v5.2.0). Reservoir sampling remains the in-tree implementation.

Why use it

Estimate distinct users, devices, or sessions without storing every ID.
Run set operations such as union, intersection, and difference on approximate sets.
Query p50 / p95 / p99 latencies from large streams.
Track heavy hitters and approximate per-item frequencies.
Keep uniform or weighted samples of streaming data.
Serialize sketches with GETRAW / SETRAW for persistence, replication, and transport.

Supported sketch families

All commands are namespaced under SKETCH.*.

Family	Command prefix	Typical use	Status	Backend
CPC	`SKETCH.CPC.*`	Compact distinct counting	Supported	`apache/datasketches-cpp`
HLL	`SKETCH.HLL.*`	Distinct counting with tunable precision	Supported	`apache/datasketches-cpp`
Theta	`SKETCH.THETA.*`	Set operations on approximate distinct sets	Supported	`apache/datasketches-cpp`
Tuple	`SKETCH.TUPLE.*`	Distinct keys with associated double summaries	Supported	`apache/datasketches-cpp`
KLL	`SKETCH.KLL.*`	General-purpose quantiles	Supported	`apache/datasketches-cpp`
REQ	`SKETCH.REQ.*`	Relative-error quantiles	Supported	`apache/datasketches-cpp`
t-digest	`SKETCH.TDIGEST.*`	Tail-friendly quantiles	Supported	`apache/datasketches-cpp`
ItemSketch	`SKETCH.ITEMSKETCH.*`	Frequent items for binary-safe string keys	Supported	`apache/datasketches-cpp`
LongSketch	`SKETCH.LONGSKETCH.*`	Frequent items for `int64` keys	Supported	`apache/datasketches-cpp`
Count-Min Sketch	`SKETCH.CMS.*`	Approximate point-frequency estimation	Supported	`apache/datasketches-cpp`
Reservoir sampling	`SKETCH.RESERVOIR.*`	Uniform sampling from a stream	Supported	in-tree implementation
VarOpt sampling	`SKETCH.VAROPT.*`	Weighted / variance-optimal sampling	Supported	`apache/datasketches-cpp`

The module-level SKETCH.INFO command currently reports these algorithms:

cpc, theta, kll, hll, tdigest, tuple, reservoir, req, varopt, longsketch, itemsketch, countmin

Build

The project uses CMake + Ninja and targets C++11.

Build the module

./build.sh

Build and run unit tests

./build.sh --unit

Build and run integration tests

SERVER_VERSION=8.1 ./build.sh --integration

Reuse an existing Valkey source tree

If you already have a Valkey checkout locally, set VALKEY_SOURCE_DIR to avoid downloading it again during configure:

VALKEY_SOURCE_DIR=/path/to/valkey ./build.sh

Show build options

./build.sh --help

Load the module

On Linux:

valkey-server --loadmodule ./build/src/libdatasketches.so

On macOS the output artifact is ./build/src/libdatasketches.dylib.

You can also load it at runtime:

MODULE LOAD /absolute/path/to/libdatasketches.so

Quick start

After loading the module, verify it is available:

127.0.0.1:6379> SKETCH.INFO
1) "module_name"
2) "datasketches"
3) "version"
4) (integer) ...
5) "algorithms"
6) 1) "cpc"
   2) "theta"
   ...

Distinct counting with HLL

127.0.0.1:6379> SKETCH.HLL.ADD visits user:1 user:2 user:3 user:2
(integer) 1
127.0.0.1:6379> SKETCH.HLL.EST visits
"3.00"

Set analytics with Theta

127.0.0.1:6379> SKETCH.THETA.ADD set:a id:1 id:2 id:3
(integer) 1
127.0.0.1:6379> SKETCH.THETA.ADD set:b id:3 id:4 id:5
(integer) 1
127.0.0.1:6379> SKETCH.THETA.INTER overlap 2 set:a set:b
"OK"
127.0.0.1:6379> SKETCH.THETA.EST overlap
"1.00"

Quantiles with KLL

127.0.0.1:6379> SKETCH.KLL.ADD latency 10 20 30 40 50
(integer) 1
127.0.0.1:6379> SKETCH.KLL.QUANTILE latency 0.5 0.9
1) "30"
2) "50"

SKETCH.REQ.* and SKETCH.TDIGEST.* expose similar workflows for quantiles.

Tuple sketch for keyed summaries

127.0.0.1:6379> SKETCH.TUPLE.ADD campaign:a user:1 1.5
(integer) 1
127.0.0.1:6379> SKETCH.TUPLE.ADD campaign:a user:2 2.0
(integer) 1
127.0.0.1:6379> SKETCH.TUPLE.ADD campaign:a user:1 3.0
(integer) 1
127.0.0.1:6379> SKETCH.TUPLE.EST campaign:a
"2"
127.0.0.1:6379> SKETCH.TUPLE.ESTSUM campaign:a
"6.5"

Frequent items with ItemSketch

127.0.0.1:6379> SKETCH.ITEMSKETCH.ADD hotkeys hot hot hot warm warm cold
(integer) 1
127.0.0.1:6379> SKETCH.ITEMSKETCH.TOPK hotkeys 2
1) "hot"
2) (integer) 3
3) (integer) 3
4) (integer) 3
5) "warm"
6) (integer) 2
7) (integer) 2
8) (integer) 2

Point-frequency estimation with Count-Min Sketch

127.0.0.1:6379> SKETCH.CMS.ADD freqs apple apple banana
(integer) 3
127.0.0.1:6379> SKETCH.CMS.EST freqs apple banana missing
1) (integer) 2
2) (integer) 1
3) (integer) 0

Uniform and weighted sampling

Reservoir sampling keeps a uniform sample:

127.0.0.1:6379> SKETCH.RESERVOIR.CREATE sample:uniform 3
"OK"
127.0.0.1:6379> SKETCH.RESERVOIR.ADD sample:uniform a b c d e
(integer) 1
127.0.0.1:6379> SKETCH.RESERVOIR.SAMPLE sample:uniform 10
1) ...
2) ...
3) ...

VarOpt sampling keeps a weighted sample:

127.0.0.1:6379> SKETCH.VAROPT.CREATE sample:weighted 2
"OK"
127.0.0.1:6379> SKETCH.VAROPT.ADD sample:weighted item:1 1.0
(integer) 1
127.0.0.1:6379> SKETCH.VAROPT.ADD sample:weighted item:2 2.0
(integer) 1
127.0.0.1:6379> SKETCH.VAROPT.ADD sample:weighted item:3 3.0
(integer) 1
127.0.0.1:6379> SKETCH.VAROPT.SAMPLE sample:weighted 10
1) ...
2) ...

Common command patterns

Most sketch families follow the same lifecycle:

CREATE a sketch explicitly, or let ADD create it lazily.
ADD items or values.
Query the sketch with commands such as EST, QUANTILE, RANK, FREQ, TOPK, or SAMPLE.
Merge or combine sketches where supported.
Export/import binary snapshots with GETRAW and SETRAW.
Inspect metadata with INFO.

Examples:

Distinct counters: CREATE, ADD, EST, MERGE, GETRAW, SETRAW, INFO
Quantile sketches: CREATE, ADD, QUANTILE, RANK, MERGE, GETRAW, SETRAW, INFO
Sampling sketches: CREATE, ADD, SAMPLE, MERGE (VarOpt), GETRAW, SETRAW, INFO

Accuracy notes

These data structures are approximate by design.

Distinct counters (CPC, HLL, Theta) return estimates that converge as more data arrives.
Quantile sketches (KLL, REQ, TDigest) trade exact ordering for bounded memory.
Frequency sketches (ItemSketch, LongSketch, CMS) are intended for heavy hitters or approximate counts, not exact transactional accounting.
Sampling sketches (Reservoir, VarOpt) preserve representative subsets, not the full stream.

If you need exact answers, these sketches are usually the wrong tool. If you need small memory, mergeability, and fast online queries, they are often a good fit.

Testing

Unit tests

./build.sh --unit

Integration tests

SERVER_VERSION=8.1 ./build.sh --integration

Integration tests use pytest and the valkey-test-framework under tst/integration/.

Project layout

src/
├── module.cc            # Valkey module entry point
├── commands/            # Command metadata JSON files
├── modules/             # Valkey command bindings and module data types
└── sketches/            # Sketch adapters / algorithm implementations

The split keeps Valkey-facing command code separate from sketch implementations, which makes it easier to evolve command surfaces and algorithm backends independently.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.config		.config
.github/workflows		.github/workflows
scripts		scripts
src		src
tst		tst
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CMakeLists.txt		CMakeLists.txt
CPPLINT.cfg		CPPLINT.cfg
LICENSE		LICENSE
README.md		README.md
build.sh		build.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Valkey DataSketches

Why use it

Supported sketch families

Build

Build the module

Build and run unit tests

Build and run integration tests

Reuse an existing Valkey source tree

Show build options

Load the module

Quick start

Distinct counting with HLL

Set analytics with Theta

Quantiles with KLL

Tuple sketch for keyed summaries

Frequent items with ItemSketch

Point-frequency estimation with Count-Min Sketch

Uniform and weighted sampling

Common command patterns

Accuracy notes

Testing

Unit tests

Integration tests

Project layout

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Valkey DataSketches

Why use it

Supported sketch families

Build

Build the module

Build and run unit tests

Build and run integration tests

Reuse an existing Valkey source tree

Show build options

Load the module

Quick start

Distinct counting with HLL

Set analytics with Theta

Quantiles with KLL

Tuple sketch for keyed summaries

Frequent items with ItemSketch

Point-frequency estimation with Count-Min Sketch

Uniform and weighted sampling

Common command patterns

Accuracy notes

Testing

Unit tests

Integration tests

Project layout

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages