Skip to content

benchmark compactly::ans#131

Merged
mumbleskates merged 10 commits into
djkoloski:masterfrom
droundy:support-compactly
Apr 17, 2026
Merged

benchmark compactly::ans#131
mumbleskates merged 10 commits into
djkoloski:masterfrom
droundy:support-compactly

Conversation

@droundy
Copy link
Copy Markdown
Contributor

@droundy droundy commented Apr 13, 2026

This adds support for the compactly crate but I haven't been able to compile it due to #130 so I am doubtful that it is correct.

Technically this benchmarks compactly::Ans which is supposed to be faster at the cost of sometimes requiring one more byte of storage. If you wanted to support two versions of compactly we could also test the arithmetic coding version, but that seems excessive.

Copy link
Copy Markdown
Collaborator

@mumbleskates mumbleskates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't compile right now. derives for the required trait are missing on a number of structs.

there are also stray changes made to the capnp generated code, which should not be included and will cause our build checks to fail.

@droundy droundy requested a review from mumbleskates April 14, 2026 13:13
@droundy
Copy link
Copy Markdown
Contributor Author

droundy commented Apr 14, 2026

As I look more closely at the code... are the tests run on randomly generated data? That is sort of a pathological case for compactly, which is focused on size rather than time.

@mumbleskates
Copy link
Copy Markdown
Collaborator

are the tests run on randomly generated data?

i believe they all run against a fixed seed? or all against the same data, anyway.

there are test columns reflective of the resulting encoded size (as well as encoded data compressed with generalized compression algorithms), but if there is any kind of complicated compression going on then yeah it's going to be a very nuanced thing to measure. that said given that every other library has to represent the exact same data, if you're able to uniquely represent it in fewer bytes and eliminate that redundancy that's great. this is a good suite for general speed/size tradeoffs, not quite as much for showing the ability to compress a compressible dataset.

@mumbleskates
Copy link
Copy Markdown
Collaborator

currently the pr is failing cargo fmt. The two main things we want to be able to run for the build to pass are cargo fmt --check and cargo test --benches --no-default-features --features default-encoding-set

Copy link
Copy Markdown
Collaborator

@mumbleskates mumbleskates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs cargo fmt

@mumbleskates
Copy link
Copy Markdown
Collaborator

note that your ability to build & test this might also be affected by the same problem #132 is trying to address right now if you don't have a locally cached version of core2

@droundy
Copy link
Copy Markdown
Contributor Author

droundy commented Apr 15, 2026 via email

@droundy
Copy link
Copy Markdown
Contributor Author

droundy commented Apr 15, 2026

are the tests run on randomly generated data?

i believe they all run against a fixed seed? or all against the same data, anyway.

there are test columns reflective of the resulting encoded size (as well as encoded data compressed with generalized compression algorithms), but if there is any kind of complicated compression going on then yeah it's going to be a very nuanced thing to measure. that said given that every other library has to represent the exact same data, if you're able to uniquely represent it in fewer bytes and eliminate that redundancy that's great. this is a good suite for general speed/size tradeoffs, not quite as much for showing the ability to compress a compressible dataset.

Yeah, I see that the logs are not completely random, so presumably there is some compression to be had there. But the details of how easily it compresses is obviously a function of how the random data is constructed...

It'll be fun to see. And also depressing to see how slow compactly currently is! :)

@mumbleskates
Copy link
Copy Markdown
Collaborator

you should be able to see both a) how compressible compactly's encoded data is altogether and b) how much compression it captures in its own processes vs how much is captured after the fact by zlib/zstd. it may make the most sense for its usecase to compare compactly's encoding time to encoding+compression time for other formats

Copy link
Copy Markdown
Collaborator

@mumbleskates mumbleskates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok this looks fine to me! 👍 i think some of the fields now annotated as "low cardinality" maybe aren't actually low cardinality in terms of their intent, but i'm also not an expert on what that actually does nor do i want to put down overly precious rules about the spirit of the benchmark or whatever. This library already has different aims than a lot of the others we benchmark so we'll see what it looks like.

@droundy
Copy link
Copy Markdown
Contributor Author

droundy commented Apr 16, 2026

you should be able to see both a) how compressible compactly's encoded data is altogether and b) how much compression it captures in its own processes vs how much is captured after the fact by zlib/zstd. it may make the most sense for its usecase to compare compactly's encoding time to encoding+compression time for other formats

Yes that sounds like the right way to look at it.

@droundy
Copy link
Copy Markdown
Contributor Author

droundy commented Apr 16, 2026

ok this looks fine to me! 👍 i think some of the fields now annotated as "low cardinality" maybe aren't actually low cardinality in terms of their intent, but i'm also not an expert on what that actually does nor do i want to put down overly precious rules about the spirit of the benchmark or whatever. This library already has different aims than a lot of the others we benchmark so we'll see what it looks like.

I'm curious which fields you're thinking maybe aren't low cardinality? Roughly speaking low cardinality in compactly just means low enough cardinality that it's worth holding copies of all values in memory on case we get repeats.

@mumbleskates
Copy link
Copy Markdown
Collaborator

if that's what it does then in this case it probably doesn't matter. i was just thinking that things like "user id" are often relatively high cardinality compared to enum-like strings.

@droundy
Copy link
Copy Markdown
Contributor Author

droundy commented Apr 17, 2026

if that's what it does then in this case it probably doesn't matter. i was just thinking that things like "user id" are often relatively high cardinality compared to enum-like strings.

Ah that makes sense.

@mumbleskates mumbleskates merged commit 2065c46 into djkoloski:master Apr 17, 2026
1 check passed
@mumbleskates
Copy link
Copy Markdown
Collaborator

well hey: it never encoded further-compressible data and handily won every size benchmark, so that looks like success to me :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants