Skip to content

Commit 3a83081

Browse files
authored
Speed up Stim Sampling with Faster Ref Sample (#1036)
This PR speeds up `stim sample` by switching the reference sample calculation from the `TableauSimulator` to the `ReferenceSampleTree`. Calculating the reference sample takes a large portion of the time for larger codes. Testing of performance for larger codes (disance 25 at 1M rounds) was done by building stim with `bazel build :stim`, then running the following CLI command: `time bazel-bin/stim --gen surface_code --task rotated_memory_x --distance 25 --rounds 1000000 --after_clifford_depolarization 0.001 | bazel-bin/stim sample --shots 10 --out_format=r8 > ./debug.r8` Metrics given are based on my machine (linux), but all metrics should be considered relative to eachother. The time taken for generating the circuit is considered trivial (< 0.1s). Before this change, this sample took ~7m 23s. With this change, this sample took ~2m 12s, a ~3.4x speedup (about as fast as not calculating a reference sample at all). I also looked into `FrameSimulator`'s logic to look for more speedup opportunities. The only real opportunity seen is to use multi-threading with worker threads. In particular, any of the overloads for `simd_bits_range_ref::for_each_word()` could likely benefit from being done in parallel across multiple worker threads. Async file IO (either using native `<aio.h>`/`OVERLAPPED`/etc, or hand-rolling queued writes where `putc()` is called from another thread) could also possibly help to bring down total sample duration. However, any multi-threaded work can be handled/discussed in another PR. Changes: * Added an overload for `ReferenceSampleTree::decompress_into()` that works with `simd_bits`. * Uses the `vector` overload (instead of using `operator[]` on the tree directly in the loop) as it is the roughly same speed when built normally, but much faster in debug (from what I saw). * Updated `stim::command_sample()` to use `ReferenceSampleTree` instead of `TableauSimulator` for calculating the reference sample. * The output sample is still fully expanded out into a flat `simd_bits` for use with the compare / file writing logic. * Adding `--skip_loop_folding` CLI flag to disable `ReferenceSampleTree`, falling back to `TableauSimulator`. * Updating `command_sample_help()` to document this new command.
1 parent 626c473 commit 3a83081

4 files changed

Lines changed: 83 additions & 2 deletions

File tree

doc/usage_command_line.md

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1676,6 +1676,7 @@ SYNOPSIS
16761676
[--out_format 01|b8|r8|ptb64|hits|dets] \
16771677
[--seed int] \
16781678
[--shots int] \
1679+
[--skip_loop_folding] \
16791680
[--skip_reference_sample]
16801681
16811682
DESCRIPTION
@@ -1762,6 +1763,30 @@ OPTIONS
17621763
Must be an integer between 0 and a quintillion (10^18).
17631764
17641765
1766+
--skip_loop_folding
1767+
Skips loop folding logic on the reference sample calculation.
1768+
1769+
When this argument is specified, the reference sample (that is used
1770+
to convert measurement flip data from frame simulations into actual
1771+
measurement data) is generated by iterating through the entire
1772+
flattened circuit with no loop detection.
1773+
1774+
Loop folding can enormously improve performance for circuits
1775+
containing REPEAT blocks with large repeat counts, by detecting
1776+
periodicity in loops and fast-forwarding across them when computing
1777+
the reference sample for the circuit. However, in some cases the
1778+
analysis is not able to detect the periodicity that is present. For
1779+
example, this has been observed in honeycomb code circuits. When
1780+
this happens, the folding-capable analysis is slower than simply
1781+
analyzing the flattened circuit without any specialized loop logic.
1782+
The `--skip_loop_folding` flag can be used to just analyze the
1783+
flattened circuit, bypassing this slowdown for circuits such as
1784+
honeycomb code circuits.
1785+
1786+
By default, loop detection is enabled. Pass this flag to disable
1787+
it (when appropriate by use case).
1788+
1789+
17651790
--skip_reference_sample
17661791
Asserts the circuit can produce a noiseless sample that is just 0s.
17671792

src/stim/cmd/command_sample.cc

Lines changed: 41 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -21,18 +21,20 @@
2121
#include "stim/simulators/tableau_simulator.h"
2222
#include "stim/util_bot/arg_parse.h"
2323
#include "stim/util_bot/probability_util.h"
24+
#include "stim/util_top/reference_sample_tree.h"
2425

2526
using namespace stim;
2627

2728
int stim::command_sample(int argc, const char **argv) {
2829
check_for_unknown_arguments(
29-
{"--seed", "--skip_reference_sample", "--out_format", "--out", "--in", "--shots"},
30+
{"--seed", "--skip_reference_sample", "--skip_loop_folding", "--out_format", "--out", "--in", "--shots"},
3031
{"--sample", "--frame0"},
3132
"sample",
3233
argc,
3334
argv);
3435
const auto &out_format = find_enum_argument("--out_format", "01", format_name_to_enum_map(), argc, argv);
3536
bool skip_reference_sample = find_bool_argument("--skip_reference_sample", argc, argv);
37+
bool skip_loop_folding = find_bool_argument("--skip_loop_folding", argc, argv);
3638
uint64_t num_shots =
3739
find_argument("--shots", argc, argv) ? (uint64_t)find_int64_argument("--shots", 1, 0, INT64_MAX, argc, argv)
3840
: find_argument("--sample", argc, argv) ? (uint64_t)find_int64_argument("--sample", 1, 0, INT64_MAX, argc, argv)
@@ -56,7 +58,13 @@ int stim::command_sample(int argc, const char **argv) {
5658
auto circuit = Circuit::from_file(in);
5759
simd_bits<MAX_BITWORD_WIDTH> ref(0);
5860
if (!skip_reference_sample) {
59-
ref = TableauSimulator<MAX_BITWORD_WIDTH>::reference_sample_circuit(circuit);
61+
if (skip_loop_folding) {
62+
ref = TableauSimulator<MAX_BITWORD_WIDTH>::reference_sample_circuit(circuit);
63+
} else {
64+
ReferenceSampleTree reference_sample_measurement_bits =
65+
ReferenceSampleTree::from_circuit_reference_sample(circuit.aliased_noiseless_circuit());
66+
reference_sample_measurement_bits.decompress_into(ref);
67+
}
6068
}
6169
sample_batch_measurements_writing_results_to_disk(circuit, ref, num_shots, out, out_format.id, rng);
6270
}
@@ -128,6 +136,37 @@ SubCommandHelp stim::command_sample_help() {
128136
)PARAGRAPH"),
129137
});
130138

139+
result.flags.push_back(
140+
SubCommandHelpFlag{
141+
"--skip_loop_folding",
142+
"bool",
143+
"false",
144+
{"[none]", "[switch]"},
145+
clean_doc_string(R"PARAGRAPH(
146+
Skips loop folding logic on the reference sample calculation.
147+
148+
When this argument is specified, the reference sample (that is used
149+
to convert measurement flip data from frame simulations into actual
150+
measurement data) is generated by iterating through the entire
151+
flattened circuit with no loop detection.
152+
153+
Loop folding can enormously improve performance for circuits
154+
containing REPEAT blocks with large repeat counts, by detecting
155+
periodicity in loops and fast-forwarding across them when computing
156+
the reference sample for the circuit. However, in some cases the
157+
analysis is not able to detect the periodicity that is present. For
158+
example, this has been observed in honeycomb code circuits. When
159+
this happens, the folding-capable analysis is slower than simply
160+
analyzing the flattened circuit without any specialized loop logic.
161+
The `--skip_loop_folding` flag can be used to just analyze the
162+
flattened circuit, bypassing this slowdown for circuits such as
163+
honeycomb code circuits.
164+
165+
By default, loop detection is enabled. Pass this flag to disable
166+
it (when appropriate by use case).
167+
)PARAGRAPH"),
168+
});
169+
131170
result.flags.push_back(
132171
SubCommandHelpFlag{
133172
"--out_format",

src/stim/util_top/reference_sample_tree.h

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,10 @@ struct ReferenceSampleTree {
3737
/// Writes the contents of the tree into the given output vector.
3838
void decompress_into(std::vector<bool> &output) const;
3939

40+
/// Writes the contents of the tree into the given output simd_bits.
41+
template <size_t W>
42+
void decompress_into(simd_bits<W> &output) const;
43+
4044
/// Folds redundant children into the repetition count, if they repeat this many times.
4145
///
4246
/// For example, if the tree's children are [A, B, C, A, B, C] and the tree has no

src/stim/util_top/reference_sample_tree.inl

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,19 @@
22

33
namespace stim {
44

5+
template <size_t W>
6+
void ReferenceSampleTree::decompress_into(simd_bits<W> &output) const {
7+
std::vector<bool> v;
8+
this->decompress_into(v);
9+
10+
simd_bits<W> result(v.size());
11+
for (size_t k = 0; k < v.size(); k++) {
12+
result[k] ^= v[k];
13+
}
14+
15+
output = std::move(result);
16+
}
17+
518
template <size_t W>
619
ReferenceSampleTree CompressedReferenceSampleHelper<W>::do_loop_with_no_folding(const Circuit &loop, uint64_t reps) {
720
ReferenceSampleTree result;

0 commit comments

Comments
 (0)