Repository for Stochastic Optimization Solvers Benchmark implementation of the Window Sticker framework.
The benchmarking approach is described in Benchmarking the Operation of Quantum Heuristics and Ising Machines: Scoring Parameter Setting Strategies on Optimization Applications, with the arXiv preprint also available.
Details of the implementation and an illustrative example for Wishart instances found here are given in this document.
- Background
- Methodology Reliability
- References
- Installation
- Examples
- Documentation
- Testing
- Contributors
- Acknowledgements
- License
This code has been created in order to produce a set of plots that inform the performance of parameterized stochastic optimization solvers when addressing a well-established family of optimization problems. These plots are produced based on experimental data from the execution of such solvers in seen instances of the problem family and evaluated further in an unseen subset of problems. More details of the methodology have been presented in the APS March meeting and INFORMS Annual meeting conferences. A manuscript explaining the methodology is in preparation. The performance plot, or as we like to call it Window Sticker, is a graphical representation of the expected performance of a solution method or parameter setting strategy with an unseen instance from the same problem family that it is generated aiming to answer the question With X% confidence, will we find a solution with Y quality after using R resource? Consider that the quality metric and the resource values can be arbitrary functions of the parameters and performance of the given solver, providing a flexible analysis tool for its performance.
The current package implements the following functionality:
- Parsing results from files from parameterized stochastic solvers such as PySA and D-Wave ocean tools.
- Through bootstrapping and downsampling, simulate the lower data performance for such solvers.
- Compute best-recommended parameters based on aggregated statistics and individual results for each parameter setting.
- Compute optimistic bound performance, known as virtual best performance, based on the provided experiments.
- Perform an exploration-exploitation parameter setting strategy, where the fraction of the allocated resources used in the exploration round is optimized. The exploration procedure is implemented as a random search in the seen parameter settings or a Bayesian-based method known as the tree of parzen and implemented in the package Hyperopt when generation dependencies are installed.
- Plot the Window sticker, comparing the performance curves corresponding to the virtual best, recommended parameters, and exploration-exploitation parameter setting strategies.
- Plots the values of the parameters and their best values with respect to the resource considered, a plot we call the Strategy plot. These plots can show the actual solver parameter values or the meta-parameters associated with parameter-setting strategies.
The Window Sticker workflow combines two different uncertainty questions that should be interpreted separately. The cross-instance Window Sticker uncertainty comes from train/test splits, interpolation, aggregation, and bootstrap resampling across a problem family. It answers how a parameter-setting strategy is expected to perform on unseen instances from that family. Per-instance repeat-count reliability asks whether each solver, parameter setting, and resource level has enough repeated stochastic runs to support its own success probability, repeat count, and time-to-solution estimates.
Repeat reliability follows Noori et al. 2026:
Noori, Moslem, Elisabetta Valiante, Ignacio Rozada, Thomas Van Vaerenbergh, and Masoud Mohseni. "Statistical analysis for per-instance evaluation of stochastic optimizers: Avoiding unreliable conclusions." Physical Review Applied 25, no. 3 (2026): 034081. The related arXiv preprint is titled "A Statistical Analysis for Per-Instance Evaluation of Stochastic Optimizers: How Many Repeats Are Enough?"
The analytic guarantees implemented here apply to Bernoulli success events and
metrics derived from their success probabilities: R_c, RTT/TTS, CETS, and
thresholded continuous metrics. For example, a continuous energy, Response, or
PerfRatio value can be analyzed with repeat reliability only after the workflow
defines a threshold that turns each run into success or failure.
The original Window Sticker bootstrap remains the uncertainty model for continuous Response curves and for continuous PerfRatio curves that are not converted to success thresholds. Those bootstrap intervals are useful for cross-instance benchmarking, but they are not a substitute for the analytic repeat-count checks above.
Repeat reliability addresses several methodology criticisms directly:
| Criticism | Documentation and package response |
|---|---|
| Repeat-count sufficiency | Report required_trials, additional_trials_required, reliable, and reliability_status so users can tell whether more stochastic runs are needed. |
| Bootstrap-only uncertainty | Keep bootstrap intervals for cross-instance continuous curves, and use analytic Bernoulli intervals for per-instance success probabilities and derived repeat-count metrics. |
| Noisy HPO choices | Surface repeat reliability before treating a parameter choice as stable, especially when success rates are small or intervals are wide. |
| CI-overlap ambiguity | Flag ci_overlaps_best and statistically_unresolved comparisons instead of implying that overlapping intervals identify a clear winner. |
| Virtual-best optimism | Document virtual best as an optimistic, unattainable reference and pair it with reliability checks when judging whether observed gaps are meaningful. |
Use docs/passive_repeat_reliability_reports.md to add passive repeat-reliability reports to existing benchmark data. Use docs/noori_repeat_reliability_validation.md for the equation map and validation boundaries.
- Window Sticker methodology: published Quantum Machine Intelligence article and arXiv preprint.
- Repeat reliability: published Physical Review Applied article and arXiv preprint.
- QAOA benchmark case: published ACM Transactions on Quantum Computing article and arXiv preprint.
-
Clone the Repository:
git clone https://github.com/usra-riacs/stochastic-benchmark.git cd stochastic-benchmark -
Set up a Virtual Environment (Recommended):
python3 -m venv venv source venv/bin/activate # On Windows use `.\venv\Scripts\activate`
-
Install Dependencies:
pip install -r requirements.txt
-
Optional: Install Example and Notebook Dependencies (needed for self-contained runnable example notebooks):
pip install -r requirements-examples.txt
-
Optional: Install Data-Generation Dependencies (needed for workflows that run Hyperopt-based data generation):
pip install -r requirements-generation.txt
-
Download the Repository:
- Navigate to the stochastic-benchmark GitHub page.
- Click on the
Codebutton. - Choose
Download ZIP. - Once downloaded, extract the ZIP archive and navigate to the extracted folder in your terminal or command prompt.
-
Set up a Virtual Environment (Recommended):
python3 -m venv venv source venv/bin/activate # On Windows use `.\venv\Scripts\activate`
-
Install Dependencies:
pip install -r requirements.txt
-
Optional: Install Example and Notebook Dependencies (needed for self-contained runnable example notebooks):
pip install -r requirements-examples.txt
-
Optional: Install Data-Generation Dependencies (needed for workflows that run Hyperopt-based data generation):
pip install -r requirements-generation.txt
Core installs use requirements.txt and do not include notebook-only or data-generation-only packages. Use these optional sets as needed:
- Example analysis notebooks:
pip install -r requirements-examples.txt - Command-line notebook execution tools: included in
requirements-examples.txtvianbconvertandipykernel - Hyperopt-based data generation:
pip install -r requirements-generation.txt
When installing the package in editable mode, the equivalent extras are:
pip install -e ".[examples,notebooks]"
pip install -e ".[generation]"The generation requirements include a temporary setuptools<81 compatibility pin because Hyperopt 0.2.7 imports pkg_resources. This workaround is generation-specific and is not needed for analysis-only notebooks.
For a full demonstration of the stochastic-benchmark analysis in action, refer to the example notebooks located in the examples folder of this repository.
After installing requirements-examples.txt, a notebook can be checked from the command line with:
python -m jupyter nbconvert --to notebook --execute --inplace path/to/notebook.ipynbTo run the same self-contained tutorial notebook smoke check used in CI:
python scripts/verify_tutorials.py --output-dir executed-notebooksThe manifest at examples/tutorials.json lists runnable notebooks and documents notebooks that are intentionally skipped because they require external setup or are too slow for the lightweight CI smoke job.
The smoke check executes copied notebook directories under the output directory, so generated plots, summaries, and caches are kept with the executed notebooks instead of modifying the source examples.
Use the root README as the entry point, then follow the focused documents for the part of the workflow you need:
- examples/general_workflow.md for the end-to-end benchmark flow
- CI-TESTING.md for local CI reproduction and environment setup
- TESTING.md for the test suite overview
- docs/passive_repeat_reliability_reports.md for passive repeat-reliability report inputs, outputs, and joins
- docs/noori_repeat_reliability_validation.md for validation of Noori et al. repeat-reliability formulas
- examples/wishart_n_50_alpha_0.5/README.md for the Wishart example details
Tests can be executed using the helper script run_tests.py. Specify the type of
tests to run along with any optional flags:
python run_tests.py [unit|integration|smoke|all|coverage] [--verbose] [--fast]Example commands:
-
Run the unit test suite:
python run_tests.py unit
-
Generate a coverage report:
python run_tests.py coverage
For additional details see TESTING.md.
- @robinabrown Robin Brown
- @PratikSathe Pratik Sathe
- @bernalde David Bernal Neira
This code was developed under the NSF Expeditions Program NSF award CCF-1918549 on Coherent Ising Machines