Window Sticker - Stochastic Benchmark

Repository for Stochastic Optimization Solvers Benchmark implementation of the Window Sticker framework.

The benchmarking approach is described in Benchmarking the Operation of Quantum Heuristics and Ising Machines: Scoring Parameter Setting Strategies on Optimization Applications, with the arXiv preprint also available.

Details of the implementation and an illustrative example for Wishart instances found here are given in this document.

Background

This code has been created in order to produce a set of plots that inform the performance of parameterized stochastic optimization solvers when addressing a well-established family of optimization problems. These plots are produced based on experimental data from the execution of such solvers in seen instances of the problem family and evaluated further in an unseen subset of problems. More details of the methodology have been presented in the APS March meeting and INFORMS Annual meeting conferences. A manuscript explaining the methodology is in preparation. The performance plot, or as we like to call it Window Sticker, is a graphical representation of the expected performance of a solution method or parameter setting strategy with an unseen instance from the same problem family that it is generated aiming to answer the question With X% confidence, will we find a solution with Y quality after using R resource? Consider that the quality metric and the resource values can be arbitrary functions of the parameters and performance of the given solver, providing a flexible analysis tool for its performance.

The current package implements the following functionality:

Parsing results from files from parameterized stochastic solvers such as PySA and D-Wave ocean tools.
Through bootstrapping and downsampling, simulate the lower data performance for such solvers.
Compute best-recommended parameters based on aggregated statistics and individual results for each parameter setting.
Compute optimistic bound performance, known as virtual best performance, based on the provided experiments.
Perform an exploration-exploitation parameter setting strategy, where the fraction of the allocated resources used in the exploration round is optimized. The exploration procedure is implemented as a random search in the seen parameter settings or a Bayesian-based method known as the tree of parzen and implemented in the package Hyperopt when generation dependencies are installed.
Plot the Window sticker, comparing the performance curves corresponding to the virtual best, recommended parameters, and exploration-exploitation parameter setting strategies.
Plots the values of the parameters and their best values with respect to the resource considered, a plot we call the Strategy plot. These plots can show the actual solver parameter values or the meta-parameters associated with parameter-setting strategies.

Methodology Reliability

The Window Sticker workflow combines two different uncertainty questions that should be interpreted separately. The cross-instance Window Sticker uncertainty comes from train/test splits, interpolation, aggregation, and bootstrap resampling across a problem family. It answers how a parameter-setting strategy is expected to perform on unseen instances from that family. Per-instance repeat-count reliability asks whether each solver, parameter setting, and resource level has enough repeated stochastic runs to support its own success probability, repeat count, and time-to-solution estimates.

Repeat reliability follows Noori et al. 2026:

Noori, Moslem, Elisabetta Valiante, Ignacio Rozada, Thomas Van Vaerenbergh, and Masoud Mohseni. "Statistical analysis for per-instance evaluation of stochastic optimizers: Avoiding unreliable conclusions." Physical Review Applied 25, no. 3 (2026): 034081. The related arXiv preprint is titled "A Statistical Analysis for Per-Instance Evaluation of Stochastic Optimizers: How Many Repeats Are Enough?"

The analytic guarantees implemented here apply to Bernoulli success events and metrics derived from their success probabilities: R_c, RTT/TTS, CETS, and thresholded continuous metrics. For example, a continuous energy, Response, or PerfRatio value can be analyzed with repeat reliability only after the workflow defines a threshold that turns each run into success or failure.

The original Window Sticker bootstrap remains the uncertainty model for continuous Response curves and for continuous PerfRatio curves that are not converted to success thresholds. Those bootstrap intervals are useful for cross-instance benchmarking, but they are not a substitute for the analytic repeat-count checks above.

Repeat reliability addresses several methodology criticisms directly:

Criticism	Documentation and package response
Repeat-count sufficiency	Report `required_trials`, `additional_trials_required`, `reliable`, and `reliability_status` so users can tell whether more stochastic runs are needed.
Bootstrap-only uncertainty	Keep bootstrap intervals for cross-instance continuous curves, and use analytic Bernoulli intervals for per-instance success probabilities and derived repeat-count metrics.
Noisy HPO choices	Surface repeat reliability before treating a parameter choice as stable, especially when success rates are small or intervals are wide.
CI-overlap ambiguity	Flag `ci_overlaps_best` and `statistically_unresolved` comparisons instead of implying that overlapping intervals identify a clear winner.
Virtual-best optimism	Document virtual best as an optimistic, unattainable reference and pair it with reliability checks when judging whether observed gaps are meaningful.

Use docs/passive_repeat_reliability_reports.md to add passive repeat-reliability reports to existing benchmark data. Use docs/noori_repeat_reliability_validation.md for the equation map and validation boundaries.

References

Window Sticker methodology: published Quantum Machine Intelligence article and arXiv preprint.
Repeat reliability: published Physical Review Applied article and arXiv preprint.
QAOA benchmark case: published ACM Transactions on Quantum Computing article and arXiv preprint.

Installation

Method 1: Cloning the Repository

Clone the Repository:

git clone https://github.com/usra-riacs/stochastic-benchmark.git
cd stochastic-benchmark

Set up a Virtual Environment (Recommended):

python3 -m venv venv
source venv/bin/activate  # On Windows use `.\venv\Scripts\activate`

Install Dependencies:
```
pip install -r requirements.txt
```
Optional: Install Example and Notebook Dependencies (needed for self-contained runnable example notebooks):
```
pip install -r requirements-examples.txt
```
Optional: Install Data-Generation Dependencies (needed for workflows that run Hyperopt-based data generation):
```
pip install -r requirements-generation.txt
```

Method 2: Downloading as a Zip Archive

Download the Repository:
- Navigate to the stochastic-benchmark GitHub page.
- Click on the Code button.
- Choose Download ZIP.
- Once downloaded, extract the ZIP archive and navigate to the extracted folder in your terminal or command prompt.

Set up a Virtual Environment (Recommended):

python3 -m venv venv
source venv/bin/activate  # On Windows use `.\venv\Scripts\activate`

Install Dependencies:
```
pip install -r requirements.txt
```
Optional: Install Example and Notebook Dependencies (needed for self-contained runnable example notebooks):
```
pip install -r requirements-examples.txt
```
Optional: Install Data-Generation Dependencies (needed for workflows that run Hyperopt-based data generation):
```
pip install -r requirements-generation.txt
```

Optional Dependency Sets

Core installs use requirements.txt and do not include notebook-only or data-generation-only packages. Use these optional sets as needed:

Example analysis notebooks: pip install -r requirements-examples.txt
Command-line notebook execution tools: included in requirements-examples.txt via nbconvert and ipykernel
Hyperopt-based data generation: pip install -r requirements-generation.txt

When installing the package in editable mode, the equivalent extras are:

pip install -e ".[examples,notebooks]"
pip install -e ".[generation]"

The generation requirements include a temporary setuptools<81 compatibility pin because Hyperopt 0.2.7 imports pkg_resources. This workaround is generation-specific and is not needed for analysis-only notebooks.

Examples

For a full demonstration of the stochastic-benchmark analysis in action, refer to the example notebooks located in the examples folder of this repository.

After installing requirements-examples.txt, a notebook can be checked from the command line with:

python -m jupyter nbconvert --to notebook --execute --inplace path/to/notebook.ipynb

To run the same self-contained tutorial notebook smoke check used in CI:

python scripts/verify_tutorials.py --output-dir executed-notebooks

The manifest at examples/tutorials.json lists runnable notebooks and documents notebooks that are intentionally skipped because they require external setup or are too slow for the lightweight CI smoke job. The smoke check executes copied notebook directories under the output directory, so generated plots, summaries, and caches are kept with the executed notebooks instead of modifying the source examples.

Documentation

Use the root README as the entry point, then follow the focused documents for the part of the workflow you need:

examples/general_workflow.md for the end-to-end benchmark flow
CI-TESTING.md for local CI reproduction and environment setup
TESTING.md for the test suite overview
docs/passive_repeat_reliability_reports.md for passive repeat-reliability report inputs, outputs, and joins
docs/noori_repeat_reliability_validation.md for validation of Noori et al. repeat-reliability formulas
examples/wishart_n_50_alpha_0.5/README.md for the Wishart example details

Testing

Tests can be executed using the helper script run_tests.py. Specify the type of tests to run along with any optional flags:

python run_tests.py [unit|integration|smoke|all|coverage] [--verbose] [--fast]

Example commands:

Run the unit test suite:
```
python run_tests.py unit
```
Generate a coverage report:
```
python run_tests.py coverage
```

For additional details see TESTING.md.

Contributors

@robinabrown Robin Brown
@PratikSathe Pratik Sathe
@bernalde David Bernal Neira

Acknowledgements

This code was developed under the NSF Expeditions Program NSF award CCF-1918549 on Coherent Ising Machines

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 491 Commits
.github		.github
docs		docs
examples		examples
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CI-TESTING.md		CI-TESTING.md
CITATION.cff		CITATION.cff
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
TESTING.md		TESTING.md
environment-ci.yml		environment-ci.yml
pyproject.toml		pyproject.toml
quick-reference.sh		quick-reference.sh
requirements-dev.txt		requirements-dev.txt
requirements-examples.txt		requirements-examples.txt
requirements-generation.txt		requirements-generation.txt
requirements.txt		requirements.txt
run-ci-tests.sh		run-ci-tests.sh
run_tests.py		run_tests.py
setup-ci-env.sh		setup-ci-env.sh
setup.py		setup.py
stochastic-benchmarking-notes.pdf		stochastic-benchmarking-notes.pdf
test-all-python-versions.sh		test-all-python-versions.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Window Sticker - Stochastic Benchmark

Table of Contents

Background

Methodology Reliability

References

Installation

Method 1: Cloning the Repository

Method 2: Downloading as a Zip Archive

Optional Dependency Sets

Examples

Documentation

Testing

Contributors

Acknowledgements

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Window Sticker - Stochastic Benchmark

Table of Contents

Background

Methodology Reliability

References

Installation

Method 1: Cloning the Repository

Method 2: Downloading as a Zip Archive

Optional Dependency Sets

Examples

Documentation

Testing

Contributors

Acknowledgements

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages