CensuScope is a tool for rapid taxonomic profiling of NGS metagenomic data using census-based sampling and BLAST-based alignment. It supports local CLI execution, containerized execution via Docker builds, or exicution via a prebuilt Docker image.
CensuScope is a tool for estimating the taxonomic composition of metagenomic sequencing data using census-based sampling and alignment. Instead of analyzing every read in a dataset, CensuScope repeatedly samples subsets of reads and aligns them against reference databases to infer which organisms are present and at what relative levels.
This approach is intentionally stochastic. Reads are selected at random from the input FASTQ file, and alignment results depend on which reads are sampled in a given run. As a result, two runs with the same inputs can produce different outputs. This behavior is expected and is a core feature of the method rather than a limitation. CensuScope is designed to produce statistically meaningful estimates through aggregation and repeated analysis, not bit-for-bit reproducible results.
The original CensuScope algorithm was implemented as a UNIX-based pipeline, with its core logic captured in the unix.php script included in this repository. The results reported in the original CensuScope publication were generated using that implementation. The current Python version of CensuScope is a direct continuation of that work, preserving the same sampling-based logic while updating the execution model to support modern workflows.
In the current implementation, alignment results are combined across sampled reads to produce taxonomic summaries at species or higher taxonomic levels. Individual read assignments are not treated as definitive classifications. Instead, taxonomic profiles emerge from aggregation across many probabilistic observations.
The Python-based CensuScope engine improves portability, supports containerized execution, and standardizes output formats, while maintaining the original algorithmic behavior and assumptions of the census-based approach.
CensuScope is buit with a clear separation between static infrastructure, dynamic inputs, and generated outputs. This separation supports flexible execution (local or containerized), minimizes runtime assumptions, and makes the behavior of the system explicit.
- Note, in a future release a containerized version of CensuScope will become available to the user.
At a high level, a CensuScope run consists of:
- Randomly sampling reads from an input sequencing file
- Aligning sampled reads against a nucleotide BLAST database
- Mapping alignments to taxonomic identifiers
- Aggregating alignment evidence into taxonomic summaries
- User-provided sequencing data in FASTQ (or FASTA) format
- Treated as read-only input
- May vary between runs and users
- Randomly sampled during execution
The FASTQ file is the primary source of variability between runs. The code does account for FASTA files and will convert them to FASTQ.
- User-provided nucleotide BLAST database
- Built externally using standard BLAST tooling
- Not modified by CensuScope
- May be shared across multiple runs and analyses
CensuScope treats the BLAST database as a fixed reference against which sampled reads are aligned. Common database options include:
- NCBI NT database (standard, comprehensive)
- Use NCBI's FTP BLAST Site to locate and download BLAST database files, including NT.
- SlimNT (curated, reduced database)
- The GitHub Repo to create or download
slimNT.facan be found here.
- The GitHub Repo to create or download
- Filtered NT (lab-specific filtered database)
- The GitHub Repo to create or download
filter_nt.facan be found here.
- The GitHub Repo to create or download
The README blast_database.md in the docs folder can provide more information and step-by-step directions.
- SQLite database derived from the NCBI taxonomy
- Built ahead of time and treated as immutable at runtime
- Provides taxonomic hierarchy and canonical names
- Uses a reduced schema tailored to CensuScope’s lookup and aggregation needs
The taxonomy.db file needs to be created prior to runtime by the user. See the taxonomydb.md README in docs.
- Python-based execution engine
- Responsible for:
- read sampling
- BLAST invocation
- taxonomy lookup
- aggregation and reporting
The engine is stateless across runs aside from generated outputs. No persistent state is carried between executions.
- Per-run output files produced during execution
- May include:
- taxonomic summary tables
- lineage representations
- logs and intermediate artifacts
Outputs are written to a designated output directory and may differ between runs even when inputs are identical.
The outputs are described in the dockerDeployment.md README in docs. They are also summarized below.
CensuScope requires two reference resources to be prepared prior to execution:
- a taxonomy database (
taxonomy.db) - a nucleotide BLAST database (such as SlimNT, filtered NT, or NCBI NT)
These files are treated as read-only inputs during runtime and must be available before running the workflow. Once both databases are prepared, CensuScope can be executed using the Docker deployment guide.
For step-by-step setup instructions in order of workflow, follow the documentation in this order:
-
Build the taxonomy database
See taxonomydb.md -
Prepare or obtain a BLAST database
See blast_database.md -
Build and run CensuScope with Docker
See dockerDeployment.md
This documentation structure follows the same order as the CensuScope workflow: first prepare the taxonomy reference, then prepare the sequence reference database, and finally run the containerized pipeline.
Note: At the time of this release only a CLI CensuScope Docker is available. We are currently working on providing a containerized version for users.
CensuScope writes outputs to a run-specific directory within temp_dirs/. Each run generates a timestamped folder containing intermediate files and a results/ directory with the final outputs.
The results/ folder contains the main output files:
taxonomy_table.tsv– aggregated taxonomic countstax_tree.json– hierarchical taxonomy structureaccession_table.tsv– accession-level hit countscensuscope.log– execution log
These files represent the final results of a CensuScope run. Additional intermediate files (e.g., random samples and BLAST outputs) are stored in separate subdirectories but are not typically used for downstream analysis.
For more details, see the
dockerDeployment.md.