CensuScope

CensuScope is a tool for rapid taxonomic profiling of NGS metagenomic data using census-based sampling and BLAST-based alignment. It supports local CLI execution, containerized execution via Docker builds, or exicution via a prebuilt Docker image.

Overview

CensuScope is a tool for estimating the taxonomic composition of metagenomic sequencing data using census-based sampling and alignment. Instead of analyzing every read in a dataset, CensuScope repeatedly samples subsets of reads and aligns them against reference databases to infer which organisms are present and at what relative levels.

This approach is intentionally stochastic. Reads are selected at random from the input FASTQ file, and alignment results depend on which reads are sampled in a given run. As a result, two runs with the same inputs can produce different outputs. This behavior is expected and is a core feature of the method rather than a limitation. CensuScope is designed to produce statistically meaningful estimates through aggregation and repeated analysis, not bit-for-bit reproducible results.

The original CensuScope algorithm was implemented as a UNIX-based pipeline, with its core logic captured in the unix.php script included in this repository. The results reported in the original CensuScope publication were generated using that implementation. The current Python version of CensuScope is a direct continuation of that work, preserving the same sampling-based logic while updating the execution model to support modern workflows.

In the current implementation, alignment results are combined across sampled reads to produce taxonomic summaries at species or higher taxonomic levels. Individual read assignments are not treated as definitive classifications. Instead, taxonomic profiles emerge from aggregation across many probabilistic observations.

The Python-based CensuScope engine improves portability, supports containerized execution, and standardizes output formats, while maintaining the original algorithmic behavior and assumptions of the census-based approach.

System Architecture

CensuScope is buit with a clear separation between static infrastructure, dynamic inputs, and generated outputs. This separation supports flexible execution (local or containerized), minimizes runtime assumptions, and makes the behavior of the system explicit.

Note, in a future release a containerized version of CensuScope will become available to the user.

At a high level, a CensuScope run consists of:

Randomly sampling reads from an input sequencing file
Aligning sampled reads against a nucleotide BLAST database
Mapping alignments to taxonomic identifiers
Aggregating alignment evidence into taxonomic summaries

Core components

🧬 FASTQ input (dynamic, user-provided)

User-provided sequencing data in FASTQ (or FASTA) format
Treated as read-only input
May vary between runs and users
Randomly sampled during execution

The FASTQ file is the primary source of variability between runs. The code does account for FASTA files and will convert them to FASTQ.

💻 BLAST database (external, static, user-provided)

User-provided nucleotide BLAST database
Built externally using standard BLAST tooling
Not modified by CensuScope
May be shared across multiple runs and analyses

CensuScope treats the BLAST database as a fixed reference against which sampled reads are aligned. Common database options include:

NCBI NT database (standard, comprehensive)
- Use NCBI's FTP BLAST Site to locate and download BLAST database files, including NT.
SlimNT (curated, reduced database)
- The GitHub Repo to create or download slimNT.fa can be found here.
Filtered NT (lab-specific filtered database)
- The GitHub Repo to create or download filter_nt.fa can be found here.

The README blast_database.md in the docs folder can provide more information and step-by-step directions.

🦒 Taxonomy database (internal, static)

SQLite database derived from the NCBI taxonomy
Built ahead of time and treated as immutable at runtime
Provides taxonomic hierarchy and canonical names
Uses a reduced schema tailored to CensuScope’s lookup and aggregation needs

The taxonomy.db file needs to be created prior to runtime by the user. See the taxonomydb.md README in docs.

CensuScope Engine (Runtime)

Python-based execution engine
Responsible for:
- read sampling
- BLAST invocation
- taxonomy lookup
- aggregation and reporting

The engine is stateless across runs aside from generated outputs. No persistent state is carried between executions.

Output artifacts (generated per run)

Per-run output files produced during execution
May include:
- taxonomic summary tables
- lineage representations
- logs and intermediate artifacts

Outputs are written to a designated output directory and may differ between runs even when inputs are identical.

The outputs are described in the dockerDeployment.md README in docs. They are also summarized below.

Deployment

CensuScope requires two reference resources to be prepared prior to execution:

a taxonomy database (taxonomy.db)
a nucleotide BLAST database (such as SlimNT, filtered NT, or NCBI NT)

These files are treated as read-only inputs during runtime and must be available before running the workflow. Once both databases are prepared, CensuScope can be executed using the Docker deployment guide.

For step-by-step setup instructions in order of workflow, follow the documentation in this order:

Build the taxonomy database
See taxonomydb.md
Prepare or obtain a BLAST database
See blast_database.md
Build and run CensuScope with Docker
See dockerDeployment.md

This documentation structure follows the same order as the CensuScope workflow: first prepare the taxonomy reference, then prepare the sequence reference database, and finally run the containerized pipeline.

Note: At the time of this release only a CLI CensuScope Docker is available. We are currently working on providing a containerized version for users.

Output Files

CensuScope writes outputs to a run-specific directory within temp_dirs/. Each run generates a timestamped folder containing intermediate files and a results/ directory with the final outputs.

The results/ folder contains the main output files:

taxonomy_table.tsv – aggregated taxonomic counts
tax_tree.json – hierarchical taxonomy structure
accession_table.tsv – accession-level hit counts
censuscope.log – execution log

These files represent the final results of a CensuScope run. Additional intermediate files (e.g., random samples and BLAST outputs) are stored in separate subdirectories but are not typically used for downstream analysis.

For more details, see the
dockerDeployment.md.

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
docs		docs
lib		lib
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CensuScope

Table of Contents

Overview

System Architecture

Core components

🧬 FASTQ input (dynamic, user-provided)

💻 BLAST database (external, static, user-provided)

🦒 Taxonomy database (internal, static)

CensuScope Engine (Runtime)

Output artifacts (generated per run)

Deployment

Output Files

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CensuScope

Table of Contents

Overview

System Architecture

Core components

🧬 FASTQ input (dynamic, user-provided)

💻 BLAST database (external, static, user-provided)

🦒 Taxonomy database (internal, static)

CensuScope Engine (Runtime)

Output artifacts (generated per run)

Deployment

Output Files

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages