Skip to content

cwoodside1278/CensuScope

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

92 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CensuScope

scope

CensuScope is a tool for rapid taxonomic profiling of NGS metagenomic data using census-based sampling and BLAST-based alignment. It supports local CLI execution, containerized execution via Docker builds, or exicution via a prebuilt Docker image.


Table of Contents

  1. Overview
  2. System Architecture
  3. Deployment
  4. Output Files

Overview

CensuScope is a tool for estimating the taxonomic composition of metagenomic sequencing data using census-based sampling and alignment. Instead of analyzing every read in a dataset, CensuScope repeatedly samples subsets of reads and aligns them against reference databases to infer which organisms are present and at what relative levels.

This approach is intentionally stochastic. Reads are selected at random from the input FASTQ file, and alignment results depend on which reads are sampled in a given run. As a result, two runs with the same inputs can produce different outputs. This behavior is expected and is a core feature of the method rather than a limitation. CensuScope is designed to produce statistically meaningful estimates through aggregation and repeated analysis, not bit-for-bit reproducible results.

The original CensuScope algorithm was implemented as a UNIX-based pipeline, with its core logic captured in the unix.php script included in this repository. The results reported in the original CensuScope publication were generated using that implementation. The current Python version of CensuScope is a direct continuation of that work, preserving the same sampling-based logic while updating the execution model to support modern workflows.

In the current implementation, alignment results are combined across sampled reads to produce taxonomic summaries at species or higher taxonomic levels. Individual read assignments are not treated as definitive classifications. Instead, taxonomic profiles emerge from aggregation across many probabilistic observations.

The Python-based CensuScope engine improves portability, supports containerized execution, and standardizes output formats, while maintaining the original algorithmic behavior and assumptions of the census-based approach.


System Architecture

CensuScope is buit with a clear separation between static infrastructure, dynamic inputs, and generated outputs. This separation supports flexible execution (local or containerized), minimizes runtime assumptions, and makes the behavior of the system explicit.

  • Note, in a future release a containerized version of CensuScope will become available to the user.

At a high level, a CensuScope run consists of:

  1. Randomly sampling reads from an input sequencing file
  2. Aligning sampled reads against a nucleotide BLAST database
  3. Mapping alignments to taxonomic identifiers
  4. Aggregating alignment evidence into taxonomic summaries

Core components

🧬 FASTQ input (dynamic, user-provided)

  • User-provided sequencing data in FASTQ (or FASTA) format
  • Treated as read-only input
  • May vary between runs and users
  • Randomly sampled during execution

The FASTQ file is the primary source of variability between runs. The code does account for FASTA files and will convert them to FASTQ.

💻 BLAST database (external, static, user-provided)

  • User-provided nucleotide BLAST database
  • Built externally using standard BLAST tooling
  • Not modified by CensuScope
  • May be shared across multiple runs and analyses

CensuScope treats the BLAST database as a fixed reference against which sampled reads are aligned. Common database options include:

  • NCBI NT database (standard, comprehensive)
    • Use NCBI's FTP BLAST Site to locate and download BLAST database files, including NT.
  • SlimNT (curated, reduced database)
    • The GitHub Repo to create or download slimNT.fa can be found here.
  • Filtered NT (lab-specific filtered database)
    • The GitHub Repo to create or download filter_nt.fa can be found here.

The README blast_database.md in the docs folder can provide more information and step-by-step directions.

🦒 Taxonomy database (internal, static)

  • SQLite database derived from the NCBI taxonomy
  • Built ahead of time and treated as immutable at runtime
  • Provides taxonomic hierarchy and canonical names
  • Uses a reduced schema tailored to CensuScope’s lookup and aggregation needs

The taxonomy.db file needs to be created prior to runtime by the user. See the taxonomydb.md README in docs.


CensuScope Engine (Runtime)

  • Python-based execution engine
  • Responsible for:
    • read sampling
    • BLAST invocation
    • taxonomy lookup
    • aggregation and reporting

The engine is stateless across runs aside from generated outputs. No persistent state is carried between executions.

Output artifacts (generated per run)

  • Per-run output files produced during execution
  • May include:
    • taxonomic summary tables
    • lineage representations
    • logs and intermediate artifacts

Outputs are written to a designated output directory and may differ between runs even when inputs are identical.

The outputs are described in the dockerDeployment.md README in docs. They are also summarized below.


Deployment

CensuScope requires two reference resources to be prepared prior to execution:

  1. a taxonomy database (taxonomy.db)
  2. a nucleotide BLAST database (such as SlimNT, filtered NT, or NCBI NT)

These files are treated as read-only inputs during runtime and must be available before running the workflow. Once both databases are prepared, CensuScope can be executed using the Docker deployment guide.

For step-by-step setup instructions in order of workflow, follow the documentation in this order:

  1. Build the taxonomy database
    See taxonomydb.md

  2. Prepare or obtain a BLAST database
    See blast_database.md

  3. Build and run CensuScope with Docker
    See dockerDeployment.md

This documentation structure follows the same order as the CensuScope workflow: first prepare the taxonomy reference, then prepare the sequence reference database, and finally run the containerized pipeline.

Note: At the time of this release only a CLI CensuScope Docker is available. We are currently working on providing a containerized version for users.


Output Files

CensuScope writes outputs to a run-specific directory within temp_dirs/. Each run generates a timestamped folder containing intermediate files and a results/ directory with the final outputs.

The results/ folder contains the main output files:

  • taxonomy_table.tsv – aggregated taxonomic counts
  • tax_tree.json – hierarchical taxonomy structure
  • accession_table.tsv – accession-level hit counts
  • censuscope.log – execution log

These files represent the final results of a CensuScope run. Additional intermediate files (e.g., random samples and BLAST outputs) are stored in separate subdirectories but are not typically used for downstream analysis.

For more details, see the
dockerDeployment.md.

About

CensuScope is designed and optimized for the quick detection of the components in a given NGS metagenomic dataset and provides users with a standard format report of components into species or higher taxonomic node resolution.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • PHP 45.3%
  • Python 42.4%
  • Shell 9.8%
  • Dockerfile 1.5%
  • Makefile 1.0%