Skip to content

Fix tximport gene-level crash with mismatched transcript FASTA/GTF#11141

Open
pinin4fjords wants to merge 5 commits intomasterfrom
fix/tximport-unmapped-transcripts
Open

Fix tximport gene-level crash with mismatched transcript FASTA/GTF#11141
pinin4fjords wants to merge 5 commits intomasterfrom
fix/tximport-unmapped-transcripts

Conversation

@pinin4fjords
Copy link
Copy Markdown
Member

@pinin4fjords pinin4fjords commented Apr 9, 2026

Problem

When users provide a --transcript_fasta that contains transcripts not present in the GTF, tximport's read_transcript_info() function adds these unmatched transcripts as "extras" with gene_id = transcript_id (i.e., the transcript ID is used as a fake gene ID). After summarizeToGene(), these fake gene entries end up in the gene-level count matrices alongside real genes.

The downstream SummarizedExperiment process then receives these gene count matrices (containing a mix of real ENSG gene IDs and fake ENST "gene" IDs) along with the original tx2gene.tsv as row metadata. findColumnWithAllEntries() tries to find a single column in the tx2gene that contains ALL gene IDs from the count matrix, but no column matches: the gene_id column has only real gene IDs, and the transcript_id column has only real transcript IDs. This produces the error:

Error in findColumnWithAllEntries(ids, metadata) :
  No column contains all vector entries ENSG00000000003.14, ...

This commonly happens when users provide --transcript_fasta from a different source or version than their --gtf.

Fix

  1. Warning: When unmatched transcripts are detected, emit a warning with the count and first 5 IDs, so users understand the FASTA/GTF inconsistency.

  2. Filter fake genes: After summarizeToGene(), remove the fake gene entries (where gene_id was set to the transcript ID) from the gene-level output. The extras are still needed as input to summarizeToGene() (it errors if transcripts are missing from the tx2gene), but their resulting fake gene rows are stripped before writing output files.

Existing behavior is unchanged when FASTA and GTF are consistent (no extras, no filtering).

Fixes nf-core/rnaseq#1773

Test plan

  • All 8 existing tximport tests pass unchanged
  • New mismatched_transcripts test: truncates tx2gene to simulate FASTA/GTF mismatch, asserts:
    • Process succeeds (previously would crash downstream)
    • Gene count rows < transcript count rows (proves fake genes were filtered)
    • Warning message appears in .command.err
  • Verified the mismatch test fails against unfixed code (gene count == transcript count, no warning)

🤖 Generated with Claude Code

…tched

When users provide a --transcript_fasta with transcripts not present in
the GTF, tximport creates fake gene entries (gene_id = transcript_id)
that break downstream SummarizedExperiment construction. This filters
those entries from gene-level outputs and emits a warning so users can
identify the FASTA/GTF inconsistency.

Fixes nf-core/rnaseq#1773

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added the size/s label Apr 9, 2026
pinin4fjords and others added 2 commits April 9, 2026 12:04
summarizeToGene returns a list containing both matrices and a scalar
string (countsFromAbundance). Only filter rows on matrix elements.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added the size/m label Apr 9, 2026
Check that unmapped transcripts are filtered from gene-level output
(gene rows < transcript rows) and that the warning appears in stderr.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@@ -138,11 +138,11 @@
]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlss I'm being blined, I don't see the new test represented in this snapshot?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new test doesn't use snapshots- it's asserting that e.g. we get the warning message we expect etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Error with findColumnWithAllEntries(ids, metadata) in summarizedExperiment

2 participants