Fix tximport gene-level crash with mismatched transcript FASTA/GTF#11141
Open
pinin4fjords wants to merge 5 commits intomasterfrom
Open
Fix tximport gene-level crash with mismatched transcript FASTA/GTF#11141pinin4fjords wants to merge 5 commits intomasterfrom
pinin4fjords wants to merge 5 commits intomasterfrom
Conversation
…tched When users provide a --transcript_fasta with transcripts not present in the GTF, tximport creates fake gene entries (gene_id = transcript_id) that break downstream SummarizedExperiment construction. This filters those entries from gene-level outputs and emits a warning so users can identify the FASTA/GTF inconsistency. Fixes nf-core/rnaseq#1773 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
summarizeToGene returns a list containing both matrices and a scalar string (countsFromAbundance). Only filter rows on matrix elements. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Check that unmapped transcripts are filtered from gene-level output (gene rows < transcript rows) and that the warning appears in stderr. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
jfy133
reviewed
Apr 9, 2026
| @@ -138,11 +138,11 @@ | |||
| ] | |||
Member
There was a problem hiding this comment.
Unlss I'm being blined, I don't see the new test represented in this snapshot?
Member
Author
There was a problem hiding this comment.
The new test doesn't use snapshots- it's asserting that e.g. we get the warning message we expect etc.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When users provide a
--transcript_fastathat contains transcripts not present in the GTF,tximport'sread_transcript_info()function adds these unmatched transcripts as "extras" withgene_id = transcript_id(i.e., the transcript ID is used as a fake gene ID). AftersummarizeToGene(), these fake gene entries end up in the gene-level count matrices alongside real genes.The downstream
SummarizedExperimentprocess then receives these gene count matrices (containing a mix of real ENSG gene IDs and fake ENST "gene" IDs) along with the originaltx2gene.tsvas row metadata.findColumnWithAllEntries()tries to find a single column in the tx2gene that contains ALL gene IDs from the count matrix, but no column matches: thegene_idcolumn has only real gene IDs, and thetranscript_idcolumn has only real transcript IDs. This produces the error:This commonly happens when users provide
--transcript_fastafrom a different source or version than their--gtf.Fix
Warning: When unmatched transcripts are detected, emit a warning with the count and first 5 IDs, so users understand the FASTA/GTF inconsistency.
Filter fake genes: After
summarizeToGene(), remove the fake gene entries (where gene_id was set to the transcript ID) from the gene-level output. The extras are still needed as input tosummarizeToGene()(it errors if transcripts are missing from the tx2gene), but their resulting fake gene rows are stripped before writing output files.Existing behavior is unchanged when FASTA and GTF are consistent (no extras, no filtering).
Fixes nf-core/rnaseq#1773
Test plan
mismatched_transcriptstest: truncates tx2gene to simulate FASTA/GTF mismatch, asserts:.command.err🤖 Generated with Claude Code