Background
I'm the maintainer of atropos, another adapter-trimming tool that started as a cutadapt fork. I'm winding atropos down in favor of actively-maintained tools like fastp and wanted to surface a few capabilities that users might miss, in case they're interesting for fastp.
Proposal
Add a pre-processing pass that estimates the empirical per-base sequencing error rate from the FASTQ input and surfaces that estimate to the user (and optionally feeds it into downstream thresholds such as -n/--n_base_limit, quality cutoffs, or adapter-match error tolerance).
Two methods are worth considering:
- Quality-based: sum per-base
10^(-Q/10) and divide by base count — cheap, streams, no calibration. Useful as a sanity check but inflated by any quality-score miscalibration.
- Wang et al. 2012 "shadow regression": regress the number of mismatching reads against the number of unique reads across a range of read-length prefixes, then solve for the per-base error rate. Works on any set of reads without requiring alignment to a reference.
Why this is useful
Users tuning adapter-match stringency (--adapter_fasta tolerance, insert-match diff_limit, etc.) currently guess. An empirical baseline lets them pick thresholds that sit a defined distance above the platform's actual error floor — and flags obviously-degraded runs that would otherwise appear as "clean" just because Qs are high.
Prior art
Happy to help with tests/data if you decide to pursue this.
Background
I'm the maintainer of atropos, another adapter-trimming tool that started as a cutadapt fork. I'm winding atropos down in favor of actively-maintained tools like fastp and wanted to surface a few capabilities that users might miss, in case they're interesting for fastp.
Proposal
Add a pre-processing pass that estimates the empirical per-base sequencing error rate from the FASTQ input and surfaces that estimate to the user (and optionally feeds it into downstream thresholds such as
-n/--n_base_limit, quality cutoffs, or adapter-match error tolerance).Two methods are worth considering:
10^(-Q/10)and divide by base count — cheap, streams, no calibration. Useful as a sanity check but inflated by any quality-score miscalibration.Why this is useful
Users tuning adapter-match stringency (
--adapter_fastatolerance, insert-matchdiff_limit, etc.) currently guess. An empirical baseline lets them pick thresholds that sit a defined distance above the platform's actual error floor — and flags obviously-degraded runs that would otherwise appear as "clean" just because Qs are high.Prior art
errorsubcommand: https://github.com/jdidion/atropos/blob/master/atropos/commands/error/__init__.py (quality-based + shadow-regression; latter currently shells out to R)Happy to help with tests/data if you decide to pursue this.