Feature request: empirical sequencing-error-rate estimation

## Background

I'm the maintainer of [atropos](https://github.com/jdidion/atropos), another adapter-trimming tool that started as a cutadapt fork. I'm winding atropos down in favor of actively-maintained tools like fastp and wanted to surface a few capabilities that users might miss, in case they're interesting for fastp.

## Proposal

Add a pre-processing pass that estimates the empirical per-base sequencing error rate from the FASTQ input and surfaces that estimate to the user (and optionally feeds it into downstream thresholds such as `-n`/`--n_base_limit`, quality cutoffs, or adapter-match error tolerance).

Two methods are worth considering:

1. **Quality-based**: sum per-base `10^(-Q/10)` and divide by base count &mdash; cheap, streams, no calibration. Useful as a sanity check but inflated by any quality-score miscalibration.
2. **[Wang et al. 2012](https://doi.org/10.1186/1471-2105-13-185) "shadow regression"**: regress the number of mismatching reads against the number of unique reads across a range of read-length prefixes, then solve for the per-base error rate. Works on any set of reads without requiring alignment to a reference.

### Why this is useful

Users tuning adapter-match stringency (`--adapter_fasta` tolerance, insert-match `diff_limit`, etc.) currently guess. An empirical baseline lets them pick thresholds that sit a defined distance above the platform's actual error floor &mdash; and flags obviously-degraded runs that would otherwise appear as "clean" just because Qs are high.

### Prior art

- atropos `error` subcommand: https://github.com/jdidion/atropos/blob/master/atropos/commands/error/__init__.py (quality-based + shadow-regression; latter currently shells out to R)
- Wang 2012 paper above for the algorithm
- A pure-C++ reimplementation would be cleaner than atropos's current R dependency

Happy to help with tests/data if you decide to pursue this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: empirical sequencing-error-rate estimation #688

Background

Proposal

Why this is useful

Prior art

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature request: empirical sequencing-error-rate estimation #688

Description

Background

Proposal

Why this is useful

Prior art

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions