Skip to content

Feature request: empirical sequencing-error-rate estimation #688

@jdidion

Description

@jdidion

Background

I'm the maintainer of atropos, another adapter-trimming tool that started as a cutadapt fork. I'm winding atropos down in favor of actively-maintained tools like fastp and wanted to surface a few capabilities that users might miss, in case they're interesting for fastp.

Proposal

Add a pre-processing pass that estimates the empirical per-base sequencing error rate from the FASTQ input and surfaces that estimate to the user (and optionally feeds it into downstream thresholds such as -n/--n_base_limit, quality cutoffs, or adapter-match error tolerance).

Two methods are worth considering:

  1. Quality-based: sum per-base 10^(-Q/10) and divide by base count — cheap, streams, no calibration. Useful as a sanity check but inflated by any quality-score miscalibration.
  2. Wang et al. 2012 "shadow regression": regress the number of mismatching reads against the number of unique reads across a range of read-length prefixes, then solve for the per-base error rate. Works on any set of reads without requiring alignment to a reference.

Why this is useful

Users tuning adapter-match stringency (--adapter_fasta tolerance, insert-match diff_limit, etc.) currently guess. An empirical baseline lets them pick thresholds that sit a defined distance above the platform's actual error floor — and flags obviously-degraded runs that would otherwise appear as "clean" just because Qs are high.

Prior art

Happy to help with tests/data if you decide to pursue this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions