Update pipeline for the reference segment DB for FluPipe (https://github.com/rki-mf1/FluPipe)
The pipeline is written in Nextflow, which can be used on any POSIX compatible system (Linux, OS X, etc). Windows system is supported through WSL. You need Nextflow installed and conda to run the steps of the pipeline:
-
Install
Nextflowclick here for a bash one-liner
wget -qO- https://get.nextflow.io | bash # In the case you don’t have wget # curl -s https://get.nextflow.io | bash
-
Install
condaclick here for a bash two-liner for Miniconda3 Linux 64-bit
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh bash Miniconda3-latest-Linux-x86_64.sh
OR
-
Install
condaclick here for a bash two-liner for Miniconda3 Linux 64-bit
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh bash Miniconda3-latest-Linux-x86_64.sh
-
Install
Nextflowviacondaclick here to see how to do that
conda create -n nextflow -c bioconda nextflow conda active nextflow
All other dependencies and tools will be installed within the pipeline via conda.
# conda active nextflow
nextflow run rki-mf1/inv-update-segment-db \
-profile local,mamba \
--input_segments '/path/to/fastas/*.fasta' \
--input_metadata '/path/to/metadata_tables/*.xls' \
--references '/path/to/segment/references/*.fasta' \
--intermediateThe assumptions about input FASTA header and input metadata are quite tailored to GISAID/Epiflu data.
Assumed header: variable number of fields separated by |, where
- one field has to be one of ["HA", "MP", "NA", "NP", "NS", "PA", "PB1", "PB2"], and,
- one field has to match an isolate ID:
^EPI_ISL_\d+$
- Required fields:
Isolate_Id,Isolate_Name,Subtype,Lineage Isolate_Idin the table needs to match the isolate ID in the--input_segmentsFASTA header (see above)
- Required prefix:
segment_, where segment is one of ["HA", "MP", "NA", "NP", "NS", "PA", "PB1", "PB2"]- e.g.
HA_reference.fasta
- e.g.
The output FASTA headers are renamed in the following pattern:
>{kraken}_{isolate_name}|{lineage}|{isolate_id}|{direction}|{subtype}|{segment}
where kraken is kraken:taxid|11320 in case of Influenza A and kraken:taxid|11520 in case of Influenza B; and direction is rc (reverse complementary) or f (forward) compared to the respective reference (--references).
input_metadatafiles are concatenatedinput_segmentsare- concatenated,
- name duplicates are removed,
- split into 8 FASTA files, one for each segment (based on FASTA header), and
- filter FASTA based on
seqkit fx2tabstats.
- Each reference file corresponding to one segment is
- aligned with
MAFFT, and - filtered segments from step 2.4 are added to the
MAFFTalignment.- Reverse complementary segments are reverse complemented.
- Filtered and reverse complemented segments are renamed.
- aligned with
click here to see the complete help message
Usage example:
nextflow run rki-mf1/inv-update-segment-db \
--input_segments '/path/to/fastas/*.fasta' \
--input_metadata '/path/to/metadata_tables/*.xls' \
--references '/path/to/segment/references/*.fasta'
Required parameters:
--input_segments (Multi) FASTA file(s)
Assumed header: fields separated by `|`, where
- one field has to be one of ["HA", "MP", "NA", "NP", "NS", "PA", "PB1", "PB2"],
AND
- one field has to match an isolate ID: `^EPI_ISL_\d+$`
--input_metadata Excel table(s) containing metadata for --input_segments
Required fields: `Isolate_Id`, `Isolate_Name`, `Subtype`, `Lineage`
`Isolate_Id` in the table needs to match `isolate_id` in the input_segments fasta header
--references 8 (multi) FASTA files containing reference sequences for each segment. The references are used
to check if the input_segments are reverse complementary compared to the references.
Required prefix: `segment_`, where segment is one of [HA, MP, NA, NP, NS, PA, PB1, PB2]
e.g. HA_reference.fasta
Optional parameters:
--output The output directory where the results will be saved [default: results]
--intermediate Publish also intermediate results [default: false]
Computing options:
--max_cpus Maximum number of CPUs that can be requested for any single job [default: 20]
--max_memory Maximum amount of memory that can be requested for any single job [default: 128.GB]
--max_time Maximum amount of time that can be requested for any single job [default: 16.h]
For Nextflow options, see https://www.nextflow.io/docs/latest/cli.html#options and https://www.nextflow.io/docs/latest/cli.html#run
Execution/Engine profiles:
The pipeline supports profiles to run via different Executers and Engines e.g.: -profile local,mamba
Executer (choose one):
local
slurm
Engines (choose one):
conda
mamba
Per default: -profile slurm,mamba is executed.
A list of references for the tools used by the pipeline can be found in the CITATIONS.md file.
This project was supported by co-funding from the European Union’s EU4Health programme under project no. 101113012 (IMS-HERA2).