Nextflow pipeline to group genes on polyploid phased assemblies that are orthologous and syntelogous based on GENESPACE results. This pipeline is part of the LongPolyASE framework for long-read RNA-seq allele-specific expressin analysis in polyploid organisms.
Requirements:
-
nextflow
-
conda
The following packages are not in bioconda/pip so need to be installed manually if running with --profile conda (for singularity this is not necessary):
- McxScan (follow instructions here and provide path to installation to --mcscanx_path)
- GENESPACE (instructions)(inside the conda environment genespace-env (syntelogfinder/modules/local/genespace/genespace_run/environment.yml))
- parameter file (params.json)
- genome fasta of phased reference (chromosome names like this: >chr[]01_1, >chr[]01_2 Where the _suffix is the haplotype )
- gff or gtf with CDS corresponding to the reference (same chromosome names!)
The gff file should look like this https://agat.readthedocs.io/en/latest/gff_to_gtf.html#the-gff-file-to-convert with the following features:
- gene
- mRNA/transcript
- exon
- CDS
Or a gtf file with the following features:
- gene
- mRNA/transcript
- exon
- CDS
Mandatory Attributes
- gene_id - must be present on ALL lines
- transcript_id - required for transcript, exon, CDS features
- Parent - links child features to parent
The parms.json should look like this
{
"reference_fasta": "genome.fa",
"reference_gff": "annotation.gff",
"ploidy": 3,
"outdir": "output_path"
}
Run like this:
git clone https://github.com/NIB-SI/syntelogfinder.git
cd syntelogfinder
nextflow run main.nf -params-file params/params.json \
-profile singularity \
-resume
or with conda
nextflow run main.nf -params-file params/params.json \
-profile singularity \
--mcscanx_path [path to McScaX installation]
-resume
A test dataset is available for testing and demonstration purposes. This dataset contains a phased genome assembly and annotation for chromosome 1 across all haplotypes of the tetraploid potato cultivar Atlantic.
Then the params_test.json should look like this:
{
"reference_fasta": "{download_dir}/ATL_v3.asm.chr01_all_haplotypes.fa",
"reference_gff": "{download_dir}/ATL_unitato_liftoff.chr01_all_haplotypes.gtf",
"ploidy": 4,
"outdir": "output_path"
}After downloading the fasta and gtf file and preperation of the parameter file the pipeline can be run like this:
git clone https://github.com/NIB-SI/syntelogfinder.git --branch v1.0.0
cd syntelogfinder
conda create -n nextflow -c bioconda nextflow
conda activate nextflow
nextflow run main.nf \
-params-file params/params_test.json \
-profile singularity \
--run_blast \
-resume Expected runtime: 10 minutes (if all singularity images are already pulled)
- Running syntelogfinder on phased reference of diplpid rice
- Running syntelogfinder on phased reference of haxaploid wheat
Here's the formatted sample output for your README:
The pipeline generates a tab-separated file with the following columns:
| Column | Description |
|---|---|
gene_id |
Gene identifier |
transcript_id |
Transcript identifier |
Synt_id |
Synteny group identifier |
synteny_category |
Summary of syntenic gene distribution across haplotypes |
syntenic_genes |
Comma-separated list of all syntenic genes |
haplotype |
Haplotype assignment |
CDS_length_category |
CDS length classification (if applicable) |
CDS_haplotype_with_longest_annotation |
Haplotype with the longest CDS annotation (if applicable) |
gene_id transcript_id Synt_id synteny_category syntenic_genes haplotype CDS_length_category CDS_haplotype_with_longest_annotation
TraesAK58CH7A01G122800 TraesAK58CH7A01G122800.1 Synt_id_0 1hapA_3hapB_1hapD_no_s TraesAK58CH7A01G122800.1,TraesAK58CH1B01G017800.1,TraesAK58CH4B01G024800.1,TraesAK58CH2B01G118200.1,TraesAK58CH2D01G119400.1 hapA
TraesAK58CH1A01G005100 TraesAK58CH1A01G005100.1 Synt_id_1 2hapA_1hapB_2hapD_no_s TraesAK58CH1A01G005100.1,TraesAK58CH3A01G490400.1,TraesAK58CH1B01G017500.1,TraesAK58CH1D01G000500.1,TraesAK58CH7D01G525700.1 hapA
TraesAK58CH3A01G490400 TraesAK58CH3A01G490400.1 Synt_id_1 2hapA_1hapB_2hapD_no_s TraesAK58CH1A01G005100.1,TraesAK58CH3A01G490400.1,TraesAK58CH1B01G017500.1,TraesAK58CH1D01G000500.1,TraesAK58CH7D01G525700.1 hapA
TraesAK58CH3A01G236000 TraesAK58CH3A01G236000.1 Synt_id_2 1hapA_1hapB_2hapD_no_s TraesAK58CH3A01G236000.1,TraesAK58CH1B01G017200.1,TraesAK58CH1D01G000900.1,TraesAK58CH5D01G521100.1 hapA
TraesAK58CH1A01G006000 TraesAK58CH1A01G006000.1 Synt_id_3 3hapA_0hapB_1hapD_no_s TraesAK58CH1A01G006000.1,TraesAK58CH3A01G436000.1,TraesAK58CH5A01G002300.1,<NA>,TraesAK58CH1D01G365400.1 hapA
TraesAK58CH3A01G436000 TraesAK58CH3A01G436000.1 Synt_id_3 3hapA_0hapB_1hapD_no_s TraesAK58CH1A01G006000.1,TraesAK58CH3A01G436000.1,TraesAK58CH5A01G002300.1,<NA>,TraesAK58CH1D01G365400.1 hapAKey Features:
- Each gene is assigned to a synteny group (
Synt_id) - The
synteny_categoryshows the distribution pattern (e.g.,2hapA_1hapB_2hapD_no_smeans 2 genes in hapA, 1 in hapB, 2 in hapD, with no specific pattern) - Missing syntenic genes are indicated with
<NA> - All syntenic gene members are listed in the
syntenic_genescolumn
- if GENESPACE process is interuped, running with -resume flag will fail. To chace the other processes, delete the genespace work dir before resuming
