syntelogfinder

Nextflow pipeline to group genes on polyploid phased assemblies that are orthologous and syntelogous based on GENESPACE results. This pipeline is part of the LongPolyASE framework for long-read RNA-seq allele-specific expressin analysis in polyploid organisms.

Getting started

Requirements:

nextflow
conda

The following packages are not in bioconda/pip so need to be installed manually if running with --profile conda (for singularity this is not necessary):

McxScan (follow instructions here and provide path to installation to --mcscanx_path)
GENESPACE (instructions)(inside the conda environment genespace-env (syntelogfinder/modules/local/genespace/genespace_run/environment.yml))

minimal input:

parameter file (params.json)
genome fasta of phased reference (chromosome names like this: >chr[]01_1, >chr[]01_2 Where the _suffix is the haplotype )
gff or gtf with CDS corresponding to the reference (same chromosome names!)

The gff file should look like this https://agat.readthedocs.io/en/latest/gff_to_gtf.html#the-gff-file-to-convert with the following features:

gene
mRNA/transcript
exon
CDS

Or a gtf file with the following features:

gene
mRNA/transcript
exon
CDS

Mandatory Attributes

gene_id - must be present on ALL lines
transcript_id - required for transcript, exon, CDS features
Parent - links child features to parent

The parms.json should look like this

{
    "reference_fasta": "genome.fa",
    "reference_gff": "annotation.gff",
    "ploidy": 3,
    "outdir": "output_path"
}

Run like this:

git clone https://github.com/NIB-SI/syntelogfinder.git

cd syntelogfinder

nextflow run main.nf -params-file params/params.json \
                     -profile singularity \
                     -resume

or with conda

nextflow run main.nf -params-file params/params.json \
                     -profile singularity \
                     --mcscanx_path [path to McScaX installation]
                     -resume

Test data

A test dataset is available for testing and demonstration purposes. This dataset contains a phased genome assembly and annotation for chromosome 1 across all haplotypes of the tetraploid potato cultivar Atlantic.

fasta
gtf

Then the params_test.json should look like this:

{
    "reference_fasta": "{download_dir}/ATL_v3.asm.chr01_all_haplotypes.fa",
    "reference_gff": "{download_dir}/ATL_unitato_liftoff.chr01_all_haplotypes.gtf",
    "ploidy": 4,
    "outdir": "output_path"
}

Running Syntelogfinder on test data

After downloading the fasta and gtf file and preperation of the parameter file the pipeline can be run like this:

git clone https://github.com/NIB-SI/syntelogfinder.git --branch v1.0.0
cd syntelogfinder
conda create -n nextflow -c bioconda nextflow 
conda activate nextflow
nextflow run main.nf \
  -params-file params/params_test.json \
  -profile singularity \
  --run_blast \
  -resume

Expected runtime: 10 minutes (if all singularity images are already pulled)

Tutorial

Output

Here's the formatted sample output for your README:

Sample Output

The pipeline generates a tab-separated file with the following columns:

Column	Description
`gene_id`	Gene identifier
`transcript_id`	Transcript identifier
`Synt_id`	Synteny group identifier
`synteny_category`	Summary of syntenic gene distribution across haplotypes
`syntenic_genes`	Comma-separated list of all syntenic genes
`haplotype`	Haplotype assignment
`CDS_length_category`	CDS length classification (if applicable)
`CDS_haplotype_with_longest_annotation`	Haplotype with the longest CDS annotation (if applicable)

Example Output

gene_id	transcript_id	Synt_id	synteny_category	syntenic_genes	haplotype	CDS_length_category	CDS_haplotype_with_longest_annotation
TraesAK58CH7A01G122800	TraesAK58CH7A01G122800.1	Synt_id_0	1hapA_3hapB_1hapD_no_s	TraesAK58CH7A01G122800.1,TraesAK58CH1B01G017800.1,TraesAK58CH4B01G024800.1,TraesAK58CH2B01G118200.1,TraesAK58CH2D01G119400.1	hapA
TraesAK58CH1A01G005100	TraesAK58CH1A01G005100.1	Synt_id_1	2hapA_1hapB_2hapD_no_s	TraesAK58CH1A01G005100.1,TraesAK58CH3A01G490400.1,TraesAK58CH1B01G017500.1,TraesAK58CH1D01G000500.1,TraesAK58CH7D01G525700.1	hapA
TraesAK58CH3A01G490400	TraesAK58CH3A01G490400.1	Synt_id_1	2hapA_1hapB_2hapD_no_s	TraesAK58CH1A01G005100.1,TraesAK58CH3A01G490400.1,TraesAK58CH1B01G017500.1,TraesAK58CH1D01G000500.1,TraesAK58CH7D01G525700.1	hapA
TraesAK58CH3A01G236000	TraesAK58CH3A01G236000.1	Synt_id_2	1hapA_1hapB_2hapD_no_s	TraesAK58CH3A01G236000.1,TraesAK58CH1B01G017200.1,TraesAK58CH1D01G000900.1,TraesAK58CH5D01G521100.1	hapA
TraesAK58CH1A01G006000	TraesAK58CH1A01G006000.1	Synt_id_3	3hapA_0hapB_1hapD_no_s	TraesAK58CH1A01G006000.1,TraesAK58CH3A01G436000.1,TraesAK58CH5A01G002300.1,<NA>,TraesAK58CH1D01G365400.1	hapA
TraesAK58CH3A01G436000	TraesAK58CH3A01G436000.1	Synt_id_3	3hapA_0hapB_1hapD_no_s	TraesAK58CH1A01G006000.1,TraesAK58CH3A01G436000.1,TraesAK58CH5A01G002300.1,<NA>,TraesAK58CH1D01G365400.1	hapA

Key Features:

Each gene is assigned to a synteny group (Synt_id)
The synteny_category shows the distribution pattern (e.g., 2hapA_1hapB_2hapD_no_s means 2 genes in hapA, 1 in hapB, 2 in hapD, with no specific pattern)
Missing syntenic genes are indicated with <NA>
All syntenic gene members are listed in the syntenic_genes column

Plots

Troubleshooting

if GENESPACE process is interuped, running with -resume flag will fail. To chace the other processes, delete the genespace work dir before resuming

Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
.github/workflows		.github/workflows
assets		assets
bin		bin
conf		conf
example_output/03_GENESPACE		example_output/03_GENESPACE
modules		modules
params		params
subworkflows		subworkflows
.gitignore		.gitignore
.nf-core.yml		.nf-core.yml
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
main.nf		main.nf
modules.json		modules.json
nextflow.config		nextflow.config
runMain.sh		runMain.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

syntelogfinder

Getting started

minimal input:

Test data

Running Syntelogfinder on test data

Tutorial

Output

Sample Output

Example Output

Plots

Troubleshooting

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

syntelogfinder

Getting started

minimal input:

Test data

Running Syntelogfinder on test data

Tutorial

Output

Sample Output

Example Output

Plots

Troubleshooting

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages