Skip to content

Feature issue in a record after cdna_alignment_orf_to_genome_orf.pl #196

@SalvadorGJ

Description

@SalvadorGJ

Hello @brianjohnhaas,

I was trying to annotate a StringTie transcriptome assembly, following this pipeline: gtf_to_alignment_gff3.pl -> TransDecoder.LongOrfs -> TransDecoder.Predict -> cdna_alignment_orf_to_genome_orf.pl

Unfortunately, there were some issues with one transcript regarding the feature field in the output GFF3 of cdna_alignment_orf_to_genome_orf.pl. The transcript ID is "AMEX231106C057529.2p1", and the transcript is one of the isoforms from the gene "AMEX231106C057529". At the end of the gff3 record for that specific transcript there are two exons without CDS, and three consecutive three_prime_UTR. No other isoform and no other transcript in other genes from my assembly has such problem. Here is the development of the pipeline that I followed:

  1. First I executed:
gtf_to_alignment_gff3.pl stringtie.merge.processed.gtf > AmexT_v231106C_FullOnly.stringtie.merge.raw.gff3
  • "AMEX231106C057529.2" entries in the input stringtie.merge.processed.gtf are:
ptg000056l	StringTie	transcript	24490808	32418568	1000	+	.	gene_id "AMEX231106C057529"; transcript_id "AMEX231106C057529.2"; 
ptg000056l	StringTie	exon	24490808	24491053	1000	+	.	gene_id "AMEX231106C057529"; transcript_id "AMEX231106C057529.2"; exon_number "1"; 
ptg000056l	StringTie	exon	25718856	25718950	1000	+	.	gene_id "AMEX231106C057529"; transcript_id "AMEX231106C057529.2"; exon_number "2"; 
ptg000056l	StringTie	exon	26892333	26892549	1000	+	.	gene_id "AMEX231106C057529"; transcript_id "AMEX231106C057529.2"; exon_number "3"; 
ptg000056l	StringTie	exon	28209504	28209543	1000	+	.	gene_id "AMEX231106C057529"; transcript_id "AMEX231106C057529.2"; exon_number "4"; 
ptg000056l	StringTie	exon	29692485	29692574	1000	+	.	gene_id "AMEX231106C057529"; transcript_id "AMEX231106C057529.2"; exon_number "5"; 
ptg000056l	StringTie	exon	29977070	29977184	1000	+	.	gene_id "AMEX231106C057529"; transcript_id "AMEX231106C057529.2"; exon_number "6"; 
ptg000056l	StringTie	exon	30279073	30279184	1000	+	.	gene_id "AMEX231106C057529"; transcript_id "AMEX231106C057529.2"; exon_number "7"; 
ptg000056l	StringTie	exon	31138727	31138879	1000	+	.	gene_id "AMEX231106C057529"; transcript_id "AMEX231106C057529.2"; exon_number "8"; 
ptg000056l	StringTie	exon	31484095	31484257	1000	+	.	gene_id "AMEX231106C057529"; transcript_id "AMEX231106C057529.2"; exon_number "9"; 
ptg000056l	StringTie	exon	32226903	32227002	1000	+	.	gene_id "AMEX231106C057529"; transcript_id "AMEX231106C057529.2"; exon_number "10"; 
ptg000056l	StringTie	exon	32227351	32227454	1000	+	.	gene_id "AMEX231106C057529"; transcript_id "AMEX231106C057529.2"; exon_number "11"; 
ptg000056l	StringTie	exon	32415663	32415698	1000	+	.	gene_id "AMEX231106C057529"; transcript_id "AMEX231106C057529.2"; exon_number "12"; 
ptg000056l	StringTie	exon	32418037	32418568	1000	+	.	gene_id "AMEX231106C057529"; transcript_id "AMEX231106C057529.2"; exon_number "13";
  • "AMEX231106C057529.2" entries in the output AmexT_v231106C_FullOnly.stringtie.merge.raw.gff3 are:
ptg000056l	Cufflinks	match	24490808	24491053	100	+	.	ID=GENE^AMEX231106C057529,TRANS^AMEX231106C057529.2;Target=GENE^AMEX231106C057529,TRANS^AMEX231106C057529.2 1 246 +
ptg000056l	Cufflinks	match	25718856	25718950	100	+	.	ID=GENE^AMEX231106C057529,TRANS^AMEX231106C057529.2;Target=GENE^AMEX231106C057529,TRANS^AMEX231106C057529.2 247 341 +
ptg000056l	Cufflinks	match	26892333	26892549	100	+	.	ID=GENE^AMEX231106C057529,TRANS^AMEX231106C057529.2;Target=GENE^AMEX231106C057529,TRANS^AMEX231106C057529.2 342 558 +
ptg000056l	Cufflinks	match	28209504	28209543	100	+	.	ID=GENE^AMEX231106C057529,TRANS^AMEX231106C057529.2;Target=GENE^AMEX231106C057529,TRANS^AMEX231106C057529.2 559 598 +
ptg000056l	Cufflinks	match	29692485	29692574	100	+	.	ID=GENE^AMEX231106C057529,TRANS^AMEX231106C057529.2;Target=GENE^AMEX231106C057529,TRANS^AMEX231106C057529.2 599 688 +
ptg000056l	Cufflinks	match	29977070	29977184	100	+	.	ID=GENE^AMEX231106C057529,TRANS^AMEX231106C057529.2;Target=GENE^AMEX231106C057529,TRANS^AMEX231106C057529.2 689 803 +
ptg000056l	Cufflinks	match	30279073	30279184	100	+	.	ID=GENE^AMEX231106C057529,TRANS^AMEX231106C057529.2;Target=GENE^AMEX231106C057529,TRANS^AMEX231106C057529.2 804 915 +
ptg000056l	Cufflinks	match	31138727	31138879	100	+	.	ID=GENE^AMEX231106C057529,TRANS^AMEX231106C057529.2;Target=GENE^AMEX231106C057529,TRANS^AMEX231106C057529.2 916 1068 +
ptg000056l	Cufflinks	match	31484095	31484257	100	+	.	ID=GENE^AMEX231106C057529,TRANS^AMEX231106C057529.2;Target=GENE^AMEX231106C057529,TRANS^AMEX231106C057529.2 1069 1231 +
ptg000056l	Cufflinks	match	32226903	32227002	100	+	.	ID=GENE^AMEX231106C057529,TRANS^AMEX231106C057529.2;Target=GENE^AMEX231106C057529,TRANS^AMEX231106C057529.2 1232 1331 +
ptg000056l	Cufflinks	match	32227351	32227454	100	+	.	ID=GENE^AMEX231106C057529,TRANS^AMEX231106C057529.2;Target=GENE^AMEX231106C057529,TRANS^AMEX231106C057529.2 1332 1435 +
ptg000056l	Cufflinks	match	32415663	32415698	100	+	.	ID=GENE^AMEX231106C057529,TRANS^AMEX231106C057529.2;Target=GENE^AMEX231106C057529,TRANS^AMEX231106C057529.2 1436 1471 +
ptg000056l	Cufflinks	match	32418037	32418568	100	+	.	ID=GENE^AMEX231106C057529,TRANS^AMEX231106C057529.2;Target=GENE^AMEX231106C057529,TRANS^AMEX231106C057529.2 1472 2003 +
  1. Then I extract the FASTA sequence from each transcript, which were stored in AmexT_v231106C_FullOnly.stringtie.merge.raw.fasta. There is only one sequence with the ID "AMEX231106C057529.2" in the FASTA file. Then I resume the pipeline:
TransDecoder.LongOrfs -t AmexT_v231106C_FullOnly.stringtie.merge.raw.fasta -S -m 30
TransDecoder.Predict -t AmexT_v231106C_FullOnly.stringtie.merge.raw.fasta --single_best_only
  • "AMEX231106C057529.2" entries in the output AmexT_v231106C_FullOnly.stringtie.merge.raw.fasta.transdecoder.gff3
AMEX231106C057529.2	transdecoder	gene	1	2003	.	+	.	ID=GENE.AMEX231106C057529.2~~AMEX231106C057529.2.p1;Name="ORF type:5prime_partial (+),score=76.51"
AMEX231106C057529.2	transdecoder	mRNA	1	2003	.	+	.	ID=AMEX231106C057529.2.p1;Parent=GENE.AMEX231106C057529.2~~AMEX231106C057529.2.p1;Name="ORF type:5prime_partial (+),score=76.51"
AMEX231106C057529.2	transdecoder	exon	1	2003	.	+	.	ID=AMEX231106C057529.2.p1.exon1;Parent=AMEX231106C057529.2.p1
AMEX231106C057529.2	transdecoder	CDS	1	1401	.	+	0	ID=cds.AMEX231106C057529.2.p1;Parent=AMEX231106C057529.2.p1
AMEX231106C057529.2	transdecoder	three_prime_UTR	1402	2003	.	+	.	ID=AMEX231106C057529.2.p1.utr3p1;Parent=AMEX231106C057529.2.p1
  1. Finally I tried to merge the alignment information to the prediction results:
cdna_alignment_orf_to_genome_orf.pl AmexT_v231106C_FullOnly.stringtie.merge.raw.fasta.transdecoder.gff3 AmexT_v231106C_FullOnly.stringtie.merge.raw.gff3 AmexT_v231106C_FullOnly.stringtie.merge.raw.fasta > stringtie.transdecoder.annotated.gff
  • "AMEX231106C057529.2" entries in the output stringtie.transdecoder.annotated.gff:
ptg000056l	transdecoder	mRNA	24490808	32418568	.	+	.	ID=AMEX231106C057529.2.p1;Parent=AMEX231106C057529^ptg000056l^+;Name="ORF type:3prime_partial (+),score=41.53"
ptg000056l	transdecoder	exon	24490808	24491053	.	+	.	ID=AMEX231106C057529.2.p1.exon1;Parent=AMEX231106C057529.2.p1
ptg000056l	transdecoder	CDS	24490808	24491053	.	+	0	ID=cds.AMEX231106C057529.2.p1;Parent=AMEX231106C057529.2.p1
ptg000056l	transdecoder	exon	25718856	25718950	.	+	.	ID=AMEX231106C057529.2.p1.exon2;Parent=AMEX231106C057529.2.p1
ptg000056l	transdecoder	CDS	25718856	25718950	.	+	0	ID=cds.AMEX231106C057529.2.p1;Parent=AMEX231106C057529.2.p1
ptg000056l	transdecoder	exon	26892333	26892549	.	+	.	ID=AMEX231106C057529.2.p1.exon3;Parent=AMEX231106C057529.2.p1
ptg000056l	transdecoder	CDS	26892333	26892549	.	+	1	ID=cds.AMEX231106C057529.2.p1;Parent=AMEX231106C057529.2.p1
ptg000056l	transdecoder	exon	28209504	28209543	.	+	.	ID=AMEX231106C057529.2.p1.exon4;Parent=AMEX231106C057529.2.p1
ptg000056l	transdecoder	CDS	28209504	28209543	.	+	0	ID=cds.AMEX231106C057529.2.p1;Parent=AMEX231106C057529.2.p1
ptg000056l	transdecoder	exon	29692485	29692574	.	+	.	ID=AMEX231106C057529.2.p1.exon5;Parent=AMEX231106C057529.2.p1
ptg000056l	transdecoder	CDS	29692485	29692574	.	+	2	ID=cds.AMEX231106C057529.2.p1;Parent=AMEX231106C057529.2.p1
ptg000056l	transdecoder	exon	29977070	29977184	.	+	.	ID=AMEX231106C057529.2.p1.exon6;Parent=AMEX231106C057529.2.p1
ptg000056l	transdecoder	CDS	29977070	29977184	.	+	2	ID=cds.AMEX231106C057529.2.p1;Parent=AMEX231106C057529.2.p1
ptg000056l	transdecoder	exon	30279073	30279184	.	+	.	ID=AMEX231106C057529.2.p1.exon7;Parent=AMEX231106C057529.2.p1
ptg000056l	transdecoder	CDS	30279073	30279184	.	+	1	ID=cds.AMEX231106C057529.2.p1;Parent=AMEX231106C057529.2.p1
ptg000056l	transdecoder	exon	31138727	31138879	.	+	.	ID=AMEX231106C057529.2.p1.exon8;Parent=AMEX231106C057529.2.p1
ptg000056l	transdecoder	CDS	31138727	31138879	.	+	0	ID=cds.AMEX231106C057529.2.p1;Parent=AMEX231106C057529.2.p1
ptg000056l	transdecoder	exon	31484095	31484257	.	+	.	ID=AMEX231106C057529.2.p1.exon9;Parent=AMEX231106C057529.2.p1
ptg000056l	transdecoder	CDS	31484095	31484257	.	+	0	ID=cds.AMEX231106C057529.2.p1;Parent=AMEX231106C057529.2.p1
ptg000056l	transdecoder	exon	32226903	32227002	.	+	.	ID=AMEX231106C057529.2.p1.exon10;Parent=AMEX231106C057529.2.p1
ptg000056l	transdecoder	CDS	32226903	32227002	.	+	2	ID=cds.AMEX231106C057529.2.p1;Parent=AMEX231106C057529.2.p1
ptg000056l	transdecoder	exon	32227351	32227454	.	+	.	ID=AMEX231106C057529.2.p1.exon11;Parent=AMEX231106C057529.2.p1
ptg000056l	transdecoder	CDS	32227351	32227420	.	+	1	ID=cds.AMEX231106C057529.2.p1;Parent=AMEX231106C057529.2.p1
ptg000056l	transdecoder	exon	32415663	32415698	.	+	.	ID=AMEX231106C057529.2.p1.exon12;Parent=AMEX231106C057529.2.p1
ptg000056l	transdecoder	exon	32418037	32418568	.	+	.	ID=AMEX231106C057529.2.p1.exon13;Parent=AMEX231106C057529.2.p1
ptg000056l	transdecoder	three_prime_UTR	32227421	32227454	.	+	.	ID=AMEX231106C057529.2.p1.utr3p1;Parent=AMEX231106C057529.2.p1
ptg000056l	transdecoder	three_prime_UTR	32415663	32415698	.	+	.	ID=AMEX231106C057529.2.p1.utr3p2;Parent=AMEX231106C057529.2.p1
ptg000056l	transdecoder	three_prime_UTR	32418037	32418568	.	+	.	ID=AMEX231106C057529.2.p1.utr3p3;Parent=AMEX231106C057529.2.p1

As you can see in the last lines from the block above, there are two exons that are not followed by a CDS entry, and after them three last three_prime_UTR entries. I have no idea how to debug it (there was no warning in the log from cdna_alignment_orf_to_genome_orf.pl), or what could be the reason why it happened just with that transcript. I'll appreciate your help.

Best,
Salvador

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions