-
Notifications
You must be signed in to change notification settings - Fork 64
Description
Hello @brianjohnhaas,
I was trying to annotate a StringTie transcriptome assembly, following this pipeline: gtf_to_alignment_gff3.pl -> TransDecoder.LongOrfs -> TransDecoder.Predict -> cdna_alignment_orf_to_genome_orf.pl
Unfortunately, there were some issues with one transcript regarding the feature field in the output GFF3 of cdna_alignment_orf_to_genome_orf.pl. The transcript ID is "AMEX231106C057529.2p1", and the transcript is one of the isoforms from the gene "AMEX231106C057529". At the end of the gff3 record for that specific transcript there are two exons without CDS, and three consecutive three_prime_UTR. No other isoform and no other transcript in other genes from my assembly has such problem. Here is the development of the pipeline that I followed:
- First I executed:
gtf_to_alignment_gff3.pl stringtie.merge.processed.gtf > AmexT_v231106C_FullOnly.stringtie.merge.raw.gff3
- "AMEX231106C057529.2" entries in the input stringtie.merge.processed.gtf are:
ptg000056l StringTie transcript 24490808 32418568 1000 + . gene_id "AMEX231106C057529"; transcript_id "AMEX231106C057529.2";
ptg000056l StringTie exon 24490808 24491053 1000 + . gene_id "AMEX231106C057529"; transcript_id "AMEX231106C057529.2"; exon_number "1";
ptg000056l StringTie exon 25718856 25718950 1000 + . gene_id "AMEX231106C057529"; transcript_id "AMEX231106C057529.2"; exon_number "2";
ptg000056l StringTie exon 26892333 26892549 1000 + . gene_id "AMEX231106C057529"; transcript_id "AMEX231106C057529.2"; exon_number "3";
ptg000056l StringTie exon 28209504 28209543 1000 + . gene_id "AMEX231106C057529"; transcript_id "AMEX231106C057529.2"; exon_number "4";
ptg000056l StringTie exon 29692485 29692574 1000 + . gene_id "AMEX231106C057529"; transcript_id "AMEX231106C057529.2"; exon_number "5";
ptg000056l StringTie exon 29977070 29977184 1000 + . gene_id "AMEX231106C057529"; transcript_id "AMEX231106C057529.2"; exon_number "6";
ptg000056l StringTie exon 30279073 30279184 1000 + . gene_id "AMEX231106C057529"; transcript_id "AMEX231106C057529.2"; exon_number "7";
ptg000056l StringTie exon 31138727 31138879 1000 + . gene_id "AMEX231106C057529"; transcript_id "AMEX231106C057529.2"; exon_number "8";
ptg000056l StringTie exon 31484095 31484257 1000 + . gene_id "AMEX231106C057529"; transcript_id "AMEX231106C057529.2"; exon_number "9";
ptg000056l StringTie exon 32226903 32227002 1000 + . gene_id "AMEX231106C057529"; transcript_id "AMEX231106C057529.2"; exon_number "10";
ptg000056l StringTie exon 32227351 32227454 1000 + . gene_id "AMEX231106C057529"; transcript_id "AMEX231106C057529.2"; exon_number "11";
ptg000056l StringTie exon 32415663 32415698 1000 + . gene_id "AMEX231106C057529"; transcript_id "AMEX231106C057529.2"; exon_number "12";
ptg000056l StringTie exon 32418037 32418568 1000 + . gene_id "AMEX231106C057529"; transcript_id "AMEX231106C057529.2"; exon_number "13";
- "AMEX231106C057529.2" entries in the output AmexT_v231106C_FullOnly.stringtie.merge.raw.gff3 are:
ptg000056l Cufflinks match 24490808 24491053 100 + . ID=GENE^AMEX231106C057529,TRANS^AMEX231106C057529.2;Target=GENE^AMEX231106C057529,TRANS^AMEX231106C057529.2 1 246 +
ptg000056l Cufflinks match 25718856 25718950 100 + . ID=GENE^AMEX231106C057529,TRANS^AMEX231106C057529.2;Target=GENE^AMEX231106C057529,TRANS^AMEX231106C057529.2 247 341 +
ptg000056l Cufflinks match 26892333 26892549 100 + . ID=GENE^AMEX231106C057529,TRANS^AMEX231106C057529.2;Target=GENE^AMEX231106C057529,TRANS^AMEX231106C057529.2 342 558 +
ptg000056l Cufflinks match 28209504 28209543 100 + . ID=GENE^AMEX231106C057529,TRANS^AMEX231106C057529.2;Target=GENE^AMEX231106C057529,TRANS^AMEX231106C057529.2 559 598 +
ptg000056l Cufflinks match 29692485 29692574 100 + . ID=GENE^AMEX231106C057529,TRANS^AMEX231106C057529.2;Target=GENE^AMEX231106C057529,TRANS^AMEX231106C057529.2 599 688 +
ptg000056l Cufflinks match 29977070 29977184 100 + . ID=GENE^AMEX231106C057529,TRANS^AMEX231106C057529.2;Target=GENE^AMEX231106C057529,TRANS^AMEX231106C057529.2 689 803 +
ptg000056l Cufflinks match 30279073 30279184 100 + . ID=GENE^AMEX231106C057529,TRANS^AMEX231106C057529.2;Target=GENE^AMEX231106C057529,TRANS^AMEX231106C057529.2 804 915 +
ptg000056l Cufflinks match 31138727 31138879 100 + . ID=GENE^AMEX231106C057529,TRANS^AMEX231106C057529.2;Target=GENE^AMEX231106C057529,TRANS^AMEX231106C057529.2 916 1068 +
ptg000056l Cufflinks match 31484095 31484257 100 + . ID=GENE^AMEX231106C057529,TRANS^AMEX231106C057529.2;Target=GENE^AMEX231106C057529,TRANS^AMEX231106C057529.2 1069 1231 +
ptg000056l Cufflinks match 32226903 32227002 100 + . ID=GENE^AMEX231106C057529,TRANS^AMEX231106C057529.2;Target=GENE^AMEX231106C057529,TRANS^AMEX231106C057529.2 1232 1331 +
ptg000056l Cufflinks match 32227351 32227454 100 + . ID=GENE^AMEX231106C057529,TRANS^AMEX231106C057529.2;Target=GENE^AMEX231106C057529,TRANS^AMEX231106C057529.2 1332 1435 +
ptg000056l Cufflinks match 32415663 32415698 100 + . ID=GENE^AMEX231106C057529,TRANS^AMEX231106C057529.2;Target=GENE^AMEX231106C057529,TRANS^AMEX231106C057529.2 1436 1471 +
ptg000056l Cufflinks match 32418037 32418568 100 + . ID=GENE^AMEX231106C057529,TRANS^AMEX231106C057529.2;Target=GENE^AMEX231106C057529,TRANS^AMEX231106C057529.2 1472 2003 +
- Then I extract the FASTA sequence from each transcript, which were stored in AmexT_v231106C_FullOnly.stringtie.merge.raw.fasta. There is only one sequence with the ID "AMEX231106C057529.2" in the FASTA file. Then I resume the pipeline:
TransDecoder.LongOrfs -t AmexT_v231106C_FullOnly.stringtie.merge.raw.fasta -S -m 30
TransDecoder.Predict -t AmexT_v231106C_FullOnly.stringtie.merge.raw.fasta --single_best_only
- "AMEX231106C057529.2" entries in the output AmexT_v231106C_FullOnly.stringtie.merge.raw.fasta.transdecoder.gff3
AMEX231106C057529.2 transdecoder gene 1 2003 . + . ID=GENE.AMEX231106C057529.2~~AMEX231106C057529.2.p1;Name="ORF type:5prime_partial (+),score=76.51"
AMEX231106C057529.2 transdecoder mRNA 1 2003 . + . ID=AMEX231106C057529.2.p1;Parent=GENE.AMEX231106C057529.2~~AMEX231106C057529.2.p1;Name="ORF type:5prime_partial (+),score=76.51"
AMEX231106C057529.2 transdecoder exon 1 2003 . + . ID=AMEX231106C057529.2.p1.exon1;Parent=AMEX231106C057529.2.p1
AMEX231106C057529.2 transdecoder CDS 1 1401 . + 0 ID=cds.AMEX231106C057529.2.p1;Parent=AMEX231106C057529.2.p1
AMEX231106C057529.2 transdecoder three_prime_UTR 1402 2003 . + . ID=AMEX231106C057529.2.p1.utr3p1;Parent=AMEX231106C057529.2.p1
- Finally I tried to merge the alignment information to the prediction results:
cdna_alignment_orf_to_genome_orf.pl AmexT_v231106C_FullOnly.stringtie.merge.raw.fasta.transdecoder.gff3 AmexT_v231106C_FullOnly.stringtie.merge.raw.gff3 AmexT_v231106C_FullOnly.stringtie.merge.raw.fasta > stringtie.transdecoder.annotated.gff
- "AMEX231106C057529.2" entries in the output stringtie.transdecoder.annotated.gff:
ptg000056l transdecoder mRNA 24490808 32418568 . + . ID=AMEX231106C057529.2.p1;Parent=AMEX231106C057529^ptg000056l^+;Name="ORF type:3prime_partial (+),score=41.53"
ptg000056l transdecoder exon 24490808 24491053 . + . ID=AMEX231106C057529.2.p1.exon1;Parent=AMEX231106C057529.2.p1
ptg000056l transdecoder CDS 24490808 24491053 . + 0 ID=cds.AMEX231106C057529.2.p1;Parent=AMEX231106C057529.2.p1
ptg000056l transdecoder exon 25718856 25718950 . + . ID=AMEX231106C057529.2.p1.exon2;Parent=AMEX231106C057529.2.p1
ptg000056l transdecoder CDS 25718856 25718950 . + 0 ID=cds.AMEX231106C057529.2.p1;Parent=AMEX231106C057529.2.p1
ptg000056l transdecoder exon 26892333 26892549 . + . ID=AMEX231106C057529.2.p1.exon3;Parent=AMEX231106C057529.2.p1
ptg000056l transdecoder CDS 26892333 26892549 . + 1 ID=cds.AMEX231106C057529.2.p1;Parent=AMEX231106C057529.2.p1
ptg000056l transdecoder exon 28209504 28209543 . + . ID=AMEX231106C057529.2.p1.exon4;Parent=AMEX231106C057529.2.p1
ptg000056l transdecoder CDS 28209504 28209543 . + 0 ID=cds.AMEX231106C057529.2.p1;Parent=AMEX231106C057529.2.p1
ptg000056l transdecoder exon 29692485 29692574 . + . ID=AMEX231106C057529.2.p1.exon5;Parent=AMEX231106C057529.2.p1
ptg000056l transdecoder CDS 29692485 29692574 . + 2 ID=cds.AMEX231106C057529.2.p1;Parent=AMEX231106C057529.2.p1
ptg000056l transdecoder exon 29977070 29977184 . + . ID=AMEX231106C057529.2.p1.exon6;Parent=AMEX231106C057529.2.p1
ptg000056l transdecoder CDS 29977070 29977184 . + 2 ID=cds.AMEX231106C057529.2.p1;Parent=AMEX231106C057529.2.p1
ptg000056l transdecoder exon 30279073 30279184 . + . ID=AMEX231106C057529.2.p1.exon7;Parent=AMEX231106C057529.2.p1
ptg000056l transdecoder CDS 30279073 30279184 . + 1 ID=cds.AMEX231106C057529.2.p1;Parent=AMEX231106C057529.2.p1
ptg000056l transdecoder exon 31138727 31138879 . + . ID=AMEX231106C057529.2.p1.exon8;Parent=AMEX231106C057529.2.p1
ptg000056l transdecoder CDS 31138727 31138879 . + 0 ID=cds.AMEX231106C057529.2.p1;Parent=AMEX231106C057529.2.p1
ptg000056l transdecoder exon 31484095 31484257 . + . ID=AMEX231106C057529.2.p1.exon9;Parent=AMEX231106C057529.2.p1
ptg000056l transdecoder CDS 31484095 31484257 . + 0 ID=cds.AMEX231106C057529.2.p1;Parent=AMEX231106C057529.2.p1
ptg000056l transdecoder exon 32226903 32227002 . + . ID=AMEX231106C057529.2.p1.exon10;Parent=AMEX231106C057529.2.p1
ptg000056l transdecoder CDS 32226903 32227002 . + 2 ID=cds.AMEX231106C057529.2.p1;Parent=AMEX231106C057529.2.p1
ptg000056l transdecoder exon 32227351 32227454 . + . ID=AMEX231106C057529.2.p1.exon11;Parent=AMEX231106C057529.2.p1
ptg000056l transdecoder CDS 32227351 32227420 . + 1 ID=cds.AMEX231106C057529.2.p1;Parent=AMEX231106C057529.2.p1
ptg000056l transdecoder exon 32415663 32415698 . + . ID=AMEX231106C057529.2.p1.exon12;Parent=AMEX231106C057529.2.p1
ptg000056l transdecoder exon 32418037 32418568 . + . ID=AMEX231106C057529.2.p1.exon13;Parent=AMEX231106C057529.2.p1
ptg000056l transdecoder three_prime_UTR 32227421 32227454 . + . ID=AMEX231106C057529.2.p1.utr3p1;Parent=AMEX231106C057529.2.p1
ptg000056l transdecoder three_prime_UTR 32415663 32415698 . + . ID=AMEX231106C057529.2.p1.utr3p2;Parent=AMEX231106C057529.2.p1
ptg000056l transdecoder three_prime_UTR 32418037 32418568 . + . ID=AMEX231106C057529.2.p1.utr3p3;Parent=AMEX231106C057529.2.p1
As you can see in the last lines from the block above, there are two exons that are not followed by a CDS entry, and after them three last three_prime_UTR entries. I have no idea how to debug it (there was no warning in the log from cdna_alignment_orf_to_genome_orf.pl), or what could be the reason why it happened just with that transcript. I'll appreciate your help.
Best,
Salvador