Bio::SeqFeature::Tools::Unflattener does not convert pseudogene correctly #124

sufenhu9 · 2015-10-06T20:39:02Z

When I use Bio::SeqFeature::Tools::Unflattener to convert GenBank flat-feature-list to containment hierarchy,

$unflattener->unflatten_seq(-seq=>$seq, -use_magic=>1);

Everything is fine except these genes has pseudo tag.

Gene and CDS with pseudo tag have been convert to pseudogene and pseudoCDS, but they are flat. That is means the pseudoCDS is not the child of the pseudogene.
Gene and tRNA with pseudo tag have been convert to pseudogene and pseudotRNA, but they are not nested either. The converted pseudotRNA are not the child of the pseudogene.

Like to know if there is any other parameter, or any other method that I can use to convert both gene and pseudogene correctly.

sample file

http://www.ncbi.nlm.nih.gov/nuccore/FR823391
http://www.ncbi.nlm.nih.gov/nuccore/GL636509

sample codes

#!/usr/bin/perl
use strict;
use Bio::SeqIO;
use Bio::Seq::SeqBuilder;
use Bio::Species;
use Bio::Annotation::SimpleValue;
use Bio::SeqFeature::Tools::Unflattener;

my $seq_in = Bio::SeqIO->new( '-file' => "<$ARGV[0]", '-format' => 'Genbank' );
my $unflattener = Bio::SeqFeature::Tools::Unflattener->new;

while ( my $bioperlSeq = $seq_in->next_seq() ) {
    if ( !( $bioperlSeq->molecule =~ /rna/i ) ) {
        $unflattener->error_threshold(1);
        $unflattener->report_problems(*STDERR);
        $unflattener->unflatten_seq(
            -seq       => $bioperlSeq,
            -use_magic => 1
        );

        my @topSeqFeatures = $bioperlSeq->get_SeqFeatures;

        $bioperlSeq->remove_SeqFeatures;

        foreach my $bioperlFeatureTree (@topSeqFeatures) {
            my $type = $bioperlFeatureTree->primary_tag();
            if ( ( $bioperlFeatureTree->has_tag("locus_tag") ) ) {
                my ($cID) = $bioperlFeatureTree->get_tag_values("locus_tag");
                print STDERR "\nprocessing $cID...\n";
            }

            if (   $type =~ /pseudogene/i
                || $type =~ /pseudotRNA/i
                || $type =~ /pseudoCDS/i )
            {
                foreach my $subPseuFeat ( $bioperlFeatureTree->get_SeqFeatures )
                {
                    my $subPseuType = $subPseuFeat->primary_tag();
                    print STDERR "  subPseuType = $subPseuType\n";
                    my @pseuLocs = $subPseuFeat->location;
                    foreach my $pseuLoc (@pseuLocs) {
                        my $pseuLocStart = $pseuLoc->start();
                        my $pseuLocEnd   = $pseuLoc->end();
                        print "    pseuLoc location : $pseuLocStart ... $pseuLocEnd\n";
                    }
                }
            }
            print STDERR "\$type = $type\n";
            foreach my $subFeat ( $bioperlFeatureTree->get_SeqFeatures ) {
                my $subType = $subFeat->primary_tag();
                print STDERR "  subType = $subType\n";
                foreach my $subSubFeat ( $subFeat->get_SeqFeatures ) {
                    my $subSubType = $subSubFeat->primary_tag();
                    print STDERR "      subSubType = $subSubType\n";
                    if ( $subSubType eq "CDS" || $subSubType eq "exon" ) {
                        my @CDSLocs = $subSubFeat->location;
                        foreach my $CDSLoc (@CDSLocs) {
                            my $subSubStart = $CDSLoc->start();
                            my $subSubEnd   = $CDSLoc->end();
                            print STDERR
"            location: $subSubStart .... $subSubEnd\n";
                        }
                    }
                }
            }
        }
    }
}

I am using bioperl_live/1.6.9,
$ perl -MBio::Root::Version -e 'print $Bio::Root::Version::VERSION,"\n"'
1.0069

Everything is fine when I use bioperl version 1.4. Not sure what is changed for Bio::SeqFeature::Tools::Unflattener between 1.4 and 1.6.9.

The different outputs with above code in different bioperl version 1.6.9 and 1.4.

####### BioPerl 1.6.9 #########
processing NCLIV_047911...
$type = pseudogene
processing NCLIV_047911...
$type = pseudoCDS

####### BioPerl 1.4 #########
processing NCLIV_047911...
$type = gene
subType = mRNA
subSubType = CDS
location: 2628722 .... 2629917
subSubType = exon
location: 2628722 .... 2629792
subSubType = exon
location: 2629795 .... 2629917

Thanks.

The text was updated successfully, but these errors were encountered:

cjfields · 2015-10-17T14:40:11Z

Relevant feature in the GenBank file is:

...
     gene            complement(2628722..2629917)
                     /locus_tag="NCLIV_047911"
                     /pseudo
     CDS             complement(join(2628722..2629792,2629795..2629917))
                     /locus_tag="NCLIV_047911"
                     /pseudo
                     /codon_start=1
                     /product="hypothetical protein"
                     /db_xref="PSEUDO:CBZ54361.1"
...

not sure why it has changed, @cmungall any ideas?

cjfields · 2015-10-24T14:44:56Z

May try running git bisect to see where this changed. My feeling is a conversion to add the pseudo tag to the feature, though it's doing the right thing, actually clobbers the unflattening.

cjfields · 2015-10-24T15:45:55Z

git bisect indicates the change occurred way back in 2004 in 3a404df from @cmungall

I'll open a branch for testing fixes.

cmungall · 2015-10-26T15:06:46Z

I'm afraid the exact rationale escapes me at the moment. I assume it was to do with an asymmetry in how pseudogene models were typically encoded in genbank records at the time.

For this particular example, the pseudogene mirrors the structure of a gene precisely, even having a CDS. It looks like the code should be simplified here so that pseudogenes are structured symmetrically to genes. But it's not clear what the consequences of making this change would be for other pseudogene records in genbank. Also, there is a secondary issue that the pseudogene hierarchy doesnt mirror the gene one entirely in SO (e.g. no pseudoCDS)

sufenhu9 · 2015-10-30T16:13:07Z

Thanks for working on this issue. Yes, there is no pseudoCDS in SO so far. There is pseudogenic_exon instead.

For the change in BioPerl 1.6.9

   # PSEUDOGENES, PSEUDOEXONS AND PSEUDOINTRONS
   # these are indicated with the /pseudo tag
   # these are mapped to a different type; they should NOT
   # be treated as normal genes
   foreach my $sf (@all_seq_features) {
       if ($sf->has_tag('pseudo')) {
           my $type = $sf->primary_tag;
           # SO type is typically the same as the normal
           # type but preceeded by "pseudo"
           if ($type eq 'misc_RNA' || $type eq 'mRNA') { 
            # dgg: see TypeMapper; both pseudo mRNA,misc_RNA should be pseudogenic_transcript
               $sf->primary_tag("pseudotranscript");
           }
           else {
               $sf->primary_tag("pseudo$type");
           }
       }
   }

I propose the following,

   foreach my $sf (@all_seq_features) {
       if ($sf->has_tag('pseudo')) {
           my $type = $sf->primary_tag;
           if ($type eq 'gene') { 
               $sf->primary_tag("pseudogene");
           } elsif ($type eq 'CDS') {
               $sf->primary_tag("pseudogenic_exon");
           } elsif ($type eq 'mRNA') {
               $sf->primary_tag("pseudogenic_transcript");
           }
           else {
               $sf->primary_tag("pseudogenic_$type");
           }
       }
   }

After this, you can unflat the structure as
pseudogene -> pseudogenic_transcript -> pseudogenic_exon
pseudogene -> pseudogenic_tRNA -> pseudogenic_exon

and so on.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bio::SeqFeature::Tools::Unflattener does not convert pseudogene correctly #124

Bio::SeqFeature::Tools::Unflattener does not convert pseudogene correctly #124

sufenhu9 commented Oct 6, 2015

cjfields commented Oct 17, 2015

cjfields commented Oct 24, 2015

cjfields commented Oct 24, 2015

cmungall commented Oct 26, 2015

sufenhu9 commented Oct 30, 2015

Bio::SeqFeature::Tools::Unflattener does not convert pseudogene correctly #124

Bio::SeqFeature::Tools::Unflattener does not convert pseudogene correctly #124

Comments

sufenhu9 commented Oct 6, 2015

cjfields commented Oct 17, 2015

cjfields commented Oct 24, 2015

cjfields commented Oct 24, 2015

cmungall commented Oct 26, 2015

sufenhu9 commented Oct 30, 2015