Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bio::SeqFeature::Tools::Unflattener does not convert pseudogene correctly #124

Open
sufenhu9 opened this issue Oct 6, 2015 · 5 comments

Comments

@sufenhu9
Copy link

sufenhu9 commented Oct 6, 2015

When I use Bio::SeqFeature::Tools::Unflattener to convert GenBank flat-feature-list to containment hierarchy,

$unflattener->unflatten_seq(-seq=>$seq, -use_magic=>1);

Everything is fine except these genes has pseudo tag.

  1. Gene and CDS with pseudo tag have been convert to pseudogene and pseudoCDS, but they are flat. That is means the pseudoCDS is not the child of the pseudogene.
  2. Gene and tRNA with pseudo tag have been convert to pseudogene and pseudotRNA, but they are not nested either. The converted pseudotRNA are not the child of the pseudogene.

Like to know if there is any other parameter, or any other method that I can use to convert both gene and pseudogene correctly.

sample file

http://www.ncbi.nlm.nih.gov/nuccore/FR823391
http://www.ncbi.nlm.nih.gov/nuccore/GL636509

sample codes

#!/usr/bin/perl
use strict;
use Bio::SeqIO;
use Bio::Seq::SeqBuilder;
use Bio::Species;
use Bio::Annotation::SimpleValue;
use Bio::SeqFeature::Tools::Unflattener;

my $seq_in = Bio::SeqIO->new( '-file' => "<$ARGV[0]", '-format' => 'Genbank' );
my $unflattener = Bio::SeqFeature::Tools::Unflattener->new;

while ( my $bioperlSeq = $seq_in->next_seq() ) {
    if ( !( $bioperlSeq->molecule =~ /rna/i ) ) {
        $unflattener->error_threshold(1);
        $unflattener->report_problems(*STDERR);
        $unflattener->unflatten_seq(
            -seq       => $bioperlSeq,
            -use_magic => 1
        );

        my @topSeqFeatures = $bioperlSeq->get_SeqFeatures;

        $bioperlSeq->remove_SeqFeatures;

        foreach my $bioperlFeatureTree (@topSeqFeatures) {
            my $type = $bioperlFeatureTree->primary_tag();
            if ( ( $bioperlFeatureTree->has_tag("locus_tag") ) ) {
                my ($cID) = $bioperlFeatureTree->get_tag_values("locus_tag");
                print STDERR "\nprocessing $cID...\n";
            }

            if (   $type =~ /pseudogene/i
                || $type =~ /pseudotRNA/i
                || $type =~ /pseudoCDS/i )
            {
                foreach my $subPseuFeat ( $bioperlFeatureTree->get_SeqFeatures )
                {
                    my $subPseuType = $subPseuFeat->primary_tag();
                    print STDERR "  subPseuType = $subPseuType\n";
                    my @pseuLocs = $subPseuFeat->location;
                    foreach my $pseuLoc (@pseuLocs) {
                        my $pseuLocStart = $pseuLoc->start();
                        my $pseuLocEnd   = $pseuLoc->end();
                        print "    pseuLoc location : $pseuLocStart ... $pseuLocEnd\n";
                    }
                }
            }
            print STDERR "\$type = $type\n";
            foreach my $subFeat ( $bioperlFeatureTree->get_SeqFeatures ) {
                my $subType = $subFeat->primary_tag();
                print STDERR "  subType = $subType\n";
                foreach my $subSubFeat ( $subFeat->get_SeqFeatures ) {
                    my $subSubType = $subSubFeat->primary_tag();
                    print STDERR "      subSubType = $subSubType\n";
                    if ( $subSubType eq "CDS" || $subSubType eq "exon" ) {
                        my @CDSLocs = $subSubFeat->location;
                        foreach my $CDSLoc (@CDSLocs) {
                            my $subSubStart = $CDSLoc->start();
                            my $subSubEnd   = $CDSLoc->end();
                            print STDERR
"            location: $subSubStart .... $subSubEnd\n";
                        }
                    }
                }
            }
        }
    }
}

I am using bioperl_live/1.6.9,
$ perl -MBio::Root::Version -e 'print $Bio::Root::Version::VERSION,"\n"'
1.0069

Everything is fine when I use bioperl version 1.4. Not sure what is changed for Bio::SeqFeature::Tools::Unflattener between 1.4 and 1.6.9.

The different outputs with above code in different bioperl version 1.6.9 and 1.4.

####### BioPerl 1.6.9 #########
processing NCLIV_047911...
$type = pseudogene
processing NCLIV_047911...
$type = pseudoCDS

####### BioPerl 1.4 #########
processing NCLIV_047911...
$type = gene
subType = mRNA
subSubType = CDS
location: 2628722 .... 2629917
subSubType = exon
location: 2628722 .... 2629792
subSubType = exon
location: 2629795 .... 2629917

Thanks.

@cjfields
Copy link
Member

Relevant feature in the GenBank file is:

...
     gene            complement(2628722..2629917)
                     /locus_tag="NCLIV_047911"
                     /pseudo
     CDS             complement(join(2628722..2629792,2629795..2629917))
                     /locus_tag="NCLIV_047911"
                     /pseudo
                     /codon_start=1
                     /product="hypothetical protein"
                     /db_xref="PSEUDO:CBZ54361.1"
...

not sure why it has changed, @cmungall any ideas?

@cjfields
Copy link
Member

May try running git bisect to see where this changed. My feeling is a conversion to add the pseudo tag to the feature, though it's doing the right thing, actually clobbers the unflattening.

@cjfields
Copy link
Member

git bisect indicates the change occurred way back in 2004 in 3a404df from @cmungall

I'll open a branch for testing fixes.

@cmungall
Copy link
Member

I'm afraid the exact rationale escapes me at the moment. I assume it was to do with an asymmetry in how pseudogene models were typically encoded in genbank records at the time.

For this particular example, the pseudogene mirrors the structure of a gene precisely, even having a CDS. It looks like the code should be simplified here so that pseudogenes are structured symmetrically to genes. But it's not clear what the consequences of making this change would be for other pseudogene records in genbank. Also, there is a secondary issue that the pseudogene hierarchy doesnt mirror the gene one entirely in SO (e.g. no pseudoCDS)

@sufenhu9
Copy link
Author

Thanks for working on this issue. Yes, there is no pseudoCDS in SO so far. There is pseudogenic_exon instead.

For the change in BioPerl 1.6.9

   # PSEUDOGENES, PSEUDOEXONS AND PSEUDOINTRONS
   # these are indicated with the /pseudo tag
   # these are mapped to a different type; they should NOT
   # be treated as normal genes
   foreach my $sf (@all_seq_features) {
       if ($sf->has_tag('pseudo')) {
           my $type = $sf->primary_tag;
           # SO type is typically the same as the normal
           # type but preceeded by "pseudo"
           if ($type eq 'misc_RNA' || $type eq 'mRNA') { 
            # dgg: see TypeMapper; both pseudo mRNA,misc_RNA should be pseudogenic_transcript
               $sf->primary_tag("pseudotranscript");
           }
           else {
               $sf->primary_tag("pseudo$type");
           }
       }
   }

I propose the following,

   foreach my $sf (@all_seq_features) {
       if ($sf->has_tag('pseudo')) {
           my $type = $sf->primary_tag;
           if ($type eq 'gene') { 
               $sf->primary_tag("pseudogene");
           } elsif ($type eq 'CDS') {
               $sf->primary_tag("pseudogenic_exon");
           } elsif ($type eq 'mRNA') {
               $sf->primary_tag("pseudogenic_transcript");
           }
           else {
               $sf->primary_tag("pseudogenic_$type");
           }
       }
   }

After this, you can unflat the structure as
pseudogene -> pseudogenic_transcript -> pseudogenic_exon
pseudogene -> pseudogenic_tRNA -> pseudogenic_exon

and so on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants