2 # bioperl module for Bio::SeqFeature::Tools::Unflattener
4 # Please direct questions and support issues to <bioperl-l@bioperl.org>
6 # Cared for by Chris Mungall <cjm@fruitfly.org>
8 # Copyright Chris Mungall
10 # You may distribute this module under the same terms as perl itself
12 # POD documentation - main docs before the code
16 Bio::SeqFeature::Tools::Unflattener - turns flat list of genbank-sourced features into a nested SeqFeatureI hierarchy
20 # standard / generic use - unflatten a genbank record
22 use Bio::SeqFeature::Tools::Unflattener;
24 # generate an Unflattener object
25 $unflattener = Bio::SeqFeature::Tools::Unflattener->new;
27 # first fetch a genbank SeqI object
29 Bio::SeqIO->new(-file=>'AE003644.gbk',
32 Bio::SeqIO->new(-format=>'asciitree');
33 while ($seq = $seqio->next_seq()) {
35 # get top level unflattended SeqFeatureI objects
36 $unflattener->unflatten_seq(-seq=>$seq,
38 $out->write_seq($seq);
40 @top_sfs = $seq->get_SeqFeatures;
41 foreach my $sf (@top_sfs) {
42 # do something with top-level features (eg genes)
49 Most GenBank entries for annotated genomic DNA contain a B<flat> list
50 of features. These features can be parsed into an equivalent flat list
51 of L<Bio::SeqFeatureI> objects using the standard L<Bio::SeqIO>
52 classes. However, it is often desirable to B<unflatten> this list into
53 something resembling actual B<gene models>, in which genes, mRNAs and CDSs
54 are B<nested> according to the nature of the gene model.
56 The BioPerl object model allows us to store these kind of associations
57 between SeqFeatures in B<containment hierarchies> -- any SeqFeatureI
58 object can contain nested SeqFeatureI objects. The
59 Bio::SeqFeature::Tools::Unflattener object facilitates construction of
60 these hierarchies from the underlying GenBank flat-feature-list
63 For example, if you were to look at a typical GenBank DNA entry, say,
64 B<AE003644>, you would see a flat list of features:
83 These features have sequence locations, but it is not immediately
84 clear how to write code such that each mRNA is linked to the
85 appropriate CDS (other than relying on IDs which is very bad)
87 We would like to convert the above list into the B<containment
88 hierarchy>, shown below:
113 Where each feature is nested underneath its container. Note that exons
114 have been automatically inferred (even for tRNA genes).
116 We do this using a call on a L<Bio::SeqFeature::Tools::Unflattener>
119 @sfs = $unflattener->unflatten_seq(-seq=>$seq);
121 This would return a list of the B<top level> (i.e. container)
122 SeqFeatureI objects - in this case, genes. Other top level features
123 are possible; for instance, the B<source> feature which is always
124 present, and other features such as B<variation> or B<misc_feature>
127 The containment hierarchy can be accessed using the get_SeqFeature()
128 call on any feature object - see L<Bio::SeqFeature::FeatureHolderI>.
129 The following code will traverse the containment hierarchy for a
133 $sf = shift; # $sf isa Bio::SeqfeatureI
135 # ...do something with $sf!
137 # depth first traversal of containment tree
138 @contained_sfs = $sf->get_SeqFeatures;
139 traverse($_) foreach @contained_sfs;
142 Once you have built the hierarchy, you can do neat stuff like turn the
143 features into 'rich' feature objects (eg
144 L<Bio::SeqFeature::Gene::GeneStructure>) or convert to a suitable
145 format such as GFF3 or chadoxml (after mapping to the Sequence
146 Ontology); this step is not described here.
150 Due to the quixotic nature of how features are stored in
151 GenBank/EMBL/DDBJ, there is no guarantee that the default behaviour of
152 this module will produce perfect results. Sometimes it is hard or
153 impossible to build a correct containment hierarchy if the information
154 provided is simply too lossy, as is often the case. If you care deeply
155 about your data, you should always manually inspect the resulting
156 containment hierarchy; you may have to customise the algorithm for
157 building the hierarchy, or even manually tweak the resulting
158 hierarchy. This is explained in more detail further on in the document.
160 However, if you are satisfied with the default behaviour, then you do
161 not need to read any further. Just make sure you set the parameter
162 B<use_magic> - this will invoke incantations which will magically
163 produce good results no matter what the idiosyncracies of the
164 particular GenBank record in question.
168 $unflattener->unflatten_seq(-seq=>$seq,
171 The success of this depends on the phase of the moon at the time the
172 entry was submitted to GenBank. Note that the magical recipe is being
173 constantly improved, so the results of invoking magic may vary
174 depending on the bioperl release.
176 If you are skeptical of magic, or you wish to exact fine grained
177 control over how the entry is unflattened, or you simply wish to
178 understand more about how this crazy stuff works, then read on!
180 =head1 PROBLEMATIC DATA AND INCONSISTENCIES
182 Occasionally the Unflattener will have problems with certain
183 records. For example, the record may contain inconsistent data - maybe
184 there is an B<exon> entry that has no corresponding B<mRNA> location.
186 The default behaviour is to throw an exception reporting the problem,
187 if the problem is relatively serious - for example, inconsistent data.
189 You can exert more fine grained control over this - perhaps you want
190 the Unflattener to do the best it can, and report any problems. This
191 can be done - refer to the methods.
203 This is the default algorithm; you should be able to override any part
206 The core of the algorithm is in two parts
210 =item Partitioning the flat feature list into groups
212 =item Resolving the feature containment hierarchy for each group
216 There are other optional steps after the completion of these two
217 steps, such as B<inferring exons>; we now describe in more detail what
220 =head2 Partitioning into groups
222 First of all the flat feature list is partitioned into B<group>s.
224 The default way of doing this is to use the B<gene> attribute; if we
225 look at two features from GenBank accession AE003644.3:
230 /note="last curated on Thu Dec 13 16:51:32 PST 2001"
232 /db_xref="FLYBASE:FBgn0005771"
233 mRNA join(20111..20584,20887..23268)
237 /db_xref="FLYBASE:FBgn0005771"
239 Both these features share the same /gene tag which is "noc", so they
240 correspond to the same gene model (the CDS feature is not shown, but
241 this also has a tag-value /gene="noc").
243 Not all groups need to correspond to gene models, but this is the most
244 common use case; later on we shall describe how to customise the
247 Sometimes other tags have to be used; for instance, if you look at the
248 entire record for AE003644.3 you will see you actually need the use the
249 /locus_tag attribute. This attribute is actually B<not present> in
252 You can override this:
254 $collection->unflatten_seq(-seq=>$seq, -group_tag=>'locus_tag');
256 Alternatively, if you B<-use_magic>, the object will try and make a
257 guess as to what the correct group_tag should be.
259 At the end of this step, we should have a list of groups - there is no
260 structure within a group; the group just serves to partition the flat
261 features. For the example data above, we would have the following groups.
267 [ gene mRNA mRNA mRNA CDS CDS CDS ]
269 =head3 Multicopy Genes
271 Multicopy genes are usually rRNAs or tRNAs that are duplicated across
272 the genome. Because they are functionally equivalent, and usually have
273 the same sequence, they usually have the same group_tag (ie gene
274 symbol); they often have a /note tag giving copy number. This means
275 they will end up in the same group. This is undesirable, because they
276 are spatially disconnected.
278 There is another step, which involves splitting spatially disconnected
279 groups into distinct groups
283 [gene-rrn3 rRNA-rrn3 gene-rrn3 rRNA-rrn3]
287 [gene-rrn3 rRNA-rrn3] [gene-rrn3 rRNA-rrn3]
289 based on the coordinates
293 The next step is to add some structure to each group, by making
294 B<containment hierarchies>, trees that represent how the features
297 =head2 Resolving the containment hierarchy
299 After the grouping is done, we end up with a list of groups which
300 probably contain features of type 'gene', 'mRNA', 'CDS' and so on.
302 Singleton groups (eg the 'source' feature) are ignored at this stage.
304 Each group is itself flat; we need to add an extra level of
305 organisation. Usually this is because different spliceforms
306 (represented by the 'mRNA' feature) can give rise to different
307 protein products (indicated by the 'CDS' feature). We want to correctly
308 associate mRNAs to CDSs.
310 We want to go from a group like this:
312 [ gene mRNA mRNA mRNA CDS CDS CDS ]
314 to a containment hierarchy like this:
324 In which each CDS is nested underneath the correct corresponding mRNA.
326 For entries that contain no alternate splicing, this is simple; we
331 Must resolve to the tree
337 How can we do this in entries with alternate splicing? The bad
338 news is that there is no guaranteed way of doing this correctly for
339 any GenBank entry. Occasionally the submission will have been done in
340 such a way as to reconstruct the containment hierarchy. However, this
341 is not consistent across databank entries, so no generic solution can
342 be provided by this object. This module does provide the framework
343 within which you can customise a solution for the particular dataset
344 you are interested in - see later.
346 The good news is that there is an inference we can do that should
347 produce pretty good results the vast majority of the time. It uses
348 splice coordinate data - this is the default behaviour of this module,
349 and is described in detail below.
351 =head2 Using splice site coordinates to infer containment
353 If an mRNA is to be the container for a CDS, then the splice site
354 coordinates (or intron coordinates, depending on how you look at it)
355 of the CDS must fit inside the splice site coordinates of the mRNA.
357 Ambiguities can still arise, but the results produced should still be
358 reasonable and consistent at the sequence level. Look at this fake
361 mRNA XXX---XX--XXXXXX--XXXX join(1..3,7..8,11..16,19..23)
362 mRNA XXX-------XXXXXX--XXXX join(1..3,11..16,19..23)
363 CDS XXXX--XX join(13..16,19..20)
364 CDS XXXX--XX join(13..16,19..20)
366 [obviously the positions have been scaled down]
368 We cannot unambiguously match mRNA with CDS based on splice sites,
369 since both CDS share the splice site locations 16^17 and
370 18^19. However, the consequences of making a wrong match are probably
371 not very severe. Any annotation data attached to the first CDS is
372 probably identical to the seconds CDS, other than identifiers.
374 The default behaviour of this module is to make an arbitrary call
375 where it is ambiguous (the mapping will always be bijective; i.e. one
376 mRNA -E<gt> one CDS).
378 [TODO: NOTE: not tested on EMBL data, which may not be bijective; ie two
379 mRNAs can share the same CDS??]
381 This completes the building of the containment hierarchy; other
384 =head1 POST-GROUPING STEPS
386 =head2 Inferring exons from mRNAs
388 This step always occurs if B<-use_magic> is invoked.
390 In a typical GenBank entry, the exons are B<implicit>. That is they
391 can be inferred from the mRNA location.
395 mRNA join(20111..20584,20887..23268)
397 This tells us that this particular transcript has two exons. In
398 bioperl, the mRNA feature will have a 'split location'.
402 $unflattener->feature_from_splitloc(-seq=>$seq);
404 This will generate the necessary exon features, and nest them under
405 the appropriate mRNAs. Note that the mRNAs will no longer have split
406 locations - they will have simple locations spanning the extent of the
407 exons. This is intentional, to avoid redundancy.
409 Occasionally a GenBank entry will have both implicit exons (from the
410 mRNA location) B<and> explicit exon features.
412 In this case, exons will still be transferred. Tag-value data from the
413 explicit exon will be transferred to the implicit exon. If exons are
414 shared between mRNAs these will be represented by different
415 objects. Any inconsistencies between implicit and explicit will be
418 =head3 tRNAs and other noncoding RNAs
420 exons will also be generated from these features
422 =head2 Inferring mRNAs from CDS
424 Some GenBank entries represent gene models using features of type
425 gene, mRNA and CDS; some entries just use gene and CDS.
427 If we only have gene and CDS, then the containment hierarchies will
433 If we want the containment hierarchies to be uniform, like this
439 Then we must create an mRNA feature. This will have identical
440 coordinates to the CDS. The assumption is that there is either no
441 untranslated region, or it is unknown.
443 To do this, we can call
445 $unflattener->infer_mRNA_from_CDS(-seq=>$seq);
447 This is taken care of automatically, if B<-use_magic> is invoked.
451 =head2 Customising the grouping of features
453 The default behaviour is suited mostly to building models of protein
454 coding genes and noncoding genes from genbank genomic DNA submissions.
456 You can change the tag used to partition the feature by passing in a
457 different group_tag argument - see the unflatten_seq() method
459 Other behaviour may be desirable. For example, even though SNPs
460 (features of type 'variation' in GenBank) are not actually part of the
461 gene model, it may be desirable to group SNPs that overlap or are
464 It should certainly be possible to extend this module to do
465 this. However, I have yet to code this part!!! If anyone would find
466 this useful let me know.
468 In the meantime, you could write your own grouping subroutine, and
469 feed the results into unflatten_groups() [see the method documentation
472 =head2 Customising the resolution of the containment hierarchy
474 Once the flat list of features has been partitioned into groups, the
475 method unflatten_group() is called on each group to build a tree.
477 The algorithm for doing this is described above; ambiguities are
478 resolved by using splice coordinates. As discussed, this can be
481 Some submissions may contain information in tags/attributes that hint
482 as to the mapping that needs to be made between the features.
484 For example, with the Drosophila Melanogaster release 3 submission, we
485 see that CDS features in alternately spliced mRNAs have a form like
488 CDS join(145588..145686,145752..146156,146227..146493)
490 /note="CG32954 gene product from transcript CG32954-RA"
491 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
493 /product="CG32954-PA"
494 /protein_id="AAF53403.1"
495 /db_xref="GI:7298167"
496 /db_xref="FLYBASE:FBgn0052954"
497 /translation="MSFTLTNKNVIFVAGLGGIGLDTSKELLKRDLKNLVILDRIENP..."
499 Here the /note tag provides the clue we need to link CDS to mRNA
500 (highlighted with ^^^^). We just need to find the mRNA with the tag
502 /product="CG32954-RA"
504 I have no idea how consistent this practice is across submissions; it
505 is consistent for the fruitfly genome submission.
507 We can customise the behaviour of unflatten_group() by providing our
508 own resolver method. This obviously requires a bit of extra
509 programming, but there is no way to get around this.
511 Here is an example of how to pass in your own resolver; this example
512 basically checks the parent (container) /product tag to see if it
513 matches the required string in the child (contained) /note tag.
515 $unflattener->unflatten_seq(-seq=>$seq,
516 -group_tag=>'locus_tag',
517 -resolver_method=>sub {
519 my ($sf, @candidate_container_sfs) = @_;
520 if ($sf->has_tag('note')) {
521 my @notes = $sf->get_tag_values('note');
522 my @trnames = map {/from transcript\s+(.*)/;
524 @trnames = grep {$_} @trnames;
527 $self->throw("UNRESOLVABLE");
529 elsif (@trnames == 1) {
530 $trname = $trnames[0];
533 $self->throw("AMBIGUOUS: @trnames");
538 $_->has_tag('product') ?
539 $_->get_tag_values('product') :
542 } @candidate_container_sfs;
543 if (@container_sfs == 0) {
544 $self->throw("UNRESOLVABLE");
546 elsif (@container_sfs == 1) {
548 return $container_sfs[0];
551 $self->throw("AMBIGUOUS");
556 the resolver method is only called when there is more than one spliceform.
558 =head2 Parsing mRNA records
560 Some of the entries in sequence databanks are for mRNA sequences as
561 well as genomic DNA. We may want to build models from these too.
563 NOT YET DONE - IN PROGRESS!!!
565 Open question - what would these look like?
567 Ideally we would like a way of combining a mRNA record with the
568 corresponding SeFeature entry from the appropriate genomic DNA
569 record. This could be problemmatic in some cases - for example, the
570 mRNA sequences may not match 100% (due to differences in strain,
571 assembly problems, sequencing problems, etc). What then...?
575 Feature table description
577 http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html
583 User feedback is an integral part of the evolution of this and other
584 Bioperl modules. Send your comments and suggestions preferably to the
585 Bioperl mailing lists Your participation is much appreciated.
587 bioperl-l@bioperl.org - General discussion
588 http://bioperl.org/wiki/Mailing_lists - About the mailing lists
592 Please direct usage questions or support issues to the mailing list:
594 I<bioperl-l@bioperl.org>
596 rather than to the module maintainer directly. Many experienced and
597 reponsive experts will be able look at the problem and quickly
598 address it. Please include a thorough description of the problem
599 with code and data examples if at all possible.
601 =head2 Reporting Bugs
603 report bugs to the Bioperl bug tracking system to help us keep track
604 the bugs and their resolution. Bug reports can be submitted via the
607 https://github.com/bioperl/bioperl-live/issues
609 =head1 AUTHOR - Chris Mungall
611 Email: cjm@fruitfly.org
615 The rest of the documentation details each of the object
616 methods. Internal methods are usually preceded with a _
621 # Let the code begin...
623 package Bio
::SeqFeature
::Tools
::Unflattener
;
626 # Object preamble - inherits from Bio::Root::Root
627 use Bio
::Location
::Simple
;
628 use Bio
::SeqFeature
::Generic
;
632 use base
qw(Bio::Root::Root);
637 Usage : $unflattener = Bio::SeqFeature::Tools::Unflattener->new();
638 $unflattener->unflatten_seq(-seq=>$seq);
639 Function: constructor
641 Returns : a new Bio::SeqFeature::Tools::Unflattener
646 -seq : A L<Bio::SeqI> object (optional)
647 the sequence to unflatten; this can also be passed in
648 when we call unflatten_seq()
650 -group_tag : a string representing the /tag used to partition flat features
651 (see discussion above)
657 my($class,@args) = @_;
658 my $self = $class->SUPER::new
(@args);
660 my($seq, $group_tag, $trust_grouptag) =
661 $self->_rearrange([qw(SEQ
667 $seq && $self->seq($seq);
668 $group_tag && $self->group_tag($group_tag);
669 # $self->{'trust_grouptag'}= $trust_grouptag if($trust_grouptag); #dgg suggestion
670 return $self; # success - we hope!
675 return if $self->{_reported_problems
};
676 return if $self->{_ignore_problems
};
677 my @probs = $self->get_problems;
678 if (!$self->{_problems_reported
} &&
681 "WARNING: There are UNREPORTED PROBLEMS.\n".
682 "You may wish to use the method report_problems(), \n",
683 "or ignore_problems() on the Unflattener object\n");
691 Usage : $unflattener->seq($newval)
694 Returns : value of seq (a Bio::SeqI)
695 Args : on set, new value (a Bio::SeqI, optional)
697 The Bio::SeqI object should hold a flat list of Bio::SeqFeatureI
698 objects; this is the list that will be unflattened.
700 The sequence object can also be set when we call unflatten_seq()
707 return $self->{'seq'} = shift if @_;
708 return $self->{'seq'};
714 Usage : $unflattener->group_tag($newval)
717 Returns : value of group_tag (a scalar)
718 Args : on set, new value (a scalar or undef, optional)
720 This is the tag that will be used to collect elements from the flat
721 feature list into groups; for instance, if we look at two typical
727 /note="last curated on Thu Dec 13 16:51:32 PST 2001"
729 /db_xref="FLYBASE:FBgn0005771"
730 mRNA join(20111..20584,20887..23268)
734 /db_xref="FLYBASE:FBgn0005771"
736 We can see that these comprise the same gene model because they share
737 the same /gene attribute; we want to collect these together in groups.
739 Setting group_tag is optional. The default is to use 'gene'. In the
740 example above, we could also use /locus_tag
747 return $self->{'group_tag'} = shift if @_;
748 return $self->{'group_tag'};
754 Usage : $unflattener->partonomy({mRNA=>'gene', CDS=>'mRNA')
757 Returns : value of partonomy (a scalar)
758 Args : on set, new value (a scalar or undef, optional)
760 A hash representing the containment structure that the seq_feature
761 nesting should conform to; each key represents the contained (child)
762 type; each value represents the container (parent) type.
769 return $self->{'partonomy'} = shift if @_;
770 if (!$self->{'partonomy'}) {
771 $self->{'partonomy'} = $self->_default_partonomy;
773 return $self->{'partonomy'};
776 sub _default_partonomy
{
789 pseudoexon
=> 'pseudogene',
790 pseudointron
=> 'pseudogene',
791 pseudotranscript
=> 'pseudogene',
795 =head2 structure_type
797 Title : structure_type
798 Usage : $unflattener->structure_type($newval)
801 Returns : value of structure_type (a scalar)
802 Args : on set, new value (an int or undef, optional)
804 GenBank entries conform to different flavours, or B<structure
805 types>. Some have mRNAs, some do not.
807 Right now there are only two base structure types defined. If you set
808 the structure type, then appropriate unflattening action will be
809 taken. The presence or absence of explicit exons does not affect the
812 If you invoke B<-use_magic> then this will be set automatically, based
813 on the content of the record.
817 =item Type 0 (DEFAULT)
826 with this structure type, we want the seq_features to be nested like this
833 exons and introns are implicit from the mRNA 'join' location
835 to get exons from the mRNAs, you will need this call (see below)
837 $unflattener->feature_from_splitloc(-seq=>$seq);
849 there are no mRNA features
851 with this structure type, we want the seq_features to be nested like this
858 exon and intron may or may not be present; they may be implicit from
859 the CDS 'join' location
868 return $self->{'structure_type'} = shift if @_;
869 return $self->{'structure_type'};
875 Usage : @probs = get_problems()
876 Function: Get the list of problem(s) for this object.
878 Returns : An array of [severity, description] pairs
881 In the course of unflattening a record, problems may occur. Some of
882 these problems are non-fatal, and can be ignored.
884 Problems are represented as arrayrefs containing a pair [severity,
887 severity is a number, the higher, the more severe the problem
889 the description is a text string
896 return @
{$self->{'_problems'}} if exists($self->{'_problems'});
900 =head2 clear_problems
902 Title : clear_problems
904 Function: resets the problem list to empty
913 my ($self,@args) = @_;
914 $self->{'_problems'} = [];
924 $self->{'_problems'} = [] unless exists($self->{'_problems'});
925 if ($self->verbose > 0) {
926 warn( "PROBLEM: $_\n") foreach @_;
928 push(@
{$self->{'_problems'}}, @_);
935 my ($severity, $desc, @sfs) = @_;
937 foreach my $sf (@sfs) {
939 sprintf("\nSF [$sf]: ". $sf->location->to_FTstring . "; %s\n",
944 $sf->get_tag_values($_) : ()
945 } qw(locus_tag gene product label)));
948 my $thresh = $self->error_threshold;
949 if ($severity > $thresh) {
950 $self->{_problems_reported
} = 1;
951 $self->throw("PROBLEM, SEVERITY==$severity\n$desc");
953 $self->add_problem([$severity, $desc]);
957 =head2 report_problems
959 Title : report_problems
960 Usage : $unflattener->report_problems(\*STDERR);
964 Args : FileHandle (defaults to STDERR)
970 my ($self, $fh) = @_;
975 foreach my $problem ($self->get_problems) {
976 my ($sev, $desc) = @
$problem;
977 printf $fh "PROBLEM, SEVERITY==$sev\n$desc\n";
979 $self->{_problems_reported
} = 1;
983 =head2 ignore_problems
985 Title : ignore_problems
986 Usage : $obj->ignore_problems();
992 Unflattener is very particular about problems it finds along the
993 way. If you have set the error_threshold such that less severe
994 problems do not cause exceptions, Unflattener still expects you to
995 report_problems() at the end, so that the user of the module is aware
996 of any inconsistencies or problems with the data. In fact, a warning
997 will be produced if there are unreported problems. To silence, this
998 warning, call the ignore_problems() method before the Unflattener
1003 sub ignore_problems
{
1005 $self->{_ignore_problems
} = 1;
1010 =head2 error_threshold
1012 Title : error_threshold
1013 Usage : $obj->error_threshold($severity)
1016 Returns : value of error_threshold (a scalar)
1017 Args : on set, new value (an integer)
1019 Sets the threshold above which errors cause this module to throw an
1020 exception. The default is 0; all problems with a severity E<gt> 0 will
1023 If you raise the threshold to 1, then the unflattening process will be
1024 more lax; problems of severity==1 are generally non-fatal, but may
1025 indicate that the results should be inspected, for example, to make
1026 sure there is no data loss.
1030 sub error_threshold
{
1033 return $self->{'error_threshold'} = shift if @_;
1034 return $self->{'error_threshold'} || 0;
1041 # given a type (eg mRNA), will return the container type (eg gene)
1042 sub get_container_type
{
1043 my ($self,$type) = @_;
1044 my @roots = $self->_get_partonomy_roots;
1045 if (grep {$_ eq $type} @roots) {
1046 # it is a root - no parents/containers
1049 my $ch = $self->partonomy;
1050 my $ctype = $ch->{$type};
1052 # asterix acts as a wild card
1053 $ctype = $ch->{'*'};
1058 # get root node of partonomy hierarchy (usually gene)
1059 sub _get_partonomy_roots
{
1061 my $ch = $self->partonomy;
1062 my @parents = values %$ch;
1063 # find parents that do not have parents themselves
1064 return grep {!$ch->{$_}} @parents;
1069 =head2 unflatten_seq
1071 Title : unflatten_seq
1072 Usage : @sfs = $unflattener->unflatten_seq($seq);
1073 Function: turns a flat list of features into a list of holder features
1075 Returns : list of Bio::SeqFeatureI objects
1078 partitions a list of features then arranges them in a nested tree; see
1079 above for full explanation.
1081 note - the Bio::SeqI object passed in will be modified
1085 -seq : a Bio::SeqI object; must contain Bio::SeqFeatureI objects
1086 (this is optional if seq has already been set)
1088 -use_magic: if TRUE (ie non-zero) then magic will be invoked;
1089 see discussion above.
1091 -resolver_method: a CODE reference
1092 see the documentation above for an example of
1093 a subroutine that can be used to resolve hierarchies
1096 this is optional - if nothing is supplied, a default
1097 subroutine will be used (see below)
1099 -group_tag: a string
1100 [ see the group_tag() method ]
1101 this overrides the default group_tag which is 'gene'
1108 my ($self,@args) = @_;
1110 my($seq, $resolver_method, $group_tag, $partonomy,
1111 $structure_type, $resolver_tag, $use_magic, $noinfer) =
1112 $self->_rearrange([qw(SEQ
1123 # seq we want to unflatten
1124 $seq = $seq || $self->seq;
1130 # prevent bad argument combinations
1132 defined($structure_type)) {
1133 $self->throw("You cannot set both -partonomy and -structure_type\n".
1134 "(the former is implied by the latter)");
1137 # remember the current value of partonomy, to reset later
1138 my $old_partonomy = $self->partonomy;
1139 $self->partonomy($partonomy) if defined $partonomy;
1141 # remember old structure_type
1142 my $old_structure_type = $self->structure_type;
1143 $self->structure_type($structure_type) if defined $structure_type;
1145 # if we are sourcing our data from genbank, all the
1146 # features should be flat (eq no sub_SeqFeatures)
1147 my @flat_seq_features = $seq->get_SeqFeatures;
1148 my @all_seq_features = $seq->get_all_SeqFeatures;
1151 if (@all_seq_features > @flat_seq_features) {
1152 $self->throw("It looks as if this sequence has already been unflattened");
1154 if (@all_seq_features < @flat_seq_features) {
1155 $self->throw("ASSERTION ERROR: something is seriously wrong with your features");
1158 # tag for ungrouping; usually /gene or /locus_tag
1159 # for example: /gene="foo"
1160 $group_tag = $group_tag || $self->group_tag;
1162 # use magic to guess the group tag
1163 my @sfs_with_locus_tag =
1164 grep {$_->has_tag("locus_tag")} @flat_seq_features;
1165 my @sfs_with_gene_tag =
1166 grep {$_->has_tag("gene")} @flat_seq_features;
1167 my @sfs_with_product_tag =
1168 grep {$_->has_tag("product")} @flat_seq_features;
1170 # if ($group_tag && $self->{'trust_grouptag'}) { # dgg suggestion
1174 if (@sfs_with_locus_tag) {
1175 # dgg note: would like to -use_magic with -group_tag = 'gene' for ensembl genomes
1176 # where ensembl gene FT have both /locus_tag and /gene, but mRNA, CDS have /gene only
1177 if ($group_tag && $group_tag ne 'locus_tag') {
1178 $self->throw("You have explicitly set group_tag to be '$group_tag'\n".
1179 "However, I detect that some features use /locus_tag\n".
1180 "I believe that this is the correct group_tag to use\n".
1181 "You can resolve this by either NOT setting -group_tag\n".
1182 "OR you can unset -use_magic to regain control");
1185 # use /locus_tag instead of /gene tag for grouping
1186 # see GenBank entry AE003677 (version 3) for an example
1187 $group_tag = 'locus_tag';
1188 if ($self->verbose > 0) {
1189 warn "Set group tag to: $group_tag\n";
1193 # on rare occasions, records will have no /gene or /locus_tag
1194 # but it WILL have /product tags. These serve the same purpose
1195 # for grouping. For an example, see AY763288 (also in t/data)
1196 if (@sfs_with_locus_tag==0 &&
1197 @sfs_with_gene_tag==0 &&
1198 @sfs_with_product_tag>0 &&
1200 $group_tag = 'product';
1201 if ($self->verbose > 0) {
1202 warn "Set group tag to: $group_tag\n";
1208 $group_tag = 'gene';
1211 # ------------------------------
1212 # GROUP FEATURES using $group_tag
1213 # collect features into unstructured groups
1214 # ------------------------------
1217 # we want to generate a list of groups;
1218 # each group is a list of SeqFeatures; this
1219 # group probably (but not necessarily)
1220 # corresponds to a gene model.
1222 # this array will look something like this:
1223 # ([$f1], [$f2, $f3, $f4], ...., [$f97, $f98, $f99])
1225 # there are also 'singleton' groups, with one member.
1226 # for instance, the 'source' feature is in a singleton group;
1227 # the same with others such as 'misc_feature'
1231 # --------------------
1232 # we hope that the genbank record allows us to group by some grouping
1234 # for instance, most of the time a gene model can be grouped using
1235 # the gene tag - that is where you see
1237 # in a genbank record
1238 # --------------------
1240 # keep an index of groups by their
1242 my %group_by_tag = ();
1245 # iterate through all features, putting them into groups
1246 foreach my $sf (@flat_seq_features) {
1247 if (!$sf->has_tag($group_tag)) {
1249 # this is an ungroupable feature;
1250 # add it to a group of its own
1251 push(@groups, [$sf]);
1255 my @group_tagvals = $sf->get_tag_values($group_tag);
1256 if (@group_tagvals > 1) {
1258 # currently something can only belong to one group
1260 ">1 value for /$group_tag: @group_tagvals\n".
1261 "At this time this module is not equipped to handle this adequately", $sf);
1263 # get value of group tag
1264 my $gtv = shift @group_tagvals;
1265 $gtv || $self->throw("Empty /$group_tag vals not allowed!");
1267 # is this a new group?
1268 my $group = $group_by_tag{$gtv};
1270 # this group has been encountered before - add current
1271 # sf to the end of the group
1275 # new group; add to index and create new group
1276 $group = [$sf]; # currently one member; probably more to come
1277 $group_by_tag{$gtv} = $group;
1278 push(@groups, $group);
1283 # as well as having the same group_tag, a group should be spatially
1284 # connected. if not, then the group should be split into subgroups.
1285 # this turns out to be necessary in the case of multicopy genes.
1286 # the standard way to represent these is as spatially disconnected
1287 # gene models (usually a 'gene' feature and some kind of RNA feature)
1288 # with the same group tag; the code below will split these into
1289 # seperate groups, one per copy.
1290 @groups = map { $self->_split_group_if_disconnected($_) } @groups;
1292 # remove any duplicates; most of the time the method below has
1293 # no effect. there are some unusual genbank records for which
1294 # duplicate removal is necessary. see the comments in the
1295 # _remove_duplicates_from_group() method if you want to know
1297 foreach my $group (@groups) {
1298 $self->_remove_duplicates_from_group($group);
1303 # PSEUDOGENES, PSEUDOEXONS AND PSEUDOINTRONS
1304 # these are indicated with the /pseudo tag
1305 # these are mapped to a different type; they should NOT
1306 # be treated as normal genes
1307 foreach my $sf (@all_seq_features) {
1308 if ($sf->has_tag('pseudo')) {
1309 my $type = $sf->primary_tag;
1310 # SO type is typically the same as the normal
1311 # type but preceeded by "pseudo"
1312 if ($type eq 'misc_RNA' || $type eq 'mRNA') {
1313 # dgg: see TypeMapper; both pseudo mRNA,misc_RNA should be pseudogenic_transcript
1314 $sf->primary_tag("pseudotranscript");
1317 $sf->primary_tag("pseudo$type");
1321 # now some of the post-processing that follows which applies to
1322 # genes will NOT be applied to pseudogenes; this is deliberate
1323 # for example, gene models are normalised to be gene-transcript-exon
1324 # for pseudogenes we leave them as pseudogene-pseudoexon
1327 my $need_to_infer_exons = 0;
1328 my $need_to_infer_mRNAs = 0;
1329 my @removed_exons = ();
1331 if (defined($structure_type)) {
1332 $self->throw("Can't combine use_magic AND setting structure_type");
1335 scalar(grep {$_->primary_tag eq 'exon'} @flat_seq_features);
1337 scalar(grep {$_->primary_tag eq 'exon'} @flat_seq_features);
1339 scalar(grep {$_->primary_tag eq 'mRNA'} @flat_seq_features);
1340 my $n_mrnas_attached_to_gene =
1341 scalar(grep {$_->primary_tag eq 'mRNA' &&
1342 $_->has_tag($group_tag)} @flat_seq_features);
1344 scalar(grep {$_->primary_tag eq 'CDS'} @flat_seq_features);
1346 scalar(grep {$_->primary_tag =~ /RNA/} @flat_seq_features);
1347 # Are there any CDS features in the record?
1351 # - a pc gene model should contain at the least a CDS
1353 # Are there any mRNA features in the record?
1354 if ($n_mrnas == 0) {
1356 # looks like structure_type == 1
1357 $structure_type = 1;
1358 $need_to_infer_mRNAs = 1;
1360 elsif ($n_mrnas_attached_to_gene == 0) {
1362 # $n_mrnas_attached_to_gene = 0
1364 # The entries _do_ contain mRNA features,
1365 # but none of them are part of a group/gene, i.e. they
1368 # this is an annoying weird file that has some floating
1370 # eg ftp.ncbi.nih.gov/genomes/Schizosaccharomyces_pombe/
1372 if ($self->verbose) {
1373 my @floating_mrnas =
1374 grep {$_->primary_tag eq 'mRNA' &&
1375 !$_->has_tag($group_tag)} @flat_seq_features;
1376 printf STDERR
"Unattached mRNAs:\n";
1377 foreach my $mrna (@floating_mrnas) {
1378 $self->_write_sf_detail($mrna);
1380 printf STDERR
"Don't know how to deal with these; filter at source?\n";
1383 foreach (@flat_seq_features) {
1384 if ($_->primary_tag eq 'mRNA') {
1385 # what should we do??
1387 # I think for pombe we just have to filter
1388 # out bogus mRNAs prior to starting
1392 # looks like structure_type == 2
1393 $structure_type = 2;
1394 $need_to_infer_mRNAs = 1;
1399 # we always infer exons in magic mode
1400 $need_to_infer_exons = 1;
1403 # this doesn't seem to be any kind of protein coding gene model
1404 if ( $n_rnas > 0 ) {
1405 $need_to_infer_exons = 1;
1409 $need_to_infer_exons = 0 if $noinfer; #NML
1411 if ($need_to_infer_exons) {
1412 # remove exons and introns from group -
1413 # we will infer exons later, and we
1414 # can always infer introns from exons
1415 foreach my $group (@groups) {
1418 my $type = $_->primary_tag();
1419 if ($type eq 'exon') {
1420 # keep track of all removed exons,
1421 # so we can do a sanity check later
1422 push(@removed_exons, $_);
1424 $type ne 'exon' && $type ne 'intron'
1427 # get rid of any groups that have zero members
1428 @groups = grep {scalar(@
$_)} @groups;
1431 # --- END OF MAGIC ---
1434 if (grep {!scalar(@
$_)} @groups) {
1435 $self->throw("ASSERTION ERROR: empty group");
1439 if ($self->verbose > 0) {
1440 printf STDERR
"GROUPS:\n";
1441 foreach my $group (@groups) {
1442 $self->_write_group($group, $group_tag);
1447 # --------- FINISHED GROUPING -------------
1450 # TYPE CONTAINMENT HIERARCHY (aka partonomy)
1451 # set the containment hierarchy if desired
1452 # see docs for structure_type() method
1453 if ($structure_type) {
1454 if ($structure_type == 1) {
1463 $self->throw("structure_type $structure_type is currently unknown");
1467 # see if we have an obvious resolver_tag
1469 foreach my $sf (@all_seq_features) {
1470 if ($sf->has_tag('derived_from')) {
1471 $resolver_tag = 'derived_from';
1477 # point all feature types without a container type to the root type.
1479 # for example, if we have an unanticipated feature_type, say
1480 # 'aberration', this should by default point to the parent 'gene'
1481 foreach my $group (@groups) {
1484 foreach my $sf (@sfs) {
1485 my $type = $sf->primary_tag;
1486 next if $type eq 'gene';
1487 my $container_type = $self->get_container_type($type);
1488 if (!$container_type) {
1489 $self->partonomy->{$type} = 'gene';
1496 # we have done the first part of the unflattening.
1497 # we now have a list of groups; each group is a list of seqfeatures.
1498 # the actual group itself is flat; we may want to unflatten this further;
1499 # for instance, a gene model can contain multiple mRNAs and CDSs. We may want
1500 # to link the correct mRNA to the correct CDS via the bioperl sub_SeqFeature tree.
1502 # what we would end up with would be
1508 my @top_sfs = $self->unflatten_groups(-groups
=>\
@groups,
1509 -resolver_method
=>$resolver_method,
1510 -resolver_tag
=>$resolver_tag);
1513 $self->partonomy($old_partonomy);
1516 $self->structure_type($old_structure_type);
1518 # modify the original Seq object - the top seqfeatures are now
1519 # the top features from each group
1520 $seq->remove_SeqFeatures;
1521 $seq->add_SeqFeature($_) foreach @top_sfs;
1523 # --------- FINISHED UNFLATTENING -------------
1525 # lets see if there are any post-unflattening tasks we need to do
1530 if ($need_to_infer_mRNAs) {
1531 if ($self->verbose > 0) {
1532 printf STDERR
"** INFERRING mRNA from CDS\n";
1534 $self->infer_mRNA_from_CDS(-seq
=>$seq, -noinfer
=>$noinfer);
1538 if ($need_to_infer_exons) {
1540 # infer exons, one group/gene at a time
1541 foreach my $sf (@top_sfs) {
1542 my @sub_sfs = ($sf, $sf->get_all_SeqFeatures);
1543 $self->feature_from_splitloc(-features
=>\
@sub_sfs);
1546 # some exons are stated explicitly; ie there is an "exon" feature
1547 # most exons are inferred; ie there is a "mRNA" feature with
1550 # if there were exons explicitly stated in the entry, we need to
1553 # make sure these exons are consistent with the inferred exons
1556 # transfer annotation (tag-vals) from the explicit exon to the
1558 if (@removed_exons) {
1559 my @allfeats = $seq->get_all_SeqFeatures;
1561 # find all the inferred exons that are children of mRNA
1562 my @mrnas = grep {$_->primary_tag eq 'mRNA'} @allfeats;
1564 grep {$_->primary_tag eq 'exon'}
1565 map {$_->get_SeqFeatures} @mrnas;
1567 my %exon_h = (); # index of exons by location;
1569 # there CAN be >1 exon at a location; we can represent these redundantly
1570 # (ie as a tree, not a graph)
1571 push(@
{$exon_h{$self->_locstr($_)}}, $_) foreach @exons;
1572 my @problems = (); # list of problems;
1574 # [$severity, $description] pair
1576 my ($n_exons, $n_removed_exons) =
1577 (scalar(keys %exon_h), scalar(@removed_exons));
1578 foreach my $removed_exon (@removed_exons) {
1579 my $locstr = $self->_locstr($removed_exon);
1580 my $inferred_exons = $exon_h{$locstr};
1581 delete $exon_h{$locstr};
1582 if ($inferred_exons) {
1583 my %exons_done = ();
1584 foreach my $exon (@
$inferred_exons) {
1586 # make sure we don't move stuff twice
1587 next if $exons_done{$exon};
1588 $exons_done{$exon} = 1;
1590 # we need to tranfer any tag-values from the explicit
1591 # exon to the implicit exon
1592 foreach my $tag ($removed_exon->get_all_tags) {
1593 my @vals = $removed_exon->get_tag_values($tag);
1594 if (!$exon->can("add_tag_value")) {
1595 # I'm puzzled as to what should be done here;
1596 # SeqFeatureIs are not necessarily mutable,
1597 # but we know that in practice the implementing
1599 $self->throw("The SeqFeature object does not ".
1600 "implement add_tag_value()");
1602 $exon->add_tag_value($tag, @vals);
1607 # no exons inferred at $locstr
1610 "there is a conflict with exons; there was an explicitly ".
1611 "stated exon with location $locstr, yet I cannot generate ".
1612 "this exon from the supplied mRNA locations\n"]);
1615 # do we have any inferred exons left over, that were not
1616 # covered in the explicit exons?
1618 # TODO - we ignore this problem for now
1621 sprintf("There are some inferred exons that are not in the ".
1622 "explicit exon list; they are the exons at locations:\n".
1623 join("\n", keys %exon_h)."\n")]);
1626 # report any problems
1628 my $thresh = $self->error_threshold;
1629 my @bad_problems = grep {$_->[0] > $thresh} @problems;
1630 if (@bad_problems) {
1631 printf STDERR
"PROBLEM:\n";
1632 $self->_write_hier(\
@top_sfs);
1633 # TODO - allow more fine grained control over this
1634 $self->{_problems_reported
} = 1;
1635 $self->throw(join("\n",
1636 map {"@$_"} @bad_problems));
1638 $self->problem(@
$_) foreach @problems;
1642 # --- end of inferring exons --
1644 # return new top level features; this can also
1646 # $seq->get_SeqFeatures();
1648 return $seq->get_SeqFeatures;
1651 # _split_group_if_disconnected([@sfs])
1653 # as well as having the same group_tag, a group should be spatially
1654 # connected. if not, then the group should be split into subgroups.
1655 # this turns out to be necessary in the case of multicopy genes.
1656 # the standard way to represent these is as spatially disconnected
1657 # gene models (usually a 'gene' feature and some kind of RNA feature)
1658 # with the same group tag; the code below will split these into
1659 # seperate groups, one per copy.
1661 sub _split_group_if_disconnected
{
1666 Bio
::Range
->disconnected_ranges(@sfs);
1669 $self->throw("ASSERTION ERROR");
1671 elsif (@ranges == 1) {
1672 # no need to split the group
1677 # split the group into disconnected ranges
1678 if ($self->verbose > 0) {
1679 printf STDERR
"GROUP PRE-SPLIT:\n";
1680 $self->_write_group($group, $self->group_tag);
1686 $_->intersection($range);
1689 if ($self->verbose > 0) {
1690 printf STDERR
"SPLIT GROUPS:\n";
1691 $self->_write_group($_, $self->group_tag) foreach @groups;
1697 sub _remove_duplicates_from_group
{
1701 # ::: WEIRD BOUNDARY CASE CODE :::
1702 # for some reason, there are some gb records with two gene
1703 # features for one gene; for example, see ATF14F8.gbk
1704 # in the t/data directory
1706 # in this case, we get rid of one of the genes
1708 my @genes = grep {$_->primary_tag eq 'gene'} @
$group;
1710 # OK, if we look at ATF14F8.gbk we see that some genes
1711 # just exist as a single location, some exist as a multisplit location;
1718 # gene complement(join(16790..19855,20136..20912,21378..21497,
1719 # 21654..21876,22204..22400,22527..23158,23335..23448,
1720 # 23538..23938,24175..24536,24604..24715,24889..24984,
1721 # 25114..25171,25257..25329,25544..25589,25900..26018,
1725 # the former is the 'standard' way of representing the gene in genbank;
1726 # the latter is redundant with the CDS entry. So we shall get rid of
1727 # the latter with the following filter
1729 if ($self->verbose > 0) {
1730 printf STDERR
"REMOVING DUPLICATES:\n";
1735 my $loc = $_->location;
1736 if ($loc->isa("Bio::Location::SplitLocationI")) {
1737 my @locs = $loc->each_Location;
1751 # OK, that didn't work. Our only resort is to just pick one at random
1752 @genes = ($genes[0]);
1755 @genes == 1 || $self->throw("ASSERTION ERROR");
1757 ($genes[0], grep {$_->primary_tag ne 'gene'} @
$group);
1760 # its a dirty job but someone's gotta do it
1765 =head2 unflatten_groups
1767 Title : unflatten_groups
1769 Function: iterates over groups, calling unflatten_group() [see below]
1771 Returns : list of Bio::SeqFeatureI objects that are holders
1776 -groups: list of list references; inner list is of Bio::SeqFeatureI objects
1777 e.g. ( [$sf1], [$sf2, $sf3, $sf4], [$sf5, ...], ...)
1779 -resolver_method: a CODE reference
1780 see the documentation above for an example of
1781 a subroutine that can be used to resolve hierarchies
1784 this is optional - a default subroutine will be used
1787 NOTE: You should not need to call this method, unless you want fine
1788 grained control over how the unflattening process.
1792 sub unflatten_groups
{
1793 my ($self,@args) = @_;
1794 my($groups, $resolver_method, $resolver_tag) =
1795 $self->_rearrange([qw(GROUPS
1801 # this is just a simple wrapper for unflatten_group()
1804 $self->unflatten_group(-group
=>$_,
1805 -resolver_method
=>$resolver_method,
1806 -resolver_tag
=>$resolver_tag)
1810 =head2 unflatten_group
1812 Title : unflatten_group
1814 Function: nests a group of features into a feature containment hierarchy
1816 Returns : Bio::SeqFeatureI objects that holds other features
1821 -group: reference to list of Bio::SeqFeatureI objects
1823 -resolver_method: a CODE reference
1824 see the documentation above for an example of
1825 a subroutine that can be used to resolve hierarchies
1828 this is optional - a default subroutine will be used
1831 NOTE: You should not need to call this method, unless you want fine
1832 grained control over how the unflattening process.
1836 sub unflatten_group
{
1837 my ($self,@args) = @_;
1839 my($group, $resolver_method, $resolver_tag) =
1840 $self->_rearrange([qw(GROUP
1846 if ($self->verbose > 0) {
1847 printf STDERR
"UNFLATTENING GROUP:\n";
1848 $self->_write_group($group, $self->group_tag);
1853 # we can safely ignore singletons (e.g. [source])
1854 return $sfs[0] if @sfs == 1;
1856 my $partonomy = $self->partonomy;
1858 # $resolver_method is a reference to a SUB that will resolve
1859 # ambiguous parent/child containment; for example, determining
1860 # which mRNAs go with which CDSs
1861 $resolver_method = $resolver_method || \
&_resolve_container_for_sf
;
1863 # TAG BASED RESOLVING OF HIERARCHIES
1865 # if the user specifies $resolver_tag, then we use this tag
1866 # to pair up ambiguous parents and children;
1868 # for example, the CDS feature may have a resolver tag of /derives_from
1869 # which is a 'foreign key' into the /label tag of the mRNA feature
1871 # this kind of tag-based resolution is possible for a certain subset
1872 # of genbank records
1874 # if no resolver tag is specified, we revert to the normal
1876 if ($resolver_tag) {
1877 my $backup_resolver_method = $resolver_method;
1878 # closure: $resolver_tag is remembered by this sub
1881 my ($self, $sf, @possible_container_sfs) = @_;
1882 my @container_sfs = ();
1883 if ($sf->has_tag($resolver_tag)) {
1884 my ($resolver_tagval) = $sf->get_tag_values($resolver_tag);
1885 # if a feature has a resolver_tag (e.g. /derives_from)
1886 # this specifies the /product, /symbol or /label for the
1891 $self->_write_sf($_) if $self->verbose > 0;
1892 foreach my $tag (qw(product symbol label)) {
1893 if ($_->has_tag($tag)) {
1895 $_->get_tag_values($tag);
1896 if (grep {$_ eq $resolver_tagval} @vals) {
1903 } @possible_container_sfs;
1906 return $backup_resolver_method->($sf, @possible_container_sfs);
1908 return map {$_=>0} @container_sfs;
1910 $resolver_method = $sub;
1913 # CONDITION: $resolver_tag is NOT set
1914 $self->throw("assertion error") if $resolver_tag;
1916 # we have now set $resolver_method to a subroutine for
1917 # disambiguatimng parent/child relationships. we will
1918 # now build the whole containment hierarchy for this group
1921 # FIND TOP/ROOT SEQFEATURES
1923 # find all the features for which there is no
1924 # containing feature type (eg genes)
1927 !$self->get_container_type($_->primary_tag);
1930 # CONDITION: there must be at most one root
1932 $self->_write_group($group, $self->group_tag);
1933 printf STDERR
"TOP SFS:\n";
1934 $self->_write_sf($_) foreach @top_sfs;
1935 $self->throw("multiple top-sfs in group");
1937 my $top_sf = $top_sfs[0];
1939 # CREATE INDEX OF SEQFEATURES BY TYPE
1940 my %sfs_by_type = ();
1941 foreach my $sf (@sfs) {
1942 push(@
{$sfs_by_type{$sf->primary_tag}}, $sf);
1945 # containment index; keyed by child; lookup parent
1946 # note: this index uses the stringified object reference of
1947 # the object as a surrogate lookup key
1949 my %container = (); # child -> parent
1951 # ALGORITHM: build containment graph
1953 # find all possible containers for each SF;
1954 # for instance, for a CDS, the possible containers are all
1955 # the mRNAs in the same group. For a mRNA, the possible
1956 # containers are any SFs of type 'gene' (should only be 1).
1957 # (these container-type mappings can be overridden)
1959 # contention is resolved by checking coordinates of splice sites
1960 # (this is the default, but can be overridden)
1962 # most of the time, there is no problem identifying a unique
1963 # parent for every child; this can be ambiguous when constructing
1964 # CDS to mRNA relationships with lots of alternate splicing
1966 # a hash of child->parent relationships is constructed (%container)
1967 # any mappings that need further resolution (eg CDS to mRNA) are
1968 # placed in %unresolved
1971 # (keyed by stringified object reference of child seqfeature)
1972 my %unresolved = (); # child -> [parent,score] to be resolved
1974 # index of seqfeatures by their stringified object reference;
1975 # this is essentially a way of 'reviving' an object from its stringified
1977 # (see NOTE ON USING OBJECTS AS KEYS IN HASHES, below)
1978 my %idxsf = map {$_=>$_} @sfs;
1980 foreach my $sf (@sfs) {
1981 my $type = $sf->primary_tag;
1983 # container type (e.g. the container type for CDS is usually mRNA)
1984 my $container_type =
1985 $self->get_container_type($type);
1986 if ($container_type) {
1988 my @possible_container_sfs =
1989 @
{$sfs_by_type{$container_type} || []};
1990 # we now have a list of possible containers
1991 # (eg for a CDS in an alternately spliced gene, this
1992 # would be a list of all the mRNAs for this gene)
1994 if (!@possible_container_sfs) {
1998 if (@possible_container_sfs == 1) {
1999 # this is the easy situation, whereby the containment
2000 # hierarchy is unambiguous. this will probably be the
2001 # case if the genbank record has no alternate splicing
2004 # ONE OPTION ONLY - resolved!
2005 $container{$sf} = $possible_container_sfs[0];
2009 # MULTIPLE CONTAINER CHOICES
2010 $self->throw("ASSERTION ERROR") unless @possible_container_sfs > 1;
2012 # push this onto the %unresolved graph, and deal with it
2015 # for now we hardcode things such that the only type
2016 # with ambiguous parents is a CDS; if this is violated,
2017 # it has a weak problem class of '1' so the API user
2018 # can easily set things to ignore these
2019 if ($sf->primary_tag ne 'CDS') {
2021 "multiple container choice for non-CDS; ".
2022 "CDS to mRNA should be the only ".
2023 "relationships requiring resolving",
2027 # previously we set the SUB $resolver_method
2028 $self->throw("ASSERTION ERROR")
2029 unless $resolver_method;
2031 # $resolver_method will assign scores to
2032 # parent/child combinations; later on we
2033 # will use these scores to find the optimal
2034 # parent/child pairings
2036 # the default $resolver_method uses splice sites to
2037 # score possible parent/child matches
2040 $resolver_method->($self, $sf, @possible_container_sfs);
2041 if (!%container_sfh) {
2043 "no containers possible for SeqFeature of ".
2044 "type: $type; this SF is being placed at ".
2047 # RESOLVED! (sort of - placed at root/gene level)
2048 $container{$sf} = $top_sf;
2050 # this sort of thing happens if the record is
2051 # badly messed up and there is absolutely no indication
2052 # of where to put the CDS. Perhaps we should just
2053 # place it with a random mRNA?
2055 foreach my $jsf (keys %container_sfh) {
2057 # add [score, parent] pairs to the %unresolved
2058 # lookup table/graph
2059 push(@
{$unresolved{$sf}},
2060 [$idxsf{$jsf}, $container_sfh{$jsf} || 0]);
2067 # not container type for $sf->primary_tag
2070 # $sf must be a root/top node (eg gene)
2076 # CODE CURRENTLY DISABLED
2078 # we require a 1:1 mapping between mRNAs and CDSs;
2079 # create artificial duplicates if we can't do this...
2081 my %childh = map {$_=>1} keys %unresolved;
2082 my %parenth = map {$_->[0]=>1} map {@
$_} values %unresolved;
2083 if ($self->verbose > 0) {
2084 printf STDERR
"MATCHING %d CHILDREN TO %d PARENTS\n",
2085 scalar(keys %childh), scalar(keys %parenth);
2087 # 99.99% of the time in genbank genomic record of structure type 0, we
2088 # see one CDS for every mRNA; one exception is the S Pombe
2089 # genome, which is all CDS, bar a few spurious mRNAs; we have to
2090 # filter out the spurious mRNAs in this case
2092 # another strange case is in the mouse genome, NT_078847.1
2093 # for Pcdh13 you will notice there is 4 mRNAs and 5 CDSs.
2095 # I'm at a loss for a really clever thing to do here. I think the
2096 # best thing is to create duplicate features to preserve the 1:1 mapping
2097 # my $suffix_id = 1;
2098 # while (keys %childh > keys %parenth) {
2105 if ($self->verbose > 0 && scalar(keys %unresolved)) {
2106 printf STDERR
"UNRESOLVED PAIRS:\n";
2107 foreach my $childsf (keys %unresolved) {
2108 my @poss = @
{$unresolved{$childsf}};
2109 foreach my $p (@poss) {
2110 my $parentsf = $p->[0];
2111 $childsf = $idxsf{$childsf};
2112 my @clabels = ($childsf->get_tagset_values(qw(protein_id label product)), "?");
2113 my @plabels = ($parentsf->get_tagset_values(qw(transcript_id label product)), "?");
2115 (" PAIR: $clabels[0] => $plabels[0] (of %d)\n",
2119 } # -- end of verbose
2121 # Now we have to fully resolve the containment hierarchy; remember,
2122 # the graph %container has the fully resolved child->parent links;
2124 # the graph %unresolved is keyed by children missing parents; we
2125 # need to put all these orphans in the %container graph
2127 # we do this using the scores in %unresolved, with the
2128 # find_best_matches() algorithm
2129 my $unresolved_problem_reported = 0;
2132 $self->find_best_matches(\
%unresolved, []);
2134 my ($g) = $sfs[0]->get_tagset_values($self->group_tag || 'gene');
2136 "Could not resolve hierarchy for $g");
2138 $unresolved_problem_reported = 1;
2140 foreach my $pair (@
$new_pairs) {
2141 if ($self->verbose > 0) {
2142 printf STDERR
" resolved pair @$pair\n";
2144 $container{$pair->[0]} = $pair->[1];
2145 delete $unresolved{$pair->[0]};
2149 # CONDITION: containment hierarchy resolved
2151 $self->throw("UNRESOLVED: %unresolved")
2152 unless $unresolved_problem_reported;
2155 # make nested SeqFeature hierarchy from @containment_pairs
2156 # ie put child SeqFeatures into parent SeqFeatures
2158 foreach my $sf (@sfs) {
2159 my $container_sf = $container{$sf};
2160 if ($container_sf) {
2161 # make $sf nested inside $container_sf
2163 # first check if the container spatially contains the containee
2164 if ($container_sf->contains($sf)) {
2166 $container_sf->add_SeqFeature($sf);
2169 # weird case - the container does NOT spatially
2170 # contain the containee;
2171 # we expand and throw a warning
2173 # for an example of this see ZFP91-CNTF dicistronic gene
2174 # in NCBI chrom 11 build 34.3
2176 "Container feature does not spatially contain ".
2177 "subfeature. Perhaps this is a dicistronic gene? ".
2178 "I am expanding the parent feature",
2181 $container_sf->add_SeqFeature($sf, 'EXPAND');
2189 } # -- end of unflatten_group
2192 # A NOTE ON USING OBJECTS AS KEYS IN HASHES (stringified objects)
2194 # Often we with to use seqfeatures as keys in a hashtable; because seqfeatures
2195 # in bioperl have no unique ID, we use a surrogate ID in the form of the
2196 # stringified object references - this is just what you get if you say
2200 # this is guaranteed to be unique (within a particular perl execution)
2202 # often we want to 'revive' the objects used as keys in a hash - once the
2203 # objects are used as keys, remember it is the *strings* used as keys and
2204 # not the object itself, so the object needs to be revived using another
2205 # hashtable that looks like this
2207 # %sfidx = map { $_ => $_ } @sfs
2212 # recursively finds the best set of pairings from a matrix of possible pairings
2214 # tries to make sure nothing is unpaired
2216 # given a matrix of POSSIBLE matches
2217 # (matrix expressed as hash/lookup; keyed by child object; val = [parent, score]
2220 sub find_best_matches
{
2223 my $pairs = shift; # [child,parent] pairs already selected
2225 my $verbose = $self->verbose;
2226 #################################print "I";
2228 printf STDERR
"find_best_matches: (/%d)\n", scalar(@
$pairs);
2231 my %selected_children = map {($_->[0]=>1)} @
$pairs;
2232 my %selected_parents = map {($_->[1]=>1)} @
$pairs;
2234 # make a copy of the matrix with the portions still to be
2236 my %unresolved_parents = ();
2240 printf STDERR
" $_ : %s\n", join("; ", map {"[@$_]"} @
{$matrix->{$_}});
2242 if ($selected_children{$_}) {
2248 !$selected_parents{$_->[0]}
2250 $unresolved_parents{$_} = 1 foreach @parents;
2256 my @I = keys %unresolved;
2258 return $pairs if !scalar(keys %unresolved_parents);
2259 # NECESSARY CONDITION:
2260 # all possible parents have a child match
2262 return $pairs if !scalar(@I);
2263 # NECESSARY CONDITION:
2264 # all possible children have a parent match
2266 # give those with fewest choices highest priority
2268 # n possible parents
2269 scalar(@
{$unresolved{$a}})
2271 scalar(@
{$unresolved{$b}}) ;
2276 my @J = @
{$unresolved{$csf}}; # array of [parent, score]
2278 # sort by score, highest first
2284 # select pair(s) from remaining matrix of possible pairs
2285 # by iterating through possible parents
2287 my $successful_pairs;
2288 foreach my $j (@J) {
2289 my ($psf, $score) = @
$j;
2290 # would selecting $csf, $psf as a pair
2291 # remove all choices from another?
2293 foreach my $sf (@I) {
2294 if (!grep {$_->[0] ne $psf} @
{$unresolved{$sf}}) {
2295 # $psf was the only parent choice for $sf
2301 my $pair = [$csf, $psf];
2302 my $new_pairs = [@
$pairs, $pair];
2303 my $set = $self->find_best_matches($matrix, $new_pairs);
2305 $successful_pairs = $set;
2311 return $successful_pairs if $successful_pairs;
2316 # ----------------------------------------------
2317 # writes a group to stdout
2319 # mostly for logging/debugging
2320 # ----------------------------------------------
2324 my $group_tag = shift || 'gene';
2326 my $f = $group->[0];
2328 if ($f->has_tag($group_tag)) {
2329 ($label) = $f->get_tag_values($group_tag);
2331 if( $self->verbose > 0 ) {
2332 printf STDERR
(" GROUP [%s]:%s\n",
2335 map { $_->primary_tag } @
$group));
2343 printf STDERR
"TYPE:%s\n", $sf->primary_tag;
2347 sub _write_sf_detail
{
2350 printf STDERR
"TYPE:%s\n", $sf->primary_tag;
2351 my @locs = $sf->location->each_Location;
2352 printf STDERR
" %s,%s [%s]\n", $_->start, $_->end, $_->strand foreach @locs;
2358 my @sfs = @
{shift || []};
2359 my $indent = shift || 0;
2360 if( $self->verbose > 0 ) {
2361 foreach my $sf (@sfs) {
2363 if ($sf->has_tag('product')) {
2364 ($label) = $sf->get_tag_values('product');
2366 printf STDERR
"%s%s $label\n", ' ' x
$indent, $sf->primary_tag;
2367 my @sub_sfs = $sf->sub_SeqFeature;
2368 $self->_write_hier(\
@sub_sfs, $indent+1);
2373 # -----------------------------------------------
2375 # returns all possible containers for an SF based
2376 # on splice site coordinates; splice site coords
2378 # -----------------------------------------------
2379 sub _resolve_container_for_sf
{
2380 my ($self, $sf, @possible_container_sfs) = @_;
2382 my @coords = $self->_get_splice_coords_for_sf($sf);
2383 my $start = $sf->start;
2385 my $splice_uniq_str = "@coords";
2387 my @sf_score_pairs = ();
2388 # a CDS is contained by a mRNA if the locations of the splice
2389 # coordinates are identical
2390 foreach (@possible_container_sfs) {
2391 my @container_coords = $self->_get_splice_coords_for_sf($_);
2393 !$splice_uniq_str ||
2394 index("@container_coords", $splice_uniq_str) > -1;
2396 # the container cannot be smaller than the thing contained
2397 if ($_->start > $start || $_->end < $end) {
2403 # SPECIAL CASE FOR /ribosomal_slippage
2404 # See: https://www.ncbi.nlm.nih.gov/collab/FT/
2405 if (!$inside && $sf->has_tag('ribosomal_slippage')) {
2406 if ($self->verbose > 0) {
2407 printf STDERR
" Checking for ribosomal_slippage\n";
2410 # TODO: rewrite this to match introns;
2411 # each slippage will be a "fake" small CDS exon
2412 my @transcript_splice_sites = @container_coords;
2413 my @cds_splice_sites = @coords;
2414 ##printf STDERR "xxTR SSs: @transcript_splice_sites :: %s\n", $_->get_tag_values('product');
2415 ##printf STDERR "xxCD SSs: @cds_splice_sites :: %s\n\n", $sf->get_tag_values('product');
2417 # find the the first splice site within the CDS
2418 while (scalar(@transcript_splice_sites) &&
2419 $transcript_splice_sites[0] < $cds_splice_sites[0]) {
2420 shift @transcript_splice_sites;
2423 ##print STDERR "TR SSs: @transcript_splice_sites\n";
2424 ##print STDERR "CD SSs: @cds_splice_sites\n\n";
2426 if (!(scalar(@transcript_splice_sites)) ||
2427 $transcript_splice_sites[0] == $cds_splice_sites[0]) {
2429 # we will now try and align all splice remaining sites in the transcript and CDS;
2430 # any splice site that can't be aligned is assumed to be a ribosomal slippage
2434 $inside = 1; # innocent until proven guilty..
2435 while (@cds_splice_sites) {
2436 if (!@transcript_splice_sites) {
2438 # ribosomal slippage is after the last transcript splice site
2439 # Example: (NC_00007, isoform 3 of PEG10)
2440 # mRNA join(85682..85903,92646..99007)
2441 # mRNA join(85682..85903,92646..99007)
2442 # CDS join(85899..85903,92646..93825,93825..94994)
2444 # OR: None of the splice sites align;
2445 # may be a single CDS exon with one slippage inside it.
2446 # Example: (NC_00007, isoform 4 of PEG10)
2447 # mRNA join(85637..85892,92646..99007)
2448 # CDS join(92767..93825,93825..94994)
2450 # Yes, this code is repeated below...
2451 my $p1 = shift @cds_splice_sites;
2452 my $p2 = shift @cds_splice_sites;
2453 if ($self->verbose > 0) {
2454 printf STDERR
" Found the ribosomal_slippage: $p1..$p2\n";
2456 push(@slips, ($p2-$p1)-1);
2458 elsif ($cds_splice_sites[0] == $transcript_splice_sites[0]) {
2459 # splice sites align: this is not the slippage
2460 shift @cds_splice_sites;
2461 shift @transcript_splice_sites;
2462 ##print STDERR "MATCH\n";
2466 if ($cds_splice_sites[0] < $transcript_splice_sites[0]) {
2467 # potential slippage
2473 my $p1 = shift @cds_splice_sites;
2474 my $p2 = shift @cds_splice_sites;
2475 if ($self->verbose > 0) {
2476 printf STDERR
" Found the ribosomal_slippage: $p1..$p2\n";
2478 push(@slips, ($p2-$p1)-1);
2481 # not a potential ribosomal slippage
2482 $inside = 0; # guilty!
2483 ##print STDERR "FAIL\n";
2489 # TODO: this is currently completely arbitrary. How many ribosomal slippages do we allow?
2490 # perhaps we need some mini-statistical model here....?
2494 # TODO: this is currently completely arbitrary. What is the maximum size of a ribosomal slippage?
2495 # perhaps we need some mini-statistical model here....?
2496 if (grep {$_ > 2} @slips) {
2502 # not a ribosomal_slippage, sorry
2505 if ($self->verbose > 0) {
2506 printf STDERR
" Checking containment:[$inside] (@container_coords) IN ($splice_uniq_str)\n";
2509 # SCORE: matching (ss-scoords+2)/(n-container-ss-coords+2)
2511 (scalar(@coords)+2)/(scalar(@container_coords)+2);
2512 push(@sf_score_pairs,
2516 return @sf_score_pairs;
2519 sub _get_splice_coords_for_sf
{
2523 my @locs = $sf->location;
2524 if ($sf->location->isa("Bio::Location::SplitLocationI")) {
2525 @locs = $sf->location->each_Location;
2528 # get an ordered list of (start, end) positions
2532 # $_->strand > 0 ? ($_->start, $_->end) : ($_->end, $_->start)
2535 my @coords = map {($_->start, $_->end)} @locs;
2537 # remove first and last leaving only splice sites
2543 =head2 feature_from_splitloc
2545 Title : feature_from_splitloc
2546 Usage : $unflattener->feature_from_splitloc(-features=>$sfs);
2552 At this time all this method does is generate exons for mRNA or other RNA features
2556 -feature: a Bio::SeqFeatureI object (that conforms to Bio::FeatureHolderI)
2557 -seq: a Bio::SeqI object that contains Bio::SeqFeatureI objects
2558 -features: an arrayref of Bio::SeqFeatureI object
2563 sub feature_from_splitloc
{
2564 my ($self,@args) = @_;
2566 my($sf, $seq, $sfs) =
2567 $self->_rearrange([qw(FEATURE
2572 my @sfs = (@
{$sfs || []});
2573 push(@sfs, $sf) if $sf;
2575 $seq->isa("Bio::SeqI") || $self->throw("$seq NOT A SeqI");
2576 @sfs = $seq->get_all_SeqFeatures;
2578 my @exons = grep {$_->primary_tag eq 'exon'} @sfs;
2581 "There are already exons, so I will not infer exons");
2584 # index of features by type+location
2587 # infer for every feature
2588 foreach my $sf (@sfs) {
2590 $sf->isa("Bio::SeqFeatureI") || $self->throw("$sf NOT A SeqFeatureI");
2591 $sf->isa("Bio::FeatureHolderI") || $self->throw("$sf NOT A FeatureHolderI");
2593 my $type = $sf->primary_tag;
2594 next unless $type eq 'mRNA' or $type =~ /RNA/;
2596 # an mRNA from genbank will have a discontinuous location,
2597 # with each sub-location being equivalent to an exon
2598 my @locs = $sf->location;
2600 if ($sf->location->isa("Bio::Location::SplitLocationI")) {
2601 @locs = $sf->location->each_Location;
2607 $self->throw("ASSERTION ERROR: sf has no location objects");
2610 # make exons from locations
2613 my $subsf = Bio
::SeqFeature
::Generic
->new(-location
=>$_,
2614 -primary_tag
=>'exon');
2615 ## Provide seq_id to new feature:
2616 $subsf->seq_id($sf->seq_id) if $sf->seq_id;
2617 $subsf->source_tag($sf->source_tag) if $sf->source_tag;
2618 ## Transfer /locus_tag and /gene tag values to inferred
2619 ## features. TODO: Perhaps? this should not be done
2620 ## indiscriminantly but rather by virtue of the setting
2622 foreach my $tag (grep /gene|locus_tag/, $sf->get_all_tags) {
2623 my @vals = $sf->get_tag_values($tag);
2624 $subsf->add_tag_value($tag, @vals);
2627 my $locstr = 'exon::'.$self->_locstr($subsf);
2629 # re-use feature if type and location the same
2630 if ($loc_h{$locstr}) {
2631 $subsf = $loc_h{$locstr};
2634 $loc_h{$locstr} = $subsf;
2640 $self->_check_order_is_consistent($sf->location->strand,@subsfs);
2643 $sf->location(Bio
::Location
::Simple
->new());
2645 # we allow the exons to define the boundaries of the transcript
2646 $sf->add_SeqFeature($_, 'EXPAND') foreach @subsfs;
2649 if (!$sf->location->strand) {
2650 # correct weird bioperl bug in previous versions;
2651 # strand was not being set correctly
2652 $sf->location->strand($subsfs[0]->location->strand);
2660 #sub merge_features_with_same_loc {
2661 # my ($self,@args) = @_;
2664 # $self->_rearrange([qw(FEATURES
2668 # my @sfs = (@$sfs);
2670 # $seq->isa("Bio::SeqI") || $self->throw("$seq NOT A SeqI");
2671 # @sfs = $seq->get_all_SeqFeatures;
2676 # foreach my $sf (@sfs) {
2677 # my $type = $sf->primary_tag;
2678 # my $locstr = $self->_locstr($sf);
2679 ## $loc_h{$type.$locstr}
2680 # push(@{$exon_h{$self->_locstr($_)}}, $_) foreach @exons;
2684 =head2 infer_mRNA_from_CDS
2686 Title : infer_mRNA_from_CDS
2693 given a "type 1" containment hierarchy
2699 this will infer the uniform "type 0" containment hierarchy
2706 all the children of the CDS will be moved to the mRNA
2708 a "type 2" containment hierarchy is mixed type "0" and "1" (for
2709 example, see ftp.ncbi.nih.gov/genomes/Schizosaccharomyces_pombe/)
2713 sub infer_mRNA_from_CDS
{
2714 my ($self,@args) = @_;
2716 my($sf, $seq, $noinfer) =
2717 $self->_rearrange([qw(FEATURE
2724 $seq->isa("Bio::SeqI") || $self->throw("$seq NOT A SeqI");
2725 @sfs = $seq->get_all_SeqFeatures;
2728 foreach my $sf (@sfs) {
2730 $sf->isa("Bio::SeqFeatureI") || $self->throw("$sf NOT A SeqFeatureI");
2731 $sf->isa("Bio::FeatureHolderI") || $self->throw("$sf NOT A FeatureHolderI");
2732 if ($self->verbose > 0) {
2733 printf STDERR
" Checking $sf %s\n", $sf->primary_tag;
2736 if ($sf->primary_tag eq 'mRNA') {
2738 "Inferring mRNAs when there are already mRNAs present");
2741 my @cdsl = grep {$_->primary_tag eq 'CDS' } $sf->get_SeqFeatures;
2743 my @children = grep {$_->primary_tag ne 'CDS'} $sf->get_SeqFeatures;
2747 foreach my $cds (@cdsl) {
2749 if ($self->verbose > 0) {
2750 print " Inferring mRNA from CDS $cds\n";
2752 $self->_check_order_is_consistent($cds->location->strand,$cds->location->each_Location);
2754 my $loc = Bio
::Location
::Split
->new;
2755 foreach my $cdsexonloc ($cds->location->each_Location) {
2757 Bio
::Location
::Simple
->new(-start
=>$cdsexonloc->start,
2758 -end
=>$cdsexonloc->end,
2759 -strand
=>$cdsexonloc->strand);
2760 $loc->add_sub_Location($subloc);
2766 # share the same location
2768 Bio
::SeqFeature
::Generic
->new(-location
=>$loc,
2769 -primary_tag
=>'mRNA');
2771 ## Provide seq_id to new feature:
2772 $mrna->seq_id($cds->seq_id) if $cds->seq_id;
2773 $mrna->source_tag($cds->source_tag) if $cds->source_tag;
2775 $self->_check_order_is_consistent($mrna->location->strand,$mrna->location->each_Location);
2777 # make the mRNA hold the CDS; no EXPAND option,
2778 # the CDS cannot be wider than the mRNA
2779 $mrna->add_SeqFeature($cds);
2781 # mRNA steals children of CDS
2782 foreach my $subsf ($cds->get_SeqFeatures) {
2783 $mrna->add_SeqFeature($subsf);
2785 $cds->remove_SeqFeatures;
2786 push(@mrnas, $mrna);
2789 # change gene/CDS to gene/mRNA
2790 $sf->remove_SeqFeatures;
2791 $sf->add_SeqFeature($_) foreach (@mrnas, @children);
2801 Title : remove_types
2802 Usage : $unf->remove_types(-seq=>$seq, -types=>["mRNA"]);
2808 removes features of a set type
2810 useful for pre-filtering a genbank record; eg to get rid of STSs
2812 also, there is no way to unflatten
2813 ftp.ncbi.nih.gov/genomes/Schizosaccharomyces_pombe/ UNLESS the bogus
2814 mRNAs in these records are removed (or changed to a different type) -
2815 they just confuse things too much
2820 my ($self,@args) = @_;
2823 $self->_rearrange([qw(
2828 $seq->isa("Bio::SeqI") || $self->throw("$seq NOT A SeqI");
2829 my @sfs = $seq->get_all_SeqFeatures;
2830 my %rh = map {$_=>1} @
$types;
2831 @sfs = grep {!$rh{$_->primary_tag}} @sfs;
2832 $seq->remove_SeqFeatures;
2833 $seq->add_SeqFeature($_) foreach @sfs;
2838 # _check_order_is_consistent($strand,$ranges) RETURNS BOOL
2840 # note: the value of this test is moot - there are many valid,
2841 # if unusual cases where it would flag an anomaly. for example
2842 # transpliced genes such as mod(mdg4) in dmel on AE003744, and
2843 # the following spliced gene on NC_001284:
2845 # mRNA complement(join(20571..20717,21692..22086,190740..190761,
2846 # 140724..141939,142769..142998))
2848 # /note="trans-splicing, RNA editing"
2849 # /db_xref="GeneID:814567"
2851 # note how the exons are not in order
2852 # this will flag a level-3 warning, the user of this module
2853 # can ignore this and deal appropriately with the resulting
2855 sub _check_order_is_consistent
{
2858 my $parent_strand = shift; # this does nothing..?
2860 return unless @ranges;
2862 join(" ",map{sprintf("[%s,%s]",$_->start,$_->end)} @ranges);
2863 my $strand = $ranges[0]->strand;
2864 for (my $i=1; $i<@ranges;$i++) {
2865 if ($ranges[$i]->strand != $strand) {
2866 $self->problem(1,"inconsistent strands. Trans-spliced gene? Range: $rangestr");
2868 # mixed ranges - autopass
2869 # some mRNAs have exons on both strands; for
2870 # example, the dmel mod(mdg4) gene which is
2871 # trans-spliced (in actual fact two mRNAs)
2875 for (my $i=1; $i<@ranges;$i++) {
2876 my $rangeP = $ranges[$i-1];
2877 my $range = $ranges[$i];
2878 if ($rangeP->start > $range->end) {
2879 if ($self->seq->is_circular) {
2880 # see for example NC_006578.gbk
2881 # we make exceptions for circular genomes here.
2882 # see Re: [Gmod-ajax] flatfile-to-json.pl error with GFF
2886 # failed - but still get one more chance..
2888 $self->problem(2,"Ranges not in correct order. Strange ensembl genbank entry? Range: $rangestr");
2895 # sometimes (eg ensembl flavour genbank files)
2896 # exons on reverse strand listed in reverse order
2897 # eg join(complement(R1),...,complement(Rn))
2899 for (my $i=1; $i<@ranges;$i++) {
2900 my $rangeP = $ranges[$i-1];
2901 my $range = $ranges[$i];
2902 if ($rangeP->end < $range->start) {
2903 $self->problem(3,"inconsistent order. Range: $rangestr");
2911 # PRIVATE METHOD: _locstr($sf)
2913 # returns a location string for a feature; just the outer boundaries
2918 sprintf("%d..%d", $sf->start, $sf->end);
2921 sub iterate_containment_tree
{
2923 my $feature_holder = shift;
2925 $sub->($feature_holder);
2926 my @sfs = $feature_holder->get_SeqFeatures;
2927 $self->iterate_containment_tree($_) foreach @sfs;
2930 sub find_best_pairs
{
2935 for (my $j=0; $j < $size; $j++) {
2936 my $score = $matrix->[$i][$j];
2937 if (!defined($score)) {