t/AlignIO/AlignIO.t: fix number of tests in plan (fixup c523e6bed866)
[bioperl-live.git] / Bio / SeqFeature / Tools / Unflattener.pm
blob5cfc7fb859848612c3633b63917e776834edaa01
2 # bioperl module for Bio::SeqFeature::Tools::Unflattener
4 # Please direct questions and support issues to <bioperl-l@bioperl.org>
6 # Cared for by Chris Mungall <cjm@fruitfly.org>
8 # Copyright Chris Mungall
10 # You may distribute this module under the same terms as perl itself
12 # POD documentation - main docs before the code
14 =head1 NAME
16 Bio::SeqFeature::Tools::Unflattener - turns flat list of genbank-sourced features into a nested SeqFeatureI hierarchy
18 =head1 SYNOPSIS
20 # standard / generic use - unflatten a genbank record
21 use Bio::SeqIO;
22 use Bio::SeqFeature::Tools::Unflattener;
24 # generate an Unflattener object
25 $unflattener = Bio::SeqFeature::Tools::Unflattener->new;
27 # first fetch a genbank SeqI object
28 $seqio =
29 Bio::SeqIO->new(-file=>'AE003644.gbk',
30 -format=>'GenBank');
31 my $out =
32 Bio::SeqIO->new(-format=>'asciitree');
33 while ($seq = $seqio->next_seq()) {
35 # get top level unflattended SeqFeatureI objects
36 $unflattener->unflatten_seq(-seq=>$seq,
37 -use_magic=>1);
38 $out->write_seq($seq);
40 @top_sfs = $seq->get_SeqFeatures;
41 foreach my $sf (@top_sfs) {
42 # do something with top-level features (eg genes)
47 =head1 DESCRIPTION
49 Most GenBank entries for annotated genomic DNA contain a B<flat> list
50 of features. These features can be parsed into an equivalent flat list
51 of L<Bio::SeqFeatureI> objects using the standard L<Bio::SeqIO>
52 classes. However, it is often desirable to B<unflatten> this list into
53 something resembling actual B<gene models>, in which genes, mRNAs and CDSs
54 are B<nested> according to the nature of the gene model.
56 The BioPerl object model allows us to store these kind of associations
57 between SeqFeatures in B<containment hierarchies> -- any SeqFeatureI
58 object can contain nested SeqFeatureI objects. The
59 Bio::SeqFeature::Tools::Unflattener object facilitates construction of
60 these hierarchies from the underlying GenBank flat-feature-list
61 representation.
63 For example, if you were to look at a typical GenBank DNA entry, say,
64 B<AE003644>, you would see a flat list of features:
66 source
68 gene CG4491
69 mRNA CG4491-RA
70 CDS CG4491-PA
72 gene tRNA-Pro
73 tRNA tRNA-Pro
75 gene CG32954
76 mRNA CG32954-RA
77 mRNA CG32954-RC
78 mRNA CG32954-RB
79 CDS CG32954-PA
80 CDS CG32954-PB
81 CDS CG32954-PC
83 These features have sequence locations, but it is not immediately
84 clear how to write code such that each mRNA is linked to the
85 appropriate CDS (other than relying on IDs which is very bad)
87 We would like to convert the above list into the B<containment
88 hierarchy>, shown below:
90 source
91 gene
92 mRNA CG4491-RA
93 CDS CG4491-PA
94 exon
95 exon
96 gene
97 tRNA tRNA-Pro
98 exon
99 gene
100 mRNA CG32954-RA
101 CDS CG32954-PA
102 exon
103 exon
104 mRNA CG32954-RC
105 CDS CG32954-PC
106 exon
107 exon
108 mRNA CG32954-RB
109 CDS CG32954-PB
110 exon
111 exon
113 Where each feature is nested underneath its container. Note that exons
114 have been automatically inferred (even for tRNA genes).
116 We do this using a call on a L<Bio::SeqFeature::Tools::Unflattener>
117 object
119 @sfs = $unflattener->unflatten_seq(-seq=>$seq);
121 This would return a list of the B<top level> (i.e. container)
122 SeqFeatureI objects - in this case, genes. Other top level features
123 are possible; for instance, the B<source> feature which is always
124 present, and other features such as B<variation> or B<misc_feature>
125 types.
127 The containment hierarchy can be accessed using the get_SeqFeature()
128 call on any feature object - see L<Bio::SeqFeature::FeatureHolderI>.
129 The following code will traverse the containment hierarchy for a
130 feature:
132 sub traverse {
133 $sf = shift; # $sf isa Bio::SeqfeatureI
135 # ...do something with $sf!
137 # depth first traversal of containment tree
138 @contained_sfs = $sf->get_SeqFeatures;
139 traverse($_) foreach @contained_sfs;
142 Once you have built the hierarchy, you can do neat stuff like turn the
143 features into 'rich' feature objects (eg
144 L<Bio::SeqFeature::Gene::GeneStructure>) or convert to a suitable
145 format such as GFF3 or chadoxml (after mapping to the Sequence
146 Ontology); this step is not described here.
148 =head1 USING MAGIC
150 Due to the quixotic nature of how features are stored in
151 GenBank/EMBL/DDBJ, there is no guarantee that the default behaviour of
152 this module will produce perfect results. Sometimes it is hard or
153 impossible to build a correct containment hierarchy if the information
154 provided is simply too lossy, as is often the case. If you care deeply
155 about your data, you should always manually inspect the resulting
156 containment hierarchy; you may have to customise the algorithm for
157 building the hierarchy, or even manually tweak the resulting
158 hierarchy. This is explained in more detail further on in the document.
160 However, if you are satisfied with the default behaviour, then you do
161 not need to read any further. Just make sure you set the parameter
162 B<use_magic> - this will invoke incantations which will magically
163 produce good results no matter what the idiosyncracies of the
164 particular GenBank record in question.
166 For example
168 $unflattener->unflatten_seq(-seq=>$seq,
169 -use_magic=>1);
171 The success of this depends on the phase of the moon at the time the
172 entry was submitted to GenBank. Note that the magical recipe is being
173 constantly improved, so the results of invoking magic may vary
174 depending on the bioperl release.
176 If you are skeptical of magic, or you wish to exact fine grained
177 control over how the entry is unflattened, or you simply wish to
178 understand more about how this crazy stuff works, then read on!
180 =head1 PROBLEMATIC DATA AND INCONSISTENCIES
182 Occasionally the Unflattener will have problems with certain
183 records. For example, the record may contain inconsistent data - maybe
184 there is an B<exon> entry that has no corresponding B<mRNA> location.
186 The default behaviour is to throw an exception reporting the problem,
187 if the problem is relatively serious - for example, inconsistent data.
189 You can exert more fine grained control over this - perhaps you want
190 the Unflattener to do the best it can, and report any problems. This
191 can be done - refer to the methods.
193 error_threshold()
195 get_problems()
197 report_problems()
199 ignore_problems()
201 =head1 ALGORITHM
203 This is the default algorithm; you should be able to override any part
204 of it to customise.
206 The core of the algorithm is in two parts
208 =over
210 =item Partitioning the flat feature list into groups
212 =item Resolving the feature containment hierarchy for each group
214 =back
216 There are other optional steps after the completion of these two
217 steps, such as B<inferring exons>; we now describe in more detail what
218 is going on.
220 =head2 Partitioning into groups
222 First of all the flat feature list is partitioned into B<group>s.
224 The default way of doing this is to use the B<gene> attribute; if we
225 look at two features from GenBank accession AE003644.3:
227 gene 20111..23268
228 /gene="noc"
229 /locus_tag="CG4491"
230 /note="last curated on Thu Dec 13 16:51:32 PST 2001"
231 /map="35B2-35B2"
232 /db_xref="FLYBASE:FBgn0005771"
233 mRNA join(20111..20584,20887..23268)
234 /gene="noc"
235 /locus_tag="CG4491"
236 /product="CG4491-RA"
237 /db_xref="FLYBASE:FBgn0005771"
239 Both these features share the same /gene tag which is "noc", so they
240 correspond to the same gene model (the CDS feature is not shown, but
241 this also has a tag-value /gene="noc").
243 Not all groups need to correspond to gene models, but this is the most
244 common use case; later on we shall describe how to customise the
245 grouping.
247 Sometimes other tags have to be used; for instance, if you look at the
248 entire record for AE003644.3 you will see you actually need the use the
249 /locus_tag attribute. This attribute is actually B<not present> in
250 most records!
252 You can override this:
254 $collection->unflatten_seq(-seq=>$seq, -group_tag=>'locus_tag');
256 Alternatively, if you B<-use_magic>, the object will try and make a
257 guess as to what the correct group_tag should be.
259 At the end of this step, we should have a list of groups - there is no
260 structure within a group; the group just serves to partition the flat
261 features. For the example data above, we would have the following groups.
263 [ source ]
264 [ gene mRNA CDS ]
265 [ gene mRNA CDS ]
266 [ gene mRNA CDS ]
267 [ gene mRNA mRNA mRNA CDS CDS CDS ]
269 =head3 Multicopy Genes
271 Multicopy genes are usually rRNAs or tRNAs that are duplicated across
272 the genome. Because they are functionally equivalent, and usually have
273 the same sequence, they usually have the same group_tag (ie gene
274 symbol); they often have a /note tag giving copy number. This means
275 they will end up in the same group. This is undesirable, because they
276 are spatially disconnected.
278 There is another step, which involves splitting spatially disconnected
279 groups into distinct groups
281 this would turn this
283 [gene-rrn3 rRNA-rrn3 gene-rrn3 rRNA-rrn3]
285 into this
287 [gene-rrn3 rRNA-rrn3] [gene-rrn3 rRNA-rrn3]
289 based on the coordinates
291 =head3 What next?
293 The next step is to add some structure to each group, by making
294 B<containment hierarchies>, trees that represent how the features
295 interrelate
297 =head2 Resolving the containment hierarchy
299 After the grouping is done, we end up with a list of groups which
300 probably contain features of type 'gene', 'mRNA', 'CDS' and so on.
302 Singleton groups (eg the 'source' feature) are ignored at this stage.
304 Each group is itself flat; we need to add an extra level of
305 organisation. Usually this is because different spliceforms
306 (represented by the 'mRNA' feature) can give rise to different
307 protein products (indicated by the 'CDS' feature). We want to correctly
308 associate mRNAs to CDSs.
310 We want to go from a group like this:
312 [ gene mRNA mRNA mRNA CDS CDS CDS ]
314 to a containment hierarchy like this:
316 gene
317 mRNA
319 mRNA
321 mRNA
324 In which each CDS is nested underneath the correct corresponding mRNA.
326 For entries that contain no alternate splicing, this is simple; we
327 know that the group
329 [ gene mRNA CDS ]
331 Must resolve to the tree
333 gene
334 mRNA
337 How can we do this in entries with alternate splicing? The bad
338 news is that there is no guaranteed way of doing this correctly for
339 any GenBank entry. Occasionally the submission will have been done in
340 such a way as to reconstruct the containment hierarchy. However, this
341 is not consistent across databank entries, so no generic solution can
342 be provided by this object. This module does provide the framework
343 within which you can customise a solution for the particular dataset
344 you are interested in - see later.
346 The good news is that there is an inference we can do that should
347 produce pretty good results the vast majority of the time. It uses
348 splice coordinate data - this is the default behaviour of this module,
349 and is described in detail below.
351 =head2 Using splice site coordinates to infer containment
353 If an mRNA is to be the container for a CDS, then the splice site
354 coordinates (or intron coordinates, depending on how you look at it)
355 of the CDS must fit inside the splice site coordinates of the mRNA.
357 Ambiguities can still arise, but the results produced should still be
358 reasonable and consistent at the sequence level. Look at this fake
359 example:
361 mRNA XXX---XX--XXXXXX--XXXX join(1..3,7..8,11..16,19..23)
362 mRNA XXX-------XXXXXX--XXXX join(1..3,11..16,19..23)
363 CDS XXXX--XX join(13..16,19..20)
364 CDS XXXX--XX join(13..16,19..20)
366 [obviously the positions have been scaled down]
368 We cannot unambiguously match mRNA with CDS based on splice sites,
369 since both CDS share the splice site locations 16^17 and
370 18^19. However, the consequences of making a wrong match are probably
371 not very severe. Any annotation data attached to the first CDS is
372 probably identical to the seconds CDS, other than identifiers.
374 The default behaviour of this module is to make an arbitrary call
375 where it is ambiguous (the mapping will always be bijective; i.e. one
376 mRNA -E<gt> one CDS).
378 [TODO: NOTE: not tested on EMBL data, which may not be bijective; ie two
379 mRNAs can share the same CDS??]
381 This completes the building of the containment hierarchy; other
382 optional step follow
384 =head1 POST-GROUPING STEPS
386 =head2 Inferring exons from mRNAs
388 This step always occurs if B<-use_magic> is invoked.
390 In a typical GenBank entry, the exons are B<implicit>. That is they
391 can be inferred from the mRNA location.
393 For example:
395 mRNA join(20111..20584,20887..23268)
397 This tells us that this particular transcript has two exons. In
398 bioperl, the mRNA feature will have a 'split location'.
400 If we call
402 $unflattener->feature_from_splitloc(-seq=>$seq);
404 This will generate the necessary exon features, and nest them under
405 the appropriate mRNAs. Note that the mRNAs will no longer have split
406 locations - they will have simple locations spanning the extent of the
407 exons. This is intentional, to avoid redundancy.
409 Occasionally a GenBank entry will have both implicit exons (from the
410 mRNA location) B<and> explicit exon features.
412 In this case, exons will still be transferred. Tag-value data from the
413 explicit exon will be transferred to the implicit exon. If exons are
414 shared between mRNAs these will be represented by different
415 objects. Any inconsistencies between implicit and explicit will be
416 reported.
418 =head3 tRNAs and other noncoding RNAs
420 exons will also be generated from these features
422 =head2 Inferring mRNAs from CDS
424 Some GenBank entries represent gene models using features of type
425 gene, mRNA and CDS; some entries just use gene and CDS.
427 If we only have gene and CDS, then the containment hierarchies will
428 look like this:
430 gene
433 If we want the containment hierarchies to be uniform, like this
435 gene
436 mRNA
439 Then we must create an mRNA feature. This will have identical
440 coordinates to the CDS. The assumption is that there is either no
441 untranslated region, or it is unknown.
443 To do this, we can call
445 $unflattener->infer_mRNA_from_CDS(-seq=>$seq);
447 This is taken care of automatically, if B<-use_magic> is invoked.
449 =head1 ADVANCED
451 =head2 Customising the grouping of features
453 The default behaviour is suited mostly to building models of protein
454 coding genes and noncoding genes from genbank genomic DNA submissions.
456 You can change the tag used to partition the feature by passing in a
457 different group_tag argument - see the unflatten_seq() method
459 Other behaviour may be desirable. For example, even though SNPs
460 (features of type 'variation' in GenBank) are not actually part of the
461 gene model, it may be desirable to group SNPs that overlap or are
462 nearby gene models.
464 It should certainly be possible to extend this module to do
465 this. However, I have yet to code this part!!! If anyone would find
466 this useful let me know.
468 In the meantime, you could write your own grouping subroutine, and
469 feed the results into unflatten_groups() [see the method documentation
470 below]
472 =head2 Customising the resolution of the containment hierarchy
474 Once the flat list of features has been partitioned into groups, the
475 method unflatten_group() is called on each group to build a tree.
477 The algorithm for doing this is described above; ambiguities are
478 resolved by using splice coordinates. As discussed, this can be
479 ambiguous.
481 Some submissions may contain information in tags/attributes that hint
482 as to the mapping that needs to be made between the features.
484 For example, with the Drosophila Melanogaster release 3 submission, we
485 see that CDS features in alternately spliced mRNAs have a form like
486 this:
488 CDS join(145588..145686,145752..146156,146227..146493)
489 /locus_tag="CG32954"
490 /note="CG32954 gene product from transcript CG32954-RA"
491 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
492 /codon_start=1
493 /product="CG32954-PA"
494 /protein_id="AAF53403.1"
495 /db_xref="GI:7298167"
496 /db_xref="FLYBASE:FBgn0052954"
497 /translation="MSFTLTNKNVIFVAGLGGIGLDTSKELLKRDLKNLVILDRIENP..."
499 Here the /note tag provides the clue we need to link CDS to mRNA
500 (highlighted with ^^^^). We just need to find the mRNA with the tag
502 /product="CG32954-RA"
504 I have no idea how consistent this practice is across submissions; it
505 is consistent for the fruitfly genome submission.
507 We can customise the behaviour of unflatten_group() by providing our
508 own resolver method. This obviously requires a bit of extra
509 programming, but there is no way to get around this.
511 Here is an example of how to pass in your own resolver; this example
512 basically checks the parent (container) /product tag to see if it
513 matches the required string in the child (contained) /note tag.
515 $unflattener->unflatten_seq(-seq=>$seq,
516 -group_tag=>'locus_tag',
517 -resolver_method=>sub {
518 my $self = shift;
519 my ($sf, @candidate_container_sfs) = @_;
520 if ($sf->has_tag('note')) {
521 my @notes = $sf->get_tag_values('note');
522 my @trnames = map {/from transcript\s+(.*)/;
523 $1} @notes;
524 @trnames = grep {$_} @trnames;
525 my $trname;
526 if (@trnames == 0) {
527 $self->throw("UNRESOLVABLE");
529 elsif (@trnames == 1) {
530 $trname = $trnames[0];
532 else {
533 $self->throw("AMBIGUOUS: @trnames");
535 my @container_sfs =
536 grep {
537 my ($product) =
538 $_->has_tag('product') ?
539 $_->get_tag_values('product') :
540 ('');
541 $product eq $trname;
542 } @candidate_container_sfs;
543 if (@container_sfs == 0) {
544 $self->throw("UNRESOLVABLE");
546 elsif (@container_sfs == 1) {
547 # we got it!
548 return $container_sfs[0];
550 else {
551 $self->throw("AMBIGUOUS");
556 the resolver method is only called when there is more than one spliceform.
558 =head2 Parsing mRNA records
560 Some of the entries in sequence databanks are for mRNA sequences as
561 well as genomic DNA. We may want to build models from these too.
563 NOT YET DONE - IN PROGRESS!!!
565 Open question - what would these look like?
567 Ideally we would like a way of combining a mRNA record with the
568 corresponding SeFeature entry from the appropriate genomic DNA
569 record. This could be problemmatic in some cases - for example, the
570 mRNA sequences may not match 100% (due to differences in strain,
571 assembly problems, sequencing problems, etc). What then...?
573 =head1 SEE ALSO
575 Feature table description
577 http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html
579 =head1 FEEDBACK
581 =head2 Mailing Lists
583 User feedback is an integral part of the evolution of this and other
584 Bioperl modules. Send your comments and suggestions preferably to the
585 Bioperl mailing lists Your participation is much appreciated.
587 bioperl-l@bioperl.org - General discussion
588 http://bioperl.org/wiki/Mailing_lists - About the mailing lists
590 =head2 Support
592 Please direct usage questions or support issues to the mailing list:
594 I<bioperl-l@bioperl.org>
596 rather than to the module maintainer directly. Many experienced and
597 reponsive experts will be able look at the problem and quickly
598 address it. Please include a thorough description of the problem
599 with code and data examples if at all possible.
601 =head2 Reporting Bugs
603 report bugs to the Bioperl bug tracking system to help us keep track
604 the bugs and their resolution. Bug reports can be submitted via the
605 web:
607 https://github.com/bioperl/bioperl-live/issues
609 =head1 AUTHOR - Chris Mungall
611 Email: cjm@fruitfly.org
613 =head1 APPENDIX
615 The rest of the documentation details each of the object
616 methods. Internal methods are usually preceded with a _
618 =cut
621 # Let the code begin...
623 package Bio::SeqFeature::Tools::Unflattener;
624 use strict;
626 # Object preamble - inherits from Bio::Root::Root
627 use Bio::Location::Simple;
628 use Bio::SeqFeature::Generic;
629 use Bio::Range;
632 use base qw(Bio::Root::Root);
634 =head2 new
636 Title : new
637 Usage : $unflattener = Bio::SeqFeature::Tools::Unflattener->new();
638 $unflattener->unflatten_seq(-seq=>$seq);
639 Function: constructor
640 Example :
641 Returns : a new Bio::SeqFeature::Tools::Unflattener
642 Args : see below
644 Arguments
646 -seq : A L<Bio::SeqI> object (optional)
647 the sequence to unflatten; this can also be passed in
648 when we call unflatten_seq()
650 -group_tag : a string representing the /tag used to partition flat features
651 (see discussion above)
653 =cut
656 sub new {
657 my($class,@args) = @_;
658 my $self = $class->SUPER::new(@args);
660 my($seq, $group_tag, $trust_grouptag) =
661 $self->_rearrange([qw(SEQ
662 GROUP_TAG
663 TRUST_GROUPTAG
665 @args);
667 $seq && $self->seq($seq);
668 $group_tag && $self->group_tag($group_tag);
669 # $self->{'trust_grouptag'}= $trust_grouptag if($trust_grouptag); #dgg suggestion
670 return $self; # success - we hope!
673 sub DESTROY {
674 my $self = shift;
675 return if $self->{_reported_problems};
676 return if $self->{_ignore_problems};
677 my @probs = $self->get_problems;
678 if (!$self->{_problems_reported} &&
679 scalar(@probs)) {
680 $self->warn(
681 "WARNING: There are UNREPORTED PROBLEMS.\n".
682 "You may wish to use the method report_problems(), \n",
683 "or ignore_problems() on the Unflattener object\n");
685 return;
688 =head2 seq
690 Title : seq
691 Usage : $unflattener->seq($newval)
692 Function:
693 Example :
694 Returns : value of seq (a Bio::SeqI)
695 Args : on set, new value (a Bio::SeqI, optional)
697 The Bio::SeqI object should hold a flat list of Bio::SeqFeatureI
698 objects; this is the list that will be unflattened.
700 The sequence object can also be set when we call unflatten_seq()
702 =cut
704 sub seq{
705 my $self = shift;
707 return $self->{'seq'} = shift if @_;
708 return $self->{'seq'};
711 =head2 group_tag
713 Title : group_tag
714 Usage : $unflattener->group_tag($newval)
715 Function:
716 Example :
717 Returns : value of group_tag (a scalar)
718 Args : on set, new value (a scalar or undef, optional)
720 This is the tag that will be used to collect elements from the flat
721 feature list into groups; for instance, if we look at two typical
722 GenBank features:
724 gene 20111..23268
725 /gene="noc"
726 /locus_tag="CG4491"
727 /note="last curated on Thu Dec 13 16:51:32 PST 2001"
728 /map="35B2-35B2"
729 /db_xref="FLYBASE:FBgn0005771"
730 mRNA join(20111..20584,20887..23268)
731 /gene="noc"
732 /locus_tag="CG4491"
733 /product="CG4491-RA"
734 /db_xref="FLYBASE:FBgn0005771"
736 We can see that these comprise the same gene model because they share
737 the same /gene attribute; we want to collect these together in groups.
739 Setting group_tag is optional. The default is to use 'gene'. In the
740 example above, we could also use /locus_tag
742 =cut
744 sub group_tag{
745 my $self = shift;
747 return $self->{'group_tag'} = shift if @_;
748 return $self->{'group_tag'};
751 =head2 partonomy
753 Title : partonomy
754 Usage : $unflattener->partonomy({mRNA=>'gene', CDS=>'mRNA')
755 Function:
756 Example :
757 Returns : value of partonomy (a scalar)
758 Args : on set, new value (a scalar or undef, optional)
760 A hash representing the containment structure that the seq_feature
761 nesting should conform to; each key represents the contained (child)
762 type; each value represents the container (parent) type.
764 =cut
766 sub partonomy{
767 my $self = shift;
769 return $self->{'partonomy'} = shift if @_;
770 if (!$self->{'partonomy'}) {
771 $self->{'partonomy'} = $self->_default_partonomy;
773 return $self->{'partonomy'};
776 sub _default_partonomy{
777 return {
778 mRNA => 'gene',
779 tRNA => 'gene',
780 rRNA => 'gene',
781 scRNA => 'gene',
782 snRNA => 'gene',
783 snoRNA => 'gene',
784 misc_RNA => 'gene',
785 CDS => 'mRNA',
786 exon => 'mRNA',
787 intron => 'mRNA',
789 pseudoexon => 'pseudogene',
790 pseudointron => 'pseudogene',
791 pseudotranscript => 'pseudogene',
795 =head2 structure_type
797 Title : structure_type
798 Usage : $unflattener->structure_type($newval)
799 Function:
800 Example :
801 Returns : value of structure_type (a scalar)
802 Args : on set, new value (an int or undef, optional)
804 GenBank entries conform to different flavours, or B<structure
805 types>. Some have mRNAs, some do not.
807 Right now there are only two base structure types defined. If you set
808 the structure type, then appropriate unflattening action will be
809 taken. The presence or absence of explicit exons does not affect the
810 structure type.
812 If you invoke B<-use_magic> then this will be set automatically, based
813 on the content of the record.
815 =over
817 =item Type 0 (DEFAULT)
819 typically contains
821 source
822 gene
823 mRNA
826 with this structure type, we want the seq_features to be nested like this
828 gene
829 mRNA
831 exon
833 exons and introns are implicit from the mRNA 'join' location
835 to get exons from the mRNAs, you will need this call (see below)
837 $unflattener->feature_from_splitloc(-seq=>$seq);
839 =item Type 1
841 typically contains
843 source
844 gene
846 exon [optional]
847 intron [optional]
849 there are no mRNA features
851 with this structure type, we want the seq_features to be nested like this
853 gene
855 exon
856 intron
858 exon and intron may or may not be present; they may be implicit from
859 the CDS 'join' location
861 =back
863 =cut
865 sub structure_type{
866 my $self = shift;
868 return $self->{'structure_type'} = shift if @_;
869 return $self->{'structure_type'};
872 =head2 get_problems
874 Title : get_problems
875 Usage : @probs = get_problems()
876 Function: Get the list of problem(s) for this object.
877 Example :
878 Returns : An array of [severity, description] pairs
879 Args :
881 In the course of unflattening a record, problems may occur. Some of
882 these problems are non-fatal, and can be ignored.
884 Problems are represented as arrayrefs containing a pair [severity,
885 description]
887 severity is a number, the higher, the more severe the problem
889 the description is a text string
891 =cut
893 sub get_problems{
894 my $self = shift;
896 return @{$self->{'_problems'}} if exists($self->{'_problems'});
897 return ();
900 =head2 clear_problems
902 Title : clear_problems
903 Usage :
904 Function: resets the problem list to empty
905 Example :
906 Returns :
907 Args :
910 =cut
912 sub clear_problems{
913 my ($self,@args) = @_;
914 $self->{'_problems'} = [];
915 return;
919 # PRIVATE
920 # see get_problems
921 sub add_problem{
922 my $self = shift;
924 $self->{'_problems'} = [] unless exists($self->{'_problems'});
925 if ($self->verbose > 0) {
926 warn( "PROBLEM: $_\n") foreach @_;
928 push(@{$self->{'_problems'}}, @_);
931 # PRIVATE
932 # see get_problems
933 sub problem {
934 my $self = shift;
935 my ($severity, $desc, @sfs) = @_;
936 if (@sfs) {
937 foreach my $sf (@sfs) {
938 $desc .=
939 sprintf("\nSF [$sf]: ". $sf->location->to_FTstring . "; %s\n",
940 join('; ',
941 $sf->primary_tag,
942 map {
943 $sf->has_tag($_) ?
944 $sf->get_tag_values($_) : ()
945 } qw(locus_tag gene product label)));
948 my $thresh = $self->error_threshold;
949 if ($severity > $thresh) {
950 $self->{_problems_reported} = 1;
951 $self->throw("PROBLEM, SEVERITY==$severity\n$desc");
953 $self->add_problem([$severity, $desc]);
954 return;
957 =head2 report_problems
959 Title : report_problems
960 Usage : $unflattener->report_problems(\*STDERR);
961 Function:
962 Example :
963 Returns :
964 Args : FileHandle (defaults to STDERR)
967 =cut
969 sub report_problems{
970 my ($self, $fh) = @_;
972 if (!$fh) {
973 $fh = \*STDERR;
975 foreach my $problem ($self->get_problems) {
976 my ($sev, $desc) = @$problem;
977 printf $fh "PROBLEM, SEVERITY==$sev\n$desc\n";
979 $self->{_problems_reported} = 1;
980 return;
983 =head2 ignore_problems
985 Title : ignore_problems
986 Usage : $obj->ignore_problems();
987 Function:
988 Example :
989 Returns :
990 Args :
992 Unflattener is very particular about problems it finds along the
993 way. If you have set the error_threshold such that less severe
994 problems do not cause exceptions, Unflattener still expects you to
995 report_problems() at the end, so that the user of the module is aware
996 of any inconsistencies or problems with the data. In fact, a warning
997 will be produced if there are unreported problems. To silence, this
998 warning, call the ignore_problems() method before the Unflattener
999 object is destroyed.
1001 =cut
1003 sub ignore_problems{
1004 my ($self) = @_;
1005 $self->{_ignore_problems} = 1;
1006 return;
1010 =head2 error_threshold
1012 Title : error_threshold
1013 Usage : $obj->error_threshold($severity)
1014 Function:
1015 Example :
1016 Returns : value of error_threshold (a scalar)
1017 Args : on set, new value (an integer)
1019 Sets the threshold above which errors cause this module to throw an
1020 exception. The default is 0; all problems with a severity E<gt> 0 will
1021 cause an exception.
1023 If you raise the threshold to 1, then the unflattening process will be
1024 more lax; problems of severity==1 are generally non-fatal, but may
1025 indicate that the results should be inspected, for example, to make
1026 sure there is no data loss.
1028 =cut
1030 sub error_threshold{
1031 my $self = shift;
1033 return $self->{'error_threshold'} = shift if @_;
1034 return $self->{'error_threshold'} || 0;
1039 # PRIVATE
1041 # given a type (eg mRNA), will return the container type (eg gene)
1042 sub get_container_type{
1043 my ($self,$type) = @_;
1044 my @roots = $self->_get_partonomy_roots;
1045 if (grep {$_ eq $type} @roots) {
1046 # it is a root - no parents/containers
1047 return;
1049 my $ch = $self->partonomy;
1050 my $ctype = $ch->{$type};
1051 if (!$ctype) {
1052 # asterix acts as a wild card
1053 $ctype = $ch->{'*'};
1055 return $ctype;
1058 # get root node of partonomy hierarchy (usually gene)
1059 sub _get_partonomy_roots {
1060 my $self = shift;
1061 my $ch = $self->partonomy;
1062 my @parents = values %$ch;
1063 # find parents that do not have parents themselves
1064 return grep {!$ch->{$_}} @parents;
1069 =head2 unflatten_seq
1071 Title : unflatten_seq
1072 Usage : @sfs = $unflattener->unflatten_seq($seq);
1073 Function: turns a flat list of features into a list of holder features
1074 Example :
1075 Returns : list of Bio::SeqFeatureI objects
1076 Args : see below
1078 partitions a list of features then arranges them in a nested tree; see
1079 above for full explanation.
1081 note - the Bio::SeqI object passed in will be modified
1083 Arguments
1085 -seq : a Bio::SeqI object; must contain Bio::SeqFeatureI objects
1086 (this is optional if seq has already been set)
1088 -use_magic: if TRUE (ie non-zero) then magic will be invoked;
1089 see discussion above.
1091 -resolver_method: a CODE reference
1092 see the documentation above for an example of
1093 a subroutine that can be used to resolve hierarchies
1094 within groups.
1096 this is optional - if nothing is supplied, a default
1097 subroutine will be used (see below)
1099 -group_tag: a string
1100 [ see the group_tag() method ]
1101 this overrides the default group_tag which is 'gene'
1105 =cut
1107 sub unflatten_seq{
1108 my ($self,@args) = @_;
1110 my($seq, $resolver_method, $group_tag, $partonomy,
1111 $structure_type, $resolver_tag, $use_magic, $noinfer) =
1112 $self->_rearrange([qw(SEQ
1113 RESOLVER_METHOD
1114 GROUP_TAG
1115 PARTONOMY
1116 STRUCTURE_TYPE
1117 RESOLVER_TAG
1118 USE_MAGIC
1119 NOINFER
1121 @args);
1123 # seq we want to unflatten
1124 $seq = $seq || $self->seq;
1125 if (!$self->seq) {
1126 $self->seq($seq);
1130 # prevent bad argument combinations
1131 if ($partonomy &&
1132 defined($structure_type)) {
1133 $self->throw("You cannot set both -partonomy and -structure_type\n".
1134 "(the former is implied by the latter)");
1137 # remember the current value of partonomy, to reset later
1138 my $old_partonomy = $self->partonomy;
1139 $self->partonomy($partonomy) if defined $partonomy;
1141 # remember old structure_type
1142 my $old_structure_type = $self->structure_type;
1143 $self->structure_type($structure_type) if defined $structure_type;
1145 # if we are sourcing our data from genbank, all the
1146 # features should be flat (eq no sub_SeqFeatures)
1147 my @flat_seq_features = $seq->get_SeqFeatures;
1148 my @all_seq_features = $seq->get_all_SeqFeatures;
1150 # sanity checks
1151 if (@all_seq_features > @flat_seq_features) {
1152 $self->throw("It looks as if this sequence has already been unflattened");
1154 if (@all_seq_features < @flat_seq_features) {
1155 $self->throw("ASSERTION ERROR: something is seriously wrong with your features");
1158 # tag for ungrouping; usually /gene or /locus_tag
1159 # for example: /gene="foo"
1160 $group_tag = $group_tag || $self->group_tag;
1161 if ($use_magic) {
1162 # use magic to guess the group tag
1163 my @sfs_with_locus_tag =
1164 grep {$_->has_tag("locus_tag")} @flat_seq_features;
1165 my @sfs_with_gene_tag =
1166 grep {$_->has_tag("gene")} @flat_seq_features;
1167 my @sfs_with_product_tag =
1168 grep {$_->has_tag("product")} @flat_seq_features;
1170 # if ($group_tag && $self->{'trust_grouptag'}) { # dgg suggestion
1173 # elsif
1174 if (@sfs_with_locus_tag) {
1175 # dgg note: would like to -use_magic with -group_tag = 'gene' for ensembl genomes
1176 # where ensembl gene FT have both /locus_tag and /gene, but mRNA, CDS have /gene only
1177 if ($group_tag && $group_tag ne 'locus_tag') {
1178 $self->throw("You have explicitly set group_tag to be '$group_tag'\n".
1179 "However, I detect that some features use /locus_tag\n".
1180 "I believe that this is the correct group_tag to use\n".
1181 "You can resolve this by either NOT setting -group_tag\n".
1182 "OR you can unset -use_magic to regain control");
1185 # use /locus_tag instead of /gene tag for grouping
1186 # see GenBank entry AE003677 (version 3) for an example
1187 $group_tag = 'locus_tag';
1188 if ($self->verbose > 0) {
1189 warn "Set group tag to: $group_tag\n";
1193 # on rare occasions, records will have no /gene or /locus_tag
1194 # but it WILL have /product tags. These serve the same purpose
1195 # for grouping. For an example, see AY763288 (also in t/data)
1196 if (@sfs_with_locus_tag==0 &&
1197 @sfs_with_gene_tag==0 &&
1198 @sfs_with_product_tag>0 &&
1199 !$group_tag) {
1200 $group_tag = 'product';
1201 if ($self->verbose > 0) {
1202 warn "Set group tag to: $group_tag\n";
1207 if (!$group_tag) {
1208 $group_tag = 'gene';
1211 # ------------------------------
1212 # GROUP FEATURES using $group_tag
1213 # collect features into unstructured groups
1214 # ------------------------------
1216 # -------------
1217 # we want to generate a list of groups;
1218 # each group is a list of SeqFeatures; this
1219 # group probably (but not necessarily)
1220 # corresponds to a gene model.
1222 # this array will look something like this:
1223 # ([$f1], [$f2, $f3, $f4], ...., [$f97, $f98, $f99])
1225 # there are also 'singleton' groups, with one member.
1226 # for instance, the 'source' feature is in a singleton group;
1227 # the same with others such as 'misc_feature'
1228 my @groups = ();
1229 # -------------
1231 # --------------------
1232 # we hope that the genbank record allows us to group by some grouping
1233 # tag.
1234 # for instance, most of the time a gene model can be grouped using
1235 # the gene tag - that is where you see
1236 # /gene="foo"
1237 # in a genbank record
1238 # --------------------
1240 # keep an index of groups by their
1241 # grouping tag
1242 my %group_by_tag = ();
1245 # iterate through all features, putting them into groups
1246 foreach my $sf (@flat_seq_features) {
1247 if (!$sf->has_tag($group_tag)) {
1248 # SINGLETON
1249 # this is an ungroupable feature;
1250 # add it to a group of its own
1251 push(@groups, [$sf]);
1253 else {
1254 # NON-SINGLETON
1255 my @group_tagvals = $sf->get_tag_values($group_tag);
1256 if (@group_tagvals > 1) {
1257 # sanity check:
1258 # currently something can only belong to one group
1259 $self->problem(2,
1260 ">1 value for /$group_tag: @group_tagvals\n".
1261 "At this time this module is not equipped to handle this adequately", $sf);
1263 # get value of group tag
1264 my $gtv = shift @group_tagvals;
1265 $gtv || $self->throw("Empty /$group_tag vals not allowed!");
1267 # is this a new group?
1268 my $group = $group_by_tag{$gtv};
1269 if ($group) {
1270 # this group has been encountered before - add current
1271 # sf to the end of the group
1272 push(@$group, $sf);
1274 else {
1275 # new group; add to index and create new group
1276 $group = [$sf]; # currently one member; probably more to come
1277 $group_by_tag{$gtv} = $group;
1278 push(@groups, $group);
1283 # as well as having the same group_tag, a group should be spatially
1284 # connected. if not, then the group should be split into subgroups.
1285 # this turns out to be necessary in the case of multicopy genes.
1286 # the standard way to represent these is as spatially disconnected
1287 # gene models (usually a 'gene' feature and some kind of RNA feature)
1288 # with the same group tag; the code below will split these into
1289 # seperate groups, one per copy.
1290 @groups = map { $self->_split_group_if_disconnected($_) } @groups;
1292 # remove any duplicates; most of the time the method below has
1293 # no effect. there are some unusual genbank records for which
1294 # duplicate removal is necessary. see the comments in the
1295 # _remove_duplicates_from_group() method if you want to know
1296 # the ugly details
1297 foreach my $group (@groups) {
1298 $self->_remove_duplicates_from_group($group);
1303 # PSEUDOGENES, PSEUDOEXONS AND PSEUDOINTRONS
1304 # these are indicated with the /pseudo tag
1305 # these are mapped to a different type; they should NOT
1306 # be treated as normal genes
1307 foreach my $sf (@all_seq_features) {
1308 if ($sf->has_tag('pseudo')) {
1309 my $type = $sf->primary_tag;
1310 # SO type is typically the same as the normal
1311 # type but preceeded by "pseudo"
1312 if ($type eq 'misc_RNA' || $type eq 'mRNA') {
1313 # dgg: see TypeMapper; both pseudo mRNA,misc_RNA should be pseudogenic_transcript
1314 $sf->primary_tag("pseudotranscript");
1316 else {
1317 $sf->primary_tag("pseudo$type");
1321 # now some of the post-processing that follows which applies to
1322 # genes will NOT be applied to pseudogenes; this is deliberate
1323 # for example, gene models are normalised to be gene-transcript-exon
1324 # for pseudogenes we leave them as pseudogene-pseudoexon
1326 # --- MAGIC ---
1327 my $need_to_infer_exons = 0;
1328 my $need_to_infer_mRNAs = 0;
1329 my @removed_exons = ();
1330 if ($use_magic) {
1331 if (defined($structure_type)) {
1332 $self->throw("Can't combine use_magic AND setting structure_type");
1334 my $n_introns =
1335 scalar(grep {$_->primary_tag eq 'exon'} @flat_seq_features);
1336 my $n_exons =
1337 scalar(grep {$_->primary_tag eq 'exon'} @flat_seq_features);
1338 my $n_mrnas =
1339 scalar(grep {$_->primary_tag eq 'mRNA'} @flat_seq_features);
1340 my $n_mrnas_attached_to_gene =
1341 scalar(grep {$_->primary_tag eq 'mRNA' &&
1342 $_->has_tag($group_tag)} @flat_seq_features);
1343 my $n_cdss =
1344 scalar(grep {$_->primary_tag eq 'CDS'} @flat_seq_features);
1345 my $n_rnas =
1346 scalar(grep {$_->primary_tag =~ /RNA/} @flat_seq_features);
1347 # Are there any CDS features in the record?
1348 if ($n_cdss > 0) {
1349 # YES
1351 # - a pc gene model should contain at the least a CDS
1353 # Are there any mRNA features in the record?
1354 if ($n_mrnas == 0) {
1355 # NO mRNAs:
1356 # looks like structure_type == 1
1357 $structure_type = 1;
1358 $need_to_infer_mRNAs = 1;
1360 elsif ($n_mrnas_attached_to_gene == 0) {
1361 # $n_mrnas > 0
1362 # $n_mrnas_attached_to_gene = 0
1364 # The entries _do_ contain mRNA features,
1365 # but none of them are part of a group/gene, i.e. they
1366 # are 'floating'
1368 # this is an annoying weird file that has some floating
1369 # mRNA features;
1370 # eg ftp.ncbi.nih.gov/genomes/Schizosaccharomyces_pombe/
1372 if ($self->verbose) {
1373 my @floating_mrnas =
1374 grep {$_->primary_tag eq 'mRNA' &&
1375 !$_->has_tag($group_tag)} @flat_seq_features;
1376 printf STDERR "Unattached mRNAs:\n";
1377 foreach my $mrna (@floating_mrnas) {
1378 $self->_write_sf_detail($mrna);
1380 printf STDERR "Don't know how to deal with these; filter at source?\n";
1383 foreach (@flat_seq_features) {
1384 if ($_->primary_tag eq 'mRNA') {
1385 # what should we do??
1387 # I think for pombe we just have to filter
1388 # out bogus mRNAs prior to starting
1392 # looks like structure_type == 2
1393 $structure_type = 2;
1394 $need_to_infer_mRNAs = 1;
1396 else {
1399 # we always infer exons in magic mode
1400 $need_to_infer_exons = 1;
1402 else {
1403 # this doesn't seem to be any kind of protein coding gene model
1404 if ( $n_rnas > 0 ) {
1405 $need_to_infer_exons = 1;
1409 $need_to_infer_exons = 0 if $noinfer; #NML
1411 if ($need_to_infer_exons) {
1412 # remove exons and introns from group -
1413 # we will infer exons later, and we
1414 # can always infer introns from exons
1415 foreach my $group (@groups) {
1416 @$group =
1417 grep {
1418 my $type = $_->primary_tag();
1419 if ($type eq 'exon') {
1420 # keep track of all removed exons,
1421 # so we can do a sanity check later
1422 push(@removed_exons, $_);
1424 $type ne 'exon' && $type ne 'intron'
1425 } @$group;
1427 # get rid of any groups that have zero members
1428 @groups = grep {scalar(@$_)} @groups;
1431 # --- END OF MAGIC ---
1433 # LOGICAL ASSERTION
1434 if (grep {!scalar(@$_)} @groups) {
1435 $self->throw("ASSERTION ERROR: empty group");
1438 # LOGGING
1439 if ($self->verbose > 0) {
1440 printf STDERR "GROUPS:\n";
1441 foreach my $group (@groups) {
1442 $self->_write_group($group, $group_tag);
1447 # --------- FINISHED GROUPING -------------
1450 # TYPE CONTAINMENT HIERARCHY (aka partonomy)
1451 # set the containment hierarchy if desired
1452 # see docs for structure_type() method
1453 if ($structure_type) {
1454 if ($structure_type == 1) {
1455 $self->partonomy(
1456 {CDS => 'gene',
1457 exon => 'CDS',
1458 intron => 'CDS',
1462 else {
1463 $self->throw("structure_type $structure_type is currently unknown");
1467 # see if we have an obvious resolver_tag
1468 if ($use_magic) {
1469 foreach my $sf (@all_seq_features) {
1470 if ($sf->has_tag('derived_from')) {
1471 $resolver_tag = 'derived_from';
1476 if ($use_magic) {
1477 # point all feature types without a container type to the root type.
1479 # for example, if we have an unanticipated feature_type, say
1480 # 'aberration', this should by default point to the parent 'gene'
1481 foreach my $group (@groups) {
1482 my @sfs = @$group;
1483 if (@sfs > 1) {
1484 foreach my $sf (@sfs) {
1485 my $type = $sf->primary_tag;
1486 next if $type eq 'gene';
1487 my $container_type = $self->get_container_type($type);
1488 if (!$container_type) {
1489 $self->partonomy->{$type} = 'gene';
1496 # we have done the first part of the unflattening.
1497 # we now have a list of groups; each group is a list of seqfeatures.
1498 # the actual group itself is flat; we may want to unflatten this further;
1499 # for instance, a gene model can contain multiple mRNAs and CDSs. We may want
1500 # to link the correct mRNA to the correct CDS via the bioperl sub_SeqFeature tree.
1502 # what we would end up with would be
1503 # gene1
1504 # mRNA-a
1505 # CDS-a
1506 # mRNA-b
1507 # CDS-b
1508 my @top_sfs = $self->unflatten_groups(-groups=>\@groups,
1509 -resolver_method=>$resolver_method,
1510 -resolver_tag=>$resolver_tag);
1512 # restore settings
1513 $self->partonomy($old_partonomy);
1515 # restore settings
1516 $self->structure_type($old_structure_type);
1518 # modify the original Seq object - the top seqfeatures are now
1519 # the top features from each group
1520 $seq->remove_SeqFeatures;
1521 $seq->add_SeqFeature($_) foreach @top_sfs;
1523 # --------- FINISHED UNFLATTENING -------------
1525 # lets see if there are any post-unflattening tasks we need to do
1529 # INFERRING mRNAs
1530 if ($need_to_infer_mRNAs) {
1531 if ($self->verbose > 0) {
1532 printf STDERR "** INFERRING mRNA from CDS\n";
1534 $self->infer_mRNA_from_CDS(-seq=>$seq, -noinfer=>$noinfer);
1537 # INFERRING exons
1538 if ($need_to_infer_exons) {
1540 # infer exons, one group/gene at a time
1541 foreach my $sf (@top_sfs) {
1542 my @sub_sfs = ($sf, $sf->get_all_SeqFeatures);
1543 $self->feature_from_splitloc(-features=>\@sub_sfs);
1546 # some exons are stated explicitly; ie there is an "exon" feature
1547 # most exons are inferred; ie there is a "mRNA" feature with
1548 # split locations
1550 # if there were exons explicitly stated in the entry, we need to
1551 # do two things:
1553 # make sure these exons are consistent with the inferred exons
1554 # (you never know)
1556 # transfer annotation (tag-vals) from the explicit exon to the
1557 # new inferred exon
1558 if (@removed_exons) {
1559 my @allfeats = $seq->get_all_SeqFeatures;
1561 # find all the inferred exons that are children of mRNA
1562 my @mrnas = grep {$_->primary_tag eq 'mRNA'} @allfeats;
1563 my @exons =
1564 grep {$_->primary_tag eq 'exon'}
1565 map {$_->get_SeqFeatures} @mrnas;
1567 my %exon_h = (); # index of exons by location;
1569 # there CAN be >1 exon at a location; we can represent these redundantly
1570 # (ie as a tree, not a graph)
1571 push(@{$exon_h{$self->_locstr($_)}}, $_) foreach @exons;
1572 my @problems = (); # list of problems;
1573 # each problem is a
1574 # [$severity, $description] pair
1575 my $problem = '';
1576 my ($n_exons, $n_removed_exons) =
1577 (scalar(keys %exon_h), scalar(@removed_exons));
1578 foreach my $removed_exon (@removed_exons) {
1579 my $locstr = $self->_locstr($removed_exon);
1580 my $inferred_exons = $exon_h{$locstr};
1581 delete $exon_h{$locstr};
1582 if ($inferred_exons) {
1583 my %exons_done = ();
1584 foreach my $exon (@$inferred_exons) {
1586 # make sure we don't move stuff twice
1587 next if $exons_done{$exon};
1588 $exons_done{$exon} = 1;
1590 # we need to tranfer any tag-values from the explicit
1591 # exon to the implicit exon
1592 foreach my $tag ($removed_exon->get_all_tags) {
1593 my @vals = $removed_exon->get_tag_values($tag);
1594 if (!$exon->can("add_tag_value")) {
1595 # I'm puzzled as to what should be done here;
1596 # SeqFeatureIs are not necessarily mutable,
1597 # but we know that in practice the implementing
1598 # class is mutable
1599 $self->throw("The SeqFeature object does not ".
1600 "implement add_tag_value()");
1602 $exon->add_tag_value($tag, @vals);
1606 else {
1607 # no exons inferred at $locstr
1608 push(@problems,
1609 [1,
1610 "there is a conflict with exons; there was an explicitly ".
1611 "stated exon with location $locstr, yet I cannot generate ".
1612 "this exon from the supplied mRNA locations\n"]);
1615 # do we have any inferred exons left over, that were not
1616 # covered in the explicit exons?
1617 if (keys %exon_h) {
1618 # TODO - we ignore this problem for now
1619 push(@problems,
1621 sprintf("There are some inferred exons that are not in the ".
1622 "explicit exon list; they are the exons at locations:\n".
1623 join("\n", keys %exon_h)."\n")]);
1626 # report any problems
1627 if (@problems) {
1628 my $thresh = $self->error_threshold;
1629 my @bad_problems = grep {$_->[0] > $thresh} @problems;
1630 if (@bad_problems) {
1631 printf STDERR "PROBLEM:\n";
1632 $self->_write_hier(\@top_sfs);
1633 # TODO - allow more fine grained control over this
1634 $self->{_problems_reported} = 1;
1635 $self->throw(join("\n",
1636 map {"@$_"} @bad_problems));
1638 $self->problem(@$_) foreach @problems;
1642 # --- end of inferring exons --
1644 # return new top level features; this can also
1645 # be retrieved via
1646 # $seq->get_SeqFeatures();
1647 # return @top_sfs;
1648 return $seq->get_SeqFeatures;
1651 # _split_group_if_disconnected([@sfs])
1653 # as well as having the same group_tag, a group should be spatially
1654 # connected. if not, then the group should be split into subgroups.
1655 # this turns out to be necessary in the case of multicopy genes.
1656 # the standard way to represent these is as spatially disconnected
1657 # gene models (usually a 'gene' feature and some kind of RNA feature)
1658 # with the same group tag; the code below will split these into
1659 # seperate groups, one per copy.
1661 sub _split_group_if_disconnected {
1662 my $self = shift;
1663 my $group = shift;
1664 my @sfs = @$group;
1665 my @ranges =
1666 Bio::Range->disconnected_ranges(@sfs);
1667 my @groups;
1668 if (@ranges == 0) {
1669 $self->throw("ASSERTION ERROR");
1671 elsif (@ranges == 1) {
1672 # no need to split the group
1673 @groups = ($group);
1675 else {
1676 # @ranges > 1
1677 # split the group into disconnected ranges
1678 if ($self->verbose > 0) {
1679 printf STDERR "GROUP PRE-SPLIT:\n";
1680 $self->_write_group($group, $self->group_tag);
1682 @groups =
1683 map {
1684 my $range = $_;
1685 [grep {
1686 $_->intersection($range);
1687 } @sfs]
1688 } @ranges;
1689 if ($self->verbose > 0) {
1690 printf STDERR "SPLIT GROUPS:\n";
1691 $self->_write_group($_, $self->group_tag) foreach @groups;
1694 return @groups;
1697 sub _remove_duplicates_from_group {
1698 my $self = shift;
1699 my $group = shift;
1701 # ::: WEIRD BOUNDARY CASE CODE :::
1702 # for some reason, there are some gb records with two gene
1703 # features for one gene; for example, see ATF14F8.gbk
1704 # in the t/data directory
1706 # in this case, we get rid of one of the genes
1708 my @genes = grep {$_->primary_tag eq 'gene'} @$group;
1709 if (@genes > 1) {
1710 # OK, if we look at ATF14F8.gbk we see that some genes
1711 # just exist as a single location, some exist as a multisplit location;
1713 # eg
1715 # gene 16790..26395
1716 # /gene="F14F8_60"
1717 # ...
1718 # gene complement(join(16790..19855,20136..20912,21378..21497,
1719 # 21654..21876,22204..22400,22527..23158,23335..23448,
1720 # 23538..23938,24175..24536,24604..24715,24889..24984,
1721 # 25114..25171,25257..25329,25544..25589,25900..26018,
1722 # 26300..26395))
1723 # /gene="F14F8_60"
1725 # the former is the 'standard' way of representing the gene in genbank;
1726 # the latter is redundant with the CDS entry. So we shall get rid of
1727 # the latter with the following filter
1729 if ($self->verbose > 0) {
1730 printf STDERR "REMOVING DUPLICATES:\n";
1733 @genes =
1734 grep {
1735 my $loc = $_->location;
1736 if ($loc->isa("Bio::Location::SplitLocationI")) {
1737 my @locs = $loc->each_Location;
1738 if (@locs > 1) {
1741 else {
1745 else {
1748 } @genes;
1750 if (@genes > 1) {
1751 # OK, that didn't work. Our only resort is to just pick one at random
1752 @genes = ($genes[0]);
1754 if (@genes) {
1755 @genes == 1 || $self->throw("ASSERTION ERROR");
1756 @$group =
1757 ($genes[0], grep {$_->primary_tag ne 'gene'} @$group);
1760 # its a dirty job but someone's gotta do it
1761 return;
1765 =head2 unflatten_groups
1767 Title : unflatten_groups
1768 Usage :
1769 Function: iterates over groups, calling unflatten_group() [see below]
1770 Example :
1771 Returns : list of Bio::SeqFeatureI objects that are holders
1772 Args : see below
1774 Arguments
1776 -groups: list of list references; inner list is of Bio::SeqFeatureI objects
1777 e.g. ( [$sf1], [$sf2, $sf3, $sf4], [$sf5, ...], ...)
1779 -resolver_method: a CODE reference
1780 see the documentation above for an example of
1781 a subroutine that can be used to resolve hierarchies
1782 within groups.
1784 this is optional - a default subroutine will be used
1787 NOTE: You should not need to call this method, unless you want fine
1788 grained control over how the unflattening process.
1790 =cut
1792 sub unflatten_groups{
1793 my ($self,@args) = @_;
1794 my($groups, $resolver_method, $resolver_tag) =
1795 $self->_rearrange([qw(GROUPS
1796 RESOLVER_METHOD
1797 RESOLVER_TAG
1799 @args);
1801 # this is just a simple wrapper for unflatten_group()
1802 return
1803 map {
1804 $self->unflatten_group(-group=>$_,
1805 -resolver_method=>$resolver_method,
1806 -resolver_tag=>$resolver_tag)
1807 } @$groups;
1810 =head2 unflatten_group
1812 Title : unflatten_group
1813 Usage :
1814 Function: nests a group of features into a feature containment hierarchy
1815 Example :
1816 Returns : Bio::SeqFeatureI objects that holds other features
1817 Args : see below
1819 Arguments
1821 -group: reference to list of Bio::SeqFeatureI objects
1823 -resolver_method: a CODE reference
1824 see the documentation above for an example of
1825 a subroutine that can be used to resolve hierarchies
1826 within groups
1828 this is optional - a default subroutine will be used
1831 NOTE: You should not need to call this method, unless you want fine
1832 grained control over how the unflattening process.
1834 =cut
1836 sub unflatten_group{
1837 my ($self,@args) = @_;
1839 my($group, $resolver_method, $resolver_tag) =
1840 $self->_rearrange([qw(GROUP
1841 RESOLVER_METHOD
1842 RESOLVER_TAG
1844 @args);
1846 if ($self->verbose > 0) {
1847 printf STDERR "UNFLATTENING GROUP:\n";
1848 $self->_write_group($group, $self->group_tag);
1851 my @sfs = @$group;
1853 # we can safely ignore singletons (e.g. [source])
1854 return $sfs[0] if @sfs == 1;
1856 my $partonomy = $self->partonomy;
1858 # $resolver_method is a reference to a SUB that will resolve
1859 # ambiguous parent/child containment; for example, determining
1860 # which mRNAs go with which CDSs
1861 $resolver_method = $resolver_method || \&_resolve_container_for_sf;
1863 # TAG BASED RESOLVING OF HIERARCHIES
1865 # if the user specifies $resolver_tag, then we use this tag
1866 # to pair up ambiguous parents and children;
1868 # for example, the CDS feature may have a resolver tag of /derives_from
1869 # which is a 'foreign key' into the /label tag of the mRNA feature
1871 # this kind of tag-based resolution is possible for a certain subset
1872 # of genbank records
1874 # if no resolver tag is specified, we revert to the normal
1875 # resolver_method
1876 if ($resolver_tag) {
1877 my $backup_resolver_method = $resolver_method;
1878 # closure: $resolver_tag is remembered by this sub
1879 my $sub =
1880 sub {
1881 my ($self, $sf, @possible_container_sfs) = @_;
1882 my @container_sfs = ();
1883 if ($sf->has_tag($resolver_tag)) {
1884 my ($resolver_tagval) = $sf->get_tag_values($resolver_tag);
1885 # if a feature has a resolver_tag (e.g. /derives_from)
1886 # this specifies the /product, /symbol or /label for the
1887 # parent feature
1888 @container_sfs =
1889 grep {
1890 my $match = 0;
1891 $self->_write_sf($_) if $self->verbose > 0;
1892 foreach my $tag (qw(product symbol label)) {
1893 if ($_->has_tag($tag)) {
1894 my @vals =
1895 $_->get_tag_values($tag);
1896 if (grep {$_ eq $resolver_tagval} @vals) {
1897 $match = 1;
1898 last;
1902 $match;
1903 } @possible_container_sfs;
1905 else {
1906 return $backup_resolver_method->($sf, @possible_container_sfs);
1908 return map {$_=>0} @container_sfs;
1910 $resolver_method = $sub;
1912 else {
1913 # CONDITION: $resolver_tag is NOT set
1914 $self->throw("assertion error") if $resolver_tag;
1916 # we have now set $resolver_method to a subroutine for
1917 # disambiguatimng parent/child relationships. we will
1918 # now build the whole containment hierarchy for this group
1921 # FIND TOP/ROOT SEQFEATURES
1923 # find all the features for which there is no
1924 # containing feature type (eg genes)
1925 my @top_sfs =
1926 grep {
1927 !$self->get_container_type($_->primary_tag);
1928 } @sfs;
1930 # CONDITION: there must be at most one root
1931 if (@top_sfs > 1) {
1932 $self->_write_group($group, $self->group_tag);
1933 printf STDERR "TOP SFS:\n";
1934 $self->_write_sf($_) foreach @top_sfs;
1935 $self->throw("multiple top-sfs in group");
1937 my $top_sf = $top_sfs[0];
1939 # CREATE INDEX OF SEQFEATURES BY TYPE
1940 my %sfs_by_type = ();
1941 foreach my $sf (@sfs) {
1942 push(@{$sfs_by_type{$sf->primary_tag}}, $sf);
1945 # containment index; keyed by child; lookup parent
1946 # note: this index uses the stringified object reference of
1947 # the object as a surrogate lookup key
1949 my %container = (); # child -> parent
1951 # ALGORITHM: build containment graph
1953 # find all possible containers for each SF;
1954 # for instance, for a CDS, the possible containers are all
1955 # the mRNAs in the same group. For a mRNA, the possible
1956 # containers are any SFs of type 'gene' (should only be 1).
1957 # (these container-type mappings can be overridden)
1959 # contention is resolved by checking coordinates of splice sites
1960 # (this is the default, but can be overridden)
1962 # most of the time, there is no problem identifying a unique
1963 # parent for every child; this can be ambiguous when constructing
1964 # CDS to mRNA relationships with lots of alternate splicing
1966 # a hash of child->parent relationships is constructed (%container)
1967 # any mappings that need further resolution (eg CDS to mRNA) are
1968 # placed in %unresolved
1970 # %unresolved index
1971 # (keyed by stringified object reference of child seqfeature)
1972 my %unresolved = (); # child -> [parent,score] to be resolved
1974 # index of seqfeatures by their stringified object reference;
1975 # this is essentially a way of 'reviving' an object from its stringified
1976 # reference
1977 # (see NOTE ON USING OBJECTS AS KEYS IN HASHES, below)
1978 my %idxsf = map {$_=>$_} @sfs;
1980 foreach my $sf (@sfs) {
1981 my $type = $sf->primary_tag;
1983 # container type (e.g. the container type for CDS is usually mRNA)
1984 my $container_type =
1985 $self->get_container_type($type);
1986 if ($container_type) {
1988 my @possible_container_sfs =
1989 @{$sfs_by_type{$container_type} || []};
1990 # we now have a list of possible containers
1991 # (eg for a CDS in an alternately spliced gene, this
1992 # would be a list of all the mRNAs for this gene)
1994 if (!@possible_container_sfs) {
1995 # root of hierarchy
1997 else {
1998 if (@possible_container_sfs == 1) {
1999 # this is the easy situation, whereby the containment
2000 # hierarchy is unambiguous. this will probably be the
2001 # case if the genbank record has no alternate splicing
2002 # within it
2004 # ONE OPTION ONLY - resolved!
2005 $container{$sf} = $possible_container_sfs[0];
2008 else {
2009 # MULTIPLE CONTAINER CHOICES
2010 $self->throw("ASSERTION ERROR") unless @possible_container_sfs > 1;
2012 # push this onto the %unresolved graph, and deal with it
2013 # later
2015 # for now we hardcode things such that the only type
2016 # with ambiguous parents is a CDS; if this is violated,
2017 # it has a weak problem class of '1' so the API user
2018 # can easily set things to ignore these
2019 if ($sf->primary_tag ne 'CDS') {
2020 $self->problem(1,
2021 "multiple container choice for non-CDS; ".
2022 "CDS to mRNA should be the only ".
2023 "relationships requiring resolving",
2024 $sf);
2027 # previously we set the SUB $resolver_method
2028 $self->throw("ASSERTION ERROR")
2029 unless $resolver_method;
2031 # $resolver_method will assign scores to
2032 # parent/child combinations; later on we
2033 # will use these scores to find the optimal
2034 # parent/child pairings
2036 # the default $resolver_method uses splice sites to
2037 # score possible parent/child matches
2039 my %container_sfh =
2040 $resolver_method->($self, $sf, @possible_container_sfs);
2041 if (!%container_sfh) {
2042 $self->problem(2,
2043 "no containers possible for SeqFeature of ".
2044 "type: $type; this SF is being placed at ".
2045 "root level",
2046 $sf);
2047 # RESOLVED! (sort of - placed at root/gene level)
2048 $container{$sf} = $top_sf;
2050 # this sort of thing happens if the record is
2051 # badly messed up and there is absolutely no indication
2052 # of where to put the CDS. Perhaps we should just
2053 # place it with a random mRNA?
2055 foreach my $jsf (keys %container_sfh) {
2057 # add [score, parent] pairs to the %unresolved
2058 # lookup table/graph
2059 push(@{$unresolved{$sf}},
2060 [$idxsf{$jsf}, $container_sfh{$jsf} || 0]);
2065 else {
2066 # CONDITION:
2067 # not container type for $sf->primary_tag
2069 # CONDITION:
2070 # $sf must be a root/top node (eg gene)
2074 if (0) {
2076 # CODE CURRENTLY DISABLED
2078 # we require a 1:1 mapping between mRNAs and CDSs;
2079 # create artificial duplicates if we can't do this...
2080 if (%unresolved) {
2081 my %childh = map {$_=>1} keys %unresolved;
2082 my %parenth = map {$_->[0]=>1} map {@$_} values %unresolved;
2083 if ($self->verbose > 0) {
2084 printf STDERR "MATCHING %d CHILDREN TO %d PARENTS\n",
2085 scalar(keys %childh), scalar(keys %parenth);
2087 # 99.99% of the time in genbank genomic record of structure type 0, we
2088 # see one CDS for every mRNA; one exception is the S Pombe
2089 # genome, which is all CDS, bar a few spurious mRNAs; we have to
2090 # filter out the spurious mRNAs in this case
2092 # another strange case is in the mouse genome, NT_078847.1
2093 # for Pcdh13 you will notice there is 4 mRNAs and 5 CDSs.
2094 # most unusual!
2095 # I'm at a loss for a really clever thing to do here. I think the
2096 # best thing is to create duplicate features to preserve the 1:1 mapping
2097 # my $suffix_id = 1;
2098 # while (keys %childh > keys %parenth) {
2104 # DEBUGGING CODE
2105 if ($self->verbose > 0 && scalar(keys %unresolved)) {
2106 printf STDERR "UNRESOLVED PAIRS:\n";
2107 foreach my $childsf (keys %unresolved) {
2108 my @poss = @{$unresolved{$childsf}};
2109 foreach my $p (@poss) {
2110 my $parentsf = $p->[0];
2111 $childsf = $idxsf{$childsf};
2112 my @clabels = ($childsf->get_tagset_values(qw(protein_id label product)), "?");
2113 my @plabels = ($parentsf->get_tagset_values(qw(transcript_id label product)), "?");
2114 printf STDERR
2115 (" PAIR: $clabels[0] => $plabels[0] (of %d)\n",
2116 scalar(@poss));
2119 } # -- end of verbose
2121 # Now we have to fully resolve the containment hierarchy; remember,
2122 # the graph %container has the fully resolved child->parent links;
2124 # the graph %unresolved is keyed by children missing parents; we
2125 # need to put all these orphans in the %container graph
2127 # we do this using the scores in %unresolved, with the
2128 # find_best_matches() algorithm
2129 my $unresolved_problem_reported = 0;
2130 if (%unresolved) {
2131 my $new_pairs =
2132 $self->find_best_matches(\%unresolved, []);
2133 if (!$new_pairs) {
2134 my ($g) = $sfs[0]->get_tagset_values($self->group_tag || 'gene');
2135 $self->problem(2,
2136 "Could not resolve hierarchy for $g");
2137 $new_pairs = [];
2138 $unresolved_problem_reported = 1;
2140 foreach my $pair (@$new_pairs) {
2141 if ($self->verbose > 0) {
2142 printf STDERR " resolved pair @$pair\n";
2144 $container{$pair->[0]} = $pair->[1];
2145 delete $unresolved{$pair->[0]};
2149 # CONDITION: containment hierarchy resolved
2150 if (%unresolved) {
2151 $self->throw("UNRESOLVED: %unresolved")
2152 unless $unresolved_problem_reported;
2155 # make nested SeqFeature hierarchy from @containment_pairs
2156 # ie put child SeqFeatures into parent SeqFeatures
2157 my @top = ();
2158 foreach my $sf (@sfs) {
2159 my $container_sf = $container{$sf};
2160 if ($container_sf) {
2161 # make $sf nested inside $container_sf
2163 # first check if the container spatially contains the containee
2164 if ($container_sf->contains($sf)) {
2165 # add containee
2166 $container_sf->add_SeqFeature($sf);
2168 else {
2169 # weird case - the container does NOT spatially
2170 # contain the containee;
2171 # we expand and throw a warning
2173 # for an example of this see ZFP91-CNTF dicistronic gene
2174 # in NCBI chrom 11 build 34.3
2175 $self->problem(1,
2176 "Container feature does not spatially contain ".
2177 "subfeature. Perhaps this is a dicistronic gene? ".
2178 "I am expanding the parent feature",
2179 $container_sf,
2180 $sf);
2181 $container_sf->add_SeqFeature($sf, 'EXPAND');
2184 else {
2185 push(@top, $sf);
2188 return @top;
2189 } # -- end of unflatten_group
2191 # -------
2192 # A NOTE ON USING OBJECTS AS KEYS IN HASHES (stringified objects)
2194 # Often we with to use seqfeatures as keys in a hashtable; because seqfeatures
2195 # in bioperl have no unique ID, we use a surrogate ID in the form of the
2196 # stringified object references - this is just what you get if you say
2198 # print "$sf\n";
2200 # this is guaranteed to be unique (within a particular perl execution)
2202 # often we want to 'revive' the objects used as keys in a hash - once the
2203 # objects are used as keys, remember it is the *strings* used as keys and
2204 # not the object itself, so the object needs to be revived using another
2205 # hashtable that looks like this
2207 # %sfidx = map { $_ => $_ } @sfs
2209 # -------
2212 # recursively finds the best set of pairings from a matrix of possible pairings
2214 # tries to make sure nothing is unpaired
2216 # given a matrix of POSSIBLE matches
2217 # (matrix expressed as hash/lookup; keyed by child object; val = [parent, score]
2220 sub find_best_matches {
2221 my $self = shift;
2222 my $matrix = shift;
2223 my $pairs = shift; # [child,parent] pairs already selected
2225 my $verbose = $self->verbose;
2226 #################################print "I";
2227 if ($verbose > 0) {
2228 printf STDERR "find_best_matches: (/%d)\n", scalar(@$pairs);
2231 my %selected_children = map {($_->[0]=>1)} @$pairs;
2232 my %selected_parents = map {($_->[1]=>1)} @$pairs;
2234 # make a copy of the matrix with the portions still to be
2235 # resolved
2236 my %unresolved_parents = ();
2237 my %unresolved =
2238 map {
2239 if ($verbose > 0) {
2240 printf STDERR " $_ : %s\n", join("; ", map {"[@$_]"} @{$matrix->{$_}});
2242 if ($selected_children{$_}) {
2245 else {
2246 my @parents =
2247 grep {
2248 !$selected_parents{$_->[0]}
2249 } @{$matrix->{$_}};
2250 $unresolved_parents{$_} = 1 foreach @parents;
2251 # new parents
2252 ($_ => [@parents]);
2254 } keys %$matrix;
2256 my @I = keys %unresolved;
2258 return $pairs if !scalar(keys %unresolved_parents);
2259 # NECESSARY CONDITION:
2260 # all possible parents have a child match
2262 return $pairs if !scalar(@I);
2263 # NECESSARY CONDITION:
2264 # all possible children have a parent match
2266 # give those with fewest choices highest priority
2267 @I = sort {
2268 # n possible parents
2269 scalar(@{$unresolved{$a}})
2271 scalar(@{$unresolved{$b}}) ;
2272 } @I;
2274 my $csf = shift @I;
2276 my @J = @{$unresolved{$csf}}; # array of [parent, score]
2278 # sort by score, highest first
2279 @J =
2280 sort {
2281 $b->[1] <=> $a->[1]
2282 } @J;
2284 # select pair(s) from remaining matrix of possible pairs
2285 # by iterating through possible parents
2287 my $successful_pairs;
2288 foreach my $j (@J) {
2289 my ($psf, $score) = @$j;
2290 # would selecting $csf, $psf as a pair
2291 # remove all choices from another?
2292 my $bad = 0;
2293 foreach my $sf (@I) {
2294 if (!grep {$_->[0] ne $psf} @{$unresolved{$sf}}) {
2295 # $psf was the only parent choice for $sf
2296 $bad = 1;
2297 last;
2300 if (!$bad) {
2301 my $pair = [$csf, $psf];
2302 my $new_pairs = [@$pairs, $pair];
2303 my $set = $self->find_best_matches($matrix, $new_pairs);
2304 if ($set) {
2305 $successful_pairs = $set;
2306 last;
2310 # success
2311 return $successful_pairs if $successful_pairs;
2312 # fail
2313 return 0;
2316 # ----------------------------------------------
2317 # writes a group to stdout
2319 # mostly for logging/debugging
2320 # ----------------------------------------------
2321 sub _write_group {
2322 my $self = shift;
2323 my $group = shift;
2324 my $group_tag = shift || 'gene';
2326 my $f = $group->[0];
2327 my $label = '?';
2328 if ($f->has_tag($group_tag)) {
2329 ($label) = $f->get_tag_values($group_tag);
2331 if( $self->verbose > 0 ) {
2332 printf STDERR (" GROUP [%s]:%s\n",
2333 $label,
2334 join(' ',
2335 map { $_->primary_tag } @$group));
2340 sub _write_sf {
2341 my $self = shift;
2342 my $sf = shift;
2343 printf STDERR "TYPE:%s\n", $sf->primary_tag;
2344 return;
2347 sub _write_sf_detail {
2348 my $self = shift;
2349 my $sf = shift;
2350 printf STDERR "TYPE:%s\n", $sf->primary_tag;
2351 my @locs = $sf->location->each_Location;
2352 printf STDERR " %s,%s [%s]\n", $_->start, $_->end, $_->strand foreach @locs;
2353 return;
2356 sub _write_hier {
2357 my $self = shift;
2358 my @sfs = @{shift || []};
2359 my $indent = shift || 0;
2360 if( $self->verbose > 0 ) {
2361 foreach my $sf (@sfs) {
2362 my $label = '?';
2363 if ($sf->has_tag('product')) {
2364 ($label) = $sf->get_tag_values('product');
2366 printf STDERR "%s%s $label\n", ' ' x $indent, $sf->primary_tag;
2367 my @sub_sfs = $sf->sub_SeqFeature;
2368 $self->_write_hier(\@sub_sfs, $indent+1);
2373 # -----------------------------------------------
2375 # returns all possible containers for an SF based
2376 # on splice site coordinates; splice site coords
2377 # must be contained
2378 # -----------------------------------------------
2379 sub _resolve_container_for_sf{
2380 my ($self, $sf, @possible_container_sfs) = @_;
2382 my @coords = $self->_get_splice_coords_for_sf($sf);
2383 my $start = $sf->start;
2384 my $end = $sf->end;
2385 my $splice_uniq_str = "@coords";
2387 my @sf_score_pairs = ();
2388 # a CDS is contained by a mRNA if the locations of the splice
2389 # coordinates are identical
2390 foreach (@possible_container_sfs) {
2391 my @container_coords = $self->_get_splice_coords_for_sf($_);
2392 my $inside =
2393 !$splice_uniq_str ||
2394 index("@container_coords", $splice_uniq_str) > -1;
2395 if ($inside) {
2396 # the container cannot be smaller than the thing contained
2397 if ($_->start > $start || $_->end < $end) {
2398 $inside = 0;
2403 # SPECIAL CASE FOR /ribosomal_slippage
2404 # See: https://www.ncbi.nlm.nih.gov/collab/FT/
2405 if (!$inside && $sf->has_tag('ribosomal_slippage')) {
2406 if ($self->verbose > 0) {
2407 printf STDERR " Checking for ribosomal_slippage\n";
2410 # TODO: rewrite this to match introns;
2411 # each slippage will be a "fake" small CDS exon
2412 my @transcript_splice_sites = @container_coords;
2413 my @cds_splice_sites = @coords;
2414 ##printf STDERR "xxTR SSs: @transcript_splice_sites :: %s\n", $_->get_tag_values('product');
2415 ##printf STDERR "xxCD SSs: @cds_splice_sites :: %s\n\n", $sf->get_tag_values('product');
2417 # find the the first splice site within the CDS
2418 while (scalar(@transcript_splice_sites) &&
2419 $transcript_splice_sites[0] < $cds_splice_sites[0]) {
2420 shift @transcript_splice_sites;
2423 ##print STDERR "TR SSs: @transcript_splice_sites\n";
2424 ##print STDERR "CD SSs: @cds_splice_sites\n\n";
2426 if (!(scalar(@transcript_splice_sites)) ||
2427 $transcript_splice_sites[0] == $cds_splice_sites[0]) {
2429 # we will now try and align all splice remaining sites in the transcript and CDS;
2430 # any splice site that can't be aligned is assumed to be a ribosomal slippage
2432 my @slips = ();
2433 my $in_exon = 1;
2434 $inside = 1; # innocent until proven guilty..
2435 while (@cds_splice_sites) {
2436 if (!@transcript_splice_sites) {
2438 # ribosomal slippage is after the last transcript splice site
2439 # Example: (NC_00007, isoform 3 of PEG10)
2440 # mRNA join(85682..85903,92646..99007)
2441 # mRNA join(85682..85903,92646..99007)
2442 # CDS join(85899..85903,92646..93825,93825..94994)
2444 # OR: None of the splice sites align;
2445 # may be a single CDS exon with one slippage inside it.
2446 # Example: (NC_00007, isoform 4 of PEG10)
2447 # mRNA join(85637..85892,92646..99007)
2448 # CDS join(92767..93825,93825..94994)
2450 # Yes, this code is repeated below...
2451 my $p1 = shift @cds_splice_sites;
2452 my $p2 = shift @cds_splice_sites;
2453 if ($self->verbose > 0) {
2454 printf STDERR " Found the ribosomal_slippage: $p1..$p2\n";
2456 push(@slips, ($p2-$p1)-1);
2458 elsif ($cds_splice_sites[0] == $transcript_splice_sites[0]) {
2459 # splice sites align: this is not the slippage
2460 shift @cds_splice_sites;
2461 shift @transcript_splice_sites;
2462 ##print STDERR "MATCH\n";
2464 else {
2465 # mismatch
2466 if ($cds_splice_sites[0] < $transcript_splice_sites[0]) {
2467 # potential slippage
2469 # ---TTTTTTTTTT----
2470 # ---CCCC--CCCC----
2473 my $p1 = shift @cds_splice_sites;
2474 my $p2 = shift @cds_splice_sites;
2475 if ($self->verbose > 0) {
2476 printf STDERR " Found the ribosomal_slippage: $p1..$p2\n";
2478 push(@slips, ($p2-$p1)-1);
2480 else {
2481 # not a potential ribosomal slippage
2482 $inside = 0; # guilty!
2483 ##print STDERR "FAIL\n";
2484 last;
2488 if ($inside) {
2489 # TODO: this is currently completely arbitrary. How many ribosomal slippages do we allow?
2490 # perhaps we need some mini-statistical model here....?
2491 if (@slips > 1) {
2492 $inside = 0;
2494 # TODO: this is currently completely arbitrary. What is the maximum size of a ribosomal slippage?
2495 # perhaps we need some mini-statistical model here....?
2496 if (grep {$_ > 2} @slips) {
2497 $inside = 0;
2501 else {
2502 # not a ribosomal_slippage, sorry
2505 if ($self->verbose > 0) {
2506 printf STDERR " Checking containment:[$inside] (@container_coords) IN ($splice_uniq_str)\n";
2508 if ($inside) {
2509 # SCORE: matching (ss-scoords+2)/(n-container-ss-coords+2)
2510 my $score =
2511 (scalar(@coords)+2)/(scalar(@container_coords)+2);
2512 push(@sf_score_pairs,
2513 $_=>$score);
2516 return @sf_score_pairs;
2519 sub _get_splice_coords_for_sf {
2520 my $self = shift;
2521 my $sf = shift;
2523 my @locs = $sf->location;
2524 if ($sf->location->isa("Bio::Location::SplitLocationI")) {
2525 @locs = $sf->location->each_Location;
2528 # get an ordered list of (start, end) positions
2530 # my @coords =
2531 # map {
2532 # $_->strand > 0 ? ($_->start, $_->end) : ($_->end, $_->start)
2533 # } @locs;
2535 my @coords = map {($_->start, $_->end)} @locs;
2537 # remove first and last leaving only splice sites
2538 pop @coords;
2539 shift @coords;
2540 return @coords;
2543 =head2 feature_from_splitloc
2545 Title : feature_from_splitloc
2546 Usage : $unflattener->feature_from_splitloc(-features=>$sfs);
2547 Function:
2548 Example :
2549 Returns :
2550 Args : see below
2552 At this time all this method does is generate exons for mRNA or other RNA features
2554 Arguments:
2556 -feature: a Bio::SeqFeatureI object (that conforms to Bio::FeatureHolderI)
2557 -seq: a Bio::SeqI object that contains Bio::SeqFeatureI objects
2558 -features: an arrayref of Bio::SeqFeatureI object
2561 =cut
2563 sub feature_from_splitloc{
2564 my ($self,@args) = @_;
2566 my($sf, $seq, $sfs) =
2567 $self->_rearrange([qw(FEATURE
2569 FEATURES
2571 @args);
2572 my @sfs = (@{$sfs || []});
2573 push(@sfs, $sf) if $sf;
2574 if ($seq) {
2575 $seq->isa("Bio::SeqI") || $self->throw("$seq NOT A SeqI");
2576 @sfs = $seq->get_all_SeqFeatures;
2578 my @exons = grep {$_->primary_tag eq 'exon'} @sfs;
2579 if (@exons) {
2580 $self->problem(2,
2581 "There are already exons, so I will not infer exons");
2584 # index of features by type+location
2585 my %loc_h = ();
2587 # infer for every feature
2588 foreach my $sf (@sfs) {
2590 $sf->isa("Bio::SeqFeatureI") || $self->throw("$sf NOT A SeqFeatureI");
2591 $sf->isa("Bio::FeatureHolderI") || $self->throw("$sf NOT A FeatureHolderI");
2593 my $type = $sf->primary_tag;
2594 next unless $type eq 'mRNA' or $type =~ /RNA/;
2596 # an mRNA from genbank will have a discontinuous location,
2597 # with each sub-location being equivalent to an exon
2598 my @locs = $sf->location;
2600 if ($sf->location->isa("Bio::Location::SplitLocationI")) {
2601 @locs = $sf->location->each_Location;
2604 if (!@locs) {
2605 use Data::Dumper;
2606 print Dumper $sf;
2607 $self->throw("ASSERTION ERROR: sf has no location objects");
2610 # make exons from locations
2611 my @subsfs =
2612 map {
2613 my $subsf = Bio::SeqFeature::Generic->new(-location=>$_,
2614 -primary_tag=>'exon');
2615 ## Provide seq_id to new feature:
2616 $subsf->seq_id($sf->seq_id) if $sf->seq_id;
2617 $subsf->source_tag($sf->source_tag) if $sf->source_tag;
2618 ## Transfer /locus_tag and /gene tag values to inferred
2619 ## features. TODO: Perhaps? this should not be done
2620 ## indiscriminantly but rather by virtue of the setting
2621 ## of group_tag.
2622 foreach my $tag (grep /gene|locus_tag/, $sf->get_all_tags) {
2623 my @vals = $sf->get_tag_values($tag);
2624 $subsf->add_tag_value($tag, @vals);
2627 my $locstr = 'exon::'.$self->_locstr($subsf);
2629 # re-use feature if type and location the same
2630 if ($loc_h{$locstr}) {
2631 $subsf = $loc_h{$locstr};
2633 else {
2634 $loc_h{$locstr} = $subsf;
2636 $subsf;
2637 } @locs;
2639 # PARANOID CHECK
2640 $self->_check_order_is_consistent($sf->location->strand,@subsfs);
2641 #----
2643 $sf->location(Bio::Location::Simple->new());
2645 # we allow the exons to define the boundaries of the transcript
2646 $sf->add_SeqFeature($_, 'EXPAND') foreach @subsfs;
2649 if (!$sf->location->strand) {
2650 # correct weird bioperl bug in previous versions;
2651 # strand was not being set correctly
2652 $sf->location->strand($subsfs[0]->location->strand);
2657 return;
2660 #sub merge_features_with_same_loc {
2661 # my ($self,@args) = @_;
2663 # my($sfs, $seq) =
2664 # $self->_rearrange([qw(FEATURES
2665 # SEQ
2666 # )],
2667 # @args);
2668 # my @sfs = (@$sfs);
2669 # if ($seq) {
2670 # $seq->isa("Bio::SeqI") || $self->throw("$seq NOT A SeqI");
2671 # @sfs = $seq->get_all_SeqFeatures;
2675 # my %loc_h = ();
2676 # foreach my $sf (@sfs) {
2677 # my $type = $sf->primary_tag;
2678 # my $locstr = $self->_locstr($sf);
2679 ## $loc_h{$type.$locstr}
2680 # push(@{$exon_h{$self->_locstr($_)}}, $_) foreach @exons;
2684 =head2 infer_mRNA_from_CDS
2686 Title : infer_mRNA_from_CDS
2687 Usage :
2688 Function:
2689 Example :
2690 Returns :
2691 Args :
2693 given a "type 1" containment hierarchy
2695 gene
2697 exon
2699 this will infer the uniform "type 0" containment hierarchy
2701 gene
2702 mRNA
2704 exon
2706 all the children of the CDS will be moved to the mRNA
2708 a "type 2" containment hierarchy is mixed type "0" and "1" (for
2709 example, see ftp.ncbi.nih.gov/genomes/Schizosaccharomyces_pombe/)
2711 =cut
2713 sub infer_mRNA_from_CDS{
2714 my ($self,@args) = @_;
2716 my($sf, $seq, $noinfer) =
2717 $self->_rearrange([qw(FEATURE
2719 NOINFER
2721 @args);
2722 my @sfs = ($sf);
2723 if ($seq) {
2724 $seq->isa("Bio::SeqI") || $self->throw("$seq NOT A SeqI");
2725 @sfs = $seq->get_all_SeqFeatures;
2728 foreach my $sf (@sfs) {
2730 $sf->isa("Bio::SeqFeatureI") || $self->throw("$sf NOT A SeqFeatureI");
2731 $sf->isa("Bio::FeatureHolderI") || $self->throw("$sf NOT A FeatureHolderI");
2732 if ($self->verbose > 0) {
2733 printf STDERR " Checking $sf %s\n", $sf->primary_tag;
2736 if ($sf->primary_tag eq 'mRNA') {
2737 $self->problem(2,
2738 "Inferring mRNAs when there are already mRNAs present");
2741 my @cdsl = grep {$_->primary_tag eq 'CDS' } $sf->get_SeqFeatures;
2742 if (@cdsl) {
2743 my @children = grep {$_->primary_tag ne 'CDS'} $sf->get_SeqFeatures;
2744 my @mrnas = ();
2747 foreach my $cds (@cdsl) {
2749 if ($self->verbose > 0) {
2750 print " Inferring mRNA from CDS $cds\n";
2752 $self->_check_order_is_consistent($cds->location->strand,$cds->location->each_Location);
2754 my $loc = Bio::Location::Split->new;
2755 foreach my $cdsexonloc ($cds->location->each_Location) {
2756 my $subloc =
2757 Bio::Location::Simple->new(-start=>$cdsexonloc->start,
2758 -end=>$cdsexonloc->end,
2759 -strand=>$cdsexonloc->strand);
2760 $loc->add_sub_Location($subloc);
2762 if ($noinfer) {
2763 push(@mrnas, $cds);
2765 else {
2766 # share the same location
2767 my $mrna =
2768 Bio::SeqFeature::Generic->new(-location=>$loc,
2769 -primary_tag=>'mRNA');
2771 ## Provide seq_id to new feature:
2772 $mrna->seq_id($cds->seq_id) if $cds->seq_id;
2773 $mrna->source_tag($cds->source_tag) if $cds->source_tag;
2775 $self->_check_order_is_consistent($mrna->location->strand,$mrna->location->each_Location);
2777 # make the mRNA hold the CDS; no EXPAND option,
2778 # the CDS cannot be wider than the mRNA
2779 $mrna->add_SeqFeature($cds);
2781 # mRNA steals children of CDS
2782 foreach my $subsf ($cds->get_SeqFeatures) {
2783 $mrna->add_SeqFeature($subsf);
2785 $cds->remove_SeqFeatures;
2786 push(@mrnas, $mrna);
2789 # change gene/CDS to gene/mRNA
2790 $sf->remove_SeqFeatures;
2791 $sf->add_SeqFeature($_) foreach (@mrnas, @children);
2794 return;
2799 =head2 remove_types
2801 Title : remove_types
2802 Usage : $unf->remove_types(-seq=>$seq, -types=>["mRNA"]);
2803 Function:
2804 Example :
2805 Returns :
2806 Args :
2808 removes features of a set type
2810 useful for pre-filtering a genbank record; eg to get rid of STSs
2812 also, there is no way to unflatten
2813 ftp.ncbi.nih.gov/genomes/Schizosaccharomyces_pombe/ UNLESS the bogus
2814 mRNAs in these records are removed (or changed to a different type) -
2815 they just confuse things too much
2817 =cut
2819 sub remove_types{
2820 my ($self,@args) = @_;
2822 my($seq, $types) =
2823 $self->_rearrange([qw(
2825 TYPES
2827 @args);
2828 $seq->isa("Bio::SeqI") || $self->throw("$seq NOT A SeqI");
2829 my @sfs = $seq->get_all_SeqFeatures;
2830 my %rh = map {$_=>1} @$types;
2831 @sfs = grep {!$rh{$_->primary_tag}} @sfs;
2832 $seq->remove_SeqFeatures;
2833 $seq->add_SeqFeature($_) foreach @sfs;
2834 return;
2838 # _check_order_is_consistent($strand,$ranges) RETURNS BOOL
2840 # note: the value of this test is moot - there are many valid,
2841 # if unusual cases where it would flag an anomaly. for example
2842 # transpliced genes such as mod(mdg4) in dmel on AE003744, and
2843 # the following spliced gene on NC_001284:
2845 # mRNA complement(join(20571..20717,21692..22086,190740..190761,
2846 # 140724..141939,142769..142998))
2847 # /gene="nad5"
2848 # /note="trans-splicing, RNA editing"
2849 # /db_xref="GeneID:814567"
2851 # note how the exons are not in order
2852 # this will flag a level-3 warning, the user of this module
2853 # can ignore this and deal appropriately with the resulting
2854 # unordered exons
2855 sub _check_order_is_consistent {
2856 my $self = shift;
2858 my $parent_strand = shift; # this does nothing..?
2859 my @ranges = @_;
2860 return unless @ranges;
2861 my $rangestr =
2862 join(" ",map{sprintf("[%s,%s]",$_->start,$_->end)} @ranges);
2863 my $strand = $ranges[0]->strand;
2864 for (my $i=1; $i<@ranges;$i++) {
2865 if ($ranges[$i]->strand != $strand) {
2866 $self->problem(1,"inconsistent strands. Trans-spliced gene? Range: $rangestr");
2867 return 1;
2868 # mixed ranges - autopass
2869 # some mRNAs have exons on both strands; for
2870 # example, the dmel mod(mdg4) gene which is
2871 # trans-spliced (in actual fact two mRNAs)
2874 my $pass = 1;
2875 for (my $i=1; $i<@ranges;$i++) {
2876 my $rangeP = $ranges[$i-1];
2877 my $range = $ranges[$i];
2878 if ($rangeP->start > $range->end) {
2879 if ($self->seq->is_circular) {
2880 # see for example NC_006578.gbk
2881 # we make exceptions for circular genomes here.
2882 # see Re: [Gmod-ajax] flatfile-to-json.pl error with GFF
2883 # 2010-07-26
2885 else {
2886 # failed - but still get one more chance..
2887 $pass = 0;
2888 $self->problem(2,"Ranges not in correct order. Strange ensembl genbank entry? Range: $rangestr");
2889 last;
2894 if (!$pass) {
2895 # sometimes (eg ensembl flavour genbank files)
2896 # exons on reverse strand listed in reverse order
2897 # eg join(complement(R1),...,complement(Rn))
2898 # where R1 > R2
2899 for (my $i=1; $i<@ranges;$i++) {
2900 my $rangeP = $ranges[$i-1];
2901 my $range = $ranges[$i];
2902 if ($rangeP->end < $range->start) {
2903 $self->problem(3,"inconsistent order. Range: $rangestr");
2904 return 0;
2908 return 1; # pass
2911 # PRIVATE METHOD: _locstr($sf)
2913 # returns a location string for a feature; just the outer boundaries
2914 sub _locstr {
2915 my $self = shift;
2916 my $sf = shift;
2917 return
2918 sprintf("%d..%d", $sf->start, $sf->end);
2921 sub iterate_containment_tree {
2922 my $self = shift;
2923 my $feature_holder = shift;
2924 my $sub = shift;
2925 $sub->($feature_holder);
2926 my @sfs = $feature_holder->get_SeqFeatures;
2927 $self->iterate_containment_tree($_) foreach @sfs;
2930 sub find_best_pairs {
2931 my $matrix = shift;
2932 my $size = shift;
2933 my $i = shift || 0;
2935 for (my $j=0; $j < $size; $j++) {
2936 my $score = $matrix->[$i][$j];
2937 if (!defined($score)) {
2938 next;