2 # BioPerl module for Bio::Tree::Statistics
4 # Please direct questions and support issues to <bioperl-l@bioperl.org>
6 # Cared for by Jason Stajich <jason@bioperl.org>
8 # Copyright Jason Stajich
10 # You may distribute this module under the same terms as perl itself
12 # POD documentation - main docs before the code
16 Bio::Tree::Statistics - Calculate certain statistics for a Tree
20 use Bio::Tree::Statistics;
24 This should be where Tree statistics are calculated. It was
25 previously where statistics from a Coalescent simulation.
27 It now contains several methods for calculating L<Tree-Trait
34 User feedback is an integral part of the evolution of this and other
35 Bioperl modules. Send your comments and suggestions preferably to
36 the Bioperl mailing list. Your participation is much appreciated.
38 bioperl-l@bioperl.org - General discussion
39 http://bioperl.org/wiki/Mailing_lists - About the mailing lists
43 Please direct usage questions or support issues to the mailing list:
45 I<bioperl-l@bioperl.org>
47 rather than to the module maintainer directly. Many experienced and
48 reponsive experts will be able look at the problem and quickly
49 address it. Please include a thorough description of the problem
50 with code and data examples if at all possible.
54 Report bugs to the Bioperl bug tracking system to help us keep track
55 of the bugs and their resolution. Bug reports can be submitted via
58 https://github.com/bioperl/bioperl-live/issues
60 =head1 AUTHOR - Jason Stajich
62 Email jason AT bioperl.org
66 Heikki Lehvaslaiho, heikki at bioperl dot org
70 The rest of the documentation details each of the object methods.
71 Internal methods are usually preceded with a _
76 # Let the code begin...
79 package Bio
::Tree
::Statistics
;
83 use base
qw(Bio::Root::Root);
88 Usage : my $obj = Bio::Tree::Statistics->new();
89 Function: Builds a new Bio::Tree::Statistics object
90 Returns : Bio::Tree::Statistics
93 =head2 assess_bootstrap
95 Title : assess_bootstrap
96 Usage : my $tree_with_bs = $stats->assess_bootstrap(\@bs_trees,$guide_tree);
97 Function: Calculates the bootstrap for internal nodes based on the percentage
98 of times \@bs_trees agree with each internal node
99 Returns : L<Bio::Tree::TreeI>
100 Args : Arrayref of L<Bio::Tree::TreeI>s
101 Guide tree, L<Bio::Tree::TreeI>s
105 sub assess_bootstrap
{
106 my ($self,$bs_trees,$guide_tree) = @_;
109 # internal nodes are defined by their children
111 my (%lookup,%internal);
113 for my $tree ( $guide_tree, @
$bs_trees ) {
114 # Do this as a top down approach, can probably be
115 # improved by caching internal node states, but not going
116 # to worry about it right now.
118 my @allnodes = $tree->get_nodes;
119 my @internalnodes = grep { ! $_->is_Leaf } @allnodes;
120 for my $node ( @internalnodes ) {
121 my @tips = sort map { $_->id }
122 grep { $_->is_Leaf() } $node->get_all_Descendents;
123 my $id = "(".join(",", @tips).")";
125 $internal{$id} = $node->internal_id;
133 for my $l ( keys %lookup ) {
134 if( defined $internal{$l} ) {#&& $lookup{$l} > $min_seen ) {
135 my $intnode = $guide_tree->find_node(-internal_id
=> $internal{$l});
136 $intnode->bootstrap(sprintf("%d",100 * $lookup{$l} / $i));
145 Example : cherries($tree, $node);
146 Description: Count number of paired leaf nodes
150 Args : 1. Bio::Tree::TreeI object
151 2. Bio::Tree::NodeI object within the tree, optional
153 Commonly used statistics assume a binary tree, but this methods
154 returns a value even for trees with polytomies.
161 my $node = shift || $tree->get_root_node;
164 my @descs = $node->each_Descendent;
166 if ($descs[0]->is_Leaf and $descs[1]->is_Leaf) {
167 if ($descs[3]) { #polytomy at leaf level
174 foreach my $desc (@descs) {
175 $cherries += $self->cherries($tree, $desc);
182 =head2 Tree-Trait statistics
184 The following methods produce descriptors of trait distribution among
185 leaf nodes within the trees. They require that a trait has been set
186 for each leaf node. The tag methods of Bio::Tree::Node are used to
187 store them as key/value pairs. In this way, one tree can store more
190 Trees have method add_traits() to set trait values from a file. See the
191 add_trait() method in L<Bio::Tree::TreeFunctionsI>.
195 Example : fitch($tree, $key, $node);
196 Description: Calculates Parsimony Score (PS) and internal trait
197 values using the Fitch 1971 parsimony algorithm for
198 the subtree a defined by the (internal) node.
199 Node defaults to the root.
200 Returns : true on success
201 Exceptions : leaf nodes have to have the trait defined
202 Args : 1. Bio::Tree::TreeI object
204 3. Bio::Tree::NodeI object within the tree, optional
206 Runs first L<fitch_up> that calculates parsimony scores and then
207 L<fitch_down> that should resolve most of the trait/character state
210 Fitch, W.M., 1971. Toward defining the course of evolution: minimal
211 change for a specific tree topology. Syst. Zool. 20, 406-416.
213 You can access calculated parsimony values using:
215 $score = $node->->get_tag_values('ps_score');
217 and the trait value with:
219 $traitvalue = $node->->get_tag_values('ps_trait'); # only the first
220 @traitvalues = $node->->get_tag_values('ps_trait');
222 Note that there can be more that one trait value, especially for the
230 my $key = shift || $self->throw("Trait name is needed");
231 my $node = shift || $tree->get_root_node;
233 $self->fitch_up($tree, $key, $node);
234 $self->fitch_down($tree, $node);
240 Example : ps($tree, $key, $node);
241 Description: Calculates Parsimony Score (PS) from Fitch 1971
242 parsimony algorithm for the subtree as defined
243 by the (internal) node.
244 Node defaults to the root.
245 Returns : integer, 1 < PS < n, where n is number of branches
246 Exceptions : leaf nodes have to have the trait defined
247 Args : 1. Bio::Tree::TreeI object
249 3. Bio::Tree::NodeI object within the tree, optional
251 This is the first half of the Fitch algorithm that is enough for
252 calculating the resolved parsimony values. The trait/chararacter
253 states are commonly left in ambiguous state. To resolve them, run
258 sub ps
{ shift->fitch_up(@_) }
263 Example : fitch_up($tree, $key, $node);
264 Description: Calculates Parsimony Score (PS) from the Fitch 1971
265 parsimony algorithm for the subtree as defined
266 by the (internal) node.
267 Node defaults to the root.
268 Returns : integer, 1< PS < n, where n is number of branches
269 Exceptions : leaf nodes have to have the trait defined
270 Args : 1. Bio::Tree::TreeI object
272 3. Bio::Tree::NodeI object within the tree, optional
274 This is a more generic name for L<ps> and indicates that it performs
275 the first bottom-up tree traversal that calculates the parsimony score
276 but usually leaves trait/character states ambiguous. If you are
277 interested in internal trait states, running L<fitch_down> should
278 resolve most of the ambiguities.
285 my $key = shift || $self->throw("Trait name is needed");
286 my $node = shift || $tree->get_root_node;
288 if ($node->is_Leaf) {
289 $self->throw ("ERROR: ". $node->internal_id. " needs a value for trait $key")
290 unless $node->has_tag($key);
291 $node->set_tag_value('ps_trait', $node->get_tag_values($key) );
292 $node->set_tag_value('ps_score', 0 );
293 return; # end of recursion
296 foreach my $child ($node->each_Descendent) {
297 $self->fitch_up($tree, $key, $child);
304 foreach my $child ($node->each_Descendent) {
305 foreach my $trait ($child->get_tag_values('ps_trait') ) {
306 $intersection{$trait}++ if $union{$trait};
309 $score += $child->get_tag_values('ps_score');
312 if (keys %intersection) {
313 $node->set_tag_value('ps_trait', keys %intersection);
314 $node->set_tag_value('ps_score', $score);
316 $node->set_tag_value('ps_trait', keys %union);
317 $node->set_tag_value('ps_score', $score+1);
320 if ($self->verbose) {
321 print "-- node --------------------------\n";
322 print "iID: ", $node->internal_id, " (", $node->id, ")\n";
323 print "Trait: ", join (', ', $node->get_tag_values('ps_trait') ), "\n";
324 print "length :", scalar($node->get_tag_values('ps_score')) , "\n";
326 return scalar $node->get_tag_values('ps_score');
332 Example : fitch_down($tree, $node);
333 Description: Runs the second pass from Fitch 1971
334 parsimony algorithm to resolve ambiguous
335 trait states left by first pass.
336 by the (internal) node.
337 Node defaults to the root.
339 Exceptions : dies unless the trait is defined in all nodes
340 Args : 1. Bio::Tree::TreeI object
341 2. Bio::Tree::NodeI object within the tree, optional
343 Before running this method you should have ran L<fitch_up> (alias to
344 L<ps> ). Note that it is not guaranteed that all states are completely
352 my $node = shift || $tree->get_root_node;
354 my $key = 'ps_trait';
355 $self->throw ("ERROR: ". $node->internal_id. " needs a value for $key")
356 unless $node->has_tag($key);
359 foreach my $trait ($node->get_tag_values($key) ) {
363 foreach my $child ($node->each_Descendent) {
364 next if $child->is_Leaf; # end of recursion
367 foreach my $trait ($child->get_tag_values($key) ) {
368 $intersection->{$trait}++ if $nodev->{$trait};
371 $self->fitch_down($tree, $child);
372 $child->set_tag_value($key, keys %$intersection);
380 Example : persistence($tree, $node);
381 Description: Calculates the persistence
382 for node in the subtree defined by the (internal)
383 node. Node defaults to the root.
384 Returns : int, number of generations trait value has to remain same
385 Exceptions : all the nodes need to have the trait defined
386 Args : 1. Bio::Tree::TreeI object
387 2. Bio::Tree::NodeI object within the tree, optional
389 Persistence measures the stability that the trait value has in a
390 tree. It expresses the number of generations the trait value remains
391 the same. All the decendants of the root in the same generation have
392 to share the same value.
394 Depends on Fitch's parsimony score (PS).
402 my $value = shift || $self->throw("Value is needed");
405 my $key = 'ps_trait';
407 $self->throw("Node is needed") unless $node->isa('Bio::Tree::NodeI');
409 return 0 unless $node->get_tag_values($key) eq $value; # wrong value
410 return 1 if $node->is_Leaf; # end of recursion
412 my $persistence = 10000000; # an arbitrarily large number
413 foreach my $child ($node->each_Descendent) {
414 my $pers = $self->_persistence($tree, $child, $value);
415 $persistence = $pers if $pers < $persistence;
417 return $persistence + 1;
423 my $node = shift || $tree->get_root_node;
424 $self->throw("Node is needed") unless $node->isa('Bio::Tree::NodeI');
426 my $key = 'ps_trait';
427 my $value = $node->get_tag_values($key);
430 my $persistence = $self->_persistence($tree, $node, $value);
431 $node->set_tag_value('persistance', $persistence);
436 =head2 count_subclusters
438 Example : count_clusters($tree, $node);
439 Description: Calculates the number of sub-clusters
440 in the subtree defined by the (internal)
441 node. Node defaults to the root.
443 Exceptions : all the nodes need to have the trait defined
444 Args : 1. Bio::Tree::TreeI object
445 2. Bio::Tree::NodeI object within the tree, optional
447 Depends on Fitch's parsimony score (PS).
451 sub _count_subclusters
{
455 my $value = shift || $self->throw("Value is needed");
457 my $key = 'ps_trait';
459 $self->throw ("ERROR: ". $node->internal_id. " needs a value for trait $key")
460 unless $node->has_tag($key);
462 if ($node->get_tag_values($key) eq $value) {
463 if ($node->get_tag_values('ps_score') == 0) {
467 foreach my $child ($node->each_Descendent) {
468 $count += $self->_count_subclusters($tree, $child, $value);
476 sub count_subclusters
{
479 my $node = shift || $tree->get_root_node;
480 $self->throw("Node is needed") unless $node->isa('Bio::Tree::NodeI');
482 my $key = 'ps_trait';
483 my $value = $node->get_tag_values($key);
485 return $self->_count_subclusters($tree, $node, $value);
491 Example : count_leaves($tree, $node);
492 Description: Calculates the number of leaves with same trait
493 value as root in the subtree defined by the (internal)
494 node. Requires an unbroken line of identical trait values.
495 Node defaults to the root.
496 Returns : int, number of leaves with this trait value
497 Exceptions : all the nodes need to have the trait defined
498 Args : 1. Bio::Tree::TreeI object
499 2. Bio::Tree::NodeI object within the tree, optional
501 Depends on Fitch's parsimony score (PS).
508 my $node = shift || $tree->get_root_node;
511 my $key = 'ps_trait';
513 $self->throw ("ERROR: ". $node->internal_id. " needs a value for trait $key")
514 unless $node->has_tag($key);
516 if ($node->get_tag_values($key) eq $value) {
517 #print $node->id, ": ", $node->get_tag_values($key), "\n";
518 return 1 if $node->is_Leaf; # end of recursion
521 foreach my $child ($node->each_Descendent) {
522 $count += $self->_count_leaves($tree, $child, $value);
532 my $node = shift || $tree->get_root_node;
533 $self->throw("Node is needed") unless $node->isa('Bio::Tree::NodeI');
535 my $key = 'ps_trait';
536 my $value = $node->get_tag_values($key);
538 return $self->_count_leaves($tree, $node, $value);
542 =head2 phylotype_length
544 Example : phylotype_length($tree, $node);
545 Description: Sums up the branch lengths within phylotype
546 exluding the subclusters where the trait values
548 Returns : float, length
549 Exceptions : all the nodes need to have the trait defined
550 Args : 1. Bio::Tree::TreeI object
551 2. Bio::Tree::NodeI object within the tree, optional
553 Depends on Fitch's parsimony score (PS).
557 sub _phylotype_length
{
563 my $key = 'ps_trait';
565 $self->throw ("ERROR: ". $node->internal_id. " needs a value for trait $key")
566 unless $node->has_tag($key);
568 return 0 if $node->get_tag_values($key) ne $value;
569 return $node->branch_length if $node->is_Leaf; # end of recursion
572 foreach my $child ($node->each_Descendent) {
573 my $sub_len = $self->_phylotype_length($tree, $child, $value);
575 $length += $child->branch_length if not $child->is_Leaf and $sub_len;
580 sub phylotype_length
{
583 my $node = shift || $tree->get_root_node;
585 my $key = 'ps_trait';
586 my $value = $node->get_tag_values($key);
588 return $self->_phylotype_length($tree, $node, $value);
592 =head2 sum_of_leaf_distances
594 Example : sum_of_leaf_distances($tree, $node);
595 Description: Sums up the branch lengths from root to leaf
596 exluding the subclusters where the trait values
598 Returns : float, length
599 Exceptions : all the nodes need to have the trait defined
600 Args : 1. Bio::Tree::TreeI object
601 2. Bio::Tree::NodeI object within the tree, optional
603 Depends on Fitch's parsimony score (PS).
607 sub _sum_of_leaf_distances
{
613 my $key = 'ps_trait';
615 $self->throw ("ERROR: ". $node->internal_id. " needs a value for trait $key")
616 unless $node->has_tag($key);
617 return 0 if $node->get_tag_values($key) ne $value;
618 #return $node->branch_length if $node->is_Leaf; # end of recursion
619 return 0 if $node->is_Leaf; # end of recursion
622 foreach my $child ($node->each_Descendent) {
623 $length += $self->_count_leaves($tree, $child, $value) * $child->branch_length +
624 $self->_sum_of_leaf_distances($tree, $child, $value);
629 sub sum_of_leaf_distances
{
632 my $node = shift || $tree->get_root_node;
634 my $key = 'ps_trait';
635 my $value = $node->get_tag_values($key);
637 return $self->_sum_of_leaf_distances($tree, $node, $value);
641 =head2 genetic_diversity
643 Example : genetic_diversity($tree, $node);
644 Description: Diversity is the sum of root to leaf distances
645 within the phylotype normalised by number of leaf
647 Returns : float, value of genetic diversity
648 Exceptions : all the nodes need to have the trait defined
649 Args : 1. Bio::Tree::TreeI object
650 2. Bio::Tree::NodeI object within the tree, optional
652 Depends on Fitch's parsimony score (PS).
656 sub genetic_diversity
{
659 my $node = shift || $tree->get_root_node;
661 return $self->sum_of_leaf_distances($tree, $node) /
662 $self->count_leaves($tree, $node);
668 Example : statratio($tree, $node);
669 Description: Ratio of the stem length and the genetic diversity of the
670 phylotype L<genetic_diversity>
671 Returns : float, separation score
672 Exceptions : all the nodes need to have the trait defined
673 Args : 1. Bio::Tree::TreeI object
674 2. Bio::Tree::NodeI object within the tree, optional
676 Statratio gives a measure of separation and variability within the phylotype.
677 Larger values identify more rapidly evolving and recent phylotypes.
679 Depends on Fitch's parsimony score (PS).
686 my $node = shift || $tree->get_root_node;
688 my $div = $self->genetic_diversity($tree, $node);
689 return 0 if $div == 0;
690 return $node->branch_length / $div;
697 Example : ai($tree, $key, $node);
698 Description: Calculates the Association Index (AI) of Whang et
699 al. 2001 for the subtree defined by the (internal)
700 node. Node defaults to the root.
702 Exceptions : leaf nodes have to have the trait defined
703 Args : 1. Bio::Tree::TreeI object
705 3. Bio::Tree::NodeI object within the tree, optional
707 Association index (AI) gives a more fine grained results than PS since
708 the result is a real number. ~0 E<lt>= AI.
710 Wang, T.H., Donaldson, Y.K., Brettle, R.P., Bell, J.E., Simmonds, P.,
711 2001. Identification of shared populations of human immunodeficiency
712 Virus Type 1 infecting microglia and tissue macrophages outside the
713 central nervous system. J. Virol. 75 (23), 11686-11699.
724 for my $desc ( $node->get_all_Descendents ) {
725 next unless $desc->is_Leaf;
727 $self->throw ("Node ". $desc->id. " needs a value for trait [$key]")
728 unless $desc->has_tag($key);
729 my $trait = $desc->get_tag_values($key);
733 foreach ( keys %$traits) {
734 $most_common = $traits->{$_} if $traits->{$_} > $most_common;
736 return sprintf "%1.6f", (1 - ($most_common/$leaf_count) ) / (2**($leaf_count-1) );
742 my $key = shift || $self->throw("Trait name is needed");
743 my $start_node = shift || $tree->get_root_node;
744 return unless $start_node;
747 for my $node ( $start_node->get_all_Descendents ) {
748 next if $node->is_Leaf;
749 $sum += $self->_node_ai($node, $key);
757 Example : mc($tree, $key, $node);
758 Description: Calculates the Monophyletic Clade (MC) size statistics
759 for the subtree a defined by the (internal) node.
760 Node defaults to the root;
761 Returns : hashref with trait values as keys
762 Exceptions : leaf nodes have to have the trait defined
763 Args : 1. Bio::Tree::TreeI object
765 3. Bio::Tree::NodeI object within the tree, optional
767 Monophyletic Clade (MC) size statistics by Salemi at al 2005. It is
768 calculated for each trait value. 1 E<lt>= MC E<lt>= nx, where nx is the
769 number of tips with value x:
771 pick the internal node with maximim value for
772 number of of tips with only trait x
774 MC was defined by Parker et al 2008.
776 Salemi, M., Lamers, S.L., Yu, S., de Oliveira, T., Fitch, W.M., McGrath, M.S.,
777 2005. Phylodynamic analysis of Human Immunodeficiency Virus Type 1 in
778 distinct brain compartments provides a model for the neuropathogenesis of
779 AIDS. J. Virol. 79 (17), 11343-11352.
781 Parker, J., Rambaut A., Pybus O., 2008. Correlating viral phenotypes
782 with phylogeny: Accounting for phylogenetic uncertainty Infection,
783 Genetics and Evolution 8 (2008), 239-246.
794 for my $node2 ( $node->get_all_Descendents ) {
795 next unless $node2->is_Leaf;
797 my $trait = $node2->get_tag_values($key);
806 my $key = shift || die "Trait name is needed";
807 my $start_node = shift || $tree->get_root_node;
808 return unless $start_node;
810 my $sum; # hashref, keys are trait values
811 my $keys; # hashref, keys are trait values
812 foreach my $node ( $start_node->get_all_Descendents ) {
813 next if $node->is_Leaf;
814 my $traits = $self->_node_mc($node, $key);
815 if (scalar keys %$traits == 1) {
816 my ($value) = keys %$traits;
818 $sum->{$value} = $traits->{$value}
819 if $sum->{$value} < $traits->{$value};
821 map { $keys->{$_} = 1 } keys %$traits;
824 # check for cases where there are no clusters
825 foreach my $value (keys %$keys) {
826 $sum->{$value} = 1 unless defined $sum->{$value};