ide/bioperl-mode: into its own repository to be developed separately
[bioperl-live.git] / Bio / Search / BlastUtils.pm
blob6ae384cf89f9fa332866db0600fbc35992e6089a
1 =head1 NAME
3 Bio::Search::BlastUtils - Utility functions for Bio::Search:: BLAST objects
5 =head1 SYNOPSIS
7 # This module is just a collection of subroutines, not an object.
9 See L<Bio::Search::Hit::BlastHit>.
11 =head1 DESCRIPTION
13 The BlastUtils.pm module is a collection of subroutines used primarily by
14 Bio::Search::Hit::BlastHit objects for some of the additional
15 functionality, such as HSP tiling. Right now, the BlastUtils is just a
16 collection of methods, not an object, and it's tightly coupled to
17 Bio::Search::Hit::BlastHit. A goal for the future is to generalize it
18 to work based on the Bio::Search interfaces, then it can work with any
19 objects that implements them.
21 =head1 AUTHOR
23 Steve Chervitz E<lt>sac@bioperl.orgE<gt>
25 =cut
29 package Bio::Search::BlastUtils;
30 use Bio::Root::Version;
33 =head2 tile_hsps
35 Usage : tile_hsps( $sbjct );
36 : This is called automatically by Bio::Search::Hit::BlastHit
37 : during object construction or
38 : as needed by methods that rely on having tiled data.
39 Purpose : Collect statistics about the aligned sequences in a set of HSPs.
40 : Calculates the following data across all HSPs:
41 : -- total alignment length
42 : -- total identical residues
43 : -- total conserved residues
44 Returns : n/a
45 Argument : A Bio::Search::Hit::BlastHit object
46 Throws : n/a
47 Comments :
48 : This method is *strongly* coupled to Bio::Search::Hit::BlastHit
49 : (it accesses BlastHit data members directly).
50 : TODO: Re-write this to the Bio::Search::Hit::HitI interface.
52 : This method performs more careful summing of data across
53 : all HSPs in the Sbjct object. Only HSPs that are in the same strand
54 : and frame are tiled. Simply summing the data from all HSPs
55 : in the same strand and frame will overestimate the actual
56 : length of the alignment if there is overlap between different HSPs
57 : (often the case).
59 : The strategy is to tile the HSPs and sum over the
60 : contigs, collecting data separately from overlapping and
61 : non-overlapping regions of each HSP. To facilitate this, the
62 : HSP.pm object now permits extraction of data from sub-sections
63 : of an HSP.
65 : Additional useful information is collected from the results
66 : of the tiling. It is possible that sub-sequences in
67 : different HSPs will overlap significantly. In this case, it
68 : is impossible to create a single unambiguous alignment by
69 : concatenating the HSPs. The ambiguity may indicate the
70 : presence of multiple, similar domains in one or both of the
71 : aligned sequences. This ambiguity is recorded using the
72 : ambiguous_aln() method.
74 : This method does not attempt to discern biologically
75 : significant vs. insignificant overlaps. The allowable amount of
76 : overlap can be set with the overlap() method or with the -OVERLAP
77 : parameter used when constructing the Blast & Sbjct objects.
79 : For a given hit, both the query and the sbjct sequences are
80 : tiled independently.
82 : -- If only query sequence HSPs overlap,
83 : this may suggest multiple domains in the sbjct.
84 : -- If only sbjct sequence HSPs overlap,
85 : this may suggest multiple domains in the query.
86 : -- If both query & sbjct sequence HSPs overlap,
87 : this suggests multiple domains in both.
88 : -- If neither query & sbjct sequence HSPs overlap,
89 : this suggests either no multiple domains in either
90 : sequence OR that both sequences have the same
91 : distribution of multiple similar domains.
93 : This method can deal with the special case of when multiple
94 : HSPs exactly overlap.
96 : Efficiency concerns:
97 : Speed will be an issue for sequences with numerous HSPs.
99 Bugs : Currently, tile_hsps() does not properly account for
100 : the number of non-tiled but overlapping HSPs, which becomes a problem
101 : as overlap() grows. Large values overlap() may thus lead to
102 : incorrect statistics for some hits. For best results, keep overlap()
103 : below 5 (DEFAULT IS 2). For more about this, see the "HSP Tiling and
104 : Ambiguous Alignments" section in L<Bio::Search::Hit::BlastHit>.
106 See Also : L<_adjust_contigs>(), L<Bio::Search::Hit::BlastHit|Bio::Search::Hit::BlastHit>
108 =cut
110 #--------------
111 sub tile_hsps {
112 #--------------
113 my $sbjct = shift;
115 $sbjct->{'_tile_hsps'} = 1;
116 $sbjct->{'_gaps_query'} = 0;
117 $sbjct->{'_gaps_sbjct'} = 0;
119 ## Simple summation scheme. Valid if there is only one HSP.
120 if((defined($sbjct->{'_n'}) and $sbjct->{'_n'} == 1) or $sbjct->num_hsps == 1) {
121 my $hsp = $sbjct->hsp;
122 $sbjct->{'_length_aln_query'} = $hsp->length('query');
123 $sbjct->{'_length_aln_sbjct'} = $hsp->length('sbjct');
124 $sbjct->{'_length_aln_total'} = $hsp->length('total');
125 ($sbjct->{'_totalIdentical'},$sbjct->{'_totalConserved'}) = $hsp->matches();
126 $sbjct->{'_gaps_query'} = $hsp->gaps('query');
127 $sbjct->{'_gaps_sbjct'} = $hsp->gaps('sbjct');
129 # print "_tile_hsps(): single HSP, easy stats.\n";
130 return;
131 } else {
132 # print STDERR "Sbjct: _tile_hsps: summing multiple HSPs\n";
133 $sbjct->{'_length_aln_query'} = 0;
134 $sbjct->{'_length_aln_sbjct'} = 0;
135 $sbjct->{'_length_aln_total'} = 0;
136 $sbjct->{'_totalIdentical'} = 0;
137 $sbjct->{'_totalConserved'} = 0;
140 ## More than one HSP. Must tile HSPs.
141 # print "\nTiling HSPs for $sbjct\n";
142 my($hsp, $qstart, $qstop, $sstart, $sstop);
143 my($frame, $strand, $qstrand, $sstrand);
144 my(@qcontigs, @scontigs);
145 my $qoverlap = 0;
146 my $soverlap = 0;
147 my $max_overlap = $sbjct->{'_overlap'};
149 foreach $hsp ($sbjct->hsps()) {
150 # printf " HSP: %s\n%s\n",$hsp->name, $hsp->str('query');
151 # printf " Length = %d; Identical = %d; Conserved = %d; Conserved(1-10): %d",$hsp->length, $hsp->length(-TYPE=>'iden'), $hsp->length(-TYPE=>'cons'), $hsp->length(-TYPE=>'cons',-START=>0,-STOP=>10);
152 ($qstart, $qstop) = $hsp->range('query');
153 ($sstart, $sstop) = $hsp->range('sbjct');
154 $frame = $hsp->frame('hit');
155 $frame = -1 unless defined $frame;
156 ($qstrand, $sstrand) = $hsp->strand;
158 my ($qgaps, $sgaps) = $hsp->gaps();
159 $sbjct->{'_gaps_query'} += $qgaps;
160 $sbjct->{'_gaps_sbjct'} += $sgaps;
162 $sbjct->{'_length_aln_total'} += $hsp->length;
163 ## Collect contigs in the query sequence.
164 $qoverlap = &_adjust_contigs('query', $hsp, $qstart, $qstop, \@qcontigs, $max_overlap, $frame, $qstrand);
166 ## Collect contigs in the sbjct sequence (needed for domain data and gapped Blast).
167 $soverlap = &_adjust_contigs('sbjct', $hsp, $sstart, $sstop, \@scontigs, $max_overlap, $frame, $sstrand);
169 ## Collect overall start and stop data for query and sbjct over all HSPs.
170 if(not defined $sbjct->{'_queryStart'}) {
171 $sbjct->{'_queryStart'} = $qstart;
172 $sbjct->{'_queryStop'} = $qstop;
173 $sbjct->{'_sbjctStart'} = $sstart;
174 $sbjct->{'_sbjctStop'} = $sstop;
175 } else {
176 $sbjct->{'_queryStart'} = ($qstart < $sbjct->{'_queryStart'} ? $qstart : $sbjct->{'_queryStart'});
177 $sbjct->{'_queryStop'} = ($qstop > $sbjct->{'_queryStop'} ? $qstop : $sbjct->{'_queryStop'});
178 $sbjct->{'_sbjctStart'} = ($sstart < $sbjct->{'_sbjctStart'} ? $sstart : $sbjct->{'_sbjctStart'});
179 $sbjct->{'_sbjctStop'} = ($sstop > $sbjct->{'_sbjctStop'} ? $sstop : $sbjct->{'_sbjctStop'});
183 ## Collect data across the collected contigs.
185 # print "\nQUERY CONTIGS:\n";
186 # print " gaps = $sbjct->{'_gaps_query'}\n";
188 # TODO: Account for strand/frame issue!
189 # Strategy: collect data on a per strand+frame basis and save the most significant one.
190 my (%qctg_dat);
191 foreach(@qcontigs) {
192 # print " query contig: $_->{'start'} - $_->{'stop'}\n";
193 # print " iden = $_->{'iden'}; cons = $_->{'cons'}\n";
194 ($frame, $strand) = ($_->{'frame'}, $_->{'strand'});
195 $qctg_dat{ "$frame$strand" }->{'length_aln_query'} += $_->{'stop'} - $_->{'start'} + 1;
196 $qctg_dat{ "$frame$strand" }->{'totalIdentical'} += $_->{'iden'};
197 $qctg_dat{ "$frame$strand" }->{'totalConserved'} += $_->{'cons'};
198 $qctg_dat{ "$frame$strand" }->{'qstrand'} = $strand;
201 # Find longest contig.
202 my @sortedkeys = reverse sort { $qctg_dat{ $a }->{'length_aln_query'} <=> $qctg_dat{ $b }->{'length_aln_query'} } keys %qctg_dat;
204 # Save the largest to the sbjct:
205 my $longest = $sortedkeys[0];
206 $sbjct->{'_length_aln_query'} = $qctg_dat{ $longest }->{'length_aln_query'};
207 $sbjct->{'_totalIdentical'} = $qctg_dat{ $longest }->{'totalIdentical'};
208 $sbjct->{'_totalConserved'} = $qctg_dat{ $longest }->{'totalConserved'};
209 $sbjct->{'_qstrand'} = $qctg_dat{ $longest }->{'qstrand'};
211 ## Collect data for sbjct contigs. Important for gapped Blast.
212 ## The totalIdentical and totalConserved numbers will be the same
213 ## as determined for the query contigs.
215 # print "\nSBJCT CONTIGS:\n";
216 # print " gaps = $sbjct->{'_gaps_sbjct'}\n";
218 my (%sctg_dat);
219 foreach(@scontigs) {
220 # print " sbjct contig: $_->{'start'} - $_->{'stop'}\n";
221 # print " iden = $_->{'iden'}; cons = $_->{'cons'}\n";
222 ($frame, $strand) = ($_->{'frame'}, $_->{'strand'});
223 $sctg_dat{ "$frame$strand" }->{'length_aln_sbjct'} += $_->{'stop'} - $_->{'start'} + 1;
224 $sctg_dat{ "$frame$strand" }->{'frame'} = $frame;
225 $sctg_dat{ "$frame$strand" }->{'sstrand'} = $strand;
228 @sortedkeys = reverse sort { $sctg_dat{ $a }->{'length_aln_sbjct'} <=> $sctg_dat{ $b }->{'length_aln_sbjct'} } keys %sctg_dat;
230 # Save the largest to the sbjct:
231 $longest = $sortedkeys[0];
233 $sbjct->{'_length_aln_sbjct'} = $sctg_dat{ $longest }->{'length_aln_sbjct'};
234 $sbjct->{'_frame'} = $sctg_dat{ $longest }->{'frame'};
235 $sbjct->{'_sstrand'} = $sctg_dat{ $longest }->{'sstrand'};
237 if($qoverlap) {
238 if($soverlap) { $sbjct->ambiguous_aln('qs');
239 # print "\n*** AMBIGUOUS ALIGNMENT: Query and Sbjct\n\n";
241 else { $sbjct->ambiguous_aln('q');
242 # print "\n*** AMBIGUOUS ALIGNMENT: Query\n\n";
244 } elsif($soverlap) {
245 $sbjct->ambiguous_aln('s');
246 # print "\n*** AMBIGUOUS ALIGNMENT: Sbjct\n\n";
249 # Adjust length based on BLAST flavor.
250 my $prog = $sbjct->algorithm;
251 if($prog eq 'TBLASTN') {
252 $sbjct->{'_length_aln_sbjct'} /= 3;
253 } elsif($prog eq 'BLASTX' ) {
254 $sbjct->{'_length_aln_query'} /= 3;
255 } elsif($prog eq 'TBLASTX') {
256 $sbjct->{'_length_aln_query'} /= 3;
257 $sbjct->{'_length_aln_sbjct'} /= 3;
263 =head2 _adjust_contigs
265 Usage : n/a; called automatically during object construction.
266 Purpose : Builds HSP contigs for a given BLAST hit.
267 : Utility method called by _tile_hsps()
268 Returns :
269 Argument :
270 Throws : Exceptions propagated from Bio::Search::Hit::BlastHSP::matches()
271 : for invalid sub-sequence ranges.
272 Status : Experimental
273 Comments : This method does not currently support gapped alignments.
274 : Also, it does not keep track of the number of HSPs that
275 : overlap within the amount specified by overlap().
276 : This will lead to significant tracking errors for large
277 : overlap values.
279 See Also : L<tile_hsps>(), L<Bio::Search::Hit::BlastHSP::matches|Bio::Search::Hit::BlastHSP>
281 =cut
283 #-------------------
284 sub _adjust_contigs {
285 #-------------------
286 my ($seqType, $hsp, $start, $stop, $contigs_ref, $max_overlap, $frame, $strand) = @_;
288 my $overlap = 0;
289 my ($numID, $numCons);
291 # print STDERR "Testing $seqType data: HSP (${\$hsp->name}); $start, $stop, strand=$strand, frame=$frame\n";
292 foreach(@$contigs_ref) {
293 # print STDERR " Contig: $_->{'start'} - $_->{'stop'}, strand=$_->{'strand'}, frame=$_->{'frame'}, iden= $_->{'iden'}, cons= $_->{'cons'}\n";
295 # Don't merge things unless they have matching strand/frame.
296 next unless ($_->{'frame'} == $frame and $_->{'strand'} == $strand);
298 ## Test special case of a nested HSP. Skip it.
299 if($start >= $_->{'start'} and $stop <= $_->{'stop'}) {
300 # print STDERR "----> Nested HSP. Skipping.\n";
301 $overlap = 1;
302 next;
305 ## Test for overlap at beginning of contig.
306 if($start < $_->{'start'} and $stop > ($_->{'start'} + $max_overlap)) {
307 # print STDERR "----> Overlaps beg: existing beg,end: $_->{'start'},$_->{'stop'}, new beg,end: $start,$stop\n";
308 # Collect stats over the non-overlapping region.
309 eval {
310 ($numID, $numCons) = $hsp->matches(-SEQ =>$seqType,
311 -START =>$start,
312 -STOP =>$_->{'start'}-1);
314 if($@) { warn "\a\n$@\n"; }
315 else {
316 $_->{'start'} = $start; # Assign a new start coordinate to the contig
317 $_->{'iden'} += $numID; # and add new data to #identical, #conserved.
318 $_->{'cons'} += $numCons;
319 $overlap = 1;
323 ## Test for overlap at end of contig.
324 if($stop > $_->{'stop'} and $start < ($_->{'stop'} - $max_overlap)) {
325 # print STDERR "----> Overlaps end: existing beg,end: $_->{'start'},$_->{'stop'}, new beg,end: $start,$stop\n";
326 # Collect stats over the non-overlapping region.
327 eval {
328 ($numID,$numCons) = $hsp->matches(-SEQ =>$seqType,
329 -START =>$_->{'stop'},
330 -STOP =>$stop);
332 if($@) { warn "\a\n$@\n"; }
333 else {
334 $_->{'stop'} = $stop; # Assign a new stop coordinate to the contig
335 $_->{'iden'} += $numID; # and add new data to #identical, #conserved.
336 $_->{'cons'} += $numCons;
337 $overlap = 1;
340 $overlap && do {
341 # print STDERR " New Contig data:\n";
342 # print STDERR " Contig: $_->{'start'} - $_->{'stop'}, iden= $_->{'iden'}, cons= $_->{'cons'}\n";
343 last;
346 ## If there is no overlap, add the complete HSP data.
347 !$overlap && do {
348 # print STDERR "No overlap. Adding new contig.\n";
349 ($numID,$numCons) = $hsp->matches(-SEQ=>$seqType);
350 push @$contigs_ref, {'start'=>$start, 'stop'=>$stop,
351 'iden'=>$numID, 'cons'=>$numCons,
352 'strand'=>$strand, 'frame'=>$frame};
354 $overlap;
357 =head2 get_exponent
359 Usage : &get_exponent( number );
360 Purpose : Determines the power of 10 exponent of an integer, float,
361 : or scientific notation number.
362 Example : &get_exponent("4.0e-206");
363 : &get_exponent("0.00032");
364 : &get_exponent("10.");
365 : &get_exponent("1000.0");
366 : &get_exponent("e+83");
367 Argument : Float, Integer, or scientific notation number
368 Returns : Integer representing the exponent part of the number (+ or -).
369 : If argument == 0 (zero), return value is "-999".
370 Comments : Exponents are rounded up (less negative) if the mantissa is >= 5.
371 : Exponents are rounded down (more negative) if the mantissa is <= -5.
373 =cut
375 #------------------
376 sub get_exponent {
377 #------------------
378 my $data = shift;
380 my($num, $exp) = split /[eE]/, $data;
382 if( defined $exp) {
383 $num = 1 if not $num;
384 $num >= 5 and $exp++;
385 $num <= -5 and $exp--;
386 } elsif( $num == 0) {
387 $exp = -999;
388 } elsif( not $num =~ /\./) {
389 $exp = CORE::length($num) -1;
390 } else {
391 $exp = 0;
392 $num .= '0' if $num =~ /\.$/;
393 my ($c);
394 my $rev = 0;
395 if($num !~ /^0/) {
396 $num = reverse($num);
397 $rev = 1;
399 do { $c = chop($num);
400 $c == 0 && $exp++;
401 } while( $c ne '.');
403 $exp = -$exp if $num == 0 and not $rev;
404 $exp -= 1 if $rev;
406 return $exp;
409 =head2 collapse_nums
411 Usage : @cnums = collapse_nums( @numbers );
412 Purpose : Collapses a list of numbers into a set of ranges of consecutive terms:
413 : Useful for condensing long lists of consecutive numbers.
414 : EXPANDED:
415 : 1 2 3 4 5 6 10 12 13 14 15 17 18 20 21 22 24 26 30 31 32
416 : COLLAPSED:
417 : 1-6 10 12-15 17 18 20-22 24 26 30-32
418 Argument : List of numbers sorted numerically.
419 Returns : List of numbers mixed with ranges of numbers (see above).
420 Throws : n/a
422 See Also : L<Bio::Search::Hit::BlastHit::seq_inds()|Bio::Search::Hit::BlastHit>
424 =cut
426 #------------------
427 sub collapse_nums {
428 #------------------
429 # This is probably not the slickest connectivity algorithm, but will do for now.
430 my @a = @_;
431 my ($from, $to, $i, @ca, $consec);
433 $consec = 0;
434 for($i=0; $i < @a; $i++) {
435 not $from and do{ $from = $a[$i]; next; };
436 if($a[$i] == $a[$i-1]+1) {
437 $to = $a[$i];
438 $consec++;
439 } else {
440 if($consec == 1) { $from .= ",$to"; }
441 else { $from .= $consec>1 ? "\-$to" : ""; }
442 push @ca, split(',', $from);
443 $from = $a[$i];
444 $consec = 0;
445 $to = undef;
448 if(defined $to) {
449 if($consec == 1) { $from .= ",$to"; }
450 else { $from .= $consec>1 ? "\-$to" : ""; }
452 push @ca, split(',', $from) if $from;
454 @ca;
458 =head2 strip_blast_html
460 Usage : $boolean = &strip_blast_html( string_ref );
461 : This method is exported.
462 Purpose : Removes HTML formatting from a supplied string.
463 : Attempts to restore the Blast report to enable
464 : parsing by Bio::SearchIO::blast.pm
465 Returns : Boolean: true if string was stripped, false if not.
466 Argument : string_ref = reference to a string containing the whole Blast
467 : report containing HTML formatting.
468 Throws : Croaks if the argument is not a scalar reference.
469 Comments : Based on code originally written by Alex Dong Li
470 : (ali@genet.sickkids.on.ca).
471 : This method does some Blast-specific stripping
472 : (adds back a '>' character in front of each HSP
473 : alignment listing).
475 : THIS METHOD IS VERY SENSITIVE TO BLAST FORMATTING CHANGES!
477 : Removal of the HTML tags and accurate reconstitution of the
478 : non-HTML-formatted report is highly dependent on structure of
479 : the HTML-formatted version. For example, it assumes that first
480 : line of each alignment section (HSP listing) starts with a
481 : <a name=..> anchor tag. This permits the reconstruction of the
482 : original report in which these lines begin with a ">".
483 : This is required for parsing.
485 : If the structure of the Blast report itself is not intended to
486 : be a standard, the structure of the HTML-formatted version
487 : is even less so. Therefore, the use of this method to
488 : reconstitute parsable Blast reports from HTML-format versions
489 : should be considered a temporary solution.
491 =cut
493 #--------------------
494 sub strip_blast_html {
495 #--------------------
496 # This may not best way to remove html tags. However, it is simple.
497 # it won't work under following conditions:
498 # 1) if quoted > appears in a tag (does this ever happen?)
499 # 2) if a tag is split over multiple lines and this method is
500 # used to process one line at a time.
502 my ($string_ref) = shift;
504 ref $string_ref eq 'SCALAR' or
505 croak ("Can't strip HTML: ".
506 "Argument is should be a SCALAR reference not a ${\ref $string_ref}\n");
508 my $str = $$string_ref;
509 my $stripped = 0;
511 # Removing "<a name =...>" and adding the '>' character for
512 # HSP alignment listings.
513 $str =~ s/(\A|\n)<a name ?=[^>]+> ?/>/sgi and $stripped = 1;
515 # Removing all "<>" tags.
516 $str =~ s/<[^>]+>|&nbsp//sgi and $stripped = 1;
518 # Re-uniting any lone '>' characters.
519 $str =~ s/(\A|\n)>\s+/\n\n>/sgi and $stripped = 1;
521 $$string_ref = $str;
522 $stripped;