t/SeqFeature/Generic.t: fix typo on required module for testing
[bioperl-live.git] / lib / Bio / Search / HSP / BlastHSP.pm
blobfad7a0631c29f53c9d90d4a855fa9147d2742123
1 #-----------------------------------------------------------------
3 # BioPerl module Bio::Search::HSP::BlastHSP
5 # (This module was originally called Bio::Tools::Blast::HSP)
7 # Please direct questions and support issues to <bioperl-l@bioperl.org>
9 # Cared for by Steve Chervitz <sac@bioperl.org>
11 # You may distribute this module under the same terms as perl itself
12 #-----------------------------------------------------------------
14 ## POD Documentation:
16 =head1 NAME
18 Bio::Search::HSP::BlastHSP - Bioperl BLAST High-Scoring Pair object
20 =head1 SYNOPSIS
22 See L<Bio::Search::Hit::BlastHit>.
24 =head1 DESCRIPTION
26 A Bio::Search::HSP::BlastHSP object provides an interface to data
27 obtained in a single alignment section of a Blast report (known as a
28 "High-scoring Segment Pair"). This is essentially a pairwise
29 alignment with score information.
31 BlastHSP objects are accessed via L<Bio::Search::Hit::BlastHit>
32 objects after parsing a BLAST report using the L<Bio::SearchIO>
33 system.
35 The construction of BlastHSP objects is performed by
36 Bio::Factory::BlastHitFactory in a process that is
37 orchestrated by the Blast parser (L<Bio::SearchIO::psiblast>).
38 The resulting BlastHSPs are then accessed via
39 L<Bio::Search::Hit::BlastHit>). Therefore, you do not need to
40 use L<Bio::Search::HSP::BlastHSP>) directly. If you need to construct
41 BlastHSPs directly, see the new() function for details.
43 For L<Bio::SearchIO> BLAST parsing usage examples, see the
44 C<examples/searchio> directory of the Bioperl distribution.
46 =head2 Start and End coordinates
48 Sequence endpoints are swapped so that start is always less than
49 end. This affects For TBLASTN/X hits on the minus strand. Strand
50 information can be recovered using the strand() method. This
51 normalization step is standard Bioperl practice. It also facilitates
52 use of range information by methods such as match().
54 =over 1
56 =item * Supports BLAST versions 1.x and 2.x, gapped and ungapped.
58 =back
60 Bio::Search::HSP::BlastHSP.pm has the ability to extract a list of all
61 residue indices for identical and conservative matches along both
62 query and sbjct sequences. Since this degree of detail is not always
63 needed, this behavior does not occur during construction of the BlastHSP
64 object. These data will automatically be collected as necessary as
65 the BlastHSP.pm object is used.
67 =head1 DEPENDENCIES
69 Bio::Search::HSP::BlastHSP.pm is a concrete class that inherits from
70 L<Bio::SeqFeature::SimilarityPair> and L<Bio::Search::HSP::HSPI>.
71 L<Bio::Seq> and L<Bio::SimpleAlign> are employed for creating
72 sequence and alignment objects, respectively.
74 =head2 Relationship to SimpleAlign.pm & Seq.pm
76 BlastHSP.pm can provide the query or sbjct sequence as a L<Bio::Seq>
77 object via the L<seq()|seq> method. The BlastHSP.pm object can also create a
78 two-sequence L<Bio::SimpleAlign> alignment object using the the query
79 and sbjct sequences via the L<get_aln()|get_aln> method. Creation of alignment
80 objects is not automatic when constructing the BlastHSP.pm object since
81 this level of functionality is not always required and would generate
82 a lot of extra overhead when crunching many reports.
85 =head1 FEEDBACK
87 =head2 Mailing Lists
89 User feedback is an integral part of the evolution of this and other
90 Bioperl modules. Send your comments and suggestions preferably to one
91 of the Bioperl mailing lists. Your participation is much appreciated.
93 bioperl-l@bioperl.org - General discussion
94 http://bioperl.org/wiki/Mailing_lists - About the mailing lists
96 =head2 Support
98 Please direct usage questions or support issues to the mailing list:
100 I<bioperl-l@bioperl.org>
102 rather than to the module maintainer directly. Many experienced and
103 reponsive experts will be able look at the problem and quickly
104 address it. Please include a thorough description of the problem
105 with code and data examples if at all possible.
107 =head2 Reporting Bugs
109 Report bugs to the Bioperl bug tracking system to help us keep track
110 the bugs and their resolution. Bug reports can be submitted via the
111 web:
113 https://github.com/bioperl/bioperl-live/issues
115 =head1 AUTHOR
117 Steve Chervitz E<lt>sac-at-bioperl.orgE<gt>
119 See L<the FEEDBACK section | FEEDBACK> for where to send bug reports and comments.
121 =head1 ACKNOWLEDGEMENTS
123 This software was originally developed in the Department of Genetics
124 at Stanford University. I would also like to acknowledge my
125 colleagues at Affymetrix for useful feedback.
127 =head1 SEE ALSO
129 Bio::Search::Hit::BlastHit.pm - Blast hit object.
130 Bio::Search::Result::BlastResult.pm - Blast Result object.
131 Bio::Seq.pm - Biosequence object
133 =head2 Links:
135 http://bio.perl.org/ - Bioperl Project Homepage
137 =head1 COPYRIGHT
139 Copyright (c) 1996-2001 Steve Chervitz. All Rights Reserved.
141 =head1 DISCLAIMER
143 This software is provided "as is" without warranty of any kind.
145 =cut
148 # END of main POD documentation.
150 =head1 APPENDIX
152 The rest of the documentation details each of the object methods.
153 Internal methods are usually preceded with a _
155 =cut
157 # Let the code begin...
159 package Bio::Search::HSP::BlastHSP;
161 use strict;
162 use Bio::SeqFeature::Similarity;
164 use vars qw($GAP_SYMBOL %STRAND_SYMBOL);
166 use overload
167 '""' => \&to_string;
169 use base qw(Bio::SeqFeature::SimilarityPair Bio::Search::HSP::HSPI);
171 $GAP_SYMBOL = '-'; # Need a more general way to handle gap symbols.
172 %STRAND_SYMBOL = ('Plus' => 1, 'Minus' => -1 );
175 =head2 new
177 Usage : $hsp = Bio::Search::HSP::BlastHSP->new( %named_params );
178 : Bio::Search::HSP::BlastHSP objects are constructed
179 : automatically by Bio::SearchIO::BlastHitFactory,
180 : so there is no need for direct instantiation.
181 Purpose : Constructs a new BlastHSP object and Initializes key variables
182 : for the HSP.
183 Returns : A Bio::Search::HSP::BlastHSP object
184 Argument : Named parameters:
185 : Parameter keys are case-insensitive.
186 : -RAW_DATA => array ref containing raw BLAST report data for
187 : for a single HSP. This includes all lines
188 : of the HSP alignment from a traditional BLAST
189 or PSI-BLAST (non-XML) report,
190 : -RANK => integer (1..n).
191 : -PROGRAM => string ('TBLASTN', 'BLASTP', etc.).
192 : -QUERY_NAME => string, id of query sequence
193 : -HIT_NAME => string, id of hit sequence
195 Comments : Having the raw data allows this object to do lazy parsing of
196 : the raw HSP data (i.e., not parsed until needed).
198 : Note that there is a fair amount of basic parsing that is
199 : currently performed in this module that would be more appropriate
200 : to do within a separate factory object.
201 : This parsing code will likely be relocated and more initialization
202 : parameters will be added to new().
204 See Also : L<Bio::SeqFeature::SimilarityPair::new()>, L<Bio::SeqFeature::Similarity::new()>
206 =cut
208 #----------------
209 sub new {
210 #----------------
211 my ($class, @args ) = @_;
213 my $self = $class->SUPER::new( @args );
214 # Initialize placeholders
215 $self->{'_queryGaps'} = $self->{'_sbjctGaps'} = 0;
216 my ($raw_data, $qname, $hname, $qlen, $hlen);
218 ($self->{'_prog'}, $self->{'_rank'}, $raw_data,
219 $qname, $hname) =
220 $self->_rearrange([qw( PROGRAM
221 RANK
222 RAW_DATA
223 QUERY_NAME
224 HIT_NAME
225 )], @args );
227 # _set_data() does a fair amount of parsing.
228 # This will likely change (see comment above.)
229 $self->_set_data( @{$raw_data} );
230 # Store the aligned query as sequence feature
231 my ($qb, $hb) = ($self->start());
232 my ($qe, $he) = ($self->end());
233 my ($qs, $hs) = ($self->strand());
234 my ($qf,$hf) = ($self->query->frame(),
235 $self->hit->frame);
237 $self->query( Bio::SeqFeature::Similarity->new (-start =>$qb,
238 -end =>$qe,
239 -strand =>$qs,
240 -bits =>$self->bits,
241 -score =>$self->score,
242 -frame =>$qf,
243 -seq_id => $qname,
244 -source =>$self->{'_prog'} ));
246 $self->hit( Bio::SeqFeature::Similarity->new (-start =>$hb,
247 -end =>$he,
248 -strand =>$hs,
249 -bits =>$self->bits,
250 -score =>$self->score,
251 -frame =>$hf,
252 -seq_id => $hname,
253 -source =>$self->{'_prog'} ));
255 # set lengths
256 $self->query->seqlength($qlen); # query
257 $self->hit->seqlength($hlen); # subject
259 $self->query->frac_identical($self->frac_identical('query'));
260 $self->hit->frac_identical($self->frac_identical('hit'));
261 return $self;
265 # Title : _id_str;
266 # Purpose : Intended for internal use only to provide a string for use
267 # within exception messages to help users figure out which
268 # query/hit caused the problem.
269 # Returns : Short string with name of query and hit seq
270 sub _id_str {
271 my $self = shift;
272 if( not defined $self->{'_id_str'}) {
273 my $qname = $self->query->seq_id;
274 my $hname = $self->hit->seq_id;
275 $self->{'_id_str'} = "QUERY=\"$qname\" HIT=\"$hname\"";
277 return $self->{'_id_str'};
280 #=================================================
281 # Begin Bio::Search::HSP::HSPI implementation
282 #=================================================
284 =head2 algorithm
286 Title : algorithm
287 Usage : $alg = $hsp->algorithm();
288 Function: Gets the algorithm specification that was used to obtain the hsp
289 For BLAST, the algorithm denotes what type of sequence was aligned
290 against what (BLASTN: dna-dna, BLASTP prt-prt, BLASTX translated
291 dna-prt, TBLASTN prt-translated dna, TBLASTX translated
292 dna-translated dna).
293 Returns : a scalar string
294 Args : none
296 =cut
298 #----------------
299 sub algorithm {
300 #----------------
301 my ($self,@args) = @_;
302 return $self->{'_prog'};
308 =head2 signif()
310 Usage : $hsp_obj->signif()
311 Purpose : Get the P-value or Expect value for the HSP.
312 Returns : Float (0.001 or 1.3e-43)
313 : Returns P-value if it is defined, otherwise, Expect value.
314 Argument : n/a
315 Throws : n/a
316 Comments : Provided for consistency with BlastHit::signif()
317 : Support for returning the significance data in different
318 : formats (e.g., exponent only), is not provided for HSP objects.
319 : This is only available for the BlastHit or Blast object.
321 See Also : L</p>, L</expect>, L<Bio::Search::Hit::BlastHit::signif()|Bio::Search::Hit::BlastHit>
323 =cut
325 #-----------
326 sub signif {
327 #-----------
328 my $self = shift;
329 my $val ||= defined($self->{'_p'}) ? $self->{'_p'} : $self->{'_expect'};
330 $val;
335 =head2 evalue
337 Usage : $hsp_obj->evalue()
338 Purpose : Get the Expect value for the HSP.
339 Returns : Float (0.001 or 1.3e-43)
340 Argument : n/a
341 Throws : n/a
342 Comments : Support for returning the expectation data in different
343 : formats (e.g., exponent only), is not provided for HSP objects.
344 : This is only available for the BlastHit or Blast object.
346 See Also : L</p>
348 =cut
350 #----------
351 sub evalue { shift->{'_expect'} }
352 #----------
355 =head2 p
357 Usage : $hsp_obj->p()
358 Purpose : Get the P-value for the HSP.
359 Returns : Float (0.001 or 1.3e-43) or undef if not defined.
360 Argument : n/a
361 Throws : n/a
362 Comments : P-value is not defined with NCBI Blast2 reports.
363 : Support for returning the expectation data in different
364 : formats (e.g., exponent only) is not provided for HSP objects.
365 : This is only available for the BlastHit or Blast object.
367 See Also : L<expect()|expect>
369 =cut
371 #-----
372 sub p { my $self = shift; $self->{'_p'}; }
373 #-----
375 # alias
376 sub pvalue { shift->p(@_); }
378 =head2 length
380 Usage : $hsp->length( [seq_type] )
381 Purpose : Get the length of the aligned portion of the query or sbjct.
382 Example : $hsp->length('query')
383 Returns : integer
384 Argument : seq_type: 'query' | 'hit' or 'sbjct' | 'total' (default = 'total')
385 ('sbjct' is synonymous with 'hit')
386 Throws : n/a
387 Comments : 'total' length is the full length of the alignment
388 : as reported in the denominators in the alignment section:
389 : "Identical = 34/120 Positives = 67/120".
391 See Also : L</gaps>
393 =cut
395 #-----------
396 sub length {
397 #-----------
398 ## Developer note: when using the built-in length function within
399 ## this module, call it as CORE::length().
400 my( $self, $seqType,$data ) = @_;
401 $seqType ||= 'total';
402 $seqType = 'sbjct' if $seqType eq 'hit';
404 $seqType ne 'total' and $self->_set_seq_data() unless $self->{'_set_seq_data'};
406 ## Sensitive to member name format.
407 $seqType = "_\L$seqType\E";
408 if( defined $data ) {
409 $self->{$seqType.'Length'} = $data;
411 $self->{$seqType.'Length'};
416 =head2 gaps
418 Usage : $hsp->gaps( [seq_type] )
419 Purpose : Get the number of gap characters in the query, sbjct, or total alignment.
420 : Also can return query gap chars and sbjct gap chars as a two-element list
421 : when in array context.
422 Example : $total_gaps = $hsp->gaps();
423 : ($qgaps, $sgaps) = $hsp->gaps();
424 : $qgaps = $hsp->gaps('query');
425 Returns : scalar context: integer
426 : array context without args: (int, int) = ('queryGaps', 'sbjctGaps')
427 Argument : seq_type: 'query' or 'hit' or 'sbjct' or 'total'
428 : ('sbjct' is synonymous with 'hit')
429 : (default = 'total', scalar context)
430 : Array context can be "induced" by providing an argument of 'list' or 'array'.
431 Throws : n/a
433 See Also : L</length>, L</matches>
435 =cut
437 #---------
438 sub gaps {
439 #---------
440 my( $self, $seqType ) = @_;
442 $self->_set_seq_data() unless $self->{'_set_seq_data'};
444 $seqType ||= (wantarray ? 'list' : 'total');
445 $seqType = 'sbjct' if $seqType eq 'hit';
447 if($seqType =~ /list|array/i) {
448 return (($self->{'_queryGaps'} || 0), ($self->{'_sbjctGaps'} || 0));
451 if($seqType eq 'total') {
452 return ($self->{'_queryGaps'} + $self->{'_sbjctGaps'}) || 0;
453 } else {
454 ## Sensitive to member name format.
455 $seqType = "_\L$seqType\E";
456 return $self->{$seqType.'Gaps'} || 0;
461 =head2 frac_identical
463 Usage : $hsp_object->frac_identical( [seq_type] );
464 Purpose : Get the fraction of identical positions within the given HSP.
465 Example : $frac_iden = $hsp_object->frac_identical('query');
466 Returns : Float (2-decimal precision, e.g., 0.75).
467 Argument : seq_type: 'query' or 'hit' or 'sbjct' or 'total'
468 : ('sbjct' is synonymous with 'hit')
469 : default = 'total' (but see comments below).
470 Throws : n/a
471 Comments : Different versions of Blast report different values for the total
472 : length of the alignment. This is the number reported in the
473 : denominators in the stats section:
474 : "Identical = 34/120 Positives = 67/120".
475 : NCBI-BLAST uses the total length of the alignment (with gaps)
476 : WU-BLAST uses the length of the query sequence (without gaps).
477 : Therefore, when called without an argument or an argument of 'total',
478 : this method will report different values depending on the
479 : version of BLAST used.
481 : To get the fraction identical among only the aligned residues,
482 : ignoring the gaps, call this method with an argument of 'query'
483 : or 'sbjct' ('sbjct' is synonymous with 'hit').
485 See Also : L</frac_conserved>, L</num_identical>, L</matches>
487 =cut
489 #-------------------
490 sub frac_identical {
491 #-------------------
492 # The value is calculated as opposed to storing it from the parsed results.
493 # This saves storage and also permits flexibility in determining for which
494 # sequence (query or sbjct) the figure is to be calculated.
496 my( $self, $seqType ) = @_;
497 $seqType ||= 'total';
498 $seqType = 'sbjct' if $seqType eq 'hit';
500 if($seqType ne 'total') {
501 $self->_set_seq_data() unless $self->{'_set_seq_data'};
503 ## Sensitive to member name format.
504 $seqType = "_\L$seqType\E";
506 sprintf( "%.2f", $self->{'_numIdentical'}/$self->{$seqType.'Length'});
510 =head2 frac_conserved
512 Usage : $hsp_object->frac_conserved( [seq_type] );
513 Purpose : Get the fraction of conserved positions within the given HSP.
514 : (Note: 'conservative' positions are called 'positives' in the
515 : Blast report.)
516 Example : $frac_cons = $hsp_object->frac_conserved('query');
517 Returns : Float (2-decimal precision, e.g., 0.75).
518 Argument : seq_type: 'query' or 'hit' or 'sbjct' or 'total'
519 : ('sbjct' is synonymous with 'hit')
520 : default = 'total' (but see comments below).
521 Throws : n/a
522 Comments : Different versions of Blast report different values for the total
523 : length of the alignment. This is the number reported in the
524 : denominators in the stats section:
525 : "Identical = 34/120 Positives = 67/120".
526 : NCBI-BLAST uses the total length of the alignment (with gaps)
527 : WU-BLAST uses the length of the query sequence (without gaps).
528 : Therefore, when called without an argument or an argument of 'total',
529 : this method will report different values depending on the
530 : version of BLAST used.
532 : To get the fraction conserved among only the aligned residues,
533 : ignoring the gaps, call this method with an argument of 'query'
534 : or 'sbjct'.
536 See Also : L</frac_conserved>, L</num_conserved>, L</matches>
538 =cut
540 #--------------------
541 sub frac_conserved {
542 #--------------------
543 # The value is calculated as opposed to storing it from the parsed results.
544 # This saves storage and also permits flexibility in determining for which
545 # sequence (query or sbjct) the figure is to be calculated.
547 my( $self, $seqType ) = @_;
548 $seqType ||= 'total';
549 $seqType = 'sbjct' if $seqType eq 'hit';
551 if($seqType ne 'total') {
552 $self->_set_seq_data() unless $self->{'_set_seq_data'};
555 ## Sensitive to member name format.
556 $seqType = "_\L$seqType\E";
558 sprintf( "%.2f", $self->{'_numConserved'}/$self->{$seqType.'Length'});
561 =head2 query_string
563 Title : query_string
564 Usage : my $qseq = $hsp->query_string;
565 Function: Retrieves the query sequence of this HSP as a string
566 Returns : string
567 Args : none
570 =cut
572 #----------------
573 sub query_string{ shift->seq_str('query'); }
574 #----------------
576 =head2 hit_string
578 Title : hit_string
579 Usage : my $hseq = $hsp->hit_string;
580 Function: Retrieves the hit sequence of this HSP as a string
581 Returns : string
582 Args : none
585 =cut
587 #----------------
588 sub hit_string{ shift->seq_str('hit'); }
589 #----------------
592 =head2 homology_string
594 Title : homology_string
595 Usage : my $homo_string = $hsp->homology_string;
596 Function: Retrieves the homology sequence for this HSP as a string.
597 : The homology sequence is the string of symbols in between the
598 : query and hit sequences in the alignment indicating the degree
599 : of conservation (e.g., identical, similar, not similar).
600 Returns : string
601 Args : none
603 =cut
605 #----------------
606 sub homology_string{ shift->seq_str('match'); }
607 #----------------
609 #=================================================
610 # End Bio::Search::HSP::HSPI implementation
611 #=================================================
613 # Older method delegating to method defined in HSPI.
615 =head2 expect
617 See L<Bio::Search::HSP::HSPI::expect()|Bio::Search::HSP::HSPI>
619 =cut
621 #----------
622 sub expect { shift->evalue( @_ ); }
623 #----------
626 =head2 rank
628 Usage : $hsp->rank( [string] );
629 Purpose : Get the rank of the HSP within a given Blast hit.
630 Example : $rank = $hsp->rank;
631 Returns : Integer (1..n) corresponding to the order in which the HSP
632 appears in the BLAST report.
634 =cut
638 #----------
639 sub rank { shift->{'_rank'} }
640 #----------
642 # For backward compatibility
643 #----------
644 sub name { shift->rank }
645 #----------
647 =head2 to_string
649 Title : to_string
650 Usage : print $hsp->to_string;
651 Function: Returns a string representation for the Blast HSP.
652 Primarily intended for debugging purposes.
653 Example : see usage
654 Returns : A string of the form:
655 [BlastHSP] <rank>
656 e.g.:
657 [BlastHit] 1
658 Args : None
660 =cut
662 #----------
663 sub to_string {
664 #----------
665 my $self = shift;
666 return "[BlastHSP] " . $self->rank();
670 =head2 _set_data
672 Usage : called automatically during object construction.
673 Purpose : Parses the raw HSP section from a flat BLAST report and
674 sets the query sequence, sbjct sequence, and the "match" data
675 : which consists of the symbols between the query and sbjct lines
676 : in the alignment.
677 Argument : Array (all lines for a single, complete HSP, from a raw,
678 flat (i.e., non-XML) BLAST report)
679 Throws : Propagates any exceptions from the methods called ("See Also")
681 See Also : L</_set_seq>, L</_set_score_stats>, L</_set_match_stats>
683 =cut
685 #--------------
686 sub _set_data {
687 #--------------
688 my $self = shift;
689 my @data = @_;
690 my @queryList = (); # 'Query' = SEQUENCE USED TO QUERY THE DATABASE.
691 my @sbjctList = (); # 'Sbjct' = HOMOLOGOUS SEQUENCE FOUND IN THE DATABASE.
692 my @matchList = ();
693 my $matchLine = 0; # Alternating boolean: when true, load 'match' data.
694 my @linedat = ();
696 #print STDERR "BlastHSP: set_data()\n";
698 my($line, $aln_row_len, $length_diff);
699 $length_diff = 0;
701 # Collecting data for all lines in the alignment
702 # and then storing the collections for possible processing later.
704 # Note that "match" lines may not be properly padded with spaces.
705 # This loop now properly handles such cases:
706 # Query: 1141 PSLVELTIRDCPRLEVGPMIRSLPKFPMLKKLDLAVANIIEEDLDVIGSLEELVIXXXXX 1200
707 # PSLVELTIRDCPRLEVGPMIRSLPKFPMLKKLDLAVANIIEEDLDVIGSLEELVI
708 # Sbjct: 1141 PSLVELTIRDCPRLEVGPMIRSLPKFPMLKKLDLAVANIIEEDLDVIGSLEELVILSLKL 1200
710 foreach $line( @data ) {
711 next if $line =~ /^\s*$/;
713 if( $line =~ /^ ?Score/ ) {
714 $self->_set_score_stats( $line );
715 } elsif( $line =~ /^ ?(Identities|Positives|Strand)/ ) {
716 $self->_set_match_stats( $line );
717 } elsif( $line =~ /^ ?Frame = ([\d+-]+)/ ) {
718 # Version 2.0.8 has Frame information on a separate line.
719 # Storing frame according to SeqFeature::Generic::frame()
720 # which does not contain strand info (use strand()).
721 my $frame = abs($1) - 1;
722 $self->frame( $frame );
723 } elsif( $line =~ /^(Query:?[\s\d]+)([^\s\d]+)/ ) {
724 push @queryList, $line;
725 $self->{'_match_indent'} = CORE::length $1;
726 $aln_row_len = (CORE::length $1) + (CORE::length $2);
727 $matchLine = 1;
728 } elsif( $matchLine ) {
729 # Pad the match line with spaces if necessary.
730 $length_diff = $aln_row_len - CORE::length $line;
731 $length_diff and $line .= ' 'x $length_diff;
732 push @matchList, $line;
733 $matchLine = 0;
734 } elsif( $line =~ /^Sbjct/ ) {
735 push @sbjctList, $line;
738 # Storing the query and sbjct lists in case they are needed later.
739 # We could make this conditional to save memory.
740 $self->{'_queryList'} = \@queryList;
741 $self->{'_sbjctList'} = \@sbjctList;
743 # Storing the match list in case it is needed later.
744 $self->{'_matchList'} = \@matchList;
746 if(not defined ($self->{'_numIdentical'})) {
747 my $id_str = $self->_id_str;
748 $self->throw( -text => "Can't parse match statistics. Possibly a new or unrecognized Blast format. ($id_str)");
751 if(!scalar @queryList or !scalar @sbjctList) {
752 my $id_str = $self->_id_str;
753 $self->throw( "Can't find query or sbjct alignment lines. Possibly unrecognized Blast format. ($id_str)");
758 =head2 _set_score_stats
760 Usage : called automatically by _set_data()
761 Purpose : Sets various score statistics obtained from the HSP listing.
762 Argument : String with any of the following formats:
763 : blast2: Score = 30.1 bits (66), Expect = 9.2
764 : blast2: Score = 158.2 bits (544), Expect(2) = e-110
765 : blast1: Score = 410 (144.3 bits), Expect = 1.7e-40, P = 1.7e-40
766 : blast1: Score = 55 (19.4 bits), Expect = 5.3, Sum P(3) = 0.99
767 Throws : Exception if the stats cannot be parsed, probably due to a change
768 : in the Blast report format.
770 See Also : L</_set_data>
772 =cut
774 #--------------------
775 sub _set_score_stats {
776 #--------------------
777 my ($self, $data) = @_;
779 my ($expect, $p);
781 if($data =~ /Score = +([\d.e+-]+) bits \(([\d.e+-]+)\), +Expect = +([\d.e+-]+)/) {
782 # blast2 format n = 1
783 $self->bits($1);
784 $self->score($2);
785 $expect = $3;
786 } elsif($data =~ /Score = +([\d.e+-]+) bits \(([\d.e+-]+)\), +Expect\((\d+)\) = +([\d.e+-]+)/) {
787 # blast2 format n > 1
788 $self->bits($1);
789 $self->score($2);
790 $self->{'_n'} = $3;
791 $expect = $4;
793 } elsif($data =~ /Score = +([\d.e+-]+) \(([\d.e+-]+) bits\), +Expect = +([\d.e+-]+), P = +([\d.e-]+)/) {
794 # blast1 format, n = 1
795 $self->score($1);
796 $self->bits($2);
797 $expect = $3;
798 $p = $4;
800 } elsif($data =~ /Score = +([\d.e+-]+) \(([\d.e+-]+) bits\), +Expect = +([\d.e+-]+), +Sum P\((\d+)\) = +([\d.e-]+)/) {
801 # blast1 format, n > 1
802 $self->score($1);
803 $self->bits($2);
804 $expect = $3;
805 $self->{'_n'} = $4;
806 $p = $5;
808 } else {
809 my $id_str = $self->_id_str;
810 $self->throw(-class => 'Bio::Root::Exception',
811 -text => "Can't parse score statistics: unrecognized format. ($id_str)",
812 -value => $data);
814 $expect = "1$expect" if $expect =~ /^e/i;
815 $p = "1$p" if defined $p and $p=~ /^e/i;
817 $self->{'_expect'} = $expect;
818 $self->{'_p'} = $p || undef;
819 $self->significance( $p || $expect );
823 =head2 _set_match_stats
825 Usage : Private method; called automatically by _set_data()
826 Purpose : Sets various matching statistics obtained from the HSP listing.
827 Argument : blast2: Identities = 23/74 (31%), Positives = 29/74 (39%), Gaps = 17/74 (22%)
828 : blast2: Identities = 57/98 (58%), Positives = 74/98 (75%)
829 : blast1: Identities = 87/204 (42%), Positives = 126/204 (61%)
830 : blast1: Identities = 87/204 (42%), Positives = 126/204 (61%), Frame = -3
831 : WU-blast: Identities = 310/553 (56%), Positives = 310/553 (56%), Strand = Minus / Plus
832 Throws : Exception if the stats cannot be parsed, probably due to a change
833 : in the Blast report format.
834 Comments : The "Gaps = " data in the HSP header has a different meaning depending
835 : on the type of Blast: for BLASTP, this number is the total number of
836 : gaps in query+sbjct; for TBLASTN, it is the number of gaps in the
837 : query sequence only. Thus, it is safer to collect the data
838 : separately by examining the actual sequence strings as is done
839 : in _set_seq().
841 See Also : L</_set_data>, L</_set_seq>
843 =cut
845 #--------------------
846 sub _set_match_stats {
847 #--------------------
848 my ($self, $data) = @_;
850 if($data =~ m!Identities = (\d+)/(\d+)!) {
851 # blast1 or 2 format
852 $self->{'_numIdentical'} = $1;
853 $self->{'_totalLength'} = $2;
856 if($data =~ m!Positives = (\d+)/(\d+)!) {
857 # blast1 or 2 format
858 $self->{'_numConserved'} = $1;
859 $self->{'_totalLength'} = $2;
862 if($data =~ m!Frame = ([\d+-]+)!) {
863 $self->frame($1);
866 # Strand data is not always present in this line.
867 # _set_seq() will also set strand information.
868 if($data =~ m!Strand = (\w+) / (\w+)!) {
869 $self->{'_queryStrand'} = $1;
870 $self->{'_sbjctStrand'} = $2;
873 # if($data =~ m!Gaps = (\d+)/(\d+)!) {
874 # $self->{'_totalGaps'} = $1;
875 # } else {
876 # $self->{'_totalGaps'} = 0;
882 =head2 _set_seq_data
884 Usage : called automatically when sequence data is requested.
885 Purpose : Sets the HSP sequence data for both query and sbjct sequences.
886 : Includes: start, stop, length, gaps, and raw sequence.
887 Argument : n/a
888 Throws : Propagates any exception thrown by _set_match_seq()
889 Comments : Uses raw data stored by _set_data() during object construction.
890 : These data are not always needed, so it is conditionally
891 : executed only upon demand by methods such as gaps(), _set_residues(),
892 : etc. _set_seq() does the dirty work.
894 See Also : L</_set_seq>
896 =cut
898 #-----------------
899 sub _set_seq_data {
900 #-----------------
901 my $self = shift;
903 $self->_set_seq('query', @{$self->{'_queryList'}});
904 $self->_set_seq('sbjct', @{$self->{'_sbjctList'}});
906 # Liberate some memory.
907 @{$self->{'_queryList'}} = @{$self->{'_sbjctList'}} = ();
908 undef $self->{'_queryList'};
909 undef $self->{'_sbjctList'};
911 $self->{'_set_seq_data'} = 1;
916 =head2 _set_seq
918 Usage : called automatically by _set_seq_data()
919 : $hsp_obj->($seq_type, @data);
920 Purpose : Sets sequence information for both the query and sbjct sequences.
921 : Directly counts the number of gaps in each sequence (if gapped Blast).
922 Argument : $seq_type = 'query' or 'sbjct'
923 : @data = all seq lines with the form:
924 : Query: 61 SPHNVKDRKEQNGSINNAISPTATANTSGSQQINIDSALRDRSSNVAAQPSLSDASSGSN 120
925 Throws : Exception if data strings cannot be parsed, probably due to a change
926 : in the Blast report format.
927 Comments : Uses first argument to determine which data members to set
928 : making this method sensitive data member name changes.
929 : Behavior is dependent on the type of BLAST analysis (TBLASTN, BLASTP, etc).
930 Warning : Sequence endpoints are normalized so that start < end. This affects HSPs
931 : for TBLASTN/X hits on the minus strand. Normalization facilitates use
932 : of range information by methods such as match().
934 See Also : L</_set_seq_data>, L</matches>, L</range>, L</start>, L</end>
936 =cut
938 #-------------
939 sub _set_seq {
940 #-------------
941 my $self = shift;
942 my $seqType = shift;
943 my @data = @_;
944 my @ranges = ();
945 my @sequence = ();
946 my $numGaps = 0;
948 foreach( @data ) {
949 if( m/(\d+) *([^\d\s]+) *(\d+)/) {
950 push @ranges, ( $1, $3 ) ;
951 push @sequence, $2;
952 #print STDERR "_set_seq found sequence \"$2\"\n";
953 } else {
954 $self->warn("Bad sequence data: $_");
958 if( !(scalar(@sequence) and scalar(@ranges))) {
959 my $id_str = $self->_id_str;
960 $self->throw("Can't set sequence: missing data. Possibly unrecognized Blast format. ($id_str) $seqType");
963 # Sensitive to member name changes.
964 $seqType = "_\L$seqType\E";
965 $self->{$seqType.'Start'} = $ranges[0];
966 $self->{$seqType.'Stop'} = $ranges[ $#ranges ];
967 $self->{$seqType.'Seq'} = \@sequence;
969 $self->{$seqType.'Length'} = abs($ranges[ $#ranges ] - $ranges[0]) + 1;
971 # Adjust lengths for BLASTX, TBLASTN, TBLASTX sequences
972 # Converting nucl coords to amino acid coords.
974 my $prog = $self->algorithm;
975 if($prog eq 'TBLASTN' and $seqType eq '_sbjct') {
976 $self->{$seqType.'Length'} /= 3;
977 } elsif($prog eq 'BLASTX' and $seqType eq '_query') {
978 $self->{$seqType.'Length'} /= 3;
979 } elsif($prog eq 'TBLASTX') {
980 $self->{$seqType.'Length'} /= 3;
983 if( $prog ne 'BLASTP' ) {
984 $self->{$seqType.'Strand'} = 'Plus' if $prog =~ /BLASTN/;
985 $self->{$seqType.'Strand'} = 'Plus' if ($prog =~ /BLASTX/ and $seqType eq '_query');
986 # Normalize sequence endpoints so that start < end.
987 # Reverse complement or 'minus strand' HSPs get flipped here.
988 if($self->{$seqType.'Start'} > $self->{$seqType.'Stop'}) {
989 ($self->{$seqType.'Start'}, $self->{$seqType.'Stop'}) =
990 ($self->{$seqType.'Stop'}, $self->{$seqType.'Start'});
991 $self->{$seqType.'Strand'} = 'Minus';
995 ## Count number of gaps in each seq. Only need to do this for gapped Blasts.
996 # if($self->{'_gapped'}) {
997 my $seqstr = join('', @sequence);
998 $seqstr =~ s/\s//g;
999 my $num_gaps = CORE::length($seqstr) - $self->{$seqType.'Length'};
1000 $self->{$seqType.'Gaps'} = $num_gaps if $num_gaps > 0;
1005 =head2 _set_residues
1007 Usage : called automatically when residue data is requested.
1008 Purpose : Sets the residue numbers representing the identical and
1009 : conserved positions. These data are obtained by analyzing the
1010 : symbols between query and sbjct lines of the alignments.
1011 Argument : n/a
1012 Throws : Propagates any exception thrown by _set_seq_data() and _set_match_seq().
1013 Comments : These data are not always needed, so it is conditionally
1014 : executed only upon demand by methods such as seq_inds().
1015 : Behavior is dependent on the type of BLAST analysis (TBLASTN, BLASTP, etc).
1017 See Also : L</_set_seq_data>, L</_set_match_seq>, L</seq_inds>
1019 =cut
1021 #------------------
1022 sub _set_residues {
1023 #------------------
1024 my $self = shift;
1025 my @sequence = ();
1027 $self->_set_seq_data() unless $self->{'_set_seq_data'};
1029 # Using hashes to avoid saving duplicate residue numbers.
1030 my %identicalList_query = ();
1031 my %identicalList_sbjct = ();
1032 my %conservedList_query = ();
1033 my %conservedList_sbjct = ();
1035 my $aref = $self->_set_match_seq() if not ref $self->{'_matchSeq'};
1036 $aref ||= $self->{'_matchSeq'};
1037 my $seqString = join('', @$aref );
1039 my $qseq = join('',@{$self->{'_querySeq'}});
1040 my $sseq = join('',@{$self->{'_sbjctSeq'}});
1041 my $resCount_query = $self->{'_queryStop'} || 0;
1042 my $resCount_sbjct = $self->{'_sbjctStop'} || 0;
1044 my $prog = $self->algorithm;
1045 if($prog !~ /^BLASTP|^BLASTN/) {
1046 if($prog eq 'TBLASTN') {
1047 $resCount_sbjct /= 3;
1048 } elsif($prog eq 'BLASTX') {
1049 $resCount_query /= 3;
1050 } elsif($prog eq 'TBLASTX') {
1051 $resCount_query /= 3;
1052 $resCount_sbjct /= 3;
1056 my ($mchar, $schar, $qchar);
1057 while( $mchar = chop($seqString) ) {
1058 ($qchar, $schar) = (chop($qseq), chop($sseq));
1059 if( $mchar eq '+' ) {
1060 $conservedList_query{ $resCount_query } = 1;
1061 $conservedList_sbjct{ $resCount_sbjct } = 1;
1062 } elsif( $mchar ne ' ' ) {
1063 $identicalList_query{ $resCount_query } = 1;
1064 $identicalList_sbjct{ $resCount_sbjct } = 1;
1066 $resCount_query-- if $qchar ne $GAP_SYMBOL;
1067 $resCount_sbjct-- if $schar ne $GAP_SYMBOL;
1069 $self->{'_identicalRes_query'} = \%identicalList_query;
1070 $self->{'_conservedRes_query'} = \%conservedList_query;
1071 $self->{'_identicalRes_sbjct'} = \%identicalList_sbjct;
1072 $self->{'_conservedRes_sbjct'} = \%conservedList_sbjct;
1076 =head2 _set_match_seq
1078 Usage : $hsp_obj->_set_match_seq()
1079 Purpose : Set the 'match' sequence for the current HSP (symbols in between
1080 : the query and sbjct lines.)
1081 Returns : Array reference holding the match sequences lines.
1082 Argument : n/a
1083 Throws : Exception if the _matchList field is not set.
1084 Comments : The match information is not always necessary. This method
1085 : allows it to be conditionally prepared.
1086 : Called by _set_residues>() and seq_str().
1088 See Also : L</_set_residues>, L</seq_str>
1090 =cut
1092 #-------------------
1093 sub _set_match_seq {
1094 #-------------------
1095 my $self = shift;
1097 if( ! ref($self->{'_matchList'}) ) {
1098 my $id_str = $self->_id_str;
1099 $self->throw("Can't set HSP match sequence: No data ($id_str)");
1102 my @data = @{$self->{'_matchList'}};
1104 my(@sequence);
1105 foreach( @data ) {
1106 chomp($_);
1107 ## Remove leading spaces; (note: aln may begin with a space
1108 ## which is why we can't use s/^ +//).
1109 s/^ {$self->{'_match_indent'}}//;
1110 push @sequence, $_;
1112 # Liberate some memory.
1113 @{$self->{'_matchList'}} = undef;
1114 $self->{'_matchList'} = undef;
1116 $self->{'_matchSeq'} = \@sequence;
1118 return $self->{'_matchSeq'};
1122 =head2 n
1124 Usage : $hsp_obj->n()
1125 Purpose : Get the N value (num HSPs on which P/Expect is based).
1126 : This value is not defined with NCBI Blast2 with gapping.
1127 Returns : Integer or null string if not defined.
1128 Argument : n/a
1129 Throws : n/a
1130 Comments : The 'N' value is listed in parenthesis with P/Expect value:
1131 : e.g., P(3) = 1.2e-30 ---> (N = 3).
1132 : Not defined in NCBI Blast2 with gaps.
1133 : This typically is equal to the number of HSPs but not always.
1134 : To obtain the number of HSPs, use Bio::Search::Hit::BlastHit::num_hsps().
1136 See Also : L<Bio::SeqFeature::SimilarityPair::score()|Bio::SeqFeature::SimilarityPair>
1138 =cut
1140 #-----
1141 sub n { my $self = shift; $self->{'N'} || ''; }
1142 #-----
1145 =head2 matches
1147 Usage : $hsp->matches([seq_type], [start], [stop]);
1148 Purpose : Get the total number of identical and conservative matches
1149 : in the query or sbjct sequence for the given HSP. Optionally can
1150 : report data within a defined interval along the seq.
1151 : (Note: 'conservative' matches are called 'positives' in the
1152 : Blast report.)
1153 Example : ($id,$cons) = $hsp_object->matches('hit');
1154 : ($id,$cons) = $hsp_object->matches('query',300,400);
1155 Returns : 2-element array of integers
1156 Argument : (1) seq_type = 'query' or 'hit' or 'sbjct' (default = query)
1157 : ('sbjct' is synonymous with 'hit')
1158 : (2) start = Starting coordinate (optional)
1159 : (3) stop = Ending coordinate (optional)
1160 Throws : Exception if the supplied coordinates are out of range.
1161 Comments : Relies on seq_str('match') to get the string of alignment symbols
1162 : between the query and sbjct lines which are used for determining
1163 : the number of identical and conservative matches.
1165 See Also : L</length>, L</gaps>, L</seq_str>, L<Bio::Search::Hit::BlastHit::_adjust_contigs()|Bio::Search::Hit::BlastHit>
1167 =cut
1169 #-----------
1170 sub matches {
1171 #-----------
1172 my( $self, %param ) = @_;
1173 my(@data);
1174 my($seqType, $beg, $end) = ($param{-SEQ}, $param{-START}, $param{-STOP});
1175 $seqType ||= 'query';
1176 $seqType = 'sbjct' if $seqType eq 'hit';
1178 my($start,$stop);
1180 if(!defined $beg && !defined $end) {
1181 ## Get data for the whole alignment.
1182 push @data, ($self->{'_numIdentical'}, $self->{'_numConserved'});
1183 } else {
1184 ## Get the substring representing the desired sub-section of aln.
1185 $beg ||= 0;
1186 $end ||= 0;
1187 ($start,$stop) = $self->range($seqType);
1188 if($beg == 0) { $beg = $start; $end = $beg+$end; }
1189 elsif($end == 0) { $end = $stop; $beg = $end-$beg; }
1191 if($end >= $stop) { $end = $stop; } ##ML changed from if (end >stop)
1192 else { $end += 1;} ##ML moved from commented position below, makes
1193 ##more sense here
1194 # if($end > $stop) { $end = $stop; }
1195 if($beg < $start) { $beg = $start; }
1196 # else { $end += 1;}
1198 # my $seq = substr($self->seq_str('match'), $beg-$start, ($end-$beg));
1200 ## ML: START fix for substr out of range error ------------------
1201 my $seq = "";
1202 my $prog = $self->algorithm;
1203 if (($prog eq 'TBLASTN') and ($seqType eq 'sbjct'))
1205 $seq = substr($self->seq_str('match'),
1206 int(($beg-$start)/3), int(($end-$beg+1)/3));
1208 } elsif (($prog eq 'BLASTX') and ($seqType eq 'query'))
1210 $seq = substr($self->seq_str('match'),
1211 int(($beg-$start)/3), int(($end-$beg+1)/3));
1212 } else {
1213 $seq = substr($self->seq_str('match'),
1214 $beg-$start, ($end-$beg));
1216 ## ML: End of fix for substr out of range error -----------------
1219 ## ML: debugging code
1220 ## This is where we get our exception. Try printing out the values going
1221 ## into this:
1223 # print STDERR
1224 # qq(*------------MY EXCEPTION --------------------\nSeq: ") ,
1225 # $self->seq_str("$seqType"), qq("\n),$self->rank,",( index:";
1226 # print STDERR $beg-$start, ", len: ", $end-$beg," ), (HSPRealLen:",
1227 # CORE::length $self->seq_str("$seqType");
1228 # print STDERR ", HSPCalcLen: ", $stop - $start +1 ," ),
1229 # ( beg: $beg, end: $end ), ( start: $start, stop: stop )\n";
1230 ## ML: END DEBUGGING CODE----------
1232 if(!CORE::length $seq) {
1233 my $id_str = $self->_id_str;
1234 $self->throw("Undefined $seqType sub-sequence ($beg,$end). Valid range = $start - $stop ($id_str)");
1236 ## Get data for a substring.
1237 # printf "Collecting HSP subsection data: beg,end = %d,%d; start,stop = %d,%d\n%s<---\n", $beg, $end, $start, $stop, $seq;
1238 # printf "Original match seq:\n%s\n",$self->seq_str('match');
1239 $seq =~ s/ //g; # remove space (no info).
1240 my $len_cons = CORE::length $seq;
1241 $seq =~ s/\+//g; # remove '+' characters (conservative substitutions)
1242 my $len_id = CORE::length $seq;
1243 push @data, ($len_id, $len_cons);
1244 # printf " HSP = %s\n id = %d; cons = %d\n", $self->rank, $len_id, $len_cons; <STDIN>;
1246 @data;
1250 =head2 num_identical
1252 Usage : $hsp_object->num_identical();
1253 Purpose : Get the number of identical positions within the given HSP.
1254 Example : $num_iden = $hsp_object->num_identical();
1255 Returns : integer
1256 Argument : n/a
1257 Throws : n/a
1259 See Also : L</num_conserved>, L</frac_identical>
1261 =cut
1263 #-------------------
1264 sub num_identical {
1265 #-------------------
1266 my( $self) = shift;
1268 $self->{'_numIdentical'};
1272 =head2 num_conserved
1274 Usage : $hsp_object->num_conserved();
1275 Purpose : Get the number of conserved positions within the given HSP.
1276 Example : $num_iden = $hsp_object->num_conserved();
1277 Returns : integer
1278 Argument : n/a
1279 Throws : n/a
1281 See Also : L</num_identical>, L</frac_conserved>
1283 =cut
1285 #-------------------
1286 sub num_conserved {
1287 #-------------------
1288 my( $self) = shift;
1290 $self->{'_numConserved'};
1295 =head2 range
1297 Usage : $hsp->range( [seq_type] );
1298 Purpose : Gets the (start, end) coordinates for the query or sbjct sequence
1299 : in the HSP alignment.
1300 Example : ($query_beg, $query_end) = $hsp->range('query');
1301 : ($hit_beg, $hit_end) = $hsp->range('hit');
1302 Returns : Two-element array of integers
1303 Argument : seq_type = string, 'query' or 'hit' or 'sbjct' (default = 'query')
1304 : ('sbjct' is synonymous with 'hit')
1305 Throws : n/a
1307 See Also : L</start>, L</end>
1309 =cut
1311 #----------
1312 sub range {
1313 #----------
1314 my ($self, $seqType) = @_;
1316 $self->_set_seq_data() unless $self->{'_set_seq_data'};
1318 $seqType ||= 'query';
1319 $seqType = 'sbjct' if $seqType eq 'hit';
1321 ## Sensitive to member name changes.
1322 $seqType = "_\L$seqType\E";
1324 return ($self->{$seqType.'Start'},$self->{$seqType.'Stop'});
1327 =head2 start
1329 Usage : $hsp->start( [seq_type] );
1330 Purpose : Gets the start coordinate for the query, sbjct, or both sequences
1331 : in the HSP alignment.
1332 : NOTE: Start will always be less than end.
1333 : To determine strand, use $hsp->strand()
1334 Example : $query_beg = $hsp->start('query');
1335 : $hit_beg = $hsp->start('hit');
1336 : ($query_beg, $hit_beg) = $hsp->start();
1337 Returns : scalar context: integer
1338 : array context without args: list of two integers
1339 Argument : In scalar context: seq_type = 'query' or 'hit' or 'sbjct' (default= 'query')
1340 : ('sbjct' is synonymous with 'hit')
1341 : Array context can be "induced" by providing an argument of 'list' or 'array'.
1342 Throws : n/a
1344 See Also : L</end>, L</range>
1346 =cut
1348 #----------
1349 sub start {
1350 #----------
1351 my ($self, $seqType) = @_;
1353 $seqType ||= (wantarray ? 'list' : 'query');
1354 $seqType = 'sbjct' if $seqType eq 'hit';
1356 $self->_set_seq_data() unless $self->{'_set_seq_data'};
1358 if($seqType =~ /list|array/i) {
1359 return ($self->{'_queryStart'}, $self->{'_sbjctStart'});
1360 } else {
1361 ## Sensitive to member name changes.
1362 $seqType = "_\L$seqType\E";
1363 return $self->{$seqType.'Start'};
1367 =head2 end
1369 Usage : $hsp->end( [seq_type] );
1370 Purpose : Gets the end coordinate for the query, sbjct, or both sequences
1371 : in the HSP alignment.
1372 : NOTE: Start will always be less than end.
1373 : To determine strand, use $hsp->strand()
1374 Example : $query_end = $hsp->end('query');
1375 : $hit_end = $hsp->end('hit');
1376 : ($query_end, $hit_end) = $hsp->end();
1377 Returns : scalar context: integer
1378 : array context without args: list of two integers
1379 Argument : In scalar context: seq_type = 'query' or 'hit' or 'sbjct' (default= 'query')
1380 : ('sbjct' is synonymous with 'hit')
1381 : Array context can be "induced" by providing an argument of 'list' or 'array'.
1382 Throws : n/a
1384 See Also : L</start>, L</range>, L</strand>
1386 =cut
1388 #----------
1389 sub end {
1390 #----------
1391 my ($self, $seqType) = @_;
1393 $seqType ||= (wantarray ? 'list' : 'query');
1394 $seqType = 'sbjct' if $seqType eq 'hit';
1396 $self->_set_seq_data() unless $self->{'_set_seq_data'};
1398 if($seqType =~ /list|array/i) {
1399 return ($self->{'_queryStop'}, $self->{'_sbjctStop'});
1400 } else {
1401 ## Sensitive to member name changes.
1402 $seqType = "_\L$seqType\E";
1403 return $self->{$seqType.'Stop'};
1409 =head2 strand
1411 Usage : $hsp_object->strand( [seq_type] )
1412 Purpose : Get the strand of the query or sbjct sequence.
1413 Example : print $hsp->strand('query');
1414 : ($query_strand, $hit_strand) = $hsp->strand();
1415 Returns : -1, 0, or 1
1416 : -1 = Minus strand, +1 = Plus strand
1417 : Returns 0 if strand is not defined, which occurs
1418 : for BLASTP reports, and the query of TBLASTN
1419 : as well as the hit if BLASTX reports.
1420 : In scalar context without arguments, returns queryStrand value.
1421 : In array context without arguments, returns a two-element list
1422 : of strings (queryStrand, sbjctStrand).
1423 : Array context can be "induced" by providing an argument of 'list' or 'array'.
1424 Argument : seq_type: 'query' or 'hit' or 'sbjct' or undef
1425 : ('sbjct' is synonymous with 'hit')
1426 Throws : n/a
1428 See Also : L</_set_seq>, L</_set_match_stats>
1430 =cut
1432 #-----------
1433 sub strand {
1434 #-----------
1435 my( $self, $seqType ) = @_;
1437 $seqType ||= (wantarray ? 'list' : 'query');
1438 $seqType = 'sbjct' if $seqType eq 'hit';
1440 ## Sensitive to member name format.
1441 $seqType = "_\L$seqType\E";
1443 # $seqType could be '_list'.
1444 $self->{'_queryStrand'} or $self->_set_seq_data() unless $self->{'_set_seq_data'};
1446 my $prog = $self->algorithm;
1448 if($seqType =~ /list|array/i) {
1449 my ($qstr, $hstr);
1450 if( $prog eq 'BLASTP') {
1451 $qstr = 0;
1452 $hstr = 0;
1454 elsif( $prog eq 'TBLASTN') {
1455 $qstr = 0;
1456 $hstr = $STRAND_SYMBOL{$self->{'_sbjctStrand'}};
1458 elsif( $prog eq 'BLASTX') {
1459 $qstr = $STRAND_SYMBOL{$self->{'_queryStrand'}};
1460 $hstr = 0;
1462 else {
1463 $qstr = $STRAND_SYMBOL{$self->{'_queryStrand'}} if defined $self->{'_queryStrand'};
1464 $hstr = $STRAND_SYMBOL{$self->{'_sbjctStrand'}} if defined $self->{'_sbjctStrand'};
1466 $qstr ||= 0;
1467 $hstr ||= 0;
1468 return ($qstr, $hstr);
1470 local $^W = 0;
1471 $STRAND_SYMBOL{$self->{$seqType.'Strand'}} || 0;
1475 =head2 seq
1477 Usage : $hsp->seq( [seq_type] );
1478 Purpose : Get the query or sbjct sequence as a Bio::Seq.pm object.
1479 Example : $seqObj = $hsp->seq('query');
1480 Returns : Object reference for a Bio::Seq.pm object.
1481 Argument : seq_type = 'query' or 'hit' or 'sbjct' (default = 'query').
1482 : ('sbjct' is synonymous with 'hit')
1483 Throws : Propagates any exception that occurs during construction
1484 : of the Bio::Seq.pm object.
1485 Comments : The sequence is returned in an array of strings corresponding
1486 : to the strings in the original format of the Blast alignment.
1487 : (i.e., same spacing).
1489 See Also : L</seq_str>, L</seq_inds>, L<Bio::Seq>
1491 =cut
1493 #-------
1494 sub seq {
1495 #-------
1496 my($self,$seqType) = @_;
1497 $seqType ||= 'query';
1498 $seqType = 'sbjct' if $seqType eq 'hit';
1499 my $str = $self->seq_str($seqType);
1501 require Bio::Seq;
1503 Bio::Seq->new(-ID => $self->to_string,
1504 -SEQ => $str,
1505 -DESC => "$seqType sequence",
1509 =head2 seq_str
1511 Usage : $hsp->seq_str( seq_type );
1512 Purpose : Get the full query, sbjct, or 'match' sequence as a string.
1513 : The 'match' sequence is the string of symbols in between the
1514 : query and sbjct sequences.
1515 Example : $str = $hsp->seq_str('query');
1516 Returns : String
1517 Argument : seq_Type = 'query' or 'hit' or 'sbjct' or 'match'
1518 : ('sbjct' is synonymous with 'hit')
1519 Throws : Exception if the argument does not match an accepted seq_type.
1520 Comments : Calls _set_seq_data() to set the 'match' sequence if it has
1521 : not been set already.
1523 See Also : L</seq>, L</seq_inds>, L</_set_match_seq>
1525 =cut
1527 #------------
1528 sub seq_str {
1529 #------------
1530 my($self,$seqType) = @_;
1532 $seqType ||= 'query';
1533 $seqType = 'sbjct' if $seqType eq 'hit';
1534 ## Sensitive to member name changes.
1535 $seqType = "_\L$seqType\E";
1537 $self->_set_seq_data() unless $self->{'_set_seq_data'};
1539 if($seqType =~ /sbjct|query/) {
1540 my $seq = join('',@{$self->{$seqType.'Seq'}});
1541 $seq =~ s/\s+//g;
1542 return $seq;
1544 } elsif( $seqType =~ /match/i) {
1545 # Only need to call _set_match_seq() if the match seq is requested.
1546 my $aref = $self->_set_match_seq() unless ref $self->{'_matchSeq'};
1547 $aref = $self->{'_matchSeq'};
1549 return join('',@$aref);
1551 } else {
1552 my $id_str = $self->_id_str;
1553 $self->throw(-class => 'Bio::Root::BadParameter',
1554 -text => "Invalid or undefined sequence type: $seqType ($id_str)\n" .
1555 "Valid types: query, sbjct, match",
1556 -value => $seqType);
1560 =head2 seq_inds
1562 Usage : $hsp->seq_inds( seq_type, class, collapse );
1563 Purpose : Get a list of residue positions (indices) for all identical
1564 : or conserved residues in the query or sbjct sequence.
1565 Example : @s_ind = $hsp->seq_inds('query', 'identical');
1566 : @h_ind = $hsp->seq_inds('hit', 'conserved');
1567 : @h_ind = $hsp->seq_inds('hit', 'conserved', 1);
1568 Returns : List of integers
1569 : May include ranges if collapse is true.
1570 Argument : seq_type = 'query' or 'hit' or 'sbjct' (default = query)
1571 : ('sbjct' is synonymous with 'hit')
1572 : class = 'identical' or 'conserved' (default = identical)
1573 : (can be shortened to 'id' or 'cons')
1574 : (actually, anything not 'id' will evaluate to 'conserved').
1575 : collapse = boolean, if true, consecutive positions are merged
1576 : using a range notation, e.g., "1 2 3 4 5 7 9 10 11"
1577 : collapses to "1-5 7 9-11". This is useful for
1578 : consolidating long lists. Default = no collapse.
1579 Throws : n/a.
1580 Comments : Calls _set_residues() to set the 'match' sequence if it has
1581 : not been set already.
1583 See Also : L</seq>, L</_set_residues>, L<Bio::Search::BlastUtils::collapse_nums()|Bio::Search::BlastUtils>, L<Bio::Search::Hit::BlastHit::seq_inds()|Bio::Search::Hit::BlastHit>
1585 =cut
1587 #---------------
1588 sub seq_inds {
1589 #---------------
1590 my ($self, $seqType, $class, $collapse) = @_;
1592 $seqType ||= 'query';
1593 $class ||= 'identical';
1594 $collapse ||= 0;
1595 $seqType = 'sbjct' if $seqType eq 'hit';
1597 $self->_set_residues() unless defined $self->{'_identicalRes_query'};
1599 $seqType = ($seqType !~ /^q/i ? 'sbjct' : 'query');
1600 $class = ($class !~ /^id/i ? 'conserved' : 'identical');
1602 ## Sensitive to member name changes.
1603 $seqType = "_\L$seqType\E";
1604 $class = "_\L$class\E";
1606 my @ary = sort { $a <=> $b } keys %{ $self->{"${class}Res$seqType"}};
1608 require Bio::Search::BlastUtils if $collapse;
1610 return $collapse ? &Bio::Search::BlastUtils::collapse_nums(@ary) : @ary;
1614 =head2 get_aln
1616 Usage : $hsp->get_aln()
1617 Purpose : Get a Bio::SimpleAlign object constructed from the query + sbjct
1618 : sequences of the present HSP object.
1619 Example : $aln_obj = $hsp->get_aln();
1620 Returns : Object reference for a Bio::SimpleAlign.pm object.
1621 Argument : n/a.
1622 Throws : Propagates any exception ocurring during the construction of
1623 : the Bio::SimpleAlign object.
1624 Comments : Requires Bio::SimpleAlign.
1625 : The Bio::SimpleAlign object is constructed from the query + sbjct
1626 : sequence objects obtained by calling seq().
1627 : Gap residues are included (see $GAP_SYMBOL).
1629 See Also : L</seq>, L<Bio::SimpleAlign>
1631 =cut
1633 #------------
1634 sub get_aln {
1635 #------------
1636 my $self = shift;
1638 require Bio::SimpleAlign;
1639 require Bio::LocatableSeq;
1640 my $qseq = $self->seq('query');
1641 my $sseq = $self->seq('sbjct');
1643 my $type = $self->algorithm =~ /P$|^T/ ? 'amino' : 'dna';
1644 my $aln = Bio::SimpleAlign->new();
1645 $aln->add_seq(Bio::LocatableSeq->new(-seq => $qseq->seq(),
1646 -id => 'query_'.$qseq->display_id(),
1647 -start => 1,
1648 -end => CORE::length($qseq)));
1650 $aln->add_seq(Bio::LocatableSeq->new(-seq => $sseq->seq(),
1651 -id => 'hit_'.$sseq->display_id(),
1652 -start => 1,
1653 -end => CORE::length($sseq)));
1655 return $aln;
1660 __END__
1663 =head1 FOR DEVELOPERS ONLY
1665 =head2 Data Members
1667 Information about the various data members of this module is provided for those
1668 wishing to modify or understand the code. Two things to bear in mind:
1670 =over 4
1672 =item 1 Do NOT rely on these in any code outside of this module.
1674 All data members are prefixed with an underscore to signify that they are private.
1675 Always use accessor methods. If the accessor doesn't exist or is inadequate,
1676 create or modify an accessor (and let me know, too!).
1678 =item 2 This documentation may be incomplete and out of date.
1680 It is easy for these data member descriptions to become obsolete as
1681 this module is still evolving. Always double check this info and search
1682 for members not described here.
1684 =back
1686 An instance of Bio::Search::HSP::BlastHSP.pm is a blessed reference to a hash containing
1687 all or some of the following fields:
1689 FIELD VALUE
1690 --------------------------------------------------------------
1691 (member names are mostly self-explanatory)
1693 _score :
1694 _bits :
1695 _p :
1696 _n : Integer. The 'N' value listed in parenthesis with P/Expect value:
1697 : e.g., P(3) = 1.2e-30 ---> (N = 3).
1698 : Not defined in NCBI Blast2 with gaps.
1699 : To obtain the number of HSPs, use Bio::Search::Hit::BlastHit::num_hsps().
1700 _expect :
1701 _queryLength :
1702 _queryGaps :
1703 _queryStart :
1704 _queryStop :
1705 _querySeq :
1706 _sbjctLength :
1707 _sbjctGaps :
1708 _sbjctStart :
1709 _sbjctStop :
1710 _sbjctSeq :
1711 _matchSeq : String. Contains the symbols between the query and sbjct lines
1712 which indicate identical (letter) and conserved ('+') matches
1713 or a mismatch (' ').
1714 _numIdentical :
1715 _numConserved :
1716 _identicalRes_query :
1717 _identicalRes_sbjct :
1718 _conservedRes_query :
1719 _conservedRes_sbjct :
1720 _match_indent : The number of leading space characters on each line containing
1721 the match symbols. _match_indent is 13 in this example:
1722 Query: 285 QNSAPWGLARISHRERLNLGSFNKYLYDDDAG
1723 Q +APWGLARIS G+ + Y YD+ AG
1724 ^^^^^^^^^^^^^
1727 =cut