t/AlignIO/AlignIO.t: fix number of tests in plan (fixup c523e6bed866)
[bioperl-live.git] / Bio / Search / HSP / PsiBlastHSP.pm
blob2ebaabe655b9011824f6cd09f0603be239cafb84
1 #-----------------------------------------------------------------
3 # BioPerl module Bio::Search::HSP::PsiBlastHSP
5 # (This module was originally called Bio::Tools::Blast::HSP)
7 # Please direct questions and support issues to <bioperl-l@bioperl.org>
9 # Cared for by Steve Chervitz <sac@bioperl.org>
11 # You may distribute this module under the same terms as perl itself
12 #-----------------------------------------------------------------
14 ## POD Documentation:
16 =head1 NAME
18 Bio::Search::HSP::PsiBlastHSP - Bioperl BLAST High-Scoring Pair object
20 =head1 SYNOPSIS
22 See L<Bio::Search::Hit::BlastHit>.
24 =head1 DESCRIPTION
26 A Bio::Search::HSP::PsiBlastHSP object provides an interface to data
27 obtained in a single alignment section of a Blast report (known as a
28 "High-scoring Segment Pair"). This is essentially a pairwise
29 alignment with score information.
31 PsiBlastHSP objects are accessed via L<Bio::Search::Hit::BlastHit>
32 objects after parsing a BLAST report using the L<Bio::SearchIO>
33 system.
35 The construction of PsiBlastHSP objects is performed by
36 Bio::Factory::BlastHitFactory in a process that is
37 orchestrated by the Blast parser (L<Bio::SearchIO::psiblast>).
38 The resulting PsiBlastHSPs are then accessed via
39 L<Bio::Search::Hit::BlastHit>). Therefore, you do not need to
40 use L<Bio::Search::HSP::PsiBlastHSP>) directly. If you need to construct
41 PsiBlastHSPs directly, see the new() function for details.
43 For L<Bio::SearchIO> BLAST parsing usage examples, see the
44 C<examples/search-blast> directory of the Bioperl distribution.
47 =head2 Start and End coordinates
49 Sequence endpoints are swapped so that start is always less than
50 end. This affects For TBLASTN/X hits on the minus strand. Strand
51 information can be recovered using the strand() method. This
52 normalization step is standard Bioperl practice. It also facilitates
53 use of range information by methods such as match().
55 =over 1
57 =item * Supports BLAST versions 1.x and 2.x, gapped and ungapped.
59 =back
61 Bio::Search::HSP::PsiBlastHSP.pm has the ability to extract a list of all
62 residue indices for identical and conservative matches along both
63 query and sbjct sequences. Since this degree of detail is not always
64 needed, this behavior does not occur during construction of the PsiBlastHSP
65 object. These data will automatically be collected as necessary as
66 the PsiBlastHSP.pm object is used.
68 =head1 DEPENDENCIES
70 Bio::Search::HSP::PsiBlastHSP.pm is a concrete class that inherits from
71 L<Bio::SeqFeature::SimilarityPair> and L<Bio::Search::HSP::HSPI>.
72 L<Bio::Seq> and L<Bio::SimpleAlign> are employed for creating
73 sequence and alignment objects, respectively.
75 =head2 Relationship to L<Bio::SimpleAlign> and L<Bio::Seq>
77 PsiBlastHSP.pm can provide the query or sbjct sequence as a L<Bio::Seq>
78 object via the L<seq()|seq> method. The PsiBlastHSP.pm object can also create a
79 two-sequence L<Bio::SimpleAlign> alignment object using the the query
80 and sbjct sequences via the L<get_aln()|get_aln> method. Creation of alignment
81 objects is not automatic when constructing the PsiBlastHSP.pm object since
82 this level of functionality is not always required and would generate
83 a lot of extra overhead when crunching many reports.
86 =head1 FEEDBACK
88 =head2 Mailing Lists
90 User feedback is an integral part of the evolution of this and other
91 Bioperl modules. Send your comments and suggestions preferably to one
92 of the Bioperl mailing lists. Your participation is much appreciated.
94 bioperl-l@bioperl.org - General discussion
95 http://bioperl.org/wiki/Mailing_lists - About the mailing lists
97 =head2 Support
99 Please direct usage questions or support issues to the mailing list:
101 I<bioperl-l@bioperl.org>
103 rather than to the module maintainer directly. Many experienced and
104 reponsive experts will be able look at the problem and quickly
105 address it. Please include a thorough description of the problem
106 with code and data examples if at all possible.
108 =head2 Reporting Bugs
110 Report bugs to the Bioperl bug tracking system to help us keep track
111 the bugs and their resolution. Bug reports can be submitted via the
112 web:
114 https://github.com/bioperl/bioperl-live/issues
116 =head1 AUTHOR
118 Steve Chervitz E<lt>sac-at-bioperl.orgE<gt>
120 See L<the FEEDBACK section | FEEDBACK> for where to send bug reports and comments.
122 =head1 ACKNOWLEDGEMENTS
124 This software was originally developed in the Department of Genetics
125 at Stanford University. I would also like to acknowledge my
126 colleagues at Affymetrix for useful feedback.
128 =head1 SEE ALSO
130 Bio::Search::Hit::BlastHit.pm - Blast hit object.
131 Bio::Search::Result::BlastResult.pm - Blast Result object.
132 Bio::Seq.pm - Biosequence object
134 =head2 Links:
136 http://bio.perl.org/ - Bioperl Project Homepage
138 =head1 COPYRIGHT
140 Copyright (c) 1996-2001 Steve Chervitz. All Rights Reserved.
142 =head1 DISCLAIMER
144 This software is provided "as is" without warranty of any kind.
146 =cut
149 # END of main POD documentation.
151 =head1 APPENDIX
153 The rest of the documentation details each of the object methods.
154 Internal methods are usually preceded with a _
156 =cut
158 # Let the code begin...
160 package Bio::Search::HSP::PsiBlastHSP;
162 use strict;
163 use Bio::SeqFeature::Similarity;
165 use vars qw($GAP_SYMBOL %STRAND_SYMBOL);
167 use overload
168 '""' => \&to_string;
170 use base qw(Bio::SeqFeature::SimilarityPair Bio::Search::HSP::HSPI);
172 $GAP_SYMBOL = '-'; # Need a more general way to handle gap symbols.
173 %STRAND_SYMBOL = ('Plus' => 1, 'Minus' => -1 );
176 =head2 new
178 Usage : $hsp = Bio::Search::HSP::PsiBlastHSP->new( %named_params );
179 : Bio::Search::HSP::PsiBlastHSP.pm objects are constructed
180 : automatically by Bio::SearchIO::BlastHitFactory.pm,
181 : so there is no need for direct instantiation.
182 Purpose : Constructs a new PsiBlastHSP object and Initializes key variables
183 : for the HSP.
184 Returns : A Bio::Search::HSP::PsiBlastHSP object
185 Argument : Named parameters:
186 : Parameter keys are case-insensitive.
187 : -RAW_DATA => array ref containing raw BLAST report data for
188 : for a single HSP. This includes all lines
189 : of the HSP alignment from a traditional BLAST
190 or PSI-BLAST (non-XML) report,
191 : -RANK => integer (1..n).
192 : -PROGRAM => string ('TBLASTN', 'BLASTP', etc.).
193 : -QUERY_NAME => string, id of query sequence
194 : -HIT_NAME => string, id of hit sequence
196 Comments : Having the raw data allows this object to do lazy parsing of
197 : the raw HSP data (i.e., not parsed until needed).
199 : Note that there is a fair amount of basic parsing that is
200 : currently performed in this module that would be more appropriate
201 : to do within a separate factory object.
202 : This parsing code will likely be relocated and more initialization
203 : parameters will be added to new().
205 See Also : L<Bio::SeqFeature::SimilarityPair::new()>, L<Bio::SeqFeature::Similarity::new()>
207 =cut
209 #----------------
210 sub new {
211 #----------------
212 my ($class, @args ) = @_;
214 my $self = $class->SUPER::new( @args );
215 # Initialize placeholders
216 $self->{'_queryGaps'} = $self->{'_sbjctGaps'} = 0;
217 my ($raw_data, $qname, $hname, $qlen, $hlen);
219 ($self->{'_prog'}, $self->{'_rank'}, $raw_data,
220 $qname, $hname) =
221 $self->_rearrange([qw( PROGRAM
222 RANK
223 RAW_DATA
224 QUERY_NAME
225 HIT_NAME
226 )], @args );
228 # _set_data() does a fair amount of parsing.
229 # This will likely change (see comment above.)
230 $self->_set_data( @{$raw_data} );
231 # Store the aligned query as sequence feature
232 my ($qb, $hb) = ($self->start());
233 my ($qe, $he) = ($self->end());
234 my ($qs, $hs) = ($self->strand());
235 my ($qf,$hf) = ($self->query->frame(),
236 $self->hit->frame);
238 $self->query( Bio::SeqFeature::Similarity->new (-start =>$qb,
239 -end =>$qe,
240 -strand =>$qs,
241 -bits =>$self->bits,
242 -score =>$self->score,
243 -frame =>$qf,
244 -seq_id => $qname,
245 -source =>$self->{'_prog'} ));
247 $self->hit( Bio::SeqFeature::Similarity->new (-start =>$hb,
248 -end =>$he,
249 -strand =>$hs,
250 -bits =>$self->bits,
251 -score =>$self->score,
252 -frame =>$hf,
253 -seq_id => $hname,
254 -source =>$self->{'_prog'} ));
256 # set lengths
257 $self->query->seqlength($qlen); # query
258 $self->hit->seqlength($hlen); # subject
260 $self->query->frac_identical($self->frac_identical('query'));
261 $self->hit->frac_identical($self->frac_identical('hit'));
262 return $self;
265 #sub DESTROY {
266 # my $self = shift;
267 # #print STDERR "--->DESTROYING $self\n";
271 # Title : _id_str;
272 # Purpose : Intended for internal use only to provide a string for use
273 # within exception messages to help users figure out which
274 # query/hit caused the problem.
275 # Returns : Short string with name of query and hit seq
276 sub _id_str {
277 my $self = shift;
278 if( not defined $self->{'_id_str'}) {
279 my $qname = $self->query->seqname;
280 my $hname = $self->hit->seqname;
281 $self->{'_id_str'} = "QUERY=\"$qname\" HIT=\"$hname\"";
283 return $self->{'_id_str'};
286 #=================================================
287 # Begin Bio::Search::HSP::HSPI implementation
288 #=================================================
290 =head2 algorithm
292 Title : algorithm
293 Usage : $alg = $hsp->algorithm();
294 Function: Gets the algorithm specification that was used to obtain the hsp
295 For BLAST, the algorithm denotes what type of sequence was aligned
296 against what (BLASTN: dna-dna, BLASTP prt-prt, BLASTX translated
297 dna-prt, TBLASTN prt-translated dna, TBLASTX translated
298 dna-translated dna).
299 Returns : a scalar string
300 Args : none
302 =cut
304 #----------------
305 sub algorithm {
306 #----------------
307 my ($self,@args) = @_;
308 return $self->{'_prog'};
314 =head2 signif()
316 Usage : $hsp_obj->signif()
317 Purpose : Get the P-value or Expect value for the HSP.
318 Returns : Float (0.001 or 1.3e-43)
319 : Returns P-value if it is defined, otherwise, Expect value.
320 Argument : n/a
321 Throws : n/a
322 Comments : Provided for consistency with BlastHit::signif()
323 : Support for returning the significance data in different
324 : formats (e.g., exponent only), is not provided for HSP objects.
325 : This is only available for the BlastHit or Blast object.
327 See Also : L</p>, L</expect>, L<Bio::Search::Hit::BlastHit::signif()|Bio::Search::Hit::BlastHit>
329 =cut
331 #-----------
332 sub signif {
333 #-----------
334 my $self = shift;
335 my $val ||= defined($self->{'_p'}) ? $self->{'_p'} : $self->{'_expect'};
336 $val;
341 =head2 evalue
343 Usage : $hsp_obj->evalue()
344 Purpose : Get the Expect value for the HSP.
345 Returns : Float (0.001 or 1.3e-43)
346 Argument : n/a
347 Throws : n/a
348 Comments : Support for returning the expectation data in different
349 : formats (e.g., exponent only), is not provided for HSP objects.
350 : This is only available for the BlastHit or Blast object.
352 See Also : L</p>
354 =cut
356 #----------
357 sub evalue { shift->{'_expect'} }
358 #----------
361 =head2 p
363 Usage : $hsp_obj->p()
364 Purpose : Get the P-value for the HSP.
365 Returns : Float (0.001 or 1.3e-43) or undef if not defined.
366 Argument : n/a
367 Throws : n/a
368 Comments : P-value is not defined with NCBI Blast2 reports.
369 : Support for returning the expectation data in different
370 : formats (e.g., exponent only) is not provided for HSP objects.
371 : This is only available for the BlastHit or Blast object.
373 See Also : L</expect>
375 =cut
377 #-----
378 sub p { my $self = shift; $self->{'_p'}; }
379 #-----
381 # alias
382 sub pvalue { shift->p(@_); }
384 =head2 length
386 Usage : $hsp->length( [seq_type] )
387 Purpose : Get the length of the aligned portion of the query or sbjct.
388 Example : $hsp->length('query')
389 Returns : integer
390 Argument : seq_type: 'query' | 'hit' or 'sbjct' | 'total' (default = 'total')
391 ('sbjct' is synonymous with 'hit')
392 Throws : n/a
393 Comments : 'total' length is the full length of the alignment
394 : as reported in the denominators in the alignment section:
395 : "Identical = 34/120 Positives = 67/120".
397 See Also : L</gaps>
399 =cut
401 #-----------
402 sub length {
403 #-----------
404 ## Developer note: when using the built-in length function within
405 ## this module, call it as CORE::length().
406 my( $self, $seqType ) = @_;
407 $seqType ||= 'total';
408 $seqType = 'sbjct' if $seqType eq 'hit';
410 $seqType ne 'total' and $self->_set_seq_data() unless $self->{'_set_seq_data'};
412 ## Sensitive to member name format.
413 $seqType = "_\L$seqType\E";
414 $self->{$seqType.'Length'};
419 =head2 gaps
421 Usage : $hsp->gaps( [seq_type] )
422 Purpose : Get the number of gap characters in the query, sbjct, or total alignment.
423 : Also can return query gap chars and sbjct gap chars as a two-element list
424 : when in array context.
425 Example : $total_gaps = $hsp->gaps();
426 : ($qgaps, $sgaps) = $hsp->gaps();
427 : $qgaps = $hsp->gaps('query');
428 Returns : scalar context: integer
429 : array context without args: (int, int) = ('queryGaps', 'sbjctGaps')
430 Argument : seq_type: 'query' or 'hit' or 'sbjct' or 'total'
431 : ('sbjct' is synonymous with 'hit')
432 : (default = 'total', scalar context)
433 : Array context can be "induced" by providing an argument of 'list' or 'array'.
434 Throws : n/a
436 See Also : L</length>, L</matches>
438 =cut
440 #---------
441 sub gaps {
442 #---------
443 my( $self, $seqType ) = @_;
445 $self->_set_seq_data() unless $self->{'_set_seq_data'};
447 $seqType ||= (wantarray ? 'list' : 'total');
448 $seqType = 'sbjct' if $seqType eq 'hit';
450 if($seqType =~ /list|array/i) {
451 return (($self->{'_queryGaps'} || 0), ($self->{'_sbjctGaps'} || 0));
454 if($seqType eq 'total') {
455 return ($self->{'_queryGaps'} + $self->{'_sbjctGaps'}) || 0;
456 } else {
457 ## Sensitive to member name format.
458 $seqType = "_\L$seqType\E";
459 return $self->{$seqType.'Gaps'} || 0;
464 =head2 frac_identical
466 Usage : $hsp_object->frac_identical( [seq_type] );
467 Purpose : Get the fraction of identical positions within the given HSP.
468 Example : $frac_iden = $hsp_object->frac_identical('query');
469 Returns : Float (2-decimal precision, e.g., 0.75).
470 Argument : seq_type: 'query' or 'hit' or 'sbjct' or 'total'
471 : ('sbjct' is synonymous with 'hit')
472 : default = 'total' (but see comments below).
473 Throws : n/a
474 Comments : Different versions of Blast report different values for the total
475 : length of the alignment. This is the number reported in the
476 : denominators in the stats section:
477 : "Identical = 34/120 Positives = 67/120".
478 : NCBI-BLAST uses the total length of the alignment (with gaps)
479 : WU-BLAST uses the length of the query sequence (without gaps).
480 : Therefore, when called without an argument or an argument of 'total',
481 : this method will report different values depending on the
482 : version of BLAST used.
484 : To get the fraction identical among only the aligned residues,
485 : ignoring the gaps, call this method with an argument of 'query'
486 : or 'sbjct' ('sbjct' is synonymous with 'hit').
488 See Also : L</frac_conserved>, L</num_identical>, L</matches>
490 =cut
492 #-------------------
493 sub frac_identical {
494 #-------------------
495 # The value is calculated as opposed to storing it from the parsed results.
496 # This saves storage and also permits flexibility in determining for which
497 # sequence (query or sbjct) the figure is to be calculated.
499 my( $self, $seqType ) = @_;
500 $seqType ||= 'total';
501 $seqType = 'sbjct' if $seqType eq 'hit';
503 if($seqType ne 'total') {
504 $self->_set_seq_data() unless $self->{'_set_seq_data'};
506 ## Sensitive to member name format.
507 $seqType = "_\L$seqType\E";
509 sprintf( "%.2f", $self->{'_numIdentical'}/$self->{$seqType.'Length'});
513 =head2 frac_conserved
515 Usage : $hsp_object->frac_conserved( [seq_type] );
516 Purpose : Get the fraction of conserved positions within the given HSP.
517 : (Note: 'conservative' positions are called 'positives' in the
518 : Blast report.)
519 Example : $frac_cons = $hsp_object->frac_conserved('query');
520 Returns : Float (2-decimal precision, e.g., 0.75).
521 Argument : seq_type: 'query' or 'hit' or 'sbjct' or 'total'
522 : ('sbjct' is synonymous with 'hit')
523 : default = 'total' (but see comments below).
524 Throws : n/a
525 Comments : Different versions of Blast report different values for the total
526 : length of the alignment. This is the number reported in the
527 : denominators in the stats section:
528 : "Identical = 34/120 Positives = 67/120".
529 : NCBI-BLAST uses the total length of the alignment (with gaps)
530 : WU-BLAST uses the length of the query sequence (without gaps).
531 : Therefore, when called without an argument or an argument of 'total',
532 : this method will report different values depending on the
533 : version of BLAST used.
535 : To get the fraction conserved among only the aligned residues,
536 : ignoring the gaps, call this method with an argument of 'query'
537 : or 'sbjct'.
539 See Also : L</frac_conserved>, L</num_conserved>, L</matches>
541 =cut
543 #--------------------
544 sub frac_conserved {
545 #--------------------
546 # The value is calculated as opposed to storing it from the parsed results.
547 # This saves storage and also permits flexibility in determining for which
548 # sequence (query or sbjct) the figure is to be calculated.
550 my( $self, $seqType ) = @_;
551 $seqType ||= 'total';
552 $seqType = 'sbjct' if $seqType eq 'hit';
554 if($seqType ne 'total') {
555 $self->_set_seq_data() unless $self->{'_set_seq_data'};
558 ## Sensitive to member name format.
559 $seqType = "_\L$seqType\E";
561 sprintf( "%.2f", $self->{'_numConserved'}/$self->{$seqType.'Length'});
564 =head2 query_string
566 Title : query_string
567 Usage : my $qseq = $hsp->query_string;
568 Function: Retrieves the query sequence of this HSP as a string
569 Returns : string
570 Args : none
573 =cut
575 #----------------
576 sub query_string{ shift->seq_str('query'); }
577 #----------------
579 =head2 hit_string
581 Title : hit_string
582 Usage : my $hseq = $hsp->hit_string;
583 Function: Retrieves the hit sequence of this HSP as a string
584 Returns : string
585 Args : none
588 =cut
590 #----------------
591 sub hit_string{ shift->seq_str('hit'); }
592 #----------------
595 =head2 homology_string
597 Title : homology_string
598 Usage : my $homo_string = $hsp->homology_string;
599 Function: Retrieves the homology sequence for this HSP as a string.
600 : The homology sequence is the string of symbols in between the
601 : query and hit sequences in the alignment indicating the degree
602 : of conservation (e.g., identical, similar, not similar).
603 Returns : string
604 Args : none
606 =cut
608 #----------------
609 sub homology_string{ shift->seq_str('match'); }
610 #----------------
612 #=================================================
613 # End Bio::Search::HSP::HSPI implementation
614 #=================================================
616 # Older method delegating to method defined in HSPI.
618 =head2 expect
620 See L<Bio::Search::HSP::HSPI::expect()|Bio::Search::HSP::HSPI>
622 =cut
624 #----------
625 sub expect { shift->evalue( @_ ); }
626 #----------
629 =head2 rank
631 Usage : $hsp->rank( [string] );
632 Purpose : Get the rank of the HSP within a given Blast hit.
633 Example : $rank = $hsp->rank;
634 Returns : Integer (1..n) corresponding to the order in which the HSP
635 appears in the BLAST report.
637 =cut
641 #----------
642 sub rank { shift->{'_rank'} }
643 #----------
645 # For backward compatibility
646 #----------
647 sub name { shift->rank }
648 #----------
650 =head2 to_string
652 Title : to_string
653 Usage : print $hsp->to_string;
654 Function: Returns a string representation for the Blast HSP.
655 Primarily intended for debugging purposes.
656 Example : see usage
657 Returns : A string of the form:
658 [PsiBlastHSP] <rank>
659 e.g.:
660 [BlastHit] 1
661 Args : None
663 =cut
665 #----------
666 sub to_string {
667 #----------
668 my $self = shift;
669 return "[PsiBlastHSP] " . $self->rank();
673 =head2 _set_data
675 Usage : called automatically during object construction.
676 Purpose : Parses the raw HSP section from a flat BLAST report and
677 sets the query sequence, sbjct sequence, and the "match" data
678 : which consists of the symbols between the query and sbjct lines
679 : in the alignment.
680 Argument : Array (all lines for a single, complete HSP, from a raw,
681 flat (i.e., non-XML) BLAST report)
682 Throws : Propagates any exceptions from the methods called ("See Also")
684 See Also : L</_set_seq>, L</_set_score_stats>, L</_set_match_stats>
686 =cut
688 #--------------
689 sub _set_data {
690 #--------------
691 my $self = shift;
692 my @data = @_;
693 my @queryList = (); # 'Query' = SEQUENCE USED TO QUERY THE DATABASE.
694 my @sbjctList = (); # 'Sbjct' = HOMOLOGOUS SEQUENCE FOUND IN THE DATABASE.
695 my @matchList = ();
696 my $matchLine = 0; # Alternating boolean: when true, load 'match' data.
697 my @linedat = ();
699 #print STDERR "PsiBlastHSP: set_data()\n";
701 my($line, $aln_row_len, $length_diff);
702 $length_diff = 0;
704 # Collecting data for all lines in the alignment
705 # and then storing the collections for possible processing later.
707 # Note that "match" lines may not be properly padded with spaces.
708 # This loop now properly handles such cases:
709 # Query: 1141 PSLVELTIRDCPRLEVGPMIRSLPKFPMLKKLDLAVANIIEEDLDVIGSLEELVIXXXXX 1200
710 # PSLVELTIRDCPRLEVGPMIRSLPKFPMLKKLDLAVANIIEEDLDVIGSLEELVI
711 # Sbjct: 1141 PSLVELTIRDCPRLEVGPMIRSLPKFPMLKKLDLAVANIIEEDLDVIGSLEELVILSLKL 1200
713 foreach $line( @data ) {
714 next if $line =~ /^\s*$/;
716 if( $line =~ /^ ?Score/ ) {
717 $self->_set_score_stats( $line );
718 } elsif( $line =~ /^ ?(Identities|Positives|Strand)/ ) {
719 $self->_set_match_stats( $line );
720 } elsif( $line =~ /^ ?Frame = ([\d+-]+)/ ) {
721 # Version 2.0.8 has Frame information on a separate line.
722 # Storing frame according to SeqFeature::Generic::frame()
723 # which does not contain strand info (use strand()).
724 my $frame = abs($1) - 1;
725 $self->frame( $frame );
726 } elsif( $line =~ /^(Query:?[\s\d]+)([^\s\d]+)/ ) {
727 push @queryList, $line;
728 $self->{'_match_indent'} = CORE::length $1;
729 $aln_row_len = (CORE::length $1) + (CORE::length $2);
730 $matchLine = 1;
731 } elsif( $matchLine ) {
732 # Pad the match line with spaces if necessary.
733 $length_diff = $aln_row_len - CORE::length $line;
734 $length_diff and $line .= ' 'x $length_diff;
735 push @matchList, $line;
736 $matchLine = 0;
737 } elsif( $line =~ /^Sbjct/ ) {
738 push @sbjctList, $line;
741 # Storing the query and sbjct lists in case they are needed later.
742 # We could make this conditional to save memory.
743 $self->{'_queryList'} = \@queryList;
744 $self->{'_sbjctList'} = \@sbjctList;
746 # Storing the match list in case it is needed later.
747 $self->{'_matchList'} = \@matchList;
749 if(not defined ($self->{'_numIdentical'})) {
750 my $id_str = $self->_id_str;
751 $self->throw( -text => "Can't parse match statistics. Possibly a new or unrecognized Blast format. ($id_str)");
754 if(!scalar @queryList or !scalar @sbjctList) {
755 my $id_str = $self->_id_str;
756 $self->throw( "Can't find query or sbjct alignment lines. Possibly unrecognized Blast format. ($id_str)");
761 =head2 _set_score_stats
763 Usage : called automatically by _set_data()
764 Purpose : Sets various score statistics obtained from the HSP listing.
765 Argument : String with any of the following formats:
766 : blast2: Score = 30.1 bits (66), Expect = 9.2
767 : blast2: Score = 158.2 bits (544), Expect(2) = e-110
768 : blast1: Score = 410 (144.3 bits), Expect = 1.7e-40, P = 1.7e-40
769 : blast1: Score = 55 (19.4 bits), Expect = 5.3, Sum P(3) = 0.99
770 Throws : Exception if the stats cannot be parsed, probably due to a change
771 : in the Blast report format.
773 See Also : L</_set_data>
775 =cut
777 #--------------------
778 sub _set_score_stats {
779 #--------------------
780 my ($self, $data) = @_;
782 my ($expect, $p);
784 if($data =~ /Score = +([\d.e+-]+) bits \(([\d.e+-]+)\), +Expect = +([\d.e+-]+)/) {
785 # blast2 format n = 1
786 $self->bits($1);
787 $self->score($2);
788 $expect = $3;
789 } elsif($data =~ /Score = +([\d.e+-]+) bits \(([\d.e+-]+)\), +Expect\((\d+)\) = +([\d.e+-]+)/) {
790 # blast2 format n > 1
791 $self->bits($1);
792 $self->score($2);
793 $self->{'_n'} = $3;
794 $expect = $4;
796 } elsif($data =~ /Score = +([\d.e+-]+) \(([\d.e+-]+) bits\), +Expect = +([\d.e+-]+), P = +([\d.e-]+)/) {
797 # blast1 format, n = 1
798 $self->score($1);
799 $self->bits($2);
800 $expect = $3;
801 $p = $4;
803 } elsif($data =~ /Score = +([\d.e+-]+) \(([\d.e+-]+) bits\), +Expect = +([\d.e+-]+), +Sum P\((\d+)\) = +([\d.e-]+)/) {
804 # blast1 format, n > 1
805 $self->score($1);
806 $self->bits($2);
807 $expect = $3;
808 $self->{'_n'} = $4;
809 $p = $5;
811 } else {
812 my $id_str = $self->_id_str;
813 $self->throw(-class => 'Bio::Root::Exception',
814 -text => "Can't parse score statistics: unrecognized format. ($id_str)",
815 -value => $data);
817 $expect = "1$expect" if $expect =~ /^e/i;
818 $p = "1$p" if defined $p and $p=~ /^e/i;
820 $self->{'_expect'} = $expect;
821 $self->{'_p'} = $p || undef;
822 $self->significance( $p || $expect );
826 =head2 _set_match_stats
828 Usage : Private method; called automatically by _set_data()
829 Purpose : Sets various matching statistics obtained from the HSP listing.
830 Argument : blast2: Identities = 23/74 (31%), Positives = 29/74 (39%), Gaps = 17/74 (22%)
831 : blast2: Identities = 57/98 (58%), Positives = 74/98 (75%)
832 : blast1: Identities = 87/204 (42%), Positives = 126/204 (61%)
833 : blast1: Identities = 87/204 (42%), Positives = 126/204 (61%), Frame = -3
834 : WU-blast: Identities = 310/553 (56%), Positives = 310/553 (56%), Strand = Minus / Plus
835 Throws : Exception if the stats cannot be parsed, probably due to a change
836 : in the Blast report format.
837 Comments : The "Gaps = " data in the HSP header has a different meaning depending
838 : on the type of Blast: for BLASTP, this number is the total number of
839 : gaps in query+sbjct; for TBLASTN, it is the number of gaps in the
840 : query sequence only. Thus, it is safer to collect the data
841 : separately by examining the actual sequence strings as is done
842 : in _set_seq().
844 See Also : L</_set_data>, L</_set_seq>
846 =cut
848 #--------------------
849 sub _set_match_stats {
850 #--------------------
851 my ($self, $data) = @_;
853 if($data =~ m!Identities = (\d+)/(\d+)!) {
854 # blast1 or 2 format
855 $self->{'_numIdentical'} = $1;
856 $self->{'_totalLength'} = $2;
859 if($data =~ m!Positives = (\d+)/(\d+)!) {
860 # blast1 or 2 format
861 $self->{'_numConserved'} = $1;
862 $self->{'_totalLength'} = $2;
865 if($data =~ m!Frame = ([\d+-]+)!) {
866 $self->frame($1);
869 # Strand data is not always present in this line.
870 # _set_seq() will also set strand information.
871 if($data =~ m!Strand = (\w+) / (\w+)!) {
872 $self->{'_queryStrand'} = $1;
873 $self->{'_sbjctStrand'} = $2;
876 # if($data =~ m!Gaps = (\d+)/(\d+)!) {
877 # $self->{'_totalGaps'} = $1;
878 # } else {
879 # $self->{'_totalGaps'} = 0;
885 =head2 _set_seq_data
887 Usage : called automatically when sequence data is requested.
888 Purpose : Sets the HSP sequence data for both query and sbjct sequences.
889 : Includes: start, stop, length, gaps, and raw sequence.
890 Argument : n/a
891 Throws : Propagates any exception thrown by _set_match_seq()
892 Comments : Uses raw data stored by _set_data() during object construction.
893 : These data are not always needed, so it is conditionally
894 : executed only upon demand by methods such as gaps(), _set_residues(),
895 : etc. _set_seq() does the dirty work.
897 See Also : L</_set_seq>
899 =cut
901 #-----------------
902 sub _set_seq_data {
903 #-----------------
904 my $self = shift;
906 $self->_set_seq('query', @{$self->{'_queryList'}});
907 $self->_set_seq('sbjct', @{$self->{'_sbjctList'}});
909 # Liberate some memory.
910 @{$self->{'_queryList'}} = @{$self->{'_sbjctList'}} = ();
911 undef $self->{'_queryList'};
912 undef $self->{'_sbjctList'};
914 $self->{'_set_seq_data'} = 1;
919 =head2 _set_seq
921 Usage : called automatically by _set_seq_data()
922 : $hsp_obj->($seq_type, @data);
923 Purpose : Sets sequence information for both the query and sbjct sequences.
924 : Directly counts the number of gaps in each sequence (if gapped Blast).
925 Argument : $seq_type = 'query' or 'sbjct'
926 : @data = all seq lines with the form:
927 : Query: 61 SPHNVKDRKEQNGSINNAISPTATANTSGSQQINIDSALRDRSSNVAAQPSLSDASSGSN 120
928 Throws : Exception if data strings cannot be parsed, probably due to a change
929 : in the Blast report format.
930 Comments : Uses first argument to determine which data members to set
931 : making this method sensitive data member name changes.
932 : Behavior is dependent on the type of BLAST analysis (TBLASTN, BLASTP, etc).
933 Warning : Sequence endpoints are normalized so that start < end. This affects HSPs
934 : for TBLASTN/X hits on the minus strand. Normalization facilitates use
935 : of range information by methods such as match().
937 See Also : L</_set_seq_data>, L</matches>, L</range>, L</start>, L</end>
939 =cut
941 #-------------
942 sub _set_seq {
943 #-------------
944 my $self = shift;
945 my $seqType = shift;
946 my @data = @_;
947 my @ranges = ();
948 my @sequence = ();
949 my $numGaps = 0;
951 foreach( @data ) {
952 if( m/(\d+) *([^\d\s]+) *(\d+)/) {
953 push @ranges, ( $1, $3 ) ;
954 push @sequence, $2;
955 #print STDERR "_set_seq found sequence \"$2\"\n";
956 } else {
957 $self->warn("Bad sequence data: $_");
961 if( !(scalar(@sequence) and scalar(@ranges))) {
962 my $id_str = $self->_id_str;
963 $self->throw("Can't set sequence: missing data. Possibly unrecognized Blast format. ($id_str)");
966 # Sensitive to member name changes.
967 $seqType = "_\L$seqType\E";
968 $self->{$seqType.'Start'} = $ranges[0];
969 $self->{$seqType.'Stop'} = $ranges[ $#ranges ];
970 $self->{$seqType.'Seq'} = \@sequence;
972 $self->{$seqType.'Length'} = abs($ranges[ $#ranges ] - $ranges[0]) + 1;
974 # Adjust lengths for BLASTX, TBLASTN, TBLASTX sequences
975 # Converting nucl coords to amino acid coords.
977 my $prog = $self->algorithm;
978 if($prog eq 'TBLASTN' and $seqType eq '_sbjct') {
979 $self->{$seqType.'Length'} /= 3;
980 } elsif($prog eq 'BLASTX' and $seqType eq '_query') {
981 $self->{$seqType.'Length'} /= 3;
982 } elsif($prog eq 'TBLASTX') {
983 $self->{$seqType.'Length'} /= 3;
986 if( $prog ne 'BLASTP' ) {
987 $self->{$seqType.'Strand'} = 'Plus' if $prog =~ /BLASTN/;
988 $self->{$seqType.'Strand'} = 'Plus' if ($prog =~ /BLASTX/ and $seqType eq '_query');
989 # Normalize sequence endpoints so that start < end.
990 # Reverse complement or 'minus strand' HSPs get flipped here.
991 if($self->{$seqType.'Start'} > $self->{$seqType.'Stop'}) {
992 ($self->{$seqType.'Start'}, $self->{$seqType.'Stop'}) =
993 ($self->{$seqType.'Stop'}, $self->{$seqType.'Start'});
994 $self->{$seqType.'Strand'} = 'Minus';
998 ## Count number of gaps in each seq. Only need to do this for gapped Blasts.
999 # if($self->{'_gapped'}) {
1000 my $seqstr = join('', @sequence);
1001 $seqstr =~ s/\s//g;
1002 my $num_gaps = CORE::length($seqstr) - $self->{$seqType.'Length'};
1003 $self->{$seqType.'Gaps'} = $num_gaps if $num_gaps > 0;
1008 =head2 _set_residues
1010 Usage : called automatically when residue data is requested.
1011 Purpose : Sets the residue numbers representing the identical and
1012 : conserved positions. These data are obtained by analyzing the
1013 : symbols between query and sbjct lines of the alignments.
1014 Argument : n/a
1015 Throws : Propagates any exception thrown by _set_seq_data() and _set_match_seq().
1016 Comments : These data are not always needed, so it is conditionally
1017 : executed only upon demand by methods such as seq_inds().
1018 : Behavior is dependent on the type of BLAST analysis (TBLASTN, BLASTP, etc).
1020 See Also : L</_set_seq_data>, L</_set_match_seq>, L</seq_inds>
1022 =cut
1024 #------------------
1025 sub _set_residues {
1026 #------------------
1027 my $self = shift;
1028 my @sequence = ();
1030 $self->_set_seq_data() unless $self->{'_set_seq_data'};
1032 # Using hashes to avoid saving duplicate residue numbers.
1033 my %identicalList_query = ();
1034 my %identicalList_sbjct = ();
1035 my %conservedList_query = ();
1036 my %conservedList_sbjct = ();
1038 my $aref = $self->_set_match_seq() if not ref $self->{'_matchSeq'};
1039 $aref ||= $self->{'_matchSeq'};
1040 my $seqString = join('', @$aref );
1042 my $qseq = join('',@{$self->{'_querySeq'}});
1043 my $sseq = join('',@{$self->{'_sbjctSeq'}});
1044 my $resCount_query = $self->{'_queryStop'} || 0;
1045 my $resCount_sbjct = $self->{'_sbjctStop'} || 0;
1047 my $prog = $self->algorithm;
1048 if($prog !~ /^BLASTP|^BLASTN/) {
1049 if($prog eq 'TBLASTN') {
1050 $resCount_sbjct /= 3;
1051 } elsif($prog eq 'BLASTX') {
1052 $resCount_query /= 3;
1053 } elsif($prog eq 'TBLASTX') {
1054 $resCount_query /= 3;
1055 $resCount_sbjct /= 3;
1059 my ($mchar, $schar, $qchar);
1060 while( $mchar = chop($seqString) ) {
1061 ($qchar, $schar) = (chop($qseq), chop($sseq));
1062 if( $mchar eq '+' ) {
1063 $conservedList_query{ $resCount_query } = 1;
1064 $conservedList_sbjct{ $resCount_sbjct } = 1;
1065 } elsif( $mchar ne ' ' ) {
1066 $identicalList_query{ $resCount_query } = 1;
1067 $identicalList_sbjct{ $resCount_sbjct } = 1;
1069 $resCount_query-- if $qchar ne $GAP_SYMBOL;
1070 $resCount_sbjct-- if $schar ne $GAP_SYMBOL;
1072 $self->{'_identicalRes_query'} = \%identicalList_query;
1073 $self->{'_conservedRes_query'} = \%conservedList_query;
1074 $self->{'_identicalRes_sbjct'} = \%identicalList_sbjct;
1075 $self->{'_conservedRes_sbjct'} = \%conservedList_sbjct;
1082 =head2 _set_match_seq
1084 Usage : $hsp_obj->_set_match_seq()
1085 Purpose : Set the 'match' sequence for the current HSP (symbols in between
1086 : the query and sbjct lines.)
1087 Returns : Array reference holding the match sequences lines.
1088 Argument : n/a
1089 Throws : Exception if the _matchList field is not set.
1090 Comments : The match information is not always necessary. This method
1091 : allows it to be conditionally prepared.
1092 : Called by _set_residues>() and seq_str().
1094 See Also : L</_set_residues>, L</seq_str>
1096 =cut
1098 #-------------------
1099 sub _set_match_seq {
1100 #-------------------
1101 my $self = shift;
1103 if( ! ref($self->{'_matchList'}) ) {
1104 my $id_str = $self->_id_str;
1105 $self->throw("Can't set HSP match sequence: No data ($id_str)");
1108 my @data = @{$self->{'_matchList'}};
1110 my(@sequence);
1111 foreach( @data ) {
1112 chomp($_);
1113 ## Remove leading spaces; (note: aln may begin with a space
1114 ## which is why we can't use s/^ +//).
1115 s/^ {$self->{'_match_indent'}}//;
1116 push @sequence, $_;
1118 # Liberate some memory.
1119 @{$self->{'_matchList'}} = undef;
1120 $self->{'_matchList'} = undef;
1122 $self->{'_matchSeq'} = \@sequence;
1124 return $self->{'_matchSeq'};
1128 =head2 n
1130 Usage : $hsp_obj->n()
1131 Purpose : Get the N value (num HSPs on which P/Expect is based).
1132 : This value is not defined with NCBI Blast2 with gapping.
1133 Returns : Integer or null string if not defined.
1134 Argument : n/a
1135 Throws : n/a
1136 Comments : The 'N' value is listed in parenthesis with P/Expect value:
1137 : e.g., P(3) = 1.2e-30 ---> (N = 3).
1138 : Not defined in NCBI Blast2 with gaps.
1139 : This typically is equal to the number of HSPs but not always.
1140 : To obtain the number of HSPs, use Bio::Search::Hit::BlastHit::num_hsps().
1142 See Also : L<Bio::SeqFeature::SimilarityPair::score()|Bio::SeqFeature::SimilarityPair>
1144 =cut
1146 #-----
1147 sub n { my $self = shift; $self->{'_n'} || ''; }
1148 #-----
1151 =head2 matches
1153 Usage : $hsp->matches([seq_type], [start], [stop]);
1154 Purpose : Get the total number of identical and conservative matches
1155 : in the query or sbjct sequence for the given HSP. Optionally can
1156 : report data within a defined interval along the seq.
1157 : (Note: 'conservative' matches are called 'positives' in the
1158 : Blast report.)
1159 Example : ($id,$cons) = $hsp_object->matches('hit');
1160 : ($id,$cons) = $hsp_object->matches('query',300,400);
1161 Returns : 2-element array of integers
1162 Argument : (1) seq_type = 'query' or 'hit' or 'sbjct' (default = query)
1163 : ('sbjct' is synonymous with 'hit')
1164 : (2) start = Starting coordinate (optional)
1165 : (3) stop = Ending coordinate (optional)
1166 Throws : Exception if the supplied coordinates are out of range.
1167 Comments : Relies on seq_str('match') to get the string of alignment symbols
1168 : between the query and sbjct lines which are used for determining
1169 : the number of identical and conservative matches.
1171 See Also : L</length>, L</gaps>, L</seq_str>, L<Bio::Search::Hit::BlastHit::_adjust_contigs()|Bio::Search::Hit::BlastHit>
1173 =cut
1175 #-----------
1176 sub matches {
1177 #-----------
1178 my( $self, %param ) = @_;
1179 my(@data);
1180 my($seqType, $beg, $end) = ($param{-SEQ}, $param{-START}, $param{-STOP});
1181 $seqType ||= 'query';
1182 $seqType = 'sbjct' if $seqType eq 'hit';
1184 my($start,$stop);
1186 if(!defined $beg && !defined $end) {
1187 ## Get data for the whole alignment.
1188 push @data, ($self->{'_numIdentical'}, $self->{'_numConserved'});
1189 } else {
1190 ## Get the substring representing the desired sub-section of aln.
1191 $beg ||= 0;
1192 $end ||= 0;
1193 ($start,$stop) = $self->range($seqType);
1194 if($beg == 0) { $beg = $start; $end = $beg+$end; }
1195 elsif($end == 0) { $end = $stop; $beg = $end-$beg; }
1197 if($end >= $stop) { $end = $stop; } ##ML changed from if (end >stop)
1198 else { $end += 1;} ##ML moved from commented position below, makes
1199 ##more sense here
1200 # if($end > $stop) { $end = $stop; }
1201 if($beg < $start) { $beg = $start; }
1202 # else { $end += 1;}
1204 # my $seq = substr($self->seq_str('match'), $beg-$start, ($end-$beg));
1206 ## ML: START fix for substr out of range error ------------------
1207 my $seq = "";
1208 my $prog = $self->algorithm;
1209 if (($prog eq 'TBLASTN') and ($seqType eq 'sbjct'))
1211 $seq = substr($self->seq_str('match'),
1212 int(($beg-$start)/3), int(($end-$beg+1)/3));
1214 } elsif (($prog eq 'BLASTX') and ($seqType eq 'query'))
1216 $seq = substr($self->seq_str('match'),
1217 int(($beg-$start)/3), int(($end-$beg+1)/3));
1218 } else {
1219 $seq = substr($self->seq_str('match'),
1220 $beg-$start, ($end-$beg));
1222 ## ML: End of fix for substr out of range error -----------------
1225 ## ML: debugging code
1226 ## This is where we get our exception. Try printing out the values going
1227 ## into this:
1229 # print STDERR
1230 # qq(*------------MY EXCEPTION --------------------\nSeq: ") ,
1231 # $self->seq_str("$seqType"), qq("\n),$self->rank,",( index:";
1232 # print STDERR $beg-$start, ", len: ", $end-$beg," ), (HSPRealLen:",
1233 # CORE::length $self->seq_str("$seqType");
1234 # print STDERR ", HSPCalcLen: ", $stop - $start +1 ," ),
1235 # ( beg: $beg, end: $end ), ( start: $start, stop: stop )\n";
1236 ## ML: END DEBUGGING CODE----------
1238 if(!CORE::length $seq) {
1239 my $id_str = $self->_id_str;
1240 $self->throw("Undefined $seqType sub-sequence ($beg,$end). Valid range = $start - $stop ($id_str)");
1242 ## Get data for a substring.
1243 # printf "Collecting HSP subsection data: beg,end = %d,%d; start,stop = %d,%d\n%s<---\n", $beg, $end, $start, $stop, $seq;
1244 # printf "Original match seq:\n%s\n",$self->seq_str('match');
1245 $seq =~ s/ //g; # remove space (no info).
1246 my $len_cons = CORE::length $seq;
1247 $seq =~ s/\+//g; # remove '+' characters (conservative substitutions)
1248 my $len_id = CORE::length $seq;
1249 push @data, ($len_id, $len_cons);
1250 # printf " HSP = %s\n id = %d; cons = %d\n", $self->rank, $len_id, $len_cons; <STDIN>;
1252 @data;
1256 =head2 num_identical
1258 Usage : $hsp_object->num_identical();
1259 Purpose : Get the number of identical positions within the given HSP.
1260 Example : $num_iden = $hsp_object->num_identical();
1261 Returns : integer
1262 Argument : n/a
1263 Throws : n/a
1265 See Also : L</num_conserved>, L</frac_identical>
1267 =cut
1269 #-------------------
1270 sub num_identical {
1271 #-------------------
1272 my( $self) = shift;
1274 $self->{'_numIdentical'};
1278 =head2 num_conserved
1280 Usage : $hsp_object->num_conserved();
1281 Purpose : Get the number of conserved positions within the given HSP.
1282 Example : $num_iden = $hsp_object->num_conserved();
1283 Returns : integer
1284 Argument : n/a
1285 Throws : n/a
1287 See Also : L</num_identical>, L</frac_conserved>
1289 =cut
1291 #-------------------
1292 sub num_conserved {
1293 #-------------------
1294 my( $self) = shift;
1296 $self->{'_numConserved'};
1301 =head2 range
1303 Usage : $hsp->range( [seq_type] );
1304 Purpose : Gets the (start, end) coordinates for the query or sbjct sequence
1305 : in the HSP alignment.
1306 Example : ($query_beg, $query_end) = $hsp->range('query');
1307 : ($hit_beg, $hit_end) = $hsp->range('hit');
1308 Returns : Two-element array of integers
1309 Argument : seq_type = string, 'query' or 'hit' or 'sbjct' (default = 'query')
1310 : ('sbjct' is synonymous with 'hit')
1311 Throws : n/a
1313 See Also : L</start>, L</end>
1315 =cut
1317 #----------
1318 sub range {
1319 #----------
1320 my ($self, $seqType) = @_;
1322 $self->_set_seq_data() unless $self->{'_set_seq_data'};
1324 $seqType ||= 'query';
1325 $seqType = 'sbjct' if $seqType eq 'hit';
1327 ## Sensitive to member name changes.
1328 $seqType = "_\L$seqType\E";
1330 return ($self->{$seqType.'Start'},$self->{$seqType.'Stop'});
1333 =head2 start
1335 Usage : $hsp->start( [seq_type] );
1336 Purpose : Gets the start coordinate for the query, sbjct, or both sequences
1337 : in the HSP alignment.
1338 : NOTE: Start will always be less than end.
1339 : To determine strand, use $hsp->strand()
1340 Example : $query_beg = $hsp->start('query');
1341 : $hit_beg = $hsp->start('hit');
1342 : ($query_beg, $hit_beg) = $hsp->start();
1343 Returns : scalar context: integer
1344 : array context without args: list of two integers
1345 Argument : In scalar context: seq_type = 'query' or 'hit' or 'sbjct' (default= 'query')
1346 : ('sbjct' is synonymous with 'hit')
1347 : Array context can be "induced" by providing an argument of 'list' or 'array'.
1348 Throws : n/a
1350 See Also : L</end>, L</range>
1352 =cut
1354 #----------
1355 sub start {
1356 #----------
1357 my ($self, $seqType) = @_;
1359 $seqType ||= (wantarray ? 'list' : 'query');
1360 $seqType = 'sbjct' if $seqType eq 'hit';
1362 $self->_set_seq_data() unless $self->{'_set_seq_data'};
1364 if($seqType =~ /list|array/i) {
1365 return ($self->{'_queryStart'}, $self->{'_sbjctStart'});
1366 } else {
1367 ## Sensitive to member name changes.
1368 $seqType = "_\L$seqType\E";
1369 return $self->{$seqType.'Start'};
1373 =head2 end
1375 Usage : $hsp->end( [seq_type] );
1376 Purpose : Gets the end coordinate for the query, sbjct, or both sequences
1377 : in the HSP alignment.
1378 : NOTE: Start will always be less than end.
1379 : To determine strand, use $hsp->strand()
1380 Example : $query_end = $hsp->end('query');
1381 : $hit_end = $hsp->end('hit');
1382 : ($query_end, $hit_end) = $hsp->end();
1383 Returns : scalar context: integer
1384 : array context without args: list of two integers
1385 Argument : In scalar context: seq_type = 'query' or 'hit' or 'sbjct' (default= 'query')
1386 : ('sbjct' is synonymous with 'hit')
1387 : Array context can be "induced" by providing an argument of 'list' or 'array'.
1388 Throws : n/a
1390 See Also : L</start>, L</range>, L</strand>
1392 =cut
1394 #----------
1395 sub end {
1396 #----------
1397 my ($self, $seqType) = @_;
1399 $seqType ||= (wantarray ? 'list' : 'query');
1400 $seqType = 'sbjct' if $seqType eq 'hit';
1402 $self->_set_seq_data() unless $self->{'_set_seq_data'};
1404 if($seqType =~ /list|array/i) {
1405 return ($self->{'_queryStop'}, $self->{'_sbjctStop'});
1406 } else {
1407 ## Sensitive to member name changes.
1408 $seqType = "_\L$seqType\E";
1409 return $self->{$seqType.'Stop'};
1415 =head2 strand
1417 Usage : $hsp_object->strand( [seq_type] )
1418 Purpose : Get the strand of the query or sbjct sequence.
1419 Example : print $hsp->strand('query');
1420 : ($query_strand, $hit_strand) = $hsp->strand();
1421 Returns : -1, 0, or 1
1422 : -1 = Minus strand, +1 = Plus strand
1423 : Returns 0 if strand is not defined, which occurs
1424 : for BLASTP reports, and the query of TBLASTN
1425 : as well as the hit if BLASTX reports.
1426 : In scalar context without arguments, returns queryStrand value.
1427 : In array context without arguments, returns a two-element list
1428 : of strings (queryStrand, sbjctStrand).
1429 : Array context can be "induced" by providing an argument of 'list' or 'array'.
1430 Argument : seq_type: 'query' or 'hit' or 'sbjct' or undef
1431 : ('sbjct' is synonymous with 'hit')
1432 Throws : n/a
1434 See Also : L</_set_seq>, L</_set_match_stats>
1436 =cut
1438 #-----------
1439 sub strand {
1440 #-----------
1441 my( $self, $seqType ) = @_;
1443 $seqType ||= (wantarray ? 'list' : 'query');
1444 $seqType = 'sbjct' if $seqType eq 'hit';
1446 ## Sensitive to member name format.
1447 $seqType = "_\L$seqType\E";
1449 # $seqType could be '_list'.
1450 $self->{'_queryStrand'} or $self->_set_seq_data() unless $self->{'_set_seq_data'};
1452 my $prog = $self->algorithm;
1454 if($seqType =~ /list|array/i) {
1455 my ($qstr, $hstr);
1456 if( $prog eq 'BLASTP') {
1457 $qstr = 0;
1458 $hstr = 0;
1460 elsif( $prog eq 'TBLASTN') {
1461 $qstr = 0;
1462 $hstr = $STRAND_SYMBOL{$self->{'_sbjctStrand'}};
1464 elsif( $prog eq 'BLASTX') {
1465 $qstr = $STRAND_SYMBOL{$self->{'_queryStrand'}};
1466 $hstr = 0;
1468 else {
1469 $qstr = $STRAND_SYMBOL{$self->{'_queryStrand'}} if defined $self->{'_queryStrand'};
1470 $hstr = $STRAND_SYMBOL{$self->{'_sbjctStrand'}} if defined $self->{'_sbjctStrand'};
1472 $qstr ||= 0;
1473 $hstr ||= 0;
1474 return ($qstr, $hstr);
1476 local $^W = 0;
1477 $STRAND_SYMBOL{$self->{$seqType.'Strand'}} || 0;
1481 =head2 seq
1483 Usage : $hsp->seq( [seq_type] );
1484 Purpose : Get the query or sbjct sequence as a Bio::Seq.pm object.
1485 Example : $seqObj = $hsp->seq('query');
1486 Returns : Object reference for a Bio::Seq.pm object.
1487 Argument : seq_type = 'query' or 'hit' or 'sbjct' (default = 'query').
1488 : ('sbjct' is synonymous with 'hit')
1489 Throws : Propagates any exception that occurs during construction
1490 : of the Bio::Seq.pm object.
1491 Comments : The sequence is returned in an array of strings corresponding
1492 : to the strings in the original format of the Blast alignment.
1493 : (i.e., same spacing).
1495 See Also : L</seq_str>, L</seq_inds>, L<Bio::Seq>
1497 =cut
1499 #-------
1500 sub seq {
1501 #-------
1502 my($self,$seqType) = @_;
1503 $seqType ||= 'query';
1504 $seqType = 'sbjct' if $seqType eq 'hit';
1505 my $str = $self->seq_str($seqType);
1507 require Bio::Seq;
1509 Bio::Seq->new(-ID => $self->to_string,
1510 -SEQ => $str,
1511 -DESC => "$seqType sequence",
1515 =head2 seq_str
1517 Usage : $hsp->seq_str( seq_type );
1518 Purpose : Get the full query, sbjct, or 'match' sequence as a string.
1519 : The 'match' sequence is the string of symbols in between the
1520 : query and sbjct sequences.
1521 Example : $str = $hsp->seq_str('query');
1522 Returns : String
1523 Argument : seq_Type = 'query' or 'hit' or 'sbjct' or 'match'
1524 : ('sbjct' is synonymous with 'hit')
1525 Throws : Exception if the argument does not match an accepted seq_type.
1526 Comments : Calls _set_seq_data() to set the 'match' sequence if it has
1527 : not been set already.
1529 See Also : L</seq>, L</seq_inds>, L</_set_match_seq>
1531 =cut
1533 #------------
1534 sub seq_str {
1535 #------------
1536 my($self,$seqType) = @_;
1538 $seqType ||= 'query';
1539 $seqType = 'sbjct' if $seqType eq 'hit';
1540 ## Sensitive to member name changes.
1541 $seqType = "_\L$seqType\E";
1543 $self->_set_seq_data() unless $self->{'_set_seq_data'};
1545 if($seqType =~ /sbjct|query/) {
1546 my $seq = join('',@{$self->{$seqType.'Seq'}});
1547 $seq =~ s/\s+//g;
1548 return $seq;
1550 } elsif( $seqType =~ /match/i) {
1551 # Only need to call _set_match_seq() if the match seq is requested.
1552 my $aref = $self->_set_match_seq() unless ref $self->{'_matchSeq'};
1553 $aref = $self->{'_matchSeq'};
1555 return join('',@$aref);
1557 } else {
1558 my $id_str = $self->_id_str;
1559 $self->throw(-class => 'Bio::Root::BadParameter',
1560 -text => "Invalid or undefined sequence type: $seqType ($id_str)\n" .
1561 "Valid types: query, sbjct, match",
1562 -value => $seqType);
1566 =head2 seq_inds
1568 Usage : $hsp->seq_inds( seq_type, class, collapse );
1569 Purpose : Get a list of residue positions (indices) for all identical
1570 : or conserved residues in the query or sbjct sequence.
1571 Example : @s_ind = $hsp->seq_inds('query', 'identical');
1572 : @h_ind = $hsp->seq_inds('hit', 'conserved');
1573 : @h_ind = $hsp->seq_inds('hit', 'conserved', 1);
1574 Returns : List of integers
1575 : May include ranges if collapse is true.
1576 Argument : seq_type = 'query' or 'hit' or 'sbjct' (default = query)
1577 : ('sbjct' is synonymous with 'hit')
1578 : class = 'identical' or 'conserved' (default = identical)
1579 : (can be shortened to 'id' or 'cons')
1580 : (actually, anything not 'id' will evaluate to 'conserved').
1581 : collapse = boolean, if true, consecutive positions are merged
1582 : using a range notation, e.g., "1 2 3 4 5 7 9 10 11"
1583 : collapses to "1-5 7 9-11". This is useful for
1584 : consolidating long lists. Default = no collapse.
1585 Throws : n/a.
1586 Comments : Calls _set_residues() to set the 'match' sequence if it has
1587 : not been set already.
1589 See Also : L</seq>, L</_set_residues>, L<Bio::Search::BlastUtils::collapse_nums()|Bio::Search::BlastUtils>, L<Bio::Search::Hit::BlastHit::seq_inds()|Bio::Search::Hit::BlastHit>
1591 =cut
1593 #---------------
1594 sub seq_inds {
1595 #---------------
1596 my ($self, $seqType, $class, $collapse) = @_;
1598 $seqType ||= 'query';
1599 $class ||= 'identical';
1600 $collapse ||= 0;
1601 $seqType = 'sbjct' if $seqType eq 'hit';
1603 $self->_set_residues() unless defined $self->{'_identicalRes_query'};
1605 $seqType = ($seqType !~ /^q/i ? 'sbjct' : 'query');
1606 $class = ($class !~ /^id/i ? 'conserved' : 'identical');
1608 ## Sensitive to member name changes.
1609 $seqType = "_\L$seqType\E";
1610 $class = "_\L$class\E";
1612 my @ary = sort { $a <=> $b } keys %{ $self->{"${class}Res$seqType"}};
1614 require Bio::Search::BlastUtils if $collapse;
1616 return $collapse ? &Bio::Search::BlastUtils::collapse_nums(@ary) : @ary;
1620 =head2 get_aln
1622 Usage : $hsp->get_aln()
1623 Purpose : Get a Bio::SimpleAlign object constructed from the query + sbjct
1624 : sequences of the present HSP object.
1625 Example : $aln_obj = $hsp->get_aln();
1626 Returns : Object reference for a Bio::SimpleAlign.pm object.
1627 Argument : n/a.
1628 Throws : Propagates any exception ocurring during the construction of
1629 : the Bio::SimpleAlign object.
1630 Comments : Requires Bio::SimpleAlign.
1631 : The Bio::SimpleAlign object is constructed from the query + sbjct
1632 : sequence objects obtained by calling seq().
1633 : Gap residues are included (see $GAP_SYMBOL).
1635 See Also : L</seq>, L<Bio::SimpleAlign>
1637 =cut
1639 #------------
1640 sub get_aln {
1641 #------------
1642 my $self = shift;
1644 require Bio::SimpleAlign;
1645 require Bio::LocatableSeq;
1646 my $qseq = $self->seq('query');
1647 my $sseq = $self->seq('sbjct');
1649 my $type = $self->algorithm =~ /P$|^T/ ? 'amino' : 'dna';
1650 my $aln = Bio::SimpleAlign->new();
1651 $aln->add_seq(Bio::LocatableSeq->new(-seq => $qseq->seq(),
1652 -id => 'query_'.$qseq->display_id(),
1653 -start => 1,
1654 -end => CORE::length($qseq)));
1656 $aln->add_seq(Bio::LocatableSeq->new(-seq => $sseq->seq(),
1657 -id => 'hit_'.$sseq->display_id(),
1658 -start => 1,
1659 -end => CORE::length($sseq)));
1661 return $aln;
1666 __END__
1669 =head1 FOR DEVELOPERS ONLY
1671 =head2 Data Members
1673 Information about the various data members of this module is provided for those
1674 wishing to modify or understand the code. Two things to bear in mind:
1676 =over 4
1678 =item 1 Do NOT rely on these in any code outside of this module.
1680 All data members are prefixed with an underscore to signify that they are private.
1681 Always use accessor methods. If the accessor doesn't exist or is inadequate,
1682 create or modify an accessor (and let me know, too!).
1684 =item 2 This documentation may be incomplete and out of date.
1686 It is easy for these data member descriptions to become obsolete as
1687 this module is still evolving. Always double check this info and search
1688 for members not described here.
1690 =back
1692 An instance of Bio::Search::HSP::PsiBlastHSP.pm is a blessed reference to a hash containing
1693 all or some of the following fields:
1695 FIELD VALUE
1696 --------------------------------------------------------------
1697 (member names are mostly self-explanatory)
1699 _score :
1700 _bits :
1701 _p :
1702 _n : Integer. The 'N' value listed in parenthesis with P/Expect value:
1703 : e.g., P(3) = 1.2e-30 ---> (N = 3).
1704 : Not defined in NCBI Blast2 with gaps.
1705 : To obtain the number of HSPs, use Bio::Search::Hit::BlastHit::num_hsps().
1706 _expect :
1707 _queryLength :
1708 _queryGaps :
1709 _queryStart :
1710 _queryStop :
1711 _querySeq :
1712 _sbjctLength :
1713 _sbjctGaps :
1714 _sbjctStart :
1715 _sbjctStop :
1716 _sbjctSeq :
1717 _matchSeq : String. Contains the symbols between the query and sbjct lines
1718 which indicate identical (letter) and conserved ('+') matches
1719 or a mismatch (' ').
1720 _numIdentical :
1721 _numConserved :
1722 _identicalRes_query :
1723 _identicalRes_sbjct :
1724 _conservedRes_query :
1725 _conservedRes_sbjct :
1726 _match_indent : The number of leading space characters on each line containing
1727 the match symbols. _match_indent is 13 in this example:
1728 Query: 285 QNSAPWGLARISHRERLNLGSFNKYLYDDDAG
1729 Q +APWGLARIS G+ + Y YD+ AG
1730 ^^^^^^^^^^^^^
1733 =cut