Bio::DB::TFBS namespace has been moved to its own distribution named after itself
[bioperl-live.git] / Bio / Search / Hit / PsiBlastHit.pm
blobf7b87c9c77222f256f471d5d403c29d8c688ebfa
1 #-----------------------------------------------------------------
3 # BioPerl module Bio::Search::Hit::PsiBlastHit
5 # (This module was originally called Bio::Tools::Blast::Sbjct)
7 # Please direct questions and support issues to <bioperl-l@bioperl.org>
9 # Cared for by Steve Chervitz <sac@bioperl.org>
11 # You may distribute this module under the same terms as perl itself
12 #-----------------------------------------------------------------
14 ## POD Documentation:
16 =head1 NAME
18 Bio::Search::Hit::PsiBlastHit - Bioperl BLAST Hit object
20 =head1 SYNOPSIS
22 See L<Bio::Search::Result::BlastResult>.
24 =head1 DESCRIPTION
26 The Bio::Search::Hit::PsiBlastHit.pm module encapsulates data and
27 methods for manipulating "hits" from a BLAST report. A BLAST hit is a
28 collection of HSPs along with other metadata such as sequence name and
29 score information. Hit objects are accessed via
30 L<Bio::Search::Result::BlastResult> objects after parsing a BLAST
31 report using the L<Bio::SearchIO> system.
33 In Blast lingo, the "sbjct" sequences are all the sequences in a
34 target database which were compared against a "query" sequence. The
35 terms "sbjct" and "hit" will be used interchangeably in this module.
36 All methods that take 'sbjct' as an argument also support 'hit' as a
37 synonym.
39 This module supports BLAST versions 1.x and 2.x, gapped and ungapped,
40 and PSI-BLAST.
42 The construction of PsiBlastHit objects is performed by
43 Bio::SearchIO::blast::PsiBlastHitFactory in a process that is
44 orchestrated by the Blast parser (L<Bio::SearchIO::blast>).
45 The resulting PsiBlastHits are then accessed via
46 L<Bio::Search::Result::BlastResult>). Therefore, you do not need to
47 use L<Bio::Search::Hit::PsiBlastHit>) directly. If you need to
48 construct PsiBlastHits directly, see the C<new()> function for details.
50 For L<Bio::SearchIO> BLAST parsing usage examples, see the
51 C<examples/search-blast> directory of the Bioperl distribution.
54 =head2 HSP Tiling and Ambiguous Alignments
56 If a Blast hit has more than one HSP, the Bio::Search::Hit::PsiBlastHit.pm
57 object has the ability to merge overlapping HSPs into contiguous
58 blocks. This permits the PsiBlastHit object to sum data across all HSPs
59 without counting data in the overlapping regions multiple times, which
60 would happen if data from each overlapping HSP are simply summed. HSP
61 tiling is performed automatically when methods of the PsiBlastHit object
62 that rely on tiled data are invoked. These include
63 L<frac_identical()|frac_identical>, L<frac_conserved()|frac_conserved>, L<gaps()|gaps>,
64 L<frac_aligned_query()|frac_aligned_query>, L<frac_aligned_hit()|frac_aligned_hit>,
65 L<num_unaligned_query()|num_unaligned_query>, L<num_unaligned_hit()|num_unaligned_hit>.
67 It also permits the assessment of an "ambiguous alignment" if the
68 query (or sbjct) sequences from different HSPs overlap
69 (see L<ambiguous_aln()|ambiguous_aln>). The existence
70 of an overlap could indicate a biologically interesting region in the
71 sequence, such as a repeated domain. The PsiBlastHit object uses the
72 C<-OVERLAP> parameter to determine when two sequences overlap; if this is
73 set to 2 -- the default -- then any two sbjct or query HSP sequences
74 must overlap by more than two residues to get merged into the same
75 contig and counted as an overlap. See the L<BUGS | BUGS> section below for
76 "issues" with HSP tiling.
79 The results of the HSP tiling is reported with the following ambiguity codes:
81 'q' = Query sequence contains multiple sub-sequences matching
82 a single region in the sbjct sequence.
84 's' = Subject (PsiBlastHit) sequence contains multiple sub-sequences matching
85 a single region in the query sequence.
87 'qs' = Both query and sbjct sequences contain more than one
88 sub-sequence with similarity to the other sequence.
91 For addition information about ambiguous BLAST alignments, see
92 L<Bio::Search::BlastUtils::tile_hsps()|Bio::Search::BlastUtils>
94 =head1 DEPENDENCIES
96 Bio::Search::Hit::PsiBlastHit.pm is a concrete class that inherits from
97 L<Bio::Root::Root> and L<Bio::Search::Hit::HitI>. and relies on
98 L<Bio::Search::HSP::BlastHSP>.
101 =head1 BUGS
103 One consequence of the HSP tiling is that methods that rely on HSP
104 tiling such as L<frac_identical()|frac_identical>, L<frac_conserved()|frac_conserved>, L<gaps()|gaps>
105 etc. may report misleading numbers when C<-OVERLAP> is set to a large
106 number. For example, say we have two HSPs and the query sequence tile
107 as follows:
109 1 8 22 30 40 60
110 Full seq: ------------------------------------------------------------
111 * ** * **
112 HSP1: --------------- (6 identical matches)
113 ** ** **
114 HSP2: ------------- (6 identical matches)
117 If C<-OVERLAP> is set to some number over 4, HSP1 and HSP2 will not be
118 tiled into a single contig and their numbers of identical matches will
119 be added, giving a total of 12, not 10 if they had be combined into
120 one contig. This can lead to number greater than 1.0 for methods
121 L<frac_identical()|frac_identical> and L<frac_conserved()|frac_conserved>. This is less of an issue
122 with gapped Blast since it tends to combine HSPs that would be listed
123 separately without gapping. (Fractions E<gt>1.0 can be viewed as a
124 signal for an interesting alignment that warrants further inspection,
125 thus turning this bug into a feature :-).
127 Using large values for C<-OVERLAP> can lead to incorrect numbers
128 reported by methods that rely on HSP tiling but can be useful if you
129 care more about detecting ambiguous alignments. Setting C<-OVERLAP>
130 to zero will lead to the most accurate numbers for the
131 tiling-dependent methods but will be useless for detecting overlapping
132 HSPs since all HSPs will appear to overlap.
135 =head1 SEE ALSO
137 Bio::Search::HSP::BlastHSP.pm - Blast HSP object.
138 Bio::Search::Result::BlastResult.pm - Blast Result object.
139 Bio::Search::Hit::HitI.pm - Interface implemented by PsiBlastHit.pm
140 Bio::Root::Root.pm - Base class for PsiBlastHit.pm
142 Links:
144 http://bio.perl.org/ - Bioperl Project Homepage
147 =head1 FEEDBACK
149 =head2 Mailing Lists
151 User feedback is an integral part of the evolution of this and other
152 Bioperl modules. Send your comments and suggestions preferably to one
153 of the Bioperl mailing lists. Your participation is much appreciated.
155 bioperl-l@bioperl.org - General discussion
156 http://bioperl.org/wiki/Mailing_lists - About the mailing lists
158 =head2 Support
160 Please direct usage questions or support issues to the mailing list:
162 I<bioperl-l@bioperl.org>
164 rather than to the module maintainer directly. Many experienced and
165 reponsive experts will be able look at the problem and quickly
166 address it. Please include a thorough description of the problem
167 with code and data examples if at all possible.
169 =head2 Reporting Bugs
171 Report bugs to the Bioperl bug tracking system to help us keep track
172 the bugs and their resolution. Bug reports can be submitted via the
173 web:
175 https://github.com/bioperl/bioperl-live/issues
177 =head1 AUTHOR
179 Steve Chervitz E<lt>sac@bioperl.orgE<gt>
181 See L<the FEEDBACK section | FEEDBACK> for where to send bug reports and comments.
183 =head1 ACKNOWLEDGEMENTS
185 This software was originally developed in the Department of Genetics
186 at Stanford University. I would also like to acknowledge my
187 colleagues at Affymetrix for useful feedback.
189 =head1 COPYRIGHT
191 Copyright (c) 1996-2001 Steve Chervitz. All Rights Reserved.
193 =head1 DISCLAIMER
195 This software is provided "as is" without warranty of any kind.
197 =head1 APPENDIX
199 The rest of the documentation details each of the object methods.
200 Internal methods are usually preceded with a _
202 =cut
205 # Let the code begin...
207 package Bio::Search::Hit::PsiBlastHit;
209 use strict;
210 use Bio::Search::BlastUtils;
211 use vars qw(%SUMMARY_OFFSET);
213 use overload
214 '""' => \&to_string;
216 use base qw(Bio::Root::Root Bio::Search::Hit::HitI);
219 =head2 new
221 Usage : $hit = Bio::Search::Hit::PsiBlastHit->new( %named_params );
222 : Bio::Search::Hit::PsiBlastHit.pm objects are constructed
223 : automatically by Bio::SearchIO::PsiBlastHitFactory.pm,
224 : so there is no need for direct instantiation.
225 Purpose : Constructs a new PsiBlastHit object and Initializes key variables
226 : for the hit.
227 Returns : A Bio::Search::Hit::PsiBlastHit object
228 Argument : Named Parameters:
229 : Parameter keys are case-insensitive.
230 : -RAW_DATA => array reference holding raw BLAST report data
231 : for a single hit. This includes all lines
232 : within the HSP alignment listing section of a
233 : traditional BLAST or PSI-BLAST (non-XML) report,
234 : starting at (or just after) the leading '>'.
235 : -HOLD_RAW_DATA => boolean, should -RAW_DATA be saved within the object.
236 : -QUERY_LEN => Length of the query sequence
237 : -ITERATION => integer (PSI-BLAST iteration number in which hit was found)
238 : -OVERLAP => integer (maximum overlap between adjacent
239 : HSPs when tiling)
240 : -PROGRAM => string (type of Blast: BLASTP, BLASTN, etc)
241 : -SIGNIF => significance
242 : -IS_PVAL => boolean, true if -SIGNIF contains a P-value
243 : -SCORE => raw BLAST score
244 : -FOUND_AGAIN => boolean, true if this was a hit from the
245 : section of a PSI-BLAST with iteration > 1
246 : containing sequences that were also found
247 : in iteration 1.
248 Comments : This object accepts raw Blast report data not because it
249 : is required for parsing, but in order to retrieve it
250 : (only available if -HOLD_RAW_DATA is set to true).
252 See Also : L<Bio::Search::BlastUtils::tile_hsps()|Bio::Search::BlastUtils>, L<Bio::Root::Root::new()|Bio::Root::Root>
254 =cut
256 #-------------------
257 sub new {
258 #-------------------
259 my ($class, @args ) = @_;
260 my $self = $class->SUPER::new( @args );
262 my ($raw_data, $signif, $is_pval, $hold_raw);
264 ($self->{'_blast_program'}, $self->{'_query_length'}, $raw_data, $hold_raw,
265 $self->{'_overlap'}, $self->{'_iteration'}, $signif, $is_pval,
266 $self->{'_score'}, $self->{'_found_again'} ) =
267 $self->_rearrange( [qw(PROGRAM
268 QUERY_LEN
269 RAW_DATA
270 HOLD_RAW_DATA
271 OVERLAP
272 ITERATION
273 SIGNIF
274 IS_PVAL
275 SCORE
276 FOUND_AGAIN )], @args );
278 # TODO: Handle this in parser. Just pass in name parameter.
279 $self->_set_id( $raw_data->[0] );
281 if($is_pval) {
282 $self->{'_p'} = $signif;
283 } else {
284 $self->{'_expect'} = $signif;
287 if( $hold_raw ) {
288 $self->{'_hit_data'} = $raw_data;
291 return $self;
294 sub DESTROY {
295 my $self=shift;
296 #print STDERR "-->DESTROYING $self\n";
300 #=================================================
301 # Begin Bio::Search::Hit::HitI implementation
302 #=================================================
304 =head2 algorithm
306 Title : algorithm
307 Usage : $alg = $hit->algorithm();
308 Function: Gets the algorithm specification that was used to obtain the hit
309 For BLAST, the algorithm denotes what type of sequence was aligned
310 against what (BLASTN: dna-dna, BLASTP prt-prt, BLASTX translated
311 dna-prt, TBLASTN prt-translated dna, TBLASTX translated
312 dna-translated dna).
313 Returns : a scalar string
314 Args : none
316 =cut
318 #----------------
319 sub algorithm {
320 #----------------
321 my ($self,@args) = @_;
322 return $self->{'_blast_program'};
325 =head2 name
327 Usage : $hit->name([string]);
328 Purpose : Set/Get a string to identify the hit.
329 Example : $name = $hit->name;
330 : $hit->name('M81707');
331 Returns : String consisting of the hit's name or undef if not set.
332 Comments : The name is parsed out of the "Query=" line as the first chunk of
333 non-whitespace text. If you want the rest of the line, use
334 $hit->description().
336 See Also: L<accession()|accession>
338 =cut
342 #----------------
343 sub name {
344 #----------------
345 my $self = shift;
346 if (@_) {
347 my $name = shift;
348 $name =~ s/^\s+|(\s+|,)$//g;
349 $self->{'_name'} = $name;
351 return $self->{'_name'};
354 =head2 description
356 Usage : $hit_object->description( [integer] );
357 Purpose : Set/Get a description string for the hit.
358 This is parsed out of the "Query=" line as everything after
359 the first chunk of non-whitespace text. Use $hit->name()
360 to get the first chunk (the ID of the sequence).
361 Example : $description = $hit->description;
362 : $desc_60char = $hit->description(60);
363 Argument : Integer (optional) indicating the desired length of the
364 : description string to be returned.
365 Returns : String consisting of the hit's description or undef if not set.
367 =cut
371 #----------------
372 sub description {
373 #----------------
374 my( $self, $len ) = @_;
375 $len = (defined $len) ? $len : (CORE::length $self->{'_description'});
376 return substr( $self->{'_description'}, 0 ,$len );
379 =head2 accession
381 Title : accession
382 Usage : $acc = $hit->accession();
383 Function: Retrieve the accession (if available) for the hit
384 Returns : a scalar string (empty string if not set)
385 Args : none
386 Comments: Accession numbers are extracted based on the assumption that they
387 are delimited by | characters (NCBI-style). If this is not the case,
388 use the name() method and parse it as necessary.
390 See Also: L<name()|name>
392 =cut
394 #--------------------
395 sub accession {
396 #--------------------
397 my $self = shift;
398 if(@_) { $self->{'_accession'} = shift; }
399 $self->{'_accession'} || '';
402 =head2 raw_score
404 Usage : $hit_object->raw_score();
405 Purpose : Gets the BLAST score of the best HSP for the current Blast hit.
406 Example : $score = $hit_object->raw_score();
407 Returns : Integer
408 Argument : n/a
409 Throws : n/a
411 See Also : L<bits()|bits>
413 =cut
415 #----------
416 sub raw_score {
417 #----------
418 my $self = shift;
420 # The check for $self->{'_score'} is a remnant from the 'query' mode days
421 # in which the sbjct object would collect data from the description line only.
423 my ($score);
424 if(not defined($self->{'_score'})) {
425 $score = $self->hsp->score;
426 } else {
427 $score = $self->{'_score'};
429 return $score;
433 =head2 length
435 Usage : $hit_object->length();
436 Purpose : Get the total length of the hit sequence.
437 Example : $len = $hit_object->length();
438 Returns : Integer
439 Argument : n/a
440 Throws : n/a
441 Comments : Developer note: when using the built-in length function within
442 : this module, call it as CORE::length().
444 See Also : L<logical_length()|logical_length>, L<length_aln()|length_aln>
446 =cut
448 #-----------
449 sub length {
450 #-----------
451 my $self = shift;
452 return $self->{'_length'};
455 =head2 significance
457 Equivalent to L<signif()|signif>
459 =cut
461 #----------------
462 sub significance { shift->signif( @_ ); }
463 #----------------
466 =head2 next_hsp
468 Title : next_hsp
469 Usage : $hsp = $obj->next_hsp();
470 Function : returns the next available High Scoring Pair object
471 Example :
472 Returns : Bio::Search::HSP::BlastHSP or undef if finished
473 Args : none
475 =cut
477 #----------------
478 sub next_hsp {
479 #----------------
480 my $self = shift;
482 unless($self->{'_hsp_queue_started'}) {
483 $self->{'_hsp_queue'} = [$self->hsps()];
484 $self->{'_hsp_queue_started'} = 1;
486 pop @{$self->{'_hsp_queue'}};
489 #=================================================
490 # End Bio::Search::Hit::HitI implementation
491 #=================================================
494 # Providing a more explicit method for getting name of hit
495 # (corresponds with column name in HitTableWriter)
496 #----------------
497 sub hit_name {
498 #----------------
499 my $self = shift;
500 $self->name( @_ );
503 # Older method Delegates to description()
504 #----------------
505 sub desc {
506 #----------------
507 my $self = shift;
508 return $self->description( @_ );
511 # Providing a more explicit method for getting description of hit
512 # (corresponds with column name in HitTableWriter)
513 #----------------
514 sub hit_description {
515 #----------------
516 my $self = shift;
517 return $self->description( @_ );
520 =head2 score
522 Equivalent to L<raw_score()|raw_score>
524 =cut
526 #----------------
527 sub score { shift->raw_score( @_ ); }
528 #----------------
531 =head2 hit_length
533 Equivalent to L<length()|length>
535 =cut
537 # Providing a more explicit method for getting length of hit
538 #----------------
539 sub hit_length { shift->length( @_ ); }
540 #----------------
543 =head2 signif
545 Usage : $hit_object->signif( [format] );
546 Purpose : Get the P or Expect value for the best HSP of the given BLAST hit.
547 : The value returned is the one which is reported in the description
548 : section of the Blast report. For Blast1 and WU-Blast2, this
549 : is a P-value, for Blast2, it is an Expect value.
550 Example : $obj->signif() # returns 1.3e-34
551 : $obj->signif('exp') # returns -34
552 : $obj->signif('parts') # returns (1.3, -34)
553 Returns : Float or scientific notation number (the raw P/Expect value, DEFAULT).
554 : Integer if format == 'exp' (the magnitude of the base 10 exponent).
555 : 2-element list (float, int) if format == 'parts' and P/Expect value
556 : is in scientific notation (see Comments).
557 Argument : format: string of 'raw' | 'exp' | 'parts'
558 : 'raw' returns value given in report. Default. (1.2e-34)
559 : 'exp' returns exponent value only (34)
560 : 'parts' returns the decimal and exponent as a
561 : 2-element list (1.2, -34) (see Comments).
562 Throws : n/a
563 Comments : The signif() method provides a way to deal with the fact that
564 : Blast1 and Blast2 formats (and WU- vs. NCBI-BLAST) differ in
565 : what is reported in the description lines of each hit in the
566 : Blast report. The signif() method frees any client code from
567 : having to know if this is a P-value or an Expect value,
568 : making it easier to write code that can process both
569 : Blast1 and Blast2 reports. This is not necessarily a good thing,
570 : since one should always know when one is working with P-values or
571 : Expect values (hence the deprecated status).
572 : Use of expect() is recommended since all hits will have an Expect value.
574 : Using the 'parts' argument is not recommended since it will not
575 : work as expected if the expect value is not in scientific notation.
576 : That is, floats are not converted into sci notation before
577 : splitting into parts.
579 See Also : L<p()|p>, L<expect()|expect>, L<Bio::Search::BlastUtils::get_exponent()|Bio::Search::BlastUtils>
581 =cut
583 #-------------
584 sub signif {
585 #-------------
586 # Some duplication of logic for p(), expect() and signif() for the sake of performance.
587 my ($self, $fmt) = @_;
589 my $val = defined($self->{'_p'}) ? $self->{'_p'} : $self->{'_expect'};
591 # $val can be zero.
592 defined($val) or $self->throw("Can't get P- or Expect value: HSPs may not have been set.");
594 return $val if not $fmt or $fmt =~ /^raw/i;
595 ## Special formats: exponent-only or as list.
596 return &Bio::Search::BlastUtils::get_exponent($val) if $fmt =~ /^exp/i;
597 return (split (/eE/, $val)) if $fmt =~ /^parts/i;
599 ## Default: return the raw P/Expect-value.
600 return $val;
603 #----------------
604 sub raw_hit_data {
605 #----------------
606 my $self = shift;
607 my $data = '>';
608 # Need to add blank lines where we've removed them.
609 foreach( @{$self->{'_hit_data'}} ) {
610 if( $_ eq 'end') {
611 $data .= "\n";
613 else {
614 $data .= /^\s*(Score|Query)/ ? "\n$_" : $_;
617 return $data;
621 #=head2 _set_length
623 # Usage : $hit_object->_set_length( "233" );
624 # Purpose : Set the total length of the hit sequence.
625 # Example : $hit_object->_set_length( $len );
626 # Returns : n/a
627 # Argument : Integer (only when setting). Any commas will be stripped out.
628 # Throws : n/a
630 #=cut
632 #-----------
633 sub _set_length {
634 #-----------
635 my ($self, $len) = @_;
636 $len =~ s/,//g; # get rid of commas
637 $self->{'_length'} = $len;
640 #=head2 _set_description
642 # Usage : Private method; called automatically during construction
643 # Purpose : Sets the description of the hit sequence.
644 # : For sequence without descriptions, does not set any description.
645 # Argument : Array containing description (multiple lines).
646 # Comments : Processes the supplied description:
647 # 1. Join all lines into one string.
648 # 2. Remove sequence id at the beginning of description.
649 # 3. Removes junk charactes at begin and end of description.
651 #=cut
653 #--------------
654 sub _set_description {
655 #--------------
656 my( $self, @desc ) = @_;
657 my( $desc);
659 # print STDERR "PsiBlastHit: RAW DESC:\n@desc\n";
661 $desc = join(" ", @desc);
663 my $name = $self->name;
665 if($desc) {
666 $desc =~ s/^\s*\S+\s+//; # remove the sequence ID(s)
667 # This won't work if there's no description.
668 $desc =~ s/^\s*$name//; # ...but this should.
669 $desc =~ s/^[\s!]+//;
670 $desc =~ s/ \d+$//;
671 $desc =~ s/\.+$//;
672 $self->{'_description'} = $desc;
675 # print STDERR "PsiBlastHit: _set_description = $desc\n";
678 =head2 to_string
680 Title : to_string
681 Usage : print $hit->to_string;
682 Function: Returns a string representation for the Blast Hit.
683 Primarily intended for debugging purposes.
684 Example : see usage
685 Returns : A string of the form:
686 [PsiBlastHit] <name> <description>
687 e.g.:
688 [PsiBlastHit] emb|Z46660|SC9725 S.cerevisiae chromosome XIII cosmid
689 Args : None
691 =cut
693 #----------------
694 sub to_string {
695 #----------------
696 my $self = shift;
697 return "[PsiBlastHit] " . $self->name . " " . $self->description;
701 #=head2 _set_id
703 # Usage : Private method; automatically called by new()
704 # Purpose : Sets the name of the PsiBlastHit sequence from the BLAST summary line.
705 # : The identifier is assumed to be the first
706 # : chunk of non-whitespace characters in the description line
707 # : Does not assume any semantics in the structure of the identifier
708 # : (Formerly, this method attempted to extract database name from
709 # : the seq identifiers, but this was prone to break).
710 # Returns : n/a
711 # Argument : String containing description line of the hit from Blast report
712 # : or first line of an alignment section (with or without the leading '>').
713 # Throws : Warning if cannot locate sequence ID.
715 #See Also : L<new()|new>, L<accession()|accession>
717 #=cut
719 #---------------
720 sub _set_id {
721 #---------------
722 my( $self, $desc ) = @_;
724 # New strategy: Assume only that the ID is the first white space
725 # delimited chunk. Not attempting to extract accession & database name.
726 # Clients will have to interpret it as necessary.
727 if($desc =~ /^>?(\S+)\s*(.*)/) {
728 my ($name, $desc) = ($1, $2);
729 $self->name($name);
730 $self->{'_description'} = $desc;
731 # Note that this description comes from the summary section of the
732 # BLAST report and so may be truncated. The full description will be
733 # set from the alignment section. We're setting description here in case
734 # the alignment section isn't being parsed.
736 # Assuming accession is delimited with | symbols (NCBI-style)
737 my @pieces = split(/\|/,$name);
738 my $acc = pop @pieces;
739 $self->accession( $acc );
741 else {
742 $self->warn("Can't locate sequence identifier in summary line.", "Line = $desc");
743 $desc = 'Unknown sequence ID' if not $desc;
744 $self->name($desc);
749 =head2 ambiguous_aln
751 Usage : $ambig_code = $hit_object->ambiguous_aln();
752 Purpose : Sets/Gets ambiguity code data member.
753 Example : (see usage)
754 Returns : String = 'q', 's', 'qs', '-'
755 : 'q' = query sequence contains overlapping sub-sequences
756 : while sbjct does not.
757 : 's' = sbjct sequence contains overlapping sub-sequences
758 : while query does not.
759 : 'qs' = query and sbjct sequence contains overlapping sub-sequences
760 : relative to each other.
761 : '-' = query and sbjct sequence do not contains multiple domains
762 : relative to each other OR both contain the same distribution
763 : of similar domains.
764 Argument : n/a
765 Throws : n/a
766 Status : Experimental
768 See Also : L<Bio::Search::BlastUtils::tile_hsps>, L<HSP Tiling and Ambiguous Alignments>
770 =cut
772 #--------------------
773 sub ambiguous_aln {
774 #--------------------
775 my $self = shift;
776 if(@_) { $self->{'_ambiguous_aln'} = shift; }
777 $self->{'_ambiguous_aln'} || '-';
782 =head2 overlap
784 Usage : $blast_object->overlap( [integer] );
785 Purpose : Gets/Sets the allowable amount overlap between different HSP sequences.
786 Example : $blast_object->overlap(5);
787 : $overlap = $blast_object->overlap;
788 Returns : Integer.
789 Argument : integer.
790 Throws : n/a
791 Status : Experimental
792 Comments : Any two HSPs whose sequences overlap by less than or equal
793 : to the overlap() number of resides will be considered separate HSPs
794 : and will not get tiled by Bio::Search::BlastUtils::_adjust_contigs().
796 See Also : L<Bio::Search::BlastUtils::_adjust_contigs()|Bio::Search::BlastUtils>, L<BUGS | BUGS>
798 =cut
800 #-------------
801 sub overlap {
802 #-------------
803 my $self = shift;
804 if(@_) { $self->{'_overlap'} = shift; }
805 defined $self->{'_overlap'} ? $self->{'_overlap'} : 0;
813 =head2 bits
815 Usage : $hit_object->bits();
816 Purpose : Gets the BLAST bit score of the best HSP for the current Blast hit.
817 Example : $bits = $hit_object->bits();
818 Returns : Integer
819 Argument : n/a
820 Throws : Exception if bit score is not set.
821 Comments : For BLAST1, the non-bit score is listed in the summary line.
823 See Also : L<score()|score>
825 =cut
827 #---------
828 sub bits {
829 #---------
830 my $self = shift;
832 # The check for $self->{'_bits'} is a remnant from the 'query' mode days
833 # in which the sbjct object would collect data from the description line only.
835 my ($bits);
836 if(not defined($self->{'_bits'})) {
837 $bits = $self->hsp->bits;
838 } else {
839 $bits = $self->{'_bits'};
841 return $bits;
846 =head2 n
848 Usage : $hit_object->n();
849 Purpose : Gets the N number for the current Blast hit.
850 : This is the number of HSPs in the set which was ascribed
851 : the lowest P-value (listed on the description line).
852 : This number is not the same as the total number of HSPs.
853 : To get the total number of HSPs, use num_hsps().
854 Example : $n = $hit_object->n();
855 Returns : Integer
856 Argument : n/a
857 Throws : Exception if HSPs have not been set (BLAST2 reports).
858 Comments : Note that the N parameter is not reported in gapped BLAST2.
859 : Calling n() on such reports will result in a call to num_hsps().
860 : The num_hsps() method will count the actual number of
861 : HSPs in the alignment listing, which may exceed N in
862 : some cases.
864 See Also : L<num_hsps()|num_hsps>
866 =cut
868 #-----
869 sub n {
870 #-----
871 my $self = shift;
873 # The check for $self->{'_n'} is a remnant from the 'query' mode days
874 # in which the sbjct object would collect data from the description line only.
876 my ($n);
877 if(not defined($self->{'_n'})) {
878 $n = $self->hsp->n;
879 } else {
880 $n = $self->{'_n'};
882 $n ||= $self->num_hsps;
884 return $n;
889 =head2 frame
891 Usage : $hit_object->frame();
892 Purpose : Gets the reading frame for the best HSP after HSP tiling.
893 : This is only valid for BLASTX and TBLASTN/X reports.
894 Example : $frame = $hit_object->frame();
895 Returns : Integer (-2 .. +2)
896 Argument : n/a
897 Throws : Exception if HSPs have not been set (BLAST2 reports).
898 Comments : This method requires that all HSPs be tiled. If they have not
899 : already been tiled, they will be tiled first automatically..
900 : If you don't want the tiled data, iterate through each HSP
901 : calling frame() on each (use hsps() to get all HSPs).
903 See Also : L<hsps()|hsps>
905 =cut
907 #----------'
908 sub frame {
909 #----------
910 my $self = shift;
912 Bio::Search::BlastUtils::tile_hsps($self) if not $self->{'_tile_hsps'};
914 # The check for $self->{'_frame'} is a remnant from the 'query' mode days
915 # in which the sbjct object would collect data from the description line only.
917 my ($frame);
918 if(not defined($self->{'_frame'})) {
919 $frame = $self->hsp->frame('hit');
920 } else {
921 $frame = $self->{'_frame'};
923 return $frame;
930 =head2 p
932 Usage : $hit_object->p( [format] );
933 Purpose : Get the P-value for the best HSP of the given BLAST hit.
934 : (Note that P-values are not provided with NCBI Blast2 reports).
935 Example : $p = $sbjct->p;
936 : $p = $sbjct->p('exp'); # get exponent only.
937 : ($num, $exp) = $sbjct->p('parts'); # split sci notation into parts
938 Returns : Float or scientific notation number (the raw P-value, DEFAULT).
939 : Integer if format == 'exp' (the magnitude of the base 10 exponent).
940 : 2-element list (float, int) if format == 'parts' and P-value
941 : is in scientific notation (See Comments).
942 Argument : format: string of 'raw' | 'exp' | 'parts'
943 : 'raw' returns value given in report. Default. (1.2e-34)
944 : 'exp' returns exponent value only (34)
945 : 'parts' returns the decimal and exponent as a
946 : 2-element list (1.2, -34) (See Comments).
947 Throws : Warns if no P-value is defined. Uses expect instead.
948 Comments : Using the 'parts' argument is not recommended since it will not
949 : work as expected if the P-value is not in scientific notation.
950 : That is, floats are not converted into sci notation before
951 : splitting into parts.
953 See Also : L<expect()|expect>, L<signif()|signif>, L<Bio::Search::BlastUtils::get_exponent()|Bio::Search::BlastUtils>
955 =cut
957 #--------
958 sub p {
959 #--------
960 # Some duplication of logic for p(), expect() and signif() for the sake of performance.
961 my ($self, $fmt) = @_;
963 my $val = $self->{'_p'};
965 # $val can be zero.
966 if(not defined $val) {
967 # P-value not defined, must be a NCBI Blast2 report.
968 # Use expect instead.
969 $self->warn( "P-value not defined. Using expect() instead.");
970 $val = $self->{'_expect'};
973 return $val if not $fmt or $fmt =~ /^raw/i;
974 ## Special formats: exponent-only or as list.
975 return &Bio::Search::BlastUtils::get_exponent($val) if $fmt =~ /^exp/i;
976 return (split (/eE/, $val)) if $fmt =~ /^parts/i;
978 ## Default: return the raw P-value.
979 return $val;
984 =head2 expect
986 Usage : $hit_object->expect( [format] );
987 Purpose : Get the Expect value for the best HSP of the given BLAST hit.
988 Example : $e = $sbjct->expect;
989 : $e = $sbjct->expect('exp'); # get exponent only.
990 : ($num, $exp) = $sbjct->expect('parts'); # split sci notation into parts
991 Returns : Float or scientific notation number (the raw expect value, DEFAULT).
992 : Integer if format == 'exp' (the magnitude of the base 10 exponent).
993 : 2-element list (float, int) if format == 'parts' and Expect
994 : is in scientific notation (see Comments).
995 Argument : format: string of 'raw' | 'exp' | 'parts'
996 : 'raw' returns value given in report. Default. (1.2e-34)
997 : 'exp' returns exponent value only (34)
998 : 'parts' returns the decimal and exponent as a
999 : 2-element list (1.2, -34) (see Comments).
1000 Throws : Exception if the Expect value is not defined.
1001 Comments : Using the 'parts' argument is not recommended since it will not
1002 : work as expected if the expect value is not in scientific notation.
1003 : That is, floats are not converted into sci notation before
1004 : splitting into parts.
1006 See Also : L<p()|p>, L<signif()|signif>, L<Bio::Search::BlastUtils::get_exponent()|Bio::Search::BlastUtils>
1008 =cut
1010 #-----------
1011 sub expect {
1012 #-----------
1013 # Some duplication of logic for p(), expect() and signif() for the sake of performance.
1014 my ($self, $fmt) = @_;
1016 my $val;
1018 # For Blast reports that list the P value on the description line,
1019 # getting the expect value requires fully parsing the HSP data.
1020 # For NCBI blast, there's no problem.
1021 if(not defined($self->{'_expect'})) {
1022 if( defined $self->{'_hsps'}) {
1023 $self->{'_expect'} = $val = $self->hsp->expect;
1024 } else {
1025 # If _expect is not set and _hsps are not set,
1026 # then this must be a P-value-based report that was
1027 # run without setting the HSPs (shallow parsing).
1028 $self->throw("Can't get expect value. HSPs have not been set.");
1030 } else {
1031 $val = $self->{'_expect'};
1034 # $val can be zero.
1035 defined($val) or $self->throw("Can't get Expect value.");
1037 return $val if not $fmt or $fmt =~ /^raw/i;
1038 ## Special formats: exponent-only or as list.
1039 return &Bio::Search::BlastUtils::get_exponent($val) if $fmt =~ /^exp/i;
1040 return (split (/eE/, $val)) if $fmt =~ /^parts/i;
1042 ## Default: return the raw Expect-value.
1043 return $val;
1047 =head2 hsps
1049 Usage : $hit_object->hsps();
1050 Purpose : Get a list containing all HSP objects.
1051 : Get the numbers of HSPs for the current hit.
1052 Example : @hsps = $hit_object->hsps();
1053 : $num = $hit_object->hsps(); # alternatively, use num_hsps()
1054 Returns : Array context : list of Bio::Search::HSP::BlastHSP.pm objects.
1055 : Scalar context: integer (number of HSPs).
1056 : (Equivalent to num_hsps()).
1057 Argument : n/a. Relies on wantarray
1058 Throws : Exception if the HSPs have not been collected.
1060 See Also : L<hsp()|hsp>, L<num_hsps()|num_hsps>
1062 =cut
1064 #---------
1065 sub hsps {
1066 #---------
1067 my $self = shift;
1069 if (not ref $self->{'_hsps'}) {
1070 $self->throw("Can't get HSPs: data not collected.");
1073 return wantarray
1074 # returning list containing all HSPs.
1075 ? @{$self->{'_hsps'}}
1076 # returning number of HSPs.
1077 : scalar(@{$self->{'_hsps'}});
1082 =head2 hsp
1084 Usage : $hit_object->hsp( [string] );
1085 Purpose : Get a single BlastHSP.pm object for the present PsiBlastHit.pm object.
1086 Example : $hspObj = $hit_object->hsp; # same as 'best'
1087 : $hspObj = $hit_object->hsp('best');
1088 : $hspObj = $hit_object->hsp('worst');
1089 Returns : Object reference for a Bio::Search::HSP::BlastHSP.pm object.
1090 Argument : String (or no argument).
1091 : No argument (default) = highest scoring HSP (same as 'best').
1092 : 'best' or 'first' = highest scoring HSP.
1093 : 'worst' or 'last' = lowest scoring HSP.
1094 Throws : Exception if the HSPs have not been collected.
1095 : Exception if an unrecognized argument is used.
1097 See Also : L<hsps()|hsps>, L<num_hsps>()
1099 =cut
1101 #----------
1102 sub hsp {
1103 #----------
1104 my( $self, $option ) = @_;
1105 $option ||= 'best';
1107 if (not ref $self->{'_hsps'}) {
1108 $self->throw("Can't get HSPs: data not collected.");
1111 my @hsps = @{$self->{'_hsps'}};
1113 return $hsps[0] if $option =~ /best|first|1/i;
1114 return $hsps[$#hsps] if $option =~ /worst|last/i;
1116 $self->throw("Can't get HSP for: $option\n" .
1117 "Valid arguments: 'best', 'worst'");
1122 =head2 num_hsps
1124 Usage : $hit_object->num_hsps();
1125 Purpose : Get the number of HSPs for the present Blast hit.
1126 Example : $nhsps = $hit_object->num_hsps();
1127 Returns : Integer
1128 Argument : n/a
1129 Throws : Exception if the HSPs have not been collected.
1131 See Also : L<hsps()|hsps>
1133 =cut
1135 #-------------
1136 sub num_hsps {
1137 #-------------
1138 my $self = shift;
1140 if (not defined $self->{'_hsps'}) {
1141 $self->throw("Can't get HSPs: data not collected.");
1144 return scalar(@{$self->{'_hsps'}});
1149 =head2 logical_length
1151 Usage : $hit_object->logical_length( [seq_type] );
1152 : (mostly intended for internal use).
1153 Purpose : Get the logical length of the hit sequence.
1154 : For query sequence of BLASTX and TBLASTX reports and the hit
1155 : sequence of TBLASTN and TBLASTX reports, the returned length
1156 : is the length of the would-be amino acid sequence (length/3).
1157 : For all other BLAST flavors, this function is the same as length().
1158 Example : $len = $hit_object->logical_length();
1159 Returns : Integer
1160 Argument : seq_type = 'query' or 'hit' or 'sbjct' (default = 'query')
1161 ('sbjct' is synonymous with 'hit')
1162 Throws : n/a
1163 Comments : This is important for functions like frac_aligned_query()
1164 : which need to operate in amino acid coordinate space when dealing
1165 : with T?BLASTX type reports.
1167 See Also : L<length()|length>, L<frac_aligned_query()|frac_aligned_query>, L<frac_aligned_hit()|frac_aligned_hit>
1169 =cut
1171 #--------------------
1172 sub logical_length {
1173 #--------------------
1174 my $self = shift;
1175 my $seqType = shift || 'query';
1176 $seqType = 'sbjct' if $seqType eq 'hit';
1178 my $length;
1180 # For the sbjct, return logical sbjct length
1181 if( $seqType eq 'sbjct' ) {
1182 $length = $self->{'_logical_length'} || $self->{'_length'};
1184 else {
1185 # Otherwise, return logical query length
1186 $length = $self->{'_query_length'};
1188 # Adjust length based on BLAST flavor.
1189 if($self->{'_blast_program'} =~ /T?BLASTX/ ) {
1190 $length /= 3;
1193 return $length;
1197 =head2 length_aln
1199 Usage : $hit_object->length_aln( [seq_type] );
1200 Purpose : Get the total length of the aligned region for query or sbjct seq.
1201 : This number will include all HSPs
1202 Example : $len = $hit_object->length_aln(); # default = query
1203 : $lenAln = $hit_object->length_aln('query');
1204 Returns : Integer
1205 Argument : seq_Type = 'query' or 'hit' or 'sbjct' (Default = 'query')
1206 ('sbjct' is synonymous with 'hit')
1207 Throws : Exception if the argument is not recognized.
1208 Comments : This method will report the logical length of the alignment,
1209 : meaning that for TBLAST[NX] reports, the length is reported
1210 : using amino acid coordinate space (i.e., nucleotides / 3).
1212 : This method requires that all HSPs be tiled. If they have not
1213 : already been tiled, they will be tiled first automatically..
1214 : If you don't want the tiled data, iterate through each HSP
1215 : calling length() on each (use hsps() to get all HSPs).
1217 See Also : L<length()|length>, L<frac_aligned_query()|frac_aligned_query>, L<frac_aligned_hit()|frac_aligned_hit>, L<gaps()|gaps>, L<Bio::Search::BlastUtils::tile_hsps()|Bio::Search::BlastUtils>, L<Bio::Search::HSP::BlastHSP::length()|Bio::Search::HSP::BlastHSP>
1219 =cut
1221 #---------------'
1222 sub length_aln {
1223 #---------------
1224 my( $self, $seqType ) = @_;
1226 $seqType ||= 'query';
1227 $seqType = 'sbjct' if $seqType eq 'hit';
1229 Bio::Search::BlastUtils::tile_hsps($self) if not $self->{'_tile_hsps'};
1231 my $data = $self->{'_length_aln_'.$seqType};
1233 ## If we don't have data, figure out what went wrong.
1234 if(!$data) {
1235 $self->throw("Can't get length aln for sequence type \"$seqType\"" .
1236 "Valid types are 'query', 'hit', 'sbjct' ('sbjct' = 'hit')");
1238 $data;
1242 =head2 gaps
1244 Usage : $hit_object->gaps( [seq_type] );
1245 Purpose : Get the number of gaps in the aligned query, sbjct, or both sequences.
1246 : Data is summed across all HSPs.
1247 Example : $qgaps = $hit_object->gaps('query');
1248 : $hgaps = $hit_object->gaps('hit');
1249 : $tgaps = $hit_object->gaps(); # default = total (query + hit)
1250 Returns : scalar context: integer
1251 : array context without args: two-element list of integers
1252 : (queryGaps, sbjctGaps)
1253 : Array context can be forced by providing an argument of 'list' or 'array'.
1255 : CAUTION: Calling this method within printf or sprintf is arrray context.
1256 : So this function may not give you what you expect. For example:
1257 : printf "Total gaps: %d", $hit->gaps();
1258 : Actually returns a two-element array, so what gets printed
1259 : is the number of gaps in the query, not the total
1261 Argument : seq_type: 'query' | 'hit' or 'sbjct' | 'total' | 'list' (default = 'total')
1262 ('sbjct' is synonymous with 'hit')
1263 Throws : n/a
1264 Comments : If you need data for each HSP, use hsps() and then interate
1265 : through each HSP object.
1266 : This method requires that all HSPs be tiled. If they have not
1267 : already been tiled, they will be tiled first automatically..
1268 : Not relying on wantarray since that will fail in situations
1269 : such as printf "%d", $hit->gaps() in which you might expect to
1270 : be printing the total gaps, but evaluates to array context.
1272 See Also : L<length_aln()|length_aln>
1274 =cut
1276 #----------
1277 sub gaps {
1278 #----------
1279 my( $self, $seqType ) = @_;
1281 $seqType ||= (wantarray ? 'list' : 'total');
1282 $seqType = 'sbjct' if $seqType eq 'hit';
1284 Bio::Search::BlastUtils::tile_hsps($self) if not $self->{'_tile_hsps'};
1286 $seqType = lc($seqType);
1288 if($seqType =~ /list|array/i) {
1289 return ($self->{'_gaps_query'}, $self->{'_gaps_sbjct'});
1292 if($seqType eq 'total') {
1293 return ($self->{'_gaps_query'} + $self->{'_gaps_sbjct'}) || 0;
1294 } else {
1295 return $self->{'_gaps_'.$seqType} || 0;
1301 =head2 matches
1303 Usage : $hit_object->matches( [class] );
1304 Purpose : Get the total number of identical or conserved matches
1305 : (or both) across all HSPs.
1306 : (Note: 'conservative' matches are indicated as 'positives'
1307 : in the Blast report.)
1308 Example : ($id,$cons) = $hit_object->matches(); # no argument
1309 : $id = $hit_object->matches('id');
1310 : $cons = $hit_object->matches('cons');
1311 Returns : Integer or a 2-element array of integers
1312 Argument : class = 'id' | 'cons' OR none.
1313 : If no argument is provided, both identical and conservative
1314 : numbers are returned in a two element list.
1315 : (Other terms can be used to refer to the conservative
1316 : matches, e.g., 'positive'. All that is checked is whether or
1317 : not the supplied string starts with 'id'. If not, the
1318 : conservative matches are returned.)
1319 Throws : Exception if the requested data cannot be obtained.
1320 Comments : If you need data for each HSP, use hsps() and then interate
1321 : through the HSP objects.
1322 : Does not rely on wantarray to return a list. Only checks for
1323 : the presence of an argument (no arg = return list).
1325 See Also : L<Bio::Search::HSP::BlastHSP::matches()|Bio::Search::HSP::BlastHSP>, L<hsps()|hsps>
1327 =cut
1329 #---------------
1330 sub matches {
1331 #---------------
1332 my( $self, $arg) = @_;
1333 my(@data,$data);
1335 if(!$arg) {
1336 @data = ($self->{'_totalIdentical'}, $self->{'_totalConserved'});
1338 return @data if @data;
1340 } else {
1342 if($arg =~ /^id/i) {
1343 $data = $self->{'_totalIdentical'};
1344 } else {
1345 $data = $self->{'_totalConserved'};
1347 return $data if $data;
1350 ## Something went wrong if we make it to here.
1351 $self->throw("Can't get identical or conserved data: no data.");
1355 =head2 start
1357 Usage : $sbjct->start( [seq_type] );
1358 Purpose : Gets the start coordinate for the query, sbjct, or both sequences
1359 : in the PsiBlastHit object. If there is more than one HSP, the lowest start
1360 : value of all HSPs is returned.
1361 Example : $qbeg = $sbjct->start('query');
1362 : $sbeg = $sbjct->start('hit');
1363 : ($qbeg, $sbeg) = $sbjct->start();
1364 Returns : scalar context: integer
1365 : array context without args: list of two integers (queryStart, sbjctStart)
1366 : Array context can be "induced" by providing an argument of 'list' or 'array'.
1367 Argument : In scalar context: seq_type = 'query' or 'hit' or 'sbjct' (default = 'query')
1368 ('sbjct' is synonymous with 'hit')
1369 Throws : n/a
1370 Comments : This method requires that all HSPs be tiled. If there is more than one
1371 : HSP and they have not already been tiled, they will be tiled first automatically..
1372 : Remember that the start and end coordinates of all HSPs are
1373 : normalized so that start < end. Strand information can be
1374 : obtained by calling $hit->strand().
1376 See Also : L<end()|end>, L<range()|range>, L<strand()|strand>, L<HSP Tiling and Ambiguous Alignments>, L<Bio::Search::HSP::BlastHSP::start|Bio::Search::HSP::BlastHSP>
1378 =cut
1380 #----------
1381 sub start {
1382 #----------
1383 my ($self, $seqType) = @_;
1385 $seqType ||= (wantarray ? 'list' : 'query');
1386 $seqType = 'sbjct' if $seqType eq 'hit';
1388 # If there is only one HSP, defer this call to the solitary HSP.
1389 if($self->num_hsps == 1) {
1390 return $self->hsp->start($seqType);
1391 } else {
1392 Bio::Search::BlastUtils::tile_hsps($self) if not $self->{'_tile_hsps'};
1393 if($seqType =~ /list|array/i) {
1394 return ($self->{'_queryStart'}, $self->{'_sbjctStart'});
1395 } else {
1396 ## Sensitive to member name changes.
1397 $seqType = "_\L$seqType\E";
1398 return $self->{$seqType.'Start'};
1404 =head2 end
1406 Usage : $sbjct->end( [seq_type] );
1407 Purpose : Gets the end coordinate for the query, sbjct, or both sequences
1408 : in the PsiBlastHit object. If there is more than one HSP, the largest end
1409 : value of all HSPs is returned.
1410 Example : $qend = $sbjct->end('query');
1411 : $send = $sbjct->end('hit');
1412 : ($qend, $send) = $sbjct->end();
1413 Returns : scalar context: integer
1414 : array context without args: list of two integers (queryEnd, sbjctEnd)
1415 : Array context can be "induced" by providing an argument of 'list' or 'array'.
1416 Argument : In scalar context: seq_type = 'query' or 'sbjct'
1417 : (case insensitive). If not supplied, 'query' is used.
1418 Throws : n/a
1419 Comments : This method requires that all HSPs be tiled. If there is more than one
1420 : HSP and they have not already been tiled, they will be tiled first automatically..
1421 : Remember that the start and end coordinates of all HSPs are
1422 : normalized so that start < end. Strand information can be
1423 : obtained by calling $hit->strand().
1425 See Also : L<start()|start>, L<range()|range>, L<strand()|strand>, L<HSP Tiling and Ambiguous Alignments>, L<Bio::Search::HSP::BlastHSP::end|Bio::Search::HSP::BlastHSP>
1427 =cut
1429 #----------
1430 sub end {
1431 #----------
1432 my ($self, $seqType) = @_;
1434 $seqType ||= (wantarray ? 'list' : 'query');
1435 $seqType = 'sbjct' if $seqType eq 'hit';
1437 # If there is only one HSP, defer this call to the solitary HSP.
1438 if($self->num_hsps == 1) {
1439 return $self->hsp->end($seqType);
1440 } else {
1441 Bio::Search::BlastUtils::tile_hsps($self) if not $self->{'_tile_hsps'};
1442 if($seqType =~ /list|array/i) {
1443 return ($self->{'_queryStop'}, $self->{'_sbjctStop'});
1444 } else {
1445 ## Sensitive to member name changes.
1446 $seqType = "_\L$seqType\E";
1447 return $self->{$seqType.'Stop'};
1452 =head2 range
1454 Usage : $sbjct->range( [seq_type] );
1455 Purpose : Gets the (start, end) coordinates for the query or sbjct sequence
1456 : in the HSP alignment.
1457 Example : ($qbeg, $qend) = $sbjct->range('query');
1458 : ($sbeg, $send) = $sbjct->range('hit');
1459 Returns : Two-element array of integers
1460 Argument : seq_type = string, 'query' or 'hit' or 'sbjct' (default = 'query')
1461 ('sbjct' is synonymous with 'hit')
1462 Throws : n/a
1464 See Also : L<start()|start>, L<end()|end>
1466 =cut
1468 #----------
1469 sub range {
1470 #----------
1471 my ($self, $seqType) = @_;
1472 $seqType ||= 'query';
1473 $seqType = 'sbjct' if $seqType eq 'hit';
1474 return ($self->start($seqType), $self->end($seqType));
1478 =head2 frac_identical
1480 Usage : $hit_object->frac_identical( [seq_type] );
1481 Purpose : Get the overall fraction of identical positions across all HSPs.
1482 : The number refers to only the aligned regions and does not
1483 : account for unaligned regions in between the HSPs, if any.
1484 Example : $frac_iden = $hit_object->frac_identical('query');
1485 Returns : Float (2-decimal precision, e.g., 0.75).
1486 Argument : seq_type: 'query' | 'hit' or 'sbjct' | 'total'
1487 : default = 'query' (but see comments below).
1488 : ('sbjct' is synonymous with 'hit')
1489 Throws : n/a
1490 Comments : Different versions of Blast report different values for the total
1491 : length of the alignment. This is the number reported in the
1492 : denominators in the stats section:
1493 : "Identical = 34/120 Positives = 67/120".
1494 : NCBI BLAST uses the total length of the alignment (with gaps)
1495 : WU-BLAST uses the length of the query sequence (without gaps).
1497 : Therefore, when called with an argument of 'total',
1498 : this method will report different values depending on the
1499 : version of BLAST used. Total does NOT take into account HSP
1500 : tiling, so it should not be used.
1502 : To get the fraction identical among only the aligned residues,
1503 : ignoring the gaps, call this method without an argument or
1504 : with an argument of 'query' or 'hit'.
1506 : If you need data for each HSP, use hsps() and then iterate
1507 : through the HSP objects.
1508 : This method requires that all HSPs be tiled. If they have not
1509 : already been tiled, they will be tiled first automatically.
1511 See Also : L<frac_conserved()|frac_conserved>, L<frac_aligned_query()|frac_aligned_query>, L<matches()|matches>, L<Bio::Search::BlastUtils::tile_hsps()|Bio::Search::BlastUtils>
1513 =cut
1515 #------------------
1516 sub frac_identical {
1517 #------------------
1518 my ($self, $seqType) = @_;
1519 $seqType ||= 'query';
1520 $seqType = 'sbjct' if $seqType eq 'hit';
1522 ## Sensitive to member name format.
1523 $seqType = lc($seqType);
1525 Bio::Search::BlastUtils::tile_hsps($self) if not $self->{'_tile_hsps'};
1527 sprintf( "%.2f", $self->{'_totalIdentical'}/$self->{'_length_aln_'.$seqType});
1532 =head2 frac_conserved
1534 Usage : $hit_object->frac_conserved( [seq_type] );
1535 Purpose : Get the overall fraction of conserved positions across all HSPs.
1536 : The number refers to only the aligned regions and does not
1537 : account for unaligned regions in between the HSPs, if any.
1538 Example : $frac_cons = $hit_object->frac_conserved('hit');
1539 Returns : Float (2-decimal precision, e.g., 0.75).
1540 Argument : seq_type: 'query' | 'hit' or 'sbjct' | 'total'
1541 : default = 'query' (but see comments below).
1542 : ('sbjct' is synonymous with 'hit')
1543 Throws : n/a
1544 Comments : Different versions of Blast report different values for the total
1545 : length of the alignment. This is the number reported in the
1546 : denominators in the stats section:
1547 : "Positives = 34/120 Positives = 67/120".
1548 : NCBI BLAST uses the total length of the alignment (with gaps)
1549 : WU-BLAST uses the length of the query sequence (without gaps).
1551 : Therefore, when called with an argument of 'total',
1552 : this method will report different values depending on the
1553 : version of BLAST used. Total does NOT take into account HSP
1554 : tiling, so it should not be used.
1556 : To get the fraction conserved among only the aligned residues,
1557 : ignoring the gaps, call this method without an argument or
1558 : with an argument of 'query' or 'hit'.
1560 : If you need data for each HSP, use hsps() and then interate
1561 : through the HSP objects.
1562 : This method requires that all HSPs be tiled. If they have not
1563 : already been tiled, they will be tiled first automatically.
1565 See Also : L<frac_identical()|frac_identical>, L<matches()|matches>, L<Bio::Search::BlastUtils::tile_hsps()|Bio::Search::BlastUtils>
1567 =cut
1569 #--------------------
1570 sub frac_conserved {
1571 #--------------------
1572 my ($self, $seqType) = @_;
1573 $seqType ||= 'query';
1574 $seqType = 'sbjct' if $seqType eq 'hit';
1576 ## Sensitive to member name format.
1577 $seqType = lc($seqType);
1579 Bio::Search::BlastUtils::tile_hsps($self) if not $self->{'_tile_hsps'};
1581 sprintf( "%.2f", $self->{'_totalConserved'}/$self->{'_length_aln_'.$seqType});
1587 =head2 frac_aligned_query
1589 Usage : $hit_object->frac_aligned_query();
1590 Purpose : Get the fraction of the query sequence which has been aligned
1591 : across all HSPs (not including intervals between non-overlapping
1592 : HSPs).
1593 Example : $frac_alnq = $hit_object->frac_aligned_query();
1594 Returns : Float (2-decimal precision, e.g., 0.75).
1595 Argument : n/a
1596 Throws : n/a
1597 Comments : If you need data for each HSP, use hsps() and then interate
1598 : through the HSP objects.
1599 : To compute the fraction aligned, the logical length of the query
1600 : sequence is used, meaning that for [T]BLASTX reports, the
1601 : full length of the query sequence is converted into amino acids
1602 : by dividing by 3. This is necessary because of the way
1603 : the lengths of aligned sequences are computed.
1604 : This method requires that all HSPs be tiled. If they have not
1605 : already been tiled, they will be tiled first automatically.
1607 See Also : L<frac_aligned_hit()|frac_aligned_hit>, L<logical_length()|logical_length>, L<length_aln()|length_aln>, L<Bio::Search::BlastUtils::tile_hsps()|Bio::Search::BlastUtils>
1609 =cut
1611 #----------------------
1612 sub frac_aligned_query {
1613 #----------------------
1614 my $self = shift;
1616 Bio::Search::BlastUtils::tile_hsps($self) if not $self->{'_tile_hsps'};
1618 sprintf( "%.2f", $self->{'_length_aln_query'}/$self->logical_length('query'));
1623 =head2 frac_aligned_hit
1625 Usage : $hit_object->frac_aligned_hit();
1626 Purpose : Get the fraction of the hit (sbjct) sequence which has been aligned
1627 : across all HSPs (not including intervals between non-overlapping
1628 : HSPs).
1629 Example : $frac_alnq = $hit_object->frac_aligned_hit();
1630 Returns : Float (2-decimal precision, e.g., 0.75).
1631 Argument : n/a
1632 Throws : n/a
1633 Comments : If you need data for each HSP, use hsps() and then interate
1634 : through the HSP objects.
1635 : To compute the fraction aligned, the logical length of the sbjct
1636 : sequence is used, meaning that for TBLAST[NX] reports, the
1637 : full length of the sbjct sequence is converted into amino acids
1638 : by dividing by 3. This is necessary because of the way
1639 : the lengths of aligned sequences are computed.
1640 : This method requires that all HSPs be tiled. If they have not
1641 : already been tiled, they will be tiled first automatically.
1643 See Also : L<frac_aligned_query()|frac_aligned_query>, L<matches()|matches>, , L<logical_length()|logical_length>, L<length_aln()|length_aln>, L<Bio::Search::BlastUtils::tile_hsps()|Bio::Search::BlastUtils>
1645 =cut
1647 #--------------------
1648 sub frac_aligned_hit {
1649 #--------------------
1650 my $self = shift;
1652 Bio::Search::BlastUtils::tile_hsps($self) if not $self->{'_tile_hsps'};
1654 sprintf( "%.2f", $self->{'_length_aln_sbjct'}/$self->logical_length('sbjct'));
1658 ## These methods are being maintained for backward compatibility.
1660 =head2 frac_aligned_sbjct
1662 Same as L<frac_aligned_hit()|frac_aligned_hit>
1664 =cut
1666 #----------------
1667 sub frac_aligned_sbjct { my $self=shift; $self->frac_aligned_hit(@_); }
1668 #----------------
1670 =head2 num_unaligned_sbjct
1672 Same as L<num_unaligned_hit()|num_unaligned_hit>
1674 =cut
1676 #----------------
1677 sub num_unaligned_sbjct { my $self=shift; $self->num_unaligned_hit(@_); }
1678 #----------------
1682 =head2 num_unaligned_hit
1684 Usage : $hit_object->num_unaligned_hit();
1685 Purpose : Get the number of the unaligned residues in the hit sequence.
1686 : Sums across all all HSPs.
1687 Example : $num_unaln = $hit_object->num_unaligned_hit();
1688 Returns : Integer
1689 Argument : n/a
1690 Throws : n/a
1691 Comments : See notes regarding logical lengths in the comments for frac_aligned_hit().
1692 : They apply here as well.
1693 : If you need data for each HSP, use hsps() and then interate
1694 : through the HSP objects.
1695 : This method requires that all HSPs be tiled. If they have not
1696 : already been tiled, they will be tiled first automatically..
1698 See Also : L<num_unaligned_query()|num_unaligned_query>, L<Bio::Search::BlastUtils::tile_hsps()|Bio::Search::BlastUtils>, L<frac_aligned_hit()|frac_aligned_hit>
1700 =cut
1702 #---------------------
1703 sub num_unaligned_hit {
1704 #---------------------
1705 my $self = shift;
1707 Bio::Search::BlastUtils::tile_hsps($self) if not $self->{'_tile_hsps'};
1709 my $num = $self->logical_length('sbjct') - $self->{'_length_aln_sbjct'};
1710 ($num < 0 ? 0 : $num );
1714 =head2 num_unaligned_query
1716 Usage : $hit_object->num_unaligned_query();
1717 Purpose : Get the number of the unaligned residues in the query sequence.
1718 : Sums across all all HSPs.
1719 Example : $num_unaln = $hit_object->num_unaligned_query();
1720 Returns : Integer
1721 Argument : n/a
1722 Throws : n/a
1723 Comments : See notes regarding logical lengths in the comments for frac_aligned_query().
1724 : They apply here as well.
1725 : If you need data for each HSP, use hsps() and then interate
1726 : through the HSP objects.
1727 : This method requires that all HSPs be tiled. If they have not
1728 : already been tiled, they will be tiled first automatically..
1730 See Also : L<num_unaligned_hit()|num_unaligned_hit>, L<frac_aligned_query()|frac_aligned_query>, L<Bio::Search::BlastUtils::tile_hsps()|Bio::Search::BlastUtils>
1732 =cut
1734 #-----------------------
1735 sub num_unaligned_query {
1736 #-----------------------
1737 my $self = shift;
1739 Bio::Search::BlastUtils::tile_hsps($self) if not $self->{'_tile_hsps'};
1741 my $num = $self->logical_length('query') - $self->{'_length_aln_query'};
1742 ($num < 0 ? 0 : $num );
1747 =head2 seq_inds
1749 Usage : $hit->seq_inds( seq_type, class, collapse );
1750 Purpose : Get a list of residue positions (indices) across all HSPs
1751 : for identical or conserved residues in the query or sbjct sequence.
1752 Example : @s_ind = $hit->seq_inds('query', 'identical');
1753 : @h_ind = $hit->seq_inds('hit', 'conserved');
1754 : @h_ind = $hit->seq_inds('hit', 'conserved', 1);
1755 Returns : Array of integers
1756 : May include ranges if collapse is non-zero.
1757 Argument : [0] seq_type = 'query' or 'hit' or 'sbjct' (default = 'query')
1758 : ('sbjct' is synonymous with 'hit')
1759 : [1] class = 'identical' or 'conserved' (default = 'identical')
1760 : (can be shortened to 'id' or 'cons')
1761 : (actually, anything not 'id' will evaluate to 'conserved').
1762 : [2] collapse = boolean, if non-zero, consecutive positions are merged
1763 : using a range notation, e.g., "1 2 3 4 5 7 9 10 11"
1764 : collapses to "1-5 7 9-11". This is useful for
1765 : consolidating long lists. Default = no collapse.
1766 Throws : n/a.
1767 Comments : Note that HSPs are not tiled for this. This could be a problem
1768 : for hits containing mutually exclusive HSPs.
1769 : TODO: Consider tiling and then reporting seq_inds for the
1770 : best HSP contig.
1772 See Also : L<Bio::Search::HSP::BlastHSP::seq_inds()|Bio::Search::HSP::BlastHSP>
1774 =cut
1776 #-------------
1777 sub seq_inds {
1778 #-------------
1779 my ($self, $seqType, $class, $collapse) = @_;
1781 $seqType ||= 'query';
1782 $class ||= 'identical';
1783 $collapse ||= 0;
1785 $seqType = 'sbjct' if $seqType eq 'hit';
1787 my (@inds, $hsp);
1788 foreach $hsp ($self->hsps) {
1789 # This will merge data for all HSPs together.
1790 push @inds, $hsp->seq_inds($seqType, $class);
1793 # Need to remove duplicates and sort the merged positions.
1794 if(@inds) {
1795 my %tmp = map { $_, 1 } @inds;
1796 @inds = sort {$a <=> $b} keys %tmp;
1799 $collapse ? &Bio::Search::BlastUtils::collapse_nums(@inds) : @inds;
1803 =head2 iteration
1805 Usage : $sbjct->iteration( );
1806 Purpose : Gets the iteration number in which the Hit was found.
1807 Example : $iteration_num = $sbjct->iteration();
1808 Returns : Integer greater than or equal to 1
1809 Non-PSI-BLAST reports will report iteration as 1, but this number
1810 is only meaningful for PSI-BLAST reports.
1811 Argument : none
1812 Throws : none
1814 See Also : L<found_again()|found_again>
1816 =cut
1818 #----------------
1819 sub iteration { shift->{'_iteration'} }
1820 #----------------
1823 =head2 found_again
1825 Usage : $sbjct->found_again;
1826 Purpose : Gets a boolean indicator whether or not the hit has
1827 been found in a previous iteration.
1828 This is only applicable to PSI-BLAST reports.
1830 This method indicates if the hit was reported in the
1831 "Sequences used in model and found again" section of the
1832 PSI-BLAST report or if it was reported in the
1833 "Sequences not found previously or not previously below threshold"
1834 section of the PSI-BLAST report. Only for hits in iteration > 1.
1836 Example : if( $sbjct->found_again()) { ... };
1837 Returns : Boolean (1 or 0) for PSI-BLAST report iterations greater than 1.
1838 Returns undef for PSI-BLAST report iteration 1 and non PSI_BLAST
1839 reports.
1840 Argument : none
1841 Throws : none
1843 See Also : L<found_again()|found_again>
1845 =cut
1847 #----------------
1848 sub found_again { shift->{'_found_again'} }
1849 #----------------
1852 =head2 strand
1854 Usage : $sbjct->strand( [seq_type] );
1855 Purpose : Gets the strand(s) for the query, sbjct, or both sequences
1856 : in the best HSP of the PsiBlastHit object after HSP tiling.
1857 : Only valid for BLASTN, TBLASTX, BLASTX-query, TBLASTN-hit.
1858 Example : $qstrand = $sbjct->strand('query');
1859 : $sstrand = $sbjct->strand('hit');
1860 : ($qstrand, $sstrand) = $sbjct->strand();
1861 Returns : scalar context: integer '1', '-1', or '0'
1862 : array context without args: list of two strings (queryStrand, sbjctStrand)
1863 : Array context can be "induced" by providing an argument of 'list' or 'array'.
1864 Argument : In scalar context: seq_type = 'query' or 'hit' or 'sbjct' (default = 'query')
1865 ('sbjct' is synonymous with 'hit')
1866 Throws : n/a
1867 Comments : This method requires that all HSPs be tiled. If they have not
1868 : already been tiled, they will be tiled first automatically..
1869 : If you don't want the tiled data, iterate through each HSP
1870 : calling strand() on each (use hsps() to get all HSPs).
1872 : Formerly (prior to 10/21/02), this method would return the
1873 : string "-1/1" for hits with HSPs on both strands.
1874 : However, now that strand and frame is properly being accounted
1875 : for during HSP tiling, it makes more sense for strand()
1876 : to return the strand data for the best HSP after tiling.
1878 : If you really want to know about hits on opposite strands,
1879 : you should be iterating through the HSPs using methods on the
1880 : HSP objects.
1882 : A possible use case where knowing whether a hit has HSPs
1883 : on both strands would be when filtering via SearchIO for hits with
1884 : this property. However, in this case it would be better to have a
1885 : dedicated method such as $hit->hsps_on_both_strands(). Similarly
1886 : for frame. This could be provided if there is interest.
1888 See Also : L<Bio::Search::HSP::BlastHSP::strand>()
1890 =cut
1892 #----------'
1893 sub strand {
1894 #----------
1895 my ($self, $seqType) = @_;
1897 Bio::Search::BlastUtils::tile_hsps($self) if not $self->{'_tile_hsps'};
1899 $seqType ||= (wantarray ? 'list' : 'query');
1900 $seqType = 'sbjct' if $seqType eq 'hit';
1902 my ($qstr, $hstr);
1903 # If there is only one HSP, defer this call to the solitary HSP.
1904 if($self->num_hsps == 1) {
1905 return $self->hsp->strand($seqType);
1907 elsif( defined $self->{'_qstrand'}) {
1908 # Get the data computed during hsp tiling.
1909 $qstr = $self->{'_qstrand'};
1910 $hstr = $self->{'_sstrand'};
1912 else {
1913 # otherwise, iterate through all HSPs collecting strand info.
1914 # This will return the string "-1/1" if there are HSPs on different strands.
1915 # NOTE: This was the pre-10/21/02 procedure which will no longer be used,
1916 # (unless the above elsif{} is commented out).
1917 my (%qstr, %hstr);
1918 foreach my $hsp( $self->hsps ) {
1919 my ( $q, $h ) = $hsp->strand();
1920 $qstr{ $q }++;
1921 $hstr{ $h }++;
1923 $qstr = join( '/', sort keys %qstr);
1924 $hstr = join( '/', sort keys %hstr);
1927 if($seqType =~ /list|array/i) {
1928 return ($qstr, $hstr);
1929 } elsif( $seqType eq 'query' ) {
1930 return $qstr;
1931 } else {
1932 return $hstr;
1938 __END__
1940 #####################################################################################
1941 # END OF CLASS #
1942 #####################################################################################
1945 =head1 FOR DEVELOPERS ONLY
1947 =head2 Data Members
1949 Information about the various data members of this module is provided for those
1950 wishing to modify or understand the code. Two things to bear in mind:
1952 =over 4
1954 =item 1 Do NOT rely on these in any code outside of this module.
1956 All data members are prefixed with an underscore to signify that they are private.
1957 Always use accessor methods. If the accessor doesn't exist or is inadequate,
1958 create or modify an accessor (and let me know, too!). (An exception to this might
1959 be for BlastHSP.pm which is more tightly coupled to PsiBlastHit.pm and
1960 may access PsiBlastHit data members directly for efficiency purposes, but probably
1961 should not).
1963 =item 2 This documentation may be incomplete and out of date.
1965 It is easy for these data member descriptions to become obsolete as
1966 this module is still evolving. Always double check this info and search
1967 for members not described here.
1969 =back
1971 An instance of Bio::Search::Hit::PsiBlastHit.pm is a blessed reference to a hash containing
1972 all or some of the following fields:
1974 FIELD VALUE
1975 --------------------------------------------------------------
1976 _hsps : Array ref for a list of Bio::Search::HSP::BlastHSP.pm objects.
1978 _db : Database identifier from the summary line.
1980 _desc : Description data for the hit from the summary line.
1982 _length : Total length of the hit sequence.
1984 _score : BLAST score.
1986 _bits : BLAST score (in bits). Matrix-independent.
1988 _p : BLAST P value. Obtained from summary section. (Blast1/WU-Blast only)
1990 _expect : BLAST Expect value. Obtained from summary section.
1992 _n : BLAST N value (number of HSPs) (Blast1/WU-Blast2 only)
1994 _frame : Reading frame for TBLASTN and TBLASTX analyses.
1996 _totalIdentical: Total number of identical aligned monomers.
1998 _totalConserved: Total number of conserved aligned monomers (a.k.a. "positives").
2000 _overlap : Maximum number of overlapping residues between adjacent HSPs
2001 : before considering the alignment to be ambiguous.
2003 _ambiguous_aln : Boolean. True if the alignment of all HSPs is ambiguous.
2005 _length_aln_query : Length of the aligned region of the query sequence.
2007 _length_aln_sbjct : Length of the aligned region of the sbjct sequence.
2010 =cut