2 # BioPerl module for Bio::SearchIO::blast
4 # Please direct questions and support issues to <bioperl-l@bioperl.org>
6 # Cared for by Jason Stajich <jason@bioperl.org>
8 # Copyright Jason Stajich
10 # You may distribute this module under the same terms as perl itself
12 # POD documentation - main docs before the code
15 # PSI-BLAST full parsing support. Rollout of new
16 # model which will remove Steve's old psiblast driver
18 # Megablast parsing fix as reported by Neil Saunders
20 # Support bl2seq parsing
22 # Parse more blast statistics, lambda, entropy, etc
23 # from WU-BLAST in frame-specific manner
24 # 20060216 - cjf - fixed blast parsing for BLAST v2.2.13 output
25 # 20071104 - dmessina - added support for WUBLAST -echofilter
26 # 20071121 - cjf - fixed several bugs (bugs 2391, 2399, 2409)
30 Bio::SearchIO::blast - Event generator for event based parsing of
35 # Do not use this object directly - it is used as part of the
36 # Bio::SearchIO system.
39 my $searchio = Bio::SearchIO->new(-format => 'blast',
40 -file => 't/data/ecolitst.bls');
41 while( my $result = $searchio->next_result ) {
42 while( my $hit = $result->next_hit ) {
43 while( my $hsp = $hit->next_hsp ) {
51 This object encapsulated the necessary methods for generating events
52 suitable for building Bio::Search objects from a BLAST report file.
53 Read the L<Bio::SearchIO> for more information about how to use this.
55 This driver can parse:
61 NCBI produced plain text BLAST reports from blastall, this also
62 includes PSIBLAST, PSITBLASTN, RPSBLAST, and bl2seq reports. NCBI XML
63 BLAST output is parsed with the blastxml SearchIO driver
71 Jim Kent's BLAST-like output from his programs (BLASTZ, BLAT)
75 BLAST-like output from Paracel BTK output
81 Since I cannot differentiate between BLASTX and TBLASTN since bl2seq
82 doesn't report the algorithm used - I assume it is BLASTX by default -
83 you can supply the program type with -report_type in the SearchIO
86 my $parser = Bio::SearchIO->new(-format => 'blast',
87 -file => 'bl2seq.tblastn.report',
88 -report_type => 'tblastn');
90 This only really affects where the frame and strand information are
91 put - they will always be on the $hsp-E<gt>query instead of on the
92 $hsp-E<gt>hit part of the feature pair for blastx and tblastn bl2seq
93 produced reports. Hope that's clear...
99 User feedback is an integral part of the evolution of this and other
100 Bioperl modules. Send your comments and suggestions preferably to
101 the Bioperl mailing list. Your participation is much appreciated.
103 bioperl-l@bioperl.org - General discussion
104 http://bioperl.org/wiki/Mailing_lists - About the mailing lists
108 Please direct usage questions or support issues to the mailing list:
110 I<bioperl-l@bioperl.org>
112 rather than to the module maintainer directly. Many experienced and
113 reponsive experts will be able look at the problem and quickly
114 address it. Please include a thorough description of the problem
115 with code and data examples if at all possible.
117 =head2 Reporting Bugs
119 Report bugs to the Bioperl bug tracking system to help us keep track
120 of the bugs and their resolution. Bug reports can be submitted via the
123 https://github.com/bioperl/bioperl-live/issues
125 =head1 AUTHOR - Jason Stajich
127 Email Jason Stajich jason-at-bioperl.org
131 Steve Chervitz sac-at-bioperl.org
135 The rest of the documentation details each of the object methods.
136 Internal methods are usually preceded with a _
140 # Let the code begin...'
142 package Bio
::SearchIO
::blast
;
144 use Bio
::SearchIO
::IteratedSearchResultEventBuilder
;
146 use vars
qw(%MAPPING %MODEMAP
147 $DEFAULT_BLAST_WRITER_CLASS
155 use base qw(Bio::SearchIO);
160 # mapping of NCBI Blast terms to Bioperl hash keys
162 'BlastOutput' => 'result',
163 'Iteration' => 'iteration',
168 # This should really be done more intelligently, like with
172 'Hsp_bit-score' => 'HSP-bits',
173 'Hsp_score' => 'HSP-score',
174 'Hsp_evalue' => 'HSP-evalue',
176 'Hsp_pvalue' => 'HSP-pvalue',
177 'Hsp_query-from' => 'HSP-query_start',
178 'Hsp_query-to' => 'HSP-query_end',
179 'Hsp_hit-from' => 'HSP-hit_start',
180 'Hsp_hit-to' => 'HSP-hit_end',
181 'Hsp_positive' => 'HSP-conserved',
182 'Hsp_identity' => 'HSP-identical',
183 'Hsp_gaps' => 'HSP-hsp_gaps',
184 'Hsp_hitgaps' => 'HSP-hit_gaps',
185 'Hsp_querygaps' => 'HSP-query_gaps',
186 'Hsp_qseq' => 'HSP-query_seq',
187 'Hsp_hseq' => 'HSP-hit_seq',
188 'Hsp_midline' => 'HSP-homology_seq',
189 'Hsp_align-len' => 'HSP-hsp_length',
190 'Hsp_query-frame' => 'HSP-query_frame',
191 'Hsp_hit-frame' => 'HSP-hit_frame',
192 'Hsp_links' => 'HSP-links',
193 'Hsp_group' => 'HSP-hsp_group',
194 'Hsp_features' => 'HSP-hit_features',
196 'Hit_id' => 'HIT-name',
197 'Hit_len' => 'HIT-length',
198 'Hit_accession' => 'HIT-accession',
199 'Hit_def' => 'HIT-description',
200 'Hit_signif' => 'HIT-significance',
201 # For NCBI blast, the description line contains bits.
202 # For WU-blast, the description line contains score.
203 'Hit_score' => 'HIT-score',
204 'Hit_bits' => 'HIT-bits',
206 'Iteration_iter-num' => 'ITERATION-number',
207 'Iteration_converged' => 'ITERATION-converged',
209 'BlastOutput_program' => 'RESULT-algorithm_name',
210 'BlastOutput_version' => 'RESULT-algorithm_version',
211 'BlastOutput_algorithm-reference' => 'RESULT-algorithm_reference',
212 'BlastOutput_rid' => 'RESULT-rid',
213 'BlastOutput_query-def' => 'RESULT-query_name',
214 'BlastOutput_query-len' => 'RESULT-query_length',
215 'BlastOutput_query-acc' => 'RESULT-query_accession',
216 'BlastOutput_query-gi' => 'RESULT-query_gi',
217 'BlastOutput_querydesc' => 'RESULT-query_description',
218 'BlastOutput_db' => 'RESULT-database_name',
219 'BlastOutput_db-len' => 'RESULT-database_entries',
220 'BlastOutput_db-let' => 'RESULT-database_letters',
221 'BlastOutput_inclusion-threshold' => 'RESULT-inclusion_threshold',
223 'Parameters_matrix' => { 'RESULT-parameters' => 'matrix' },
224 'Parameters_expect' => { 'RESULT-parameters' => 'expect' },
225 'Parameters_include' => { 'RESULT-parameters' => 'include' },
226 'Parameters_sc-match' => { 'RESULT-parameters' => 'match' },
227 'Parameters_sc-mismatch' => { 'RESULT-parameters' => 'mismatch' },
228 'Parameters_gap-open' => { 'RESULT-parameters' => 'gapopen' },
229 'Parameters_gap-extend' => { 'RESULT-parameters' => 'gapext' },
230 'Parameters_filter' => { 'RESULT-parameters' => 'filter' },
231 'Parameters_allowgaps' => { 'RESULT-parameters' => 'allowgaps' },
232 'Parameters_full_dbpath' => { 'RESULT-parameters' => 'full_dbpath' },
233 'Statistics_db-len' => { 'RESULT-statistics' => 'dbentries' },
234 'Statistics_db-let' => { 'RESULT-statistics' => 'dbletters' },
235 'Statistics_hsp-len' =>
236 { 'RESULT-statistics' => 'effective_hsplength' },
237 'Statistics_query-len' => { 'RESULT-statistics' => 'querylength' },
238 'Statistics_eff-space' => { 'RESULT-statistics' => 'effectivespace' },
239 'Statistics_eff-spaceused' =>
240 { 'RESULT-statistics' => 'effectivespaceused' },
241 'Statistics_eff-dblen' =>
242 { 'RESULT-statistics' => 'effectivedblength' },
243 'Statistics_kappa' => { 'RESULT-statistics' => 'kappa' },
244 'Statistics_lambda' => { 'RESULT-statistics' => 'lambda' },
245 'Statistics_entropy' => { 'RESULT-statistics' => 'entropy' },
246 'Statistics_gapped_kappa' => { 'RESULT-statistics' => 'kappa_gapped' },
247 'Statistics_gapped_lambda' =>
248 { 'RESULT-statistics' => 'lambda_gapped' },
249 'Statistics_gapped_entropy' =>
250 { 'RESULT-statistics' => 'entropy_gapped' },
252 'Statistics_framewindow' =>
253 { 'RESULT-statistics' => 'frameshiftwindow' },
254 'Statistics_decay' => { 'RESULT-statistics' => 'decayconst' },
256 'Statistics_hit_to_db' => { 'RESULT-statistics' => 'Hits_to_DB' },
257 'Statistics_num_suc_extensions' =>
258 { 'RESULT-statistics' => 'num_successful_extensions' },
259 'Statistics_length_adjustment' => { 'RESULT-statistics' => 'length_adjustment' },
260 'Statistics_number_of_hsps_better_than_expect_value_cutoff_without_gapping' =>
261 { 'RESULT-statistics' => 'number_of_hsps_better_than_expect_value_cutoff_without_gapping' },
262 'Statistics_number_of_hsps_gapped' => { 'RESULT-statistics' => 'number_of_hsps_gapped' },
263 'Statistics_number_of_hsps_successfully_gapped' => { 'RESULT-statistics' => 'number_of_hsps_successfully_gapped' },
266 'Statistics_DFA_states' => { 'RESULT-statistics' => 'num_dfa_states' },
267 'Statistics_DFA_size' => { 'RESULT-statistics' => 'dfa_size' },
268 'Statistics_noprocessors' =>
269 { 'RESULT-statistics' => 'no_of_processors' },
270 'Statistics_neighbortime' =>
271 { 'RESULT-statistics' => 'neighborhood_generate_time' },
272 'Statistics_starttime' => { 'RESULT-statistics' => 'start_time' },
273 'Statistics_endtime' => { 'RESULT-statistics' => 'end_time' },
276 # add WU-BLAST Frame-Based Statistics
277 for my $frame ( 0 .. 3 ) {
278 for my $strand ( '+', '-' ) {
280 qw(length efflength E S W T X X_gapped E2
284 $MAPPING{"Statistics_frame$strand$frame\_$ind"} =
285 { 'RESULT-statistics' => "Frame$strand$frame\_$ind" };
287 for my $val (qw(lambda kappa entropy )) {
288 for my $type (qw(used computed gapped)) {
289 my $key = "Statistics_frame$strand$frame\_$val\_$type";
291 { 'RESULT-statistics' =>
292 "Frame$strand$frame\_$val\_$type" };
293 $MAPPING{$key} = $val;
301 qw(T A X1 X2 X3 S1 S2 X1_bits X2_bits X3_bits
302 S1_bits S2_bits num_extensions
303 num_successful_extensions
304 seqs_better_than_cutoff
306 search_cputime total_cputime
307 search_actualtime total_actualtime
308 no_of_processors ctxfactor)
311 my $key = "Statistics_$stats";
312 my $val = { 'RESULT-statistics' => $stats };
313 $MAPPING{$key} = $val;
316 # add WU-BLAST Parameters
318 qw(span span1 span2 links warnings notes hspsepsmax
319 hspsepqmax topcomboN topcomboE postsw cpus wordmask
320 filter sort_by_pvalue sort_by_count sort_by_highscore
321 sort_by_totalscore sort_by_subjectlength noseqs gi qtype
325 my $key = "Parameters_$param";
326 my $val = { 'RESULT-parameters' => $param };
327 $MAPPING{$key} = $val;
330 $DEFAULT_BLAST_WRITER_CLASS = 'Bio::SearchIO::Writer::HitTableWriter';
331 $MAX_HSP_OVERLAP = 2; # Used when tiling multiple HSPs.
332 $DEFAULTREPORTTYPE = 'BLASTP'; # for bl2seq
338 Usage : my $obj = Bio::SearchIO::blast->new(%args);
339 Function: Builds a new Bio::SearchIO::blast object
340 Returns : Bio::SearchIO::blast
341 Args : Key-value pairs:
342 -fh/-file => filehandle/filename to BLAST file
344 -report_type => 'blastx', 'tblastn', etc -- only for bl2seq
345 reports when you want to distinguish between
346 tblastn and blastx reports (this only controls
347 where the frame information is put - on the query
349 -inclusion_threshold => e-value threshold for inclusion in the
350 PSI-BLAST score matrix model (blastpgp)
351 -signif => float or scientific notation number to be used
352 as a P- or Expect value cutoff
353 -score => integer or scientific notation number to be used
354 as a blast score value cutoff
355 -bits => integer or scientific notation number to be used
356 as a bit score value cutoff
357 -hit_filter => reference to a function to be used for
358 filtering hits based on arbitrary criteria.
359 All hits of each BLAST report must satisfy
360 this criteria to be retained.
361 If a hit fails this test, it is ignored.
362 This function should take a
363 Bio::Search::Hit::BlastHit.pm object as its first
364 argument and return true
365 if the hit should be retained.
366 Sample filter function:
367 -hit_filter => sub { $hit = shift;
369 (Note: -filt_func is synonymous with -hit_filter)
370 -overlap => integer. The amount of overlap to permit between
371 adjacent HSPs when tiling HSPs. A reasonable value is 2.
372 Default = $Bio::SearchIO::blast::MAX_HSP_OVERLAP.
374 The following criteria are not yet supported:
375 (these are probably best applied within this module rather than in the
376 event handler since they would permit the parser to take some shortcuts.)
378 -check_all_hits => boolean. Check all hits for significance against
379 significance criteria. Default = false.
380 If false, stops processing hits after the first
381 non-significant hit or the first hit that fails
382 the hit_filter call. This speeds parsing,
383 taking advantage of the fact that the hits
384 are processed in the order they appear in the report.
385 -min_query_len => integer to be used as a minimum for query sequence length.
386 Reports with query sequences below this length will
387 not be processed. Default = no minimum length.
388 -best => boolean. Only process the best hit of each report;
394 my ( $self, @args ) = @_;
395 $self->SUPER::_initialize
(@args);
397 # Blast reports require a specialized version of the SREB due to the
398 # possibility of iterations (PSI-BLAST). Forwarding all arguments to it. An
399 # issue here is that we want to set new default object factories if none are
402 my $handler = Bio
::SearchIO
::IteratedSearchResultEventBuilder
->new(@args);
403 $self->attach_EventHandler($handler);
405 # 2006-04-26 move this to the attach_handler function in this module so we
406 # can really reset the handler
407 # Optimization: caching
408 # the EventHandler since it is used a lot during the parse.
410 # $self->{'_handler_cache'} = $handler;
412 my ($rpttype ) = $self->_rearrange(
419 defined $rpttype && ( $self->{'_reporttype'} = $rpttype );
422 sub attach_EventHandler
{
423 my ($self,$handler) = @_;
425 $self->SUPER::attach_EventHandler
($handler);
427 # Optimization: caching the EventHandler since it is used a lot
430 $self->{'_handler_cache'} = $handler;
437 Usage : my $hit = $searchio->next_result;
438 Function: Returns the next Result from a search
439 Returns : Bio::Search::Result::ResultI object
446 my $v = $self->verbose;
449 $self->{'_seentop'} = 0; # start next report at top
451 my ( $reporttype, $seenquery, $reportline, $reportversion );
452 my ( $seeniteration, $found_again );
453 my $incl_threshold = $self->inclusion_threshold;
455 $self->start_document(); # let the fun begin...
457 my $gapped_stats = 0; # for switching between gapped/ungapped
459 local $_ = "\n"; #consistency
461 while ( defined( $_ = $self->_readline ) ) {
462 next if (/^\s+$/); # skip empty lines
463 next if (/CPU time:/);
465 next if (/[*]+\s+No hits found\s+[*]+/);
467 /^((?:\S+?)?BLAST[NPX]?)\s+(.+)$/i # NCBI BLAST, PSIBLAST
468 # RPSBLAST, MEGABLAST
469 || /^(P?GENEWISE|HFRAME|SWN|TSWN)\s+(.+)/i #Paracel BTK
472 ($reporttype, $reportversion) = ($1, $2);
473 # need to keep track of whether this is WU-BLAST
474 if ($reportversion && $reportversion =~ m{WashU$}) {
475 $self->{'_wublast'}++;
477 $self->debug("blast.pm: Start of new report: $reporttype, $reportversion\n");
478 if ( $self->{'_seentop'} ) {
479 # This handles multi-result input streams
480 $self->_pushback($_);
483 $self->_start_blastoutput;
484 if ($reporttype =~ /RPS-BLAST/) {
485 $reporttype .= '(BLASTP)'; # default RPS-BLAST type
487 $reportline = $_; # to fix the fact that RPS-BLAST output is wrong
490 'Name' => 'BlastOutput_program',
491 'Data' => $reporttype
497 'Name' => 'BlastOutput_version',
498 'Data' => $reportversion
503 'Name' => 'BlastOutput_inclusion-threshold',
504 'Data' => $incl_threshold
508 # parse the BLAST algorithm reference
509 elsif(/^Reference:\s+(.*)$/) {
510 # want to preserve newlines for the BLAST algorithm reference
511 my $algorithm_reference = "$1\n";
512 $_ = $self->_readline;
513 # while the current line, does not match an empty line, a RID:, a
514 # Database:, or a query definition line (Query=) we are still
515 # looking at the algorithm_reference, append it to what we parsed so
517 while($_ !~ /^$/ && $_ !~ /^RID:/ && $_ !~ /^Database:/ && $_ !~ /^Query=/) {
518 $algorithm_reference .= "$_";
519 $_ = $self->_readline;
521 # if we exited the while loop, we saw an empty line, a RID:, or a
522 # Database:, so push it back
523 $self->_pushback($_);
526 'Name' => 'BlastOutput_algorithm-reference',
527 'Data' => $algorithm_reference
531 # parse BLAST RID (Request ID)
532 elsif(/^RID:\s+(.*)$/) {
536 'Name' => 'BlastOutput_rid',
541 # added Windows workaround for bug 1985
542 elsif (/^(Searching|Results from round)/) {
543 next unless $1 =~ /Results from round/;
544 $self->debug("blast.pm: Possible psi blast iterations found...\n");
546 $self->in_element('hsp')
547 && $self->end_element( { 'Name' => 'Hsp' } );
548 $self->in_element('hit')
549 && $self->end_element( { 'Name' => 'Hit' } );
550 if ( defined $seeniteration ) {
551 $self->within_element('iteration')
552 && $self->end_element( { 'Name' => 'Iteration' } );
553 $self->start_element( { 'Name' => 'Iteration' } );
556 $self->start_element( { 'Name' => 'Iteration' } );
560 elsif (/^Query=\s*(.*)$/) {
562 $self->debug("blast.pm: Query= found...$_\n");
564 if ( defined $seenquery ) {
565 $self->_pushback($_);
566 $self->_pushback($reportline) if $reportline;
569 if ( !defined $reporttype ) {
570 $self->_start_blastoutput;
571 if ( defined $seeniteration ) {
572 $self->in_element('iteration')
573 && $self->end_element( { 'Name' => 'Iteration' } );
574 $self->start_element( { 'Name' => 'Iteration' } );
577 $self->start_element( { 'Name' => 'Iteration' } );
582 $_ = $self->_readline;
583 while ( defined($_) ) {
585 $self->_pushback($_);
588 # below line fixes length issue with BLAST v2.2.13; still works
590 if ( /\((\-?[\d,]+)\s+letters.*\)/ || /^Length=(\-?[\d,]+)/ ) {
597 $q .= ($q =~ /\w$/ && $_ =~ /^\w/) ?
" $_" : $_;
598 $q =~ s/\s+/ /g; # this catches the newline as well
602 $_ = $self->_readline;
605 my ( $nm, $desc ) = split( /\s+/, $q, 2 );
608 'Name' => 'BlastOutput_query-def',
614 'Name' => 'BlastOutput_query-len',
618 defined $desc && $desc =~ s/\s+$//;
621 'Name' => 'BlastOutput_querydesc',
625 my ( $gi, $acc, $version ) = $self->_get_seq_identifiers($nm);
626 $version = defined($version) && length($version) ?
".$version" : "";
629 'Name' => 'BlastOutput_query-acc',
630 'Data' => "$acc$version"
634 # these elements are dropped with some multiquery reports; add
638 'Name' => 'BlastOutput_db-len',
639 'Data' => $self->{'_blsdb_length'}
641 ) if $self->{'_blsdb_length'};
644 'Name' => 'BlastOutput_db-let',
645 'Data' => $self->{'_blsdb_letters'}
647 ) if $self->{'_blsdb_letters'};
650 'Name' => 'BlastOutput_db',
651 'Data' => $self->{'_blsdb'}
653 ) if $self->{'_blsdb_letters'};
655 # added check for WU-BLAST -echofilter option (bug 2388)
656 elsif (/^>Unfiltered[+-]1$/) {
657 # skip all of the lines of unfiltered sequence
658 while($_ !~ /^Database:/) {
659 $self->debug("Bypassing features line: $_");
660 $_ = $self->_readline;
662 $self->_pushback($_);
664 elsif (/Sequences producing significant alignments:/) {
665 $self->debug("blast.pm: Processing NCBI-BLAST descriptions\n");
668 # PSI-BLAST parsing needs to be fixed to specifically look
669 # for old vs new per iteration, as sorting based on duplication
670 # leads to bugs, see bug 1986
672 # The next line is not necessarily whitespace in psiblast reports.
673 # Also note that we must look for the end of this section by testing
674 # for a line with a leading >. Blank lines occur with this section
676 if ( !$self->in_element('iteration') ) {
677 $self->start_element( { 'Name' => 'Iteration' } );
680 # changed 8/28/2008 to exit hit table if blank line is found after an
685 while ( defined( my $descline = $self->_readline() ) ) {
686 if ($descline =~ m{^\s*$}) {
687 last DESCLINE
if $seen_block;
690 # any text match is part of block...
692 # GCG multiline oddness...
693 if ($descline =~ /^(\S+)\s+Begin:\s\d+\s+End:\s+\d+/xms) {
694 my ($id, $nextline) = ($1, $self->_readline);
695 $nextline =~ s{^!}{};
696 $descline = "$id $nextline";
698 # NCBI style hit table (no N)
699 if ($descline =~ /(?
<!cor
) # negative lookahead
700 (\d
*\
.?
(?
:[\
+\
-eE
]+)?\d
+) # number (float or scientific notation)
702 (\d
*\
.?
(?
:[\
+\
-eE
]+)?\d
+) # number (float or scientific notation)
705 my ( $score, $evalue ) = ($1, $2);
707 # Some data clean-up so e-value will appear numeric to perl
708 $evalue =~ s/^e/1e/i;
710 # This to handle no-HSP case
711 my @line = split ' ',$descline;
713 # we want to throw away the score, evalue
714 pop @line, pop @line;
716 # and N if it is present (of course they are not
717 # really in that order, but it doesn't matter
718 if ($3) { pop @line }
720 # add the last 2 entries s.t. we can reconstruct
721 # a minimal Hit object at the end of the day
722 push @hit_signifs, [ $evalue, $score, shift @line, join( ' ', @line ) ];
723 } elsif ($descline =~ /^CONVERGED/i) {
726 'Name' => 'Iteration_converged',
731 $self->_pushback($descline); # Catch leading > (end of section)
736 elsif (/Sequences producing High-scoring Segment Pairs:/) {
738 # This block is for WU-BLAST, so we don't have to check for psi-blast stuff
740 $self->debug("blast.pm: Processing WU-BLAST descriptions\n");
741 $_ = $self->_readline();
744 if ( !$self->in_element('iteration') ) {
745 $self->start_element( { 'Name' => 'Iteration' } );
748 while ( defined( $_ = $self->_readline() )
752 pop @line; # throw away first number which is for 'N'col
754 # add the last 2 entries to array s.t. we can reconstruct
755 # a minimal Hit object at the end of the day
757 [ pop @line, pop @line, shift @line, join( ' ', @line ) ];
761 elsif (/^Database:\s*(.+?)\s*$/) {
763 $self->debug("blast.pm: Database: $1\n");
765 while ( defined( $_ = $self->_readline ) ) {
767 /^\s
+(\
-?
[\d\
,]+|\S
+)\s
+sequences\
;
768 \s
+(\
-?
[\d
,]+|\S
+)\s
+ # Deal with NCBI 2.2.8 OSX problems
772 my ( $s, $l ) = ( $1, $2 );
777 'Name' => 'BlastOutput_db-len',
783 'Name' => 'BlastOutput_db-let',
787 # cache for next round in cases with multiple queries
788 $self->{'_blsdb'} = $db;
789 $self->{'_blsdb_length'} = $s;
790 $self->{'_blsdb_letters'} = $l;
800 'Name' => 'BlastOutput_db',
806 # move inside of a hit
807 elsif (/^(?:Subject=|>)\s*(\S+)\s*(.*)?/) {
810 $self->debug("blast.pm: Hit: $1\n");
811 $self->in_element('hsp')
812 && $self->end_element( { 'Name' => 'Hsp' } );
813 $self->in_element('hit')
814 && $self->end_element( { 'Name' => 'Hit' } );
816 # special case when bl2seq reports don't have a leading
818 if ( !$self->within_element('result') ) {
819 $self->_start_blastoutput;
820 $self->start_element( { 'Name' => 'Iteration' } );
822 elsif ( !$self->within_element('iteration') ) {
823 $self->start_element( { 'Name' => 'Iteration' } );
825 $self->start_element( { 'Name' => 'Hit' } );
829 $self->debug("Starting a hit: $1 $2\n");
836 my ($gi, $acc, $version ) = $self->_get_seq_identifiers($id);
839 'Name' => 'Hit_accession',
843 # add hit significance (from the hit table)
844 # this is where Bug 1986 went awry
846 # Changed for Bug2409; hit->significance and hit->score/bits derived
847 # from HSPs, not hit table unless necessary
850 while (my $v = shift @hit_signifs) {
851 my $tableid = $v->[2];
852 if ($tableid !~ m{\Q$id\E}) {
853 $self->debug("Hit table ID $tableid doesn't match current hit id $id, checking next hit table entry...\n");
859 while ( defined( $_ = $self->_readline() ) ) {
862 if (/Length\s*=\s*([\d,]+)/) {
874 if ($restofline !~ /\s$/) { # bug #3235
875 s/^\s(?!\s)/\x01/; #new line to concatenate desc lines with <soh>
877 $restofline .= ($restofline =~ /\w$/ && $_ =~ /^\w/) ?
" $_" : $_;
878 $restofline =~ s/\s+/ /g; # this catches the newline as well
879 $restofline =~ s/^ | $//g;
882 $restofline =~ s/\s+/ /g;
886 'Data' => $restofline
890 elsif (/\s+(Plus|Minus) Strand HSPs:/i) {
894 ( $self->in_element('hit') || $self->in_element('hsp') )
895 && # paracel genewise BTK
896 m
/Score\s
*=\s
*(\S
+)\s
*bits\s
* # Bit score
897 (?
:\
((\d
+)\
))?
, # Raw score
898 \s
+Log\
-Length\sScore\s
*=\s
*(\d
+) # Log-Length score
902 $self->in_element('hsp')
903 && $self->end_element( { 'Name' => 'Hsp' } );
904 $self->start_element( { 'Name' => 'Hsp' } );
906 $self->debug( "Got paracel genewise HSP score=$1\n");
908 # Some data clean-up so e-value will appear numeric to perl
909 my ( $bits, $score, $evalue ) = ( $1, $2, $3 );
910 $evalue =~ s/^e/1e/i;
913 'Name' => 'Hsp_score',
919 'Name' => 'Hsp_bit-score',
925 'Name' => 'Hsp_evalue',
931 ( $self->in_element('hit') || $self->in_element('hsp') )
932 && # paracel hframe BTK
933 m
/Score\s
*=\s
*([^,\s
]+), # Raw score
934 \s
*Expect\s
*=\s
*([^,\s
]+), # E-value
935 \s
*P
(?
:\
(\S
+\
))?\s
*=\s
*([^,\s
]+) # P-value
939 $self->in_element('hsp')
940 && $self->end_element( { 'Name' => 'Hsp' } );
941 $self->start_element( { 'Name' => 'Hsp' } );
943 $self->debug( "Got paracel hframe HSP score=$1\n");
945 # Some data clean-up so e-value will appear numeric to perl
946 my ( $score, $evalue, $pvalue ) = ( $1, $2, $3 );
947 $evalue = "1$evalue" if $evalue =~ /^e/;
948 $pvalue = "1$pvalue" if $pvalue =~ /^e/;
952 'Name' => 'Hsp_score',
958 'Name' => 'Hsp_evalue',
964 'Name' => 'Hsp_pvalue',
970 ( $self->in_element('hit') || $self->in_element('hsp') )
972 m
/Score\s
*=\s
*(\S
+)\s
* # Bit score
973 \
(([\d\
.]+)\s
*bits\
), # Raw score
974 \s
*Expect\s
*=\s
*([^,\s
]+), # E-value
976 P
(?
:\
(\d
+\
))?\s
*=\s
*([^,\s
]+) # P-value
977 (?
:\s
*,\s
+Group\s
*\
=\s
*(\d
+))?
# HSP Group
980 { # wu-blast HSP parse
981 $self->in_element('hsp')
982 && $self->end_element( { 'Name' => 'Hsp' } );
983 $self->start_element( { 'Name' => 'Hsp' } );
985 # Some data clean-up so e-value will appear numeric to perl
986 my ( $score, $bits, $evalue, $pvalue, $group ) =
987 ( $1, $2, $3, $4, $5 );
988 $evalue =~ s/^e/1e/i;
989 $pvalue =~ s/^e/1e/i;
993 'Name' => 'Hsp_score',
999 'Name' => 'Hsp_bit-score',
1005 'Name' => 'Hsp_evalue',
1011 'Name' => 'Hsp_pvalue',
1016 if ( defined $group ) {
1019 'Name' => 'Hsp_group',
1027 ( $self->in_element('hit') || $self->in_element('hsp') )
1028 && # ncbi blast, works with 2.2.17
1029 m/^\sFeatures\s\w+\sthis\spart/xmso
1030 # If the line begins with "Features in/flanking this part of subject sequence:"
1033 $self->in_element('hsp')
1034 && $self->end_element( { 'Name' => 'Hsp' } );
1036 $_ = $self->_readline;
1037 while($_ !~ /^\s*$/) {
1040 $_ = $self->_readline;
1042 $self->_pushback($_);
1043 $featline =~ s{(?:^\s+|\s+^)}{}g;
1044 $featline =~ s{\n}{;}g;
1045 $self->start_element( { 'Name' => 'Hsp' } );
1048 'Name' => 'Hsp_features',
1052 $self->{'_seen_hsp_features'} = 1;
1055 ( $self->in_element('hit') || $self->in_element('hsp') )
1056 && # ncbi blast, works with 2.2.17
1057 m
/Score\s
*=\s
*(\S
+)\s
*bits\s
* # Bit score
1058 (?
:\
((\d
+)\
))?
, # Missing for BLAT pseudo-BLAST fmt
1059 \s
*Expect
(?
:\
((\d
+\
+?
)\
))?\s
*=\s
*([^,\s
]+) # E-value
1062 { # parse NCBI blast HSP
1063 if( !$self->{'_seen_hsp_features'} ) {
1064 $self->in_element('hsp')
1065 && $self->end_element( { 'Name' => 'Hsp' } );
1066 $self->start_element( { 'Name' => 'Hsp' } );
1069 # Some data clean-up so e-value will appear numeric to perl
1070 my ( $bits, $score, $n, $evalue ) = ( $1, $2, $3, $4 );
1071 $evalue =~ s/^e/1e/i;
1074 'Name' => 'Hsp_score',
1080 'Name' => 'Hsp_bit-score',
1086 'Name' => 'Hsp_evalue',
1096 $score = '' unless defined $score; # deal with BLAT which
1097 # has no score only bits
1098 $self->debug("Got NCBI HSP score=$score, evalue $evalue\n");
1101 $self->in_element('hsp')
1102 && m/Identities\s*=\s*(\d+)\s*\/\s
*(\d
+)\s
*[\d\
%\
(\
)]+\s
*
1103 (?
:,\s
*Positives\s
*=\s
*(\d
+)\
/(\d
+)\s
*[\d\
%\
(\
)]+\s
*)?
# pos only valid for Protein alignments
1104 (?
:\
,\s
*Gaps\s
*=\s
*(\d
+)\
/(\d
+))?
# Gaps
1110 'Name' => 'Hsp_identity',
1116 'Name' => 'Hsp_align-len',
1123 'Name' => 'Hsp_positive',
1131 'Name' => 'Hsp_positive',
1139 'Name' => 'Hsp_gaps',
1145 $self->{'_Query'} = { 'begin' => 0, 'end' => 0 };
1146 $self->{'_Sbjct'} = { 'begin' => 0, 'end' => 0 };
1148 if (/(Frame\s*=\s*.+)$/) {
1150 # handle wu-blast Frame listing on same line
1151 $self->_pushback($1);
1154 elsif ( $self->in_element('hsp')
1155 && /Strand\s*=\s*(Plus|Minus)\s*\/\s
*(Plus
|Minus
)/i
)
1158 # consume this event ( we infer strand from start/end)
1159 if (!defined($reporttype)) {
1160 $self->{'_reporttype'} = $reporttype = 'BLASTN';
1161 $bl2seq_fix = 1; # special case to resubmit the algorithm
1166 elsif ( $self->in_element('hsp')
1167 && /Links\s*=\s*(\S+)/ox )
1171 'Name' => 'Hsp_links',
1176 elsif ( $self->in_element('hsp')
1177 && /Frame\s*=\s*([\+\-][1-3])\s*(\/\s
*([\
+\
-][1-3]))?
/ )
1179 my $frame1 = $1 || 0;
1180 my $frame2 = $2 || 0;
1181 # this is for bl2seq only
1182 if ( not defined $reporttype ) {
1184 if ( $frame1 && $frame2 ) {
1185 $reporttype = 'TBLASTX'
1188 # We can distinguish between BLASTX and TBLASTN from the report
1189 # (and assign $frame1 properly) by using the start/end from query.
1190 # If the report is BLASTX, the coordinates distance from query
1191 # will be 3 times the length of the alignment shown (coordinates in nt,
1192 # alignment in aa); if not then subject is the nucleotide sequence (TBLASTN).
1193 # Will have to fast-forward to query alignment line and then go back.
1194 my $fh = $self->_fh;
1195 my $file_pos = tell $fh;
1197 my $a_position = '';
1198 my $ali_length = '';
1199 my $b_position = '';
1200 while (my $line = <$fh>) {
1201 if ($line =~ m/^(?:Query|Sbjct):?\s+(\-?\d+)?\s*(\S+)\s+(\-?\d+)?/) {
1206 use Bio
::LocatableSeq
;
1207 my $gap_symbols = $Bio::LocatableSeq
::GAP_SYMBOLS
;
1208 $alignment =~ s/[$gap_symbols]//g;
1209 $ali_length = length($alignment);
1213 my $coord_length = ($a_position < $b_position) ?
($b_position - $a_position + 1)
1214 : ($a_position - $b_position + 1);
1215 ($coord_length == ($ali_length * 3)) ?
($reporttype = 'BLASTX') : ($reporttype = 'TBLASTN');
1217 # Rewind filehandle to its original position to continue parsing
1218 seek $fh, $file_pos, 0;
1220 $self->{'_reporttype'} = $reporttype;
1223 my ( $queryframe, $hitframe );
1224 if ( $reporttype eq 'TBLASTX' ) {
1225 ( $queryframe, $hitframe ) = ( $frame1, $frame2 );
1226 $hitframe =~ s/\/\s*//g
;
1228 elsif ( $reporttype eq 'TBLASTN' || $reporttype eq 'PSITBLASTN') {
1229 ( $hitframe, $queryframe ) = ( $frame1, 0 );
1231 elsif ( $reporttype eq 'BLASTX' || $reporttype eq 'RPS-BLAST(BLASTP)') {
1232 ( $queryframe, $hitframe ) = ( $frame1, 0 );
1233 # though NCBI doesn't report it, this is a special BLASTX-like
1234 # RPS-BLAST; should be handled differently
1235 if ($reporttype eq 'RPS-BLAST(BLASTP)') {
1238 'Name' => 'BlastOutput_program',
1239 'Data' => 'RPS-BLAST(BLASTX)'
1246 'Name' => 'Hsp_query-frame',
1247 'Data' => $queryframe
1253 'Name' => 'Hsp_hit-frame',
1258 elsif (/^Parameters:/
1259 || /^\s+Database:\s+?/
1263 || ( $self->in_element('hsp') && /WARNING|NOTE/ ) )
1266 # Note: Lambda check was necessary to parse
1267 # t/data/ecoli_domains.rpsblast AND to parse bl2seq
1268 $self->debug("blast.pm: found parameters section \n");
1270 $self->in_element('hsp')
1271 && $self->end_element( { 'Name' => 'Hsp' } );
1272 $self->in_element('hit')
1273 && $self->end_element( { 'Name' => 'Hit' } );
1275 # This is for the case when we specify -b 0 (or B=0 for WU-BLAST)
1276 # and still want to construct minimal Hit objects
1277 $self->_cleanup_hits(\
@hit_signifs) if scalar(@hit_signifs);
1278 $self->within_element('iteration')
1279 && $self->end_element( { 'Name' => 'Iteration' } );
1281 next if /^\s+Subset/;
1282 my $blast = (/^(\s+Database\:)|(\s*Lambda)/) ?
'ncbi' : 'wublast';
1283 if (/^\s*Histogram/) {
1289 # default is that gaps are allowed
1292 'Name' => 'Parameters_allowgaps',
1296 while ( defined( $_ = $self->_readline ) ) {
1297 # If Lambda/Kappa/Entropy numbers appear first at this point,
1298 # pushback and add the header line to process it correctly
1299 if (/^\s+[\d+\.]+\s+[\d+\.]+\s+[\d+\.]/ and $last eq '') {
1300 $self->_pushback($_);
1301 $self->_pushback("Lambda K H\n");
1305 /^((?:\S+)?BLAST[NPX]?)\s+(.+)$/i # NCBI BLAST, PSIBLAST
1306 # RPSBLAST, MEGABLAST
1307 || /^(P?GENEWISE|HFRAME|SWN|TSWN)\s+(.+)/i #Paracel BTK
1310 $self->_pushback($_);
1312 # let's handle this in the loop
1316 $self->_pushback($_);
1317 $self->_pushback($reportline) if $reportline;
1321 # here is where difference between wublast and ncbiblast
1322 # is better handled by different logic
1323 if ( /Number of Sequences:\s+([\d\,]+)/i
1324 || /of sequences in database:\s+(\-?[\d,]+)/i )
1330 'Name' => 'Statistics_db-len',
1335 elsif (/letters in database:\s+(\-?[\d,]+)/i) {
1340 'Name' => 'Statistics_db-let',
1345 elsif ( $blast eq 'btk' ) {
1348 elsif ( $blast eq 'wublast' ) {
1354 'Name' => 'Parameters_expect',
1362 'Name' => 'Parameters_allowgaps',
1367 elsif (/ctxfactor=(\S+)/) {
1370 'Name' => 'Statistics_ctxfactor',
1376 /(postsw|links|span[12]?|warnings|notes|gi|noseqs|qres|qype)/
1381 'Name' => "Parameters_$1",
1386 elsif (/(\S+)=(\S+)/) {
1389 'Name' => "Parameters_$1",
1394 elsif ( $last =~ /(Frame|Strand)\s+MatID\s+Matrix name/i ) {
1395 my $firstgapinfo = 1;
1397 while ( defined($_) && !/^\s+$/ ) {
1401 && s/Q=(\d+),R=(\d+)\s+//x )
1407 'Name' => 'Parameters_gap-open',
1413 'Name' => 'Parameters_gap-extend',
1425 next if $type eq 'n/a';
1427 warn "fields is empty for $type\n";
1433 "Statistics_frame$frame\_$type",
1434 'Data' => shift @fields
1440 my ( $frameo, $matid, $matrix, @fields ) =
1442 if ( !defined $frame ) {
1444 # keep some sort of default feature I guess
1445 # even though this is sort of wrong
1448 'Name' => 'Parameters_matrix',
1454 'Name' => 'Statistics_lambda',
1455 'Data' => $fields[0]
1460 'Name' => 'Statistics_kappa',
1461 'Data' => $fields[1]
1466 'Name' => 'Statistics_entropy',
1467 'Data' => $fields[2]
1482 my $f = $fields[$ii];
1483 next unless defined $f; # deal with n/a
1484 if ( $f eq 'same' ) {
1485 $f = $fields[ $ii - 3 ];
1491 "Statistics_frame$frame\_$type",
1500 $_ = $self->_readline;
1504 elsif ( $last =~ /(Frame|Strand)\s+MatID\s+Length/i ) {
1506 while ( defined($_) && !/^\s+/ ) {
1510 if ( @fields <= 3 ) {
1511 for my $type (qw(X_gapped E2_gapped S2)) {
1512 last unless @fields;
1516 "Statistics_frame$frame\_$type",
1517 'Data' => shift @fields
1533 "Statistics_frame$frame\_$type",
1534 'Data' => shift @fields
1539 $_ = $self->_readline;
1543 elsif (/(\S+\s+\S+)\s+DFA:\s+(\S+)\s+\((.+)\)/) {
1544 if ( $1 eq 'states in' ) {
1547 'Name' => 'Statistics_DFA_states',
1552 elsif ( $1 eq 'size of' ) {
1555 'Name' => 'Statistics_DFA_size',
1562 m
/^\s
+Time to generate neighborhood
:\s
+
1568 'Name' => 'Statistics_neighbortime',
1573 elsif (/processors\s+used:\s+(\d+)/) {
1576 'Name' => 'Statistics_noprocessors',
1582 m
/^\s
+(\S
+)\s
+cpu\s
+time:\s
+ # cputype
1583 (\S
+\s
+\S
+\s
+\S
+) # cputime
1584 \s
+Elapsed
:\s
+(\S
+)/x
1587 my $cputype = lc($1);
1590 'Name' => "Statistics_$cputype\_cputime",
1596 'Name' => "Statistics_$cputype\_actualtime",
1601 elsif (/^\s+Start:/) {
1602 my ( $junk, $start, $stime, $end, $etime ) =
1603 split( /\s+(Start|End)\:\s+/, $_ );
1607 'Name' => 'Statistics_starttime',
1614 'Name' => 'Statistics_endtime',
1619 elsif (/^\s+Database:\s+(.+)$/) {
1622 'Name' => 'Parameters_full_dbpath',
1628 elsif (/^\s+Posted:\s+(.+)/) {
1633 'Name' => 'Statistics_posted_date',
1639 elsif ( $blast eq 'ncbi' ) {
1641 if (m/^Matrix:\s+(.+)\s*$/oxi) {
1644 'Name' => 'Parameters_matrix',
1653 $_ = $self->_readline;
1655 my ( $lambda, $kappa, $entropy ) = split;
1656 if ($gapped_stats) {
1659 'Name' => "Statistics_gapped_lambda",
1665 'Name' => "Statistics_gapped_kappa",
1671 'Name' => "Statistics_gapped_entropy",
1679 'Name' => "Statistics_lambda",
1685 'Name' => "Statistics_kappa",
1691 'Name' => "Statistics_entropy",
1697 elsif (m/effective\s+search\s+space\s+used:\s+(\d+)/oxi) {
1700 'Name' => 'Statistics_eff-spaceused',
1705 elsif (m/effective\s+search\s+space:\s+(\d+)/oxi) {
1708 'Name' => 'Statistics_eff-space',
1714 m
/Gap\s
+Penalties
:\s
+Existence
:\s
+(\d
+)\
,
1715 \s
+Extension
:\s
+(\d
+)/ox
1720 'Name' => 'Parameters_gap-open',
1726 'Name' => 'Parameters_gap-extend',
1731 elsif (/effective\s+HSP\s+length:\s+(\d+)/) {
1734 'Name' => 'Statistics_hsp-len',
1739 elsif (/Number\s+of\s+HSP's\s+better\s+than\s+(\S+)\s+without\s+gapping:\s+(\d+)/) {
1742 'Name' => 'Statistics_number_of_hsps_better_than_expect_value_cutoff_without_gapping',
1747 elsif (/Number\s+of\s+HSP's\s+gapped:\s+(\d+)/) {
1750 'Name' => 'Statistics_number_of_hsps_gapped',
1755 elsif (/Number\s+of\s+HSP's\s+successfully\s+gapped:\s+(\d+)/) {
1758 'Name' => 'Statistics_number_of_hsps_successfully_gapped',
1763 elsif (/Length\s+adjustment:\s+(\d+)/) {
1766 'Name' => 'Statistics_length_adjustment',
1771 elsif (/effective\s+length\s+of\s+query:\s+([\d\,]+)/i) {
1776 'Name' => 'Statistics_query-len',
1781 elsif (/effective\s+length\s+of\s+database:\s+([\d\,]+)/i) {
1786 'Name' => 'Statistics_eff-dblen',
1792 /^(T|A|X1|X2|X3|S1|S2):\s+(\d+(\.\d+)?)\s+(?:\(\s*(\d+\.\d+) bits\))?/
1799 'Name' => "Statistics_$1",
1806 'Name' => "Statistics_$1_bits",
1813 m
/frameshift\s
+window\
,
1814 \s
+decay\s
+const
:\s
+(\d
+)\
,\s
+([\
.\d
]+)/x
1819 'Name' => 'Statistics_framewindow',
1825 'Name' => 'Statistics_decay',
1830 elsif (m/^Number\s+of\s+Hits\s+to\s+DB:\s+(\S+)/ox) {
1833 'Name' => 'Statistics_hit_to_db',
1838 elsif (m/^Number\s+of\s+extensions:\s+(\S+)/ox) {
1841 'Name' => 'Statistics_num_extensions',
1847 m
/^Number\s
+of\s
+successful\s
+extensions
:\s
+
1853 'Name' => 'Statistics_num_suc_extensions',
1859 m
/^Number\s
+of\s
+sequences\s
+better\s
+than\s
+
1865 'Name' => 'Parameters_expect',
1871 'Name' => 'Statistics_seqs_better_than_cutoff',
1876 elsif (/^\s+Posted\s+date:\s+(.+)/) {
1881 'Name' => 'Statistics_posted_date',
1886 elsif ( !/^\s+$/ ) {
1887 #$self->debug( "unmatched stat $_");
1892 } elsif ( $self->in_element('hsp') ) {
1893 $self->debug("blast.pm: Processing HSP\n");
1894 # let's read 3 lines at a time;
1895 # bl2seq hackiness... Not sure I like
1896 $self->{'_reporttype'} ||= $DEFAULTREPORTTYPE;
1903 for ( my $i = 0 ; defined($_) && $i < 3 ; $i++ ) {
1904 # $self->debug("$i: $_") if $v;
1905 if ( ( $i == 0 && /^\s+$/) ||
1906 /^\s*(?:Lambda|Minus|Plus|Score)/i
1908 $self->_pushback($_) if defined $_;
1909 $self->end_element( { 'Name' => 'Hsp' } );
1913 if (/^((Query|Sbjct):?\s+(\-?\d+)?\s*)(\S+)\s+(\-?\d+)?/) {
1914 my ( $full, $type, $start, $str, $end ) =
1915 ( $1, $2, $3, $4, $5 );
1917 if ( $str eq '-' ) {
1918 $i = 3 if $type eq 'Sbjct';
1921 $data{$type} = $str;
1923 $len = length($full);
1924 $self->{"\_$type"}->{'begin'} = $start
1925 unless $self->{"_$type"}->{'begin'};
1926 $self->{"\_$type"}->{'end'} = $end;
1927 } elsif (/^((Query|Sbjct):?\s+(\-?0+)\s*)/) {
1928 # Bug from NCBI's BLAST: unaligned output
1929 $_ = $self->_readline() for 0..1;
1932 $self->throw("no data for midline $_")
1933 unless ( defined $_ && defined $len );
1934 $data{'Mid'} = substr( $_, $len );
1936 $_ = $self->_readline();
1940 'Name' => 'Hsp_qseq',
1941 'Data' => $data{'Query'}
1946 'Name' => 'Hsp_hseq',
1947 'Data' => $data{'Sbjct'}
1952 'Name' => 'Hsp_midline',
1953 'Data' => $data{'Mid'}
1958 #$self->debug("blast.pm: unrecognized line $_");
1962 $self->debug("blast.pm: End of BlastOutput\n");
1963 if ( $self->{'_seentop'} ) {
1964 $self->within_element('hsp')
1965 && $self->end_element( { 'Name' => 'Hsp' } );
1966 $self->within_element('hit')
1967 && $self->end_element( { 'Name' => 'Hit' } );
1968 # cleanup extra hits
1969 $self->_cleanup_hits(\
@hit_signifs) if scalar(@hit_signifs);
1970 $self->within_element('iteration')
1971 && $self->end_element( { 'Name' => 'Iteration' } );
1975 'Name' => 'BlastOutput_program',
1976 'Data' => $reporttype
1980 $self->end_element( { 'Name' => 'BlastOutput' } );
1982 return $self->end_document();
1985 # Private method for internal use only.
1986 sub _start_blastoutput
{
1988 $self->start_element( { 'Name' => 'BlastOutput' } );
1989 $self->{'_seentop'} = 1;
1990 $self->{'_result_count'}++;
1991 $self->{'_handler_rc'} = undef;
1996 Title : _will_handle
1997 Usage : Private method. For internal use only.
1998 if( $self->_will_handle($type) ) { ... }
1999 Function: Provides an optimized way to check whether or not an element of a
2000 given type is to be handled.
2001 Returns : Reference to EventHandler object if the element type is to be handled.
2002 undef if the element type is not to be handled.
2003 Args : string containing type of element.
2011 Using the cached pointer to the EventHandler to minimize repeated
2016 Caching the will_handle status for each type that is encountered so
2017 that it only need be checked by calling
2018 handler-E<gt>will_handle($type) once.
2022 This does not lead to a major savings by itself (only 5-10%). In
2023 combination with other optimizations, or for large parse jobs, the
2024 savings good be significant.
2026 To test against the unoptimized version, remove the parentheses from
2027 around the third term in the ternary " ? : " operator and add two
2028 calls to $self-E<gt>_eventHandler().
2033 my ( $self, $type ) = @_;
2034 my $handler = $self->{'_handler_cache'};
2036 defined( $self->{'_will_handle_cache'}->{$type} )
2037 ?
$self->{'_will_handle_cache'}->{$type}
2038 : ( $self->{'_will_handle_cache'}->{$type} =
2039 $handler->will_handle($type) );
2041 return $will_handle ?
$handler : undef;
2044 =head2 start_element
2046 Title : start_element
2047 Usage : $eventgenerator->start_element
2048 Function: Handles a start element event
2050 Args : hashref with at least 2 keys 'Data' and 'Name'
2055 my ( $self, $data ) = @_;
2057 # we currently don't care about attributes
2058 my $nm = $data->{'Name'};
2059 my $type = $MODEMAP{$nm};
2061 my $handler = $self->_will_handle($type);
2063 my $func = sprintf( "start_%s", lc $type );
2064 $self->{'_handler_rc'} = $handler->$func( $data->{'Attributes'} );
2067 #$self->debug( # changed 4/29/2006 to play nice with other event handlers
2068 # "Bio::SearchIO::InternalParserError ".
2069 # "\nCan't handle elements of type \'$type.\'"
2072 unshift @
{ $self->{'_elements'} }, $type;
2073 if ( $type eq 'result' ) {
2074 $self->{'_values'} = {};
2075 $self->{'_result'} = undef;
2077 # cleanup some things
2078 if ( defined $self->{'_values'} ) {
2080 grep { /^\U$type\-/ }
2081 keys %{ $self->{'_values'} }
2084 delete $self->{'_values'}->{$k};
2094 Usage : $eventgenerator->end_element
2095 Function: Handles an end element event
2096 Returns : hashref with an element's worth of data
2097 Args : hashref with at least 2 keys 'Data' and 'Name'
2103 my ( $self, $data ) = @_;
2105 my $nm = $data->{'Name'};
2108 # cache these (TODO: we should probably cache all cross-report data)
2109 if ( $nm eq 'BlastOutput_program' ) {
2110 if ( $self->{'_last_data'} =~ /(t?blast[npx])/i ) {
2111 $self->{'_reporttype'} = uc $1;
2113 $self->{'_reporttype'} ||= $DEFAULTREPORTTYPE;
2116 if ( $nm eq 'BlastOutput_version' ) {
2117 $self->{'_reportversion'} = $self->{'_last_data'};
2120 # Hsps are sort of weird, in that they end when another
2121 # object begins so have to detect this in end_element for now
2122 if ( $nm eq 'Hsp' ) {
2123 foreach (qw(Hsp_qseq Hsp_midline Hsp_hseq Hsp_features)) {
2127 'Data' => $self->{'_last_hspdata'}->{$_}
2129 ) if defined $self->{'_last_hspdata'}->{$_};
2131 $self->{'_last_hspdata'} = {};
2134 'Name' => 'Hsp_query-from',
2135 'Data' => $self->{'_Query'}->{'begin'}
2140 'Name' => 'Hsp_query-to',
2141 'Data' => $self->{'_Query'}->{'end'}
2147 'Name' => 'Hsp_hit-from',
2148 'Data' => $self->{'_Sbjct'}->{'begin'}
2153 'Name' => 'Hsp_hit-to',
2154 'Data' => $self->{'_Sbjct'}->{'end'}
2158 # } elsif( $nm eq 'Iteration' ) {
2159 # Nothing special needs to be done here.
2161 if ( $type = $MODEMAP{$nm} ) {
2162 my $handler = $self->_will_handle($type);
2164 my $func = sprintf( "end_%s", lc $type );
2165 $rc = $handler->$func( $self->{'_reporttype'}, $self->{'_values'} );
2167 shift @
{ $self->{'_elements'} };
2170 elsif ( $MAPPING{$nm} ) {
2171 if ( ref( $MAPPING{$nm} ) =~ /hash/i ) {
2173 # this is where we shove in the data from the
2174 # hashref info about params or statistics
2175 my $key = ( keys %{ $MAPPING{$nm} } )[0];
2176 $self->{'_values'}->{$key}->{ $MAPPING{$nm}->{$key} } =
2177 $self->{'_last_data'};
2180 $self->{'_values'}->{ $MAPPING{$nm} } = $self->{'_last_data'};
2184 #$self->debug("blast.pm: unknown nm $nm, ignoring\n");
2186 $self->{'_last_data'} = ''; # remove read data if we are at
2188 $self->{'_result'} = $rc if ( defined $type && $type eq 'result' );
2189 $self->{'_seen_hsp_features'} = 0;
2196 Usage : $eventhandler->element({'Name' => $name, 'Data' => $str});
2197 Function: Convenience method that calls start_element, characters, end_element
2199 Args : Hash ref with the keys 'Name' and 'Data'
2205 my ( $self, $data ) = @_;
2206 # Note that start element isn't needed for character data
2207 # Not too SAX-y, though
2208 #$self->start_element($data);
2209 $self->characters($data);
2210 $self->end_element($data);
2216 Usage : $eventgenerator->characters($str)
2217 Function: Send a character events
2225 my ( $self, $data ) = @_;
2226 if ( $self->in_element('hsp')
2227 && $data->{'Name'} =~ /^Hsp\_(qseq|hseq|midline)$/ )
2229 $self->{'_last_hspdata'}->{ $data->{'Name'} } .= $data->{'Data'}
2230 if defined $data->{'Data'};
2232 return unless ( defined $data->{'Data'} && $data->{'Data'} !~ /^\s+$/ );
2233 $self->{'_last_data'} = $data->{'Data'};
2236 =head2 within_element
2238 Title : within_element
2239 Usage : if( $eventgenerator->within_element($element) ) {}
2240 Function: Test if we are within a particular element
2241 This is different than 'in' because within can be tested
2244 Args : string element name
2246 See Also: L<in_element>
2250 sub within_element
{
2251 my ( $self, $name ) = @_;
2253 if ( !defined $name && !defined $self->{'_elements'}
2254 || scalar @
{ $self->{'_elements'} } == 0 );
2255 foreach ( @
{ $self->{'_elements'} } ) {
2256 if ( $_ eq $name ) {
2266 Usage : if( $eventgenerator->in_element($element) ) {}
2267 Function: Test if we are in a particular element
2268 This is different than 'within_element' because within
2269 can be tested for a whole block.
2271 Args : string element name
2273 See Also: L<within_element>
2278 my ( $self, $name ) = @_;
2279 return 0 if !defined $self->{'_elements'}->[0];
2280 return ( $self->{'_elements'}->[0] eq $name );
2283 =head2 start_document
2285 Title : start_document
2286 Usage : $eventgenerator->start_document
2287 Function: Handle a start document event
2294 sub start_document
{
2296 $self->{'_lasttype'} = '';
2297 $self->{'_values'} = {};
2298 $self->{'_result'} = undef;
2299 $self->{'_elements'} = [];
2304 Title : end_document
2305 Usage : $eventgenerator->end_document
2306 Function: Handles an end document event
2307 Returns : Bio::Search::Result::ResultI object
2314 my ( $self, @args ) = @_;
2316 #$self->debug("blast.pm: end_document\n");
2317 return $self->{'_result'};
2321 my ( $self, $blast, @args ) = @_;
2323 if ( not defined( $self->writer ) ) {
2324 $self->warn("Writer not defined. Using a $DEFAULT_BLAST_WRITER_CLASS");
2325 $self->writer( $DEFAULT_BLAST_WRITER_CLASS->new() );
2327 $self->SUPER::write_result
( $blast, @args );
2332 return $self->{'_result_count'};
2335 sub report_count
{ shift->result_count }
2337 =head2 inclusion_threshold
2339 Title : inclusion_threshold
2340 Usage : my $incl_thresh = $isreb->inclusion_threshold;
2341 : $isreb->inclusion_threshold(1e-5);
2342 Function: Get/Set the e-value threshold for inclusion in the PSI-BLAST
2343 score matrix model (blastpgp) that was used for generating the reports
2345 Returns : number (real)
2346 Default value: $Bio::SearchIO::IteratedSearchResultEventBuilder::DEFAULT_INCLUSION_THRESHOLD
2347 Args : number (real) (e.g., 0.0001 or 1e-4 )
2351 =head2 max_significance
2353 Usage : $obj->max_significance();
2354 Purpose : Set/Get the P or Expect value used as significance screening cutoff.
2355 This is the value of the -signif parameter supplied to new().
2356 Hits with P or E-value above this are skipped.
2357 Returns : Scientific notation number with this format: 1.0e-05.
2358 Argument : Scientific notation number or float (when setting)
2359 Comments : Screening of significant hits uses the data provided on the
2360 : description line. For NCBI BLAST1 and WU-BLAST, this data
2361 : is P-value. for NCBI BLAST2 it is an Expect value.
2367 Synonym for L<max_significance()|max_significance>
2373 Usage : $obj->min_score();
2374 Purpose : Set/Get the Blast score used as screening cutoff.
2375 This is the value of the -score parameter supplied to new().
2376 Hits with scores below this are skipped.
2377 Returns : Integer or scientific notation number.
2378 Argument : Integer or scientific notation number (when setting)
2379 Comments : Screening of significant hits uses the data provided on the
2384 =head2 min_query_length
2386 Usage : $obj->min_query_length();
2387 Purpose : Gets the query sequence length used as screening criteria.
2388 This is the value of the -min_query_len parameter supplied to new().
2389 Hits with sequence length below this are skipped.
2395 =head2 best_hit_only
2397 Title : best_hit_only
2398 Usage : print "only getting best hit.\n" if $obj->best_hit_only;
2399 Purpose : Set/Get the indicator for whether or not to process only
2400 : the best BlastHit.
2401 Returns : Boolean (1 | 0)
2402 Argument : Boolean (1 | 0) (when setting)
2406 =head2 check_all_hits
2408 Title : check_all_hits
2409 Usage : print "checking all hits.\n" if $obj->check_all_hits;
2410 Purpose : Set/Get the indicator for whether or not to process all hits.
2411 : If false, the parser will stop processing hits after the
2412 : the first non-significance hit or the first hit that fails
2414 Returns : Boolean (1 | 0)
2415 Argument : Boolean (1 | 0) (when setting)
2419 # general private method used to make minimal hits from leftover
2420 # data in the hit table
2423 my ($self, $hits) = @_;
2424 while ( my $v = shift @
{ $hits }) {
2425 next unless defined $v;
2426 $self->start_element( { 'Name' => 'Hit' } );
2435 my ($gi, $acc, $version ) = $self->_get_seq_identifiers($id);
2438 'Name' => 'Hit_accession',
2445 'Name' => 'Hit_signif',
2449 if (exists $self->{'_wublast'}) {
2452 'Name' => 'Hit_score',
2459 'Name' => 'Hit_bits',
2468 'Name' => 'Hit_def',
2472 $self->end_element( { 'Name' => 'Hit' } );
2484 The following information is added
in hopes of increasing the
2485 maintainability of this code
. It runs the risk of becoming obsolete as
2486 the code gets updated
. As always
, double check against the actual
2487 source
. If you find any discrepencies
, please correct them
.
2488 [ This documentation added on
3 Jun
2003. ]
2490 The logic is the brainchild of Jason Stajich
, documented by Steve
2491 Chervitz
. Jason
: please check it over
and modify as you see fit
.
2494 Elmo wants to know
: How does this module unmarshall data from the input stream?
2495 (i
.e
., how does information from a raw input file get added to
2496 the correct Bioperl object?
)
2500 This answer is specific to SearchIO
::blast
, but may apply to other
2501 SearchIO
.pm subclasses as well
. The following description gives the
2502 basic idea
. The actual processing is a little more complex
for
2503 certain types of data
(HSP
, Report Parameters
).
2505 You can think of blast
::next_result
() as faking a SAX XML parser
,
2506 making a non
-XML document behave like its XML
. The overhead to
do this
2507 is quite substantial
(~650 lines of code instead of
~80 in
2510 0. First
, add a key
=> value pair
for the datum of interest to
%MAPPING
2512 'Foo_bar' => 'Foo-bar',
2514 1. next_result
() collects the datum of interest from the input stream
,
2515 and calls element
().
2517 $self->element({ 'Name' => 'Foo_bar',
2518 'Data' => $foobar});
2520 2. The element
() method is a convenience method that calls start_element
(),
2521 characters
(), and end_element
().
2523 3. start_element
() checks to see
if the event handler can handle a start_xxx
(),
2524 where xxx
= the
'Name' parameter passed into element
(), and calls start_xxx
()
2525 if so
. Otherwise
, start_element
() does
not do anything
.
2527 Data that will have such an event handler are
defined in %MODEMAP.
2528 Typically
, there are only handler methods
for the main parts of
2529 the search result
(e
.g
., Result
, Iteration
, Hit
, HSP
),
2530 which have corresponding Bioperl modules
. So
in this example
,
2531 there was an earlier call such as
$self->element({'Name'=>'Foo'})
2532 and the Foo_bar datum is meant to ultimately go into a Foo object
.
2534 The start_foo
() method
in the handler will typically
do any
2535 data initialization necessary to prepare
for creating a new Foo object
.
2536 Example
: SearchResultEventBuilder
::start_result
()
2538 4. characters
() takes the value of the
'Data' key from the hashref argument
in
2539 the elements
() call
and saves it
in a
local data member
:
2541 $self->{'_last_data'} = $data->{'Data'};
2543 5. end_element
() is like start_element
() in that it does the check
for whether
2544 the event handler can handle end_xxx
() and if so
, calls it
, passing
in
2545 the data collected from all of the characters
() calls that occurred
2546 since the start_xxx
() call
.
2548 If there isn
't any special handler for the data type specified by 'Name
',
2549 end_element() will place the data saved by characters() into another
2550 local data member that saves it in a hash with a key defined by %MAPPING.
2552 $nm = $data->{'Name
'};
2553 $self->{'_values
'}->{$MAPPING{$nm}} = $self->{'_last_data
'};
2555 In this case, $MAPPING{$nm} is 'Foo
-bar
'.
2557 end_element() finishes by resetting the local data member used by
2558 characters(). (i.e., $self->{'_last_data
'} = '';)
2560 6. When the next_result() method encounters the end of the Foo element in the
2561 input stream. It will invoke $self->end_element({'Name
'=>'Foo
'}).
2562 end_element() then sends all of the data in the $self->{'_values
'} hash.
2563 Note that $self->{'_values
'} is cleaned out during start_element(),
2564 keeping it at a resonable size.
2566 In the event handler, the end_foo() method takes the hash from end_element()
2567 and creates a new hash containing the same data, but having keys lacking
2568 the 'Foo
' prefix (e.g., 'Foo
-bar
' becomes '-bar
'). The handler's end_foo
()
2569 method then creates the Foo object
, passing
in this new hash as an argument
.
2570 Example
: SearchResultEventBuilder
::end_result
()
2572 7. Objects created from the data
in the search result are managed by
2573 the event handler which adds them to a ResultI object
(using API methods
2574 for that object
). The ResultI object gets passed back to
2575 SearchIO
::end_element
() when it calls end_result
().
2577 The ResultI object is then saved
in an internal data member of the
2578 SearchIO object
, which returns it at the end of next_result
()
2579 by calling end_document
().
2581 (Technical Note
: All objects created by end_xxx
() methods
in the event
2582 handler are returned to SearchIO
::end_element
(), but the SearchIO object
2583 only cares about the ResultI objects
.)
2585 (Sesame Street aficionados note
: This answer was NOT
given by Mr
. Noodle
;-P
)