Bio::DB::Universal: move into its own distribution
[bioperl-live.git] / Bio / Tools / SeqPattern.pm
blob3e130464dcffcecec9b39fcfe94b38435d0a2307
2 # bioperl module for Bio::Tools::SeqPattern
4 # Please direct questions and support issues to <bioperl-l@bioperl.org>
6 # Cared for by Steve Chervitz (sac-at-bioperl.org)
8 # Copyright Steve Chervitz
10 # You may distribute this module under the same terms as perl itself
12 # POD documentation - main docs before the code
14 =head1 NAME
16 Bio::Tools::SeqPattern - represent a sequence pattern or motif
18 =head1 SYNOPSIS
20 use Bio::Tools::SeqPattern;
22 my $pat1 = 'T[GA]AA...TAAT';
23 my $pattern1 = Bio::Tools::SeqPattern->new(-SEQ =>$pat1, -TYPE =>'Dna');
25 my $pat2 = '[VILM]R(GXX){3,2}...[^PG]';
26 my $pattern2 = Bio::Tools::SeqPattern->new(-SEQ =>$pat2, -TYPE =>'Amino');
28 =head1 DESCRIPTION
30 L<Bio::Tools::SeqPattern> module encapsulates generic data and
31 methods for manipulating regular expressions describing nucleic or
32 amino acid sequence patterns (a.k.a, "motifs"), such as the ones produced by
33 L<Bio::Tools::IUPAC>.
35 L<Bio::Tools::SeqPattern> is a concrete class that inherits from L<Bio::Seq>.
37 This class grew out of a need to have a standard module for doing routine
38 tasks with sequence patterns such as:
40 -- Forming a reverse-complement version of a nucleotide sequence pattern
41 -- Expanding patterns containing ambiguity codes
42 -- Checking for invalid regexp characters
43 -- Untainting yet preserving special characters in the pattern
45 Other features to look for in the future:
47 -- Full pattern syntax checking
48 -- Conversion between expanded and condensed forms of the pattern
50 =head1 MOTIVATIONS
52 A key motivation for L<Bio::Tools::SeqPattern> is to have a way to
53 generate a reverse complement of a nucleotide sequence pattern.
54 This makes possible simultaneous pattern matching on both sense and
55 anti-sense strands of a query sequence.
57 In principle, one could do such a search more inefficiently by testing
58 against both sense and anti-sense versions of a sequence.
59 It is entirely equivalent to test a regexp containing both sense and
60 anti-sense versions of the *pattern* against one copy of the sequence.
61 The latter approach is much more efficient since:
63 1) You need only one copy of the sequence.
64 2) Only one regexp is executed.
65 3) Regexp patterns are typically much smaller than sequences.
67 Patterns can be quite complex and it is often difficult to
68 generate the reverse complement pattern. The Bioperl SeqPattern.pm
69 addresses this problem, providing a convenient set of tools
70 for working with biological sequence regular expressions.
72 Not all patterns have been tested. If you discover a pattern that
73 is not handled properly by Bio::Tools::SeqPattern.pm, please
74 send me some email (sac@bioperl.org). Thanks.
76 =head1 OTHER FEATURES
78 =head2 Extended Alphabet Support
80 This module supports the same set of ambiguity codes for nucleotide
81 sequences as supported by L<Bio::Seq>. These ambiguity codes
82 define the behavior or the L<expand> method.
84 ------------------------------------------
85 Symbol Meaning Nucleic Acid
86 ------------------------------------------
87 A A (A)denine
88 C C (C)ytosine
89 G G (G)uanine
90 T T (T)hymine
91 U U (U)racil
92 M A or C a(M)ino group
93 R A or G pu(R)ine
94 W A or T (W)eak bond
95 S C or G (S)trong bond
96 Y C or T p(Y)rimidine
97 K G or T (K)eto group
98 V A or C or G
99 H A or C or T
100 D A or G or T
101 B C or G or T
102 X G or A or T or C
103 N G or A or T or C
104 . G or A or T or C
108 ------------------------------------------
109 Symbol Meaning
110 ------------------------------------------
111 A Alanine
112 C Cysteine
113 D Aspartic Acid
114 E Glutamic Acid
115 F Phenylalanine
116 G Glycine
117 H Histidine
118 I Isoleucine
119 K Lysine
120 L Leucine
121 M Methionine
122 N Asparagine
123 P Proline
124 Q Glutamine
125 R Arginine
126 S Serine
127 T Threonine
128 V Valine
129 W Tryptophan
130 Y Tyrosine
132 B Aspartic Acid, Asparagine
133 Z Glutamic Acid, Glutamine
134 X Any amino acid
135 . Any amino acid
138 =head2 Multiple Format Support
140 Ultimately, this module should be able to build SeqPattern.pm objects
141 using a variety of pattern formats such as ProSite, Blocks, Prints, GCG, etc.
142 Currently, this module only supports patterns using a grep-like syntax.
144 =head1 USAGE
146 A simple demo script called seq_pattern.pl is included in the examples/
147 directory of the central Bioperl distribution.
149 =head1 SEE ALSO
151 L<Bio::Seq> - Lightweight sequence object.
153 L<Bio::Tools::IUPAC> - The IUPAC code for degenerate residues and their
154 conversion to a regular expression.
156 =head1 FEEDBACK
158 =head2 Mailing Lists
160 User feedback is an integral part of the evolution of this and other
161 Bioperl modules. Send your comments and suggestions preferably to one
162 of the Bioperl mailing lists. Your participation is much appreciated.
164 bioperl-l@bioperl.org - General discussion
165 http://bioperl.org/wiki/Mailing_lists - About the mailing lists
167 =head2 Support
169 Please direct usage questions or support issues to the mailing list:
171 I<bioperl-l@bioperl.org>
173 rather than to the module maintainer directly. Many experienced and
174 reponsive experts will be able look at the problem and quickly
175 address it. Please include a thorough description of the problem
176 with code and data examples if at all possible.
178 =head2 Reporting Bugs
180 Report bugs to the Bioperl bug tracking system to help us keep track
181 the bugs and their resolution. Bug reports can be submitted via the
182 web:
184 https://github.com/bioperl/bioperl-live/issues
186 =head1 AUTHOR
188 Steve Chervitz, sac-at-bioperl.org
190 =head1 COPYRIGHT
192 Copyright (c) 1997-8 Steve Chervitz. All Rights Reserved.
193 This module is free software; you can redistribute it and/or
194 modify it under the same terms as Perl itself.
196 =cut
201 #### END of main POD documentation.
205 # CREATED : 28 Aug 1997
208 package Bio::Tools::SeqPattern;
210 use base qw(Bio::Root::Root);
211 use strict;
212 use vars qw ($ID);
213 $ID = 'Bio::Tools::SeqPattern';
215 ## These constants may be more appropriate in a Bio::Dictionary.pm
216 ## type of class.
217 my $PURINES = 'AG';
218 my $PYRIMIDINES = 'CT';
219 my $BEE = 'DN';
220 my $ZED = 'EQ';
221 my $Regexp_chars = '\w,.\*()\[\]<>\{\}^\$'; # quoted for use in regexps
223 ## Package variables used in reverse complementing.
224 my (%Processed_braces, %Processed_asterics);
226 #####################################################################################
227 ## CONSTRUCTOR ##
228 #####################################################################################
230 =head1 new
232 Title : new
233 Usage : my $seqpat = Bio::Tools::SeqPattern->new();
234 Purpose : Verifies that the type is correct for superclass (Bio::Seq.pm)
235 : and calls superclass constructor last.
236 Returns : n/a
237 Argument : Parameters passed to new()
238 Throws : Exception if the pattern string (seq) is empty.
239 Comments : The process of creating a new SeqPattern.pm object
240 : ensures that the pattern string is untained.
242 See Also : L<Bio::Root::Root::new>,
243 L<Bio::Seq::_initialize>
245 =cut
247 #----------------
248 sub new {
249 #----------------
250 my($class, %param) = @_;
252 my $self = $class->SUPER::new(%param);
253 my ($seq,$type) = $self->_rearrange([qw(SEQ TYPE)], %param);
255 $seq || $self->throw("Empty pattern.");
256 my $t;
257 # Get the type ready for Bio::Seq.pm
258 if ($type =~ /nuc|[dr]na/i) {
259 $t = 'Dna';
260 } elsif ($type =~ /amino|pep|prot/i) {
261 $t = 'Amino';
263 $seq =~ tr/a-z/A-Z/; #ps 8/8/00 Canonicalize to upper case
264 $self->str($seq);
265 $self->type($t);
267 return $self;
271 =head1 alphabet_ok
273 Title : alphabet_ok
274 Usage : $mypat->alphabet_ok;
275 Purpose : Checks for invalid regexp characters.
276 : Overrides Bio::Seq::alphabet_ok() to allow
277 : additional regexp characters ,.*()[]<>{}^$
278 : in addition to the standard genetic alphabet.
279 : Also untaints the pattern and sets the sequence
280 : object's sequence to the untained string.
281 Returns : Boolean (1 | 0)
282 Argument : n/a
283 Throws : Exception if the pattern contains invalid characters.
284 Comments : Does not call the superclass method.
285 : Actually permits any alphanumeric, not just the
286 : standard genetic alphabet.
288 =cut
290 #----------------'
291 sub alphabet_ok {
292 #----------------
293 my( $self) = @_;
295 return 1 if $self->{'_alphabet_checked'};
297 $self->{'_alphabet_checked'} = 1;
299 my $pat = $self->seq();
301 if($pat =~ /[^$Regexp_chars]/io) {
302 $self->throw("Pattern contains invalid characters: $pat",
303 'Legal characters: a-z,A-Z,0-9,,.*()[]<>{}^$ ');
306 # Untaint pattern (makes code taint-safe).
307 $pat =~ /([$Regexp_chars]+)/io;
308 $self->setseq(uc($1));
309 # print STDERR "\npattern ok: $pat\n";
313 =head1 expand
315 Title : expand
316 Usage : $seqpat_object->expand();
317 Purpose : Expands the sequence pattern using special ambiguity codes.
318 Example : $pat = $seq_pat->expand();
319 Returns : String containing fully expanded sequence pattern
320 Argument : n/a
321 Throws : Exception if sequence type is not recognized
322 : (i.e., is not one of [DR]NA, Amino)
324 See Also : L<Extended Alphabet Support>, L<_expand_pep>(), L<_expand_nuc>()
326 =cut
328 #----------
329 sub expand {
330 #----------
331 my $self = shift;
333 if($self->type =~ /[DR]na/i) { $self->_expand_nuc(); }
334 elsif($self->type =~ /Amino/i) { $self->_expand_pep(); }
335 else{
336 $self->throw("Don't know how to expand ${\$self->type} patterns.\n");
341 =head1 _expand_pep
343 Title : _expand_pep
344 Usage : n/a; automatically called by expand()
345 Purpose : Expands peptide patterns
346 Returns : String (the expanded pattern)
347 Argument : String (the unexpanded pattern)
348 Throws : n/a
350 See Also : L<expand>(), L<_expand_nuc>()
352 =cut
354 #----------------
355 sub _expand_pep {
356 #----------------
357 my ($self,$pat) = @_;
358 $pat ||= $self->str;
359 $pat =~ s/X/./g;
360 $pat =~ s/^</\^/;
361 $pat =~ s/>$/\$/;
363 ## Avoid nested situations: [bmnq] --/--> [[$ZED]mnq]
364 ## Yet correctly deal with: fze[bmnq] ---> f[$BEE]e[$ZEDmnq]
365 if($pat =~ /\[\w*[BZ]\w*\]/) {
366 $pat =~ s/\[(\w*)B(\w*)\]/\[$1$ZED$2\]/g;
367 $pat =~ s/\[(\w*)Z(\w*)\]/\[$1$BEE$2\]/g;
368 $pat =~ s/B/\[$ZED\]/g;
369 $pat =~ s/Z/\[$BEE\]/g;
370 } else {
371 $pat =~ s/B/\[$ZED\]/g;
372 $pat =~ s/Z/\[$BEE\]/g;
374 $pat =~ s/\((.)\)/$1/g; ## Doing these last since:
375 $pat =~ s/\[(.)\]/$1/g; ## Pattern could contain [B] (for example)
377 return $pat;
382 =head1 _expand_nuc
384 Title : _expand_nuc
385 Purpose : Expands nucleotide patterns
386 Returns : String (the expanded pattern)
387 Argument : String (the unexpanded pattern)
388 Throws : n/a
390 See Also : L<expand>(), L<_expand_pep>()
392 =cut
394 #---------------
395 sub _expand_nuc {
396 #---------------
397 my ($self,$pat) = @_;
399 $pat ||= $self->str;
400 $pat =~ s/N|X/./g;
401 $pat =~ s/pu/R/ig;
402 $pat =~ s/py/Y/ig;
403 $pat =~ s/U/T/g;
404 $pat =~ s/^</\^/;
405 $pat =~ s/>$/\$/;
407 ## Avoid nested situations: [ya] --/--> [[ct]a]
408 ## Yet correctly deal with: sg[ya] ---> [gc]g[cta]
409 if($pat =~ /\[\w*[RYSWMK]\w*\]/) {
410 $pat =~ s/\[(\w*)R(\w*)\]/\[$1$PURINES$2\]/g;
411 $pat =~ s/\[(\w*)Y(\w*)\]/\[$1$PYRIMIDINES$2\]/g;
412 $pat =~ s/\[(\w*)S(\w*)\]/\[$1GC$2\]/g;
413 $pat =~ s/\[(\w*)W(\w*)\]/\[$1AT$2\]/g;
414 $pat =~ s/\[(\w*)M(\w*)\]/\[$1AC$2\]/g;
415 $pat =~ s/\[(\w*)K(\w*)\]/\[$1GT$2\]/g;
416 $pat =~ s/\[(\w*)V(\w*)\]/\[$1ACG$2\]/g;
417 $pat =~ s/\[(\w*)H(\w*)\]/\[$1ACT$2\]/g;
418 $pat =~ s/\[(\w*)D(\w*)\]/\[$1AGT$2\]/g;
419 $pat =~ s/\[(\w*)B(\w*)\]/\[$1CGT$2\]/g;
420 $pat =~ s/R/\[$PURINES\]/g;
421 $pat =~ s/Y/\[$PYRIMIDINES\]/g;
422 $pat =~ s/S/\[GC\]/g;
423 $pat =~ s/W/\[AT\]/g;
424 $pat =~ s/M/\[AC\]/g;
425 $pat =~ s/K/\[GT\]/g;
426 $pat =~ s/V/\[ACG\]/g;
427 $pat =~ s/H/\[ACT\]/g;
428 $pat =~ s/D/\[AGT\]/g;
429 $pat =~ s/B/\[CGT\]/g;
430 } else {
431 $pat =~ s/R/\[$PURINES\]/g;
432 $pat =~ s/Y/\[$PYRIMIDINES\]/g;
433 $pat =~ s/S/\[GC\]/g;
434 $pat =~ s/W/\[AT\]/g;
435 $pat =~ s/M/\[AC\]/g;
436 $pat =~ s/K/\[GT\]/g;
437 $pat =~ s/V/\[ACG\]/g;
438 $pat =~ s/H/\[ACT\]/g;
439 $pat =~ s/D/\[AGT\]/g;
440 $pat =~ s/B/\[CGT\]/g;
442 $pat =~ s/\((.)\)/$1/g; ## Doing thses last since:
443 $pat =~ s/\[(.)\]/$1/g; ## Pattern could contain [y] (for example)
445 return $pat;
450 =head1 revcom
452 Title : revcom
453 Usage : revcom([1]);
454 Purpose : Forms a pattern capable of recognizing the reverse complement
455 : version of a nucleotide sequence pattern.
456 Example : $pattern_object->revcom();
457 : $pattern_object->revcom(1); ## returns expanded rev complement pattern.
458 Returns : Object reference for a new Bio::Tools::SeqPattern containing
459 : the revcom of the current pattern as its sequence.
460 Argument : (1) boolean (optional) (default= false)
461 : true : expand the pattern before rev-complementing.
462 : false: don't expand pattern before or after rev-complementing.
463 Throws : Exception if called for amino acid sequence pattern.
464 Comments : This method permits the simultaneous searching of both
465 : sense and anti-sense versions of a nucleotide pattern
466 : by means of a grep-type of functionality in which any
467 : number of patterns may be or-ed into the recognition
468 : pattern.
469 : Overrides Bio::Seq::revcom() and calls it first thing.
470 : The order of _fixpat() calls is critical.
472 See Also : L<Bio::Seq::revcom>, L</_fixpat_1>, L</_fixpat_2>, L</_fixpat_3>, L</_fixpat_4>, L</_fixpat_5>
474 =cut
476 #-----------'
477 sub revcom {
478 #-----------
479 my($self,$expand) = @_;
481 if ($self->type !~ /Dna|Rna/i) {
482 $self->throw("Can't get revcom for ${\$self->type} sequence types.\n");
484 # return $self->{'_rev'} if defined $self->{'_rev'};
486 $expand ||= 0;
487 my $str = $self->str;
488 $str =~ tr/acgtrymkswhbvdnxACGTRYMKSWHBVDNX/tgcayrkmswdvbhnxTGCAYRKMSWDVBHNX/;
489 my $rev = CORE::reverse $str;
490 $rev =~ tr/[](){}<>/][)(}{></;
492 if($expand) {
493 $rev = $self->_expand_nuc($rev);
494 # print "\nExpanded: $rev\n";
497 %Processed_braces = ();
498 %Processed_asterics = ();
500 my $fixrev = _fixpat_1($rev);
501 # print "FIX 1: $fixrev";<STDIN>;
503 $fixrev = _fixpat_2($fixrev);
504 # print "FIX 2: $fixrev";<STDIN>;
506 $fixrev = _fixpat_3($fixrev);
507 # print "FIX 3: $fixrev";<STDIN>;
509 $fixrev = _fixpat_4($fixrev);
510 # print "FIX 4: $fixrev";<STDIN>;
512 $fixrev = _fixpat_5($fixrev);
513 # print "FIX 5: $fixrev";<STDIN>;
515 ##### Added by ps 8/7/00 to allow non-greedy matching
516 $fixrev = _fixpat_6($fixrev);
517 # print "FIX 6: $fixrev";<STDIN>;
519 # $self->{'_rev'} = $fixrev;
521 return Bio::Tools::SeqPattern->new(-seq =>$fixrev, -type =>$self->type);
524 =head1 backtranslate
526 Title : backtranslate
527 Usage : backtranslate();
528 Purpose : Produce a degenerate oligonucleotide whose translation would produce
529 : the original protein motif.
530 Example : $pattern_object->backtranslate();
531 Returns : Object reference for a new Bio::Tools::SeqPattern containing
532 : the reverse translation of the current pattern as its sequence.
533 Throws : Exception if called for nucleotide sequence pattern.
535 =cut
537 sub backtranslate {
538 my $self = shift;
540 # _load_module loads dynamically, caches call if successful
541 $self->_load_module('Bio::Tools::SeqPattern::Backtranslate');
542 Bio::Tools::SeqPattern::Backtranslate->import("_reverse_translate_motif");
544 if ($self->type ne 'Amino') {
545 $self->throw(
546 "Can't get backtranslate for ${\$self->type} sequence types.\n"
550 return __PACKAGE__->new(
551 -SEQ => _reverse_translate_motif($self->str),
552 -TYPE => 'Dna',
556 =head1 _fixpat_1
558 Title : _fixpat_1
559 Usage : n/a; called automatically by revcom()
560 Purpose : Utility method for revcom()
561 : Converts all {7,5} --> {5,7} (Part I)
562 : and [T^] --> [^T] (Part II)
563 : and *N --> N* (Part III)
564 Returns : String (the new, partially reversed pattern)
565 Argument : String (the expanded pattern)
566 Throws : n/a
568 See Also : L<revcom>()
570 =cut
572 #--------------
573 sub _fixpat_1 {
574 #--------------
575 my $pat = shift;
577 ## Part I:
578 my (@done,@parts);
579 while(1) {
580 $pat =~ /(.*)\{(\S+?)\}(.*)/ or do{ push @done, $pat; last; };
581 $pat = $1.'#{'.reverse($2).'}'.$3;
582 # print "1: $1\n2: $2\n3: $3\n";
583 # print "modified pat: $pat";<STDIN>;
584 @parts = split '#', $pat;
585 push @done, $parts[1];
586 $pat = $parts[0];
587 # print "done: $parts[1]<---\nnew pat: $pat<---";<STDIN>;
588 last if not $pat;
590 $pat = join('', reverse @done);
592 ## Part II:
593 @done = ();
594 while(1) {
595 $pat =~ /(.*)\[(\S+?)\](.*)/ or do{ push @done, $pat; last; };
596 $pat = $1.'#['.reverse($2).']'.$3;
597 # print "1: $1\n2: $2\n3: $3\n";
598 # print "modified pat: $pat";<STDIN>;
599 @parts = split '#', $pat;
600 push @done, $parts[1];
601 $pat = $parts[0];
602 # print "done: $parts[1]<---\nnew pat: $pat<---";<STDIN>;
603 last if not $pat;
605 $pat = join('', reverse @done);
607 ## Part III:
608 @done = ();
609 while(1) {
610 $pat =~ /(.*)\*([\w.])(.*)/ or do{ push @done, $pat; last; };
611 $pat = $1.'#'.$2.'*'.$3;
612 $Processed_asterics{$2}++;
613 # print "1: $1\n2: $2\n3: $3\n";
614 # print "modified pat: $pat";<STDIN>;
615 @parts = split '#', $pat;
616 push @done, $parts[1];
617 $pat = $parts[0];
618 # print "done: $parts[1]<---\nnew pat: $pat<---";<STDIN>;
619 last if not $pat;
621 return join('', reverse @done);
626 =head1 _fixpat_2
628 Title : _fixpat_2
629 Usage : n/a; called automatically by revcom()
630 Purpose : Utility method for revcom()
631 : Converts all {5,7}Y ---> Y{5,7}
632 : and {10,}. ---> .{10,}
633 Returns : String (the new, partially reversed pattern)
634 Argument : String (the expanded, partially reversed pattern)
635 Throws : n/a
637 See Also : L<revcom>()
639 =cut
641 #--------------
642 sub _fixpat_2 {
643 #--------------
644 my $pat = shift;
646 local($^W) = 0;
647 my (@done,@parts,$braces);
648 while(1) {
649 # $pat =~ s/(.*)([^])])(\{\S+?\})([\w.])(.*)/$1$2#$4$3$5/ or do{ push @done, $pat; last; };
650 $pat =~ s/(.*)(\{\S+?\})([\w.])(.*)/$1#$3$2$4/ or do{ push @done, $pat; last; };
651 $braces = $2;
652 $braces =~ s/[{}]//g;
653 $Processed_braces{"$3$braces"}++;
654 # print "modified pat: $pat";<STDIN>;
655 @parts = split '#', $pat;
656 push @done, $parts[1];
657 $pat = $parts[0];
658 # print "done: $parts[1]<---\nnew pat: $pat<---";<STDIN>;
659 last if not $pat;
661 return join('', reverse @done);
665 =head1 _fixpat_3
667 Title : _fixpat_3
668 Usage : n/a; called automatically by revcom()
669 Purpose : Utility method for revcom()
670 : Converts all {5,7}(XXX) ---> (XXX){5,7}
671 Returns : String (the new, partially reversed pattern)
672 Argument : String (the expanded, partially reversed pattern)
673 Throws : n/a
675 See Also : L<revcom>()
677 =cut
679 #-------------
680 sub _fixpat_3 {
681 #-------------
682 my $pat = shift;
684 my (@done,@parts,$braces,$newpat,$oldpat);
685 while(1) {
686 # $pat =~ s/(.+)(\{\S+\})(\(\w+\))(.*)/$1#$3$2$4/ or do{ push @done, $pat; last; };
687 if( $pat =~ /(.*)(.)(\{\S+\})(\(\w+\))(.*)/) {
688 $newpat = "$1#$2$4$3$5";
689 ##ps $oldpat = "$1#$2$3$4$5";
690 # print "1: $1\n2: $2\n3: $3\n4: $4\n5: $5\n";
691 ##ps $braces = $3;
692 ##ps $braces =~ s/[{}]//g;
693 ##ps if( exists $Processed_braces{"$2$braces"} || exists $Processed_asterics{$2}) {
694 ##ps $pat = $oldpat; # Don't change it. Already processed.
695 # print "saved pat: $pat";<STDIN>;
696 ##ps } else {
697 # print "new pat: $newpat";<STDIN>;
698 $pat = $newpat; # Change it.
699 ##ps }
700 } elsif( $pat =~ /^(\{\S+\})(\(\w+\))(.*)/) {
701 $pat = "#$2$1$3";
702 } else {
703 push @done, $pat; last;
705 @parts = split '#', $pat;
706 push @done, $parts[1];
707 $pat = $parts[0];
708 # print "done: $parts[1]<---\nnew pat: $pat<---";<STDIN>;
709 last if not $pat;
711 return join('', reverse @done);
715 =head1 _fixpat_4
717 Title : _fixpat_4
718 Usage : n/a; called automatically by revcom()
719 Purpose : Utility method for revcom()
720 : Converts all {5,7}[XXX] ---> [XXX]{5,7}
721 Returns : String (the new, partially reversed pattern)
722 Argument : String (the expanded, partially reversed pattern)
723 Throws : n/a
725 See Also : L<revcom>()
727 =cut
729 #---------------
730 sub _fixpat_4 {
731 #---------------
732 my $pat = shift;
734 my (@done,@parts,$braces,$newpat,$oldpat);
735 while(1) {
736 # $pat =~ s/(.*)(\{\S+\})(\[\w+\])(.*)/$1#$3$2$4/ or do{ push @done, $pat; last; };
737 # $pat =~ s/(.*)([^\w.])(\{\S+\})(\[\w+\])(.*)/$1$2#$4$3$5/ or do{ push @done, $pat; last; };
738 if( $pat =~ /(.*)(.)(\{\S+\})(\[\w+\])(.*)/) {
739 $newpat = "$1#$2$4$3$5";
740 $oldpat = "$1#$2$3$4$5";
741 # print "1: $1\n2: $2\n3: $3\n4: $4\n5: $5\n";
742 $braces = $3;
743 $braces =~ s/[{}]//g;
744 if( (defined $braces and defined $2) and
745 exists $Processed_braces{"$2$braces"} || exists $Processed_asterics{$2}) {
746 $pat = $oldpat; # Don't change it. Already processed.
747 # print "saved pat: $pat";<STDIN>;
748 } else {
749 $pat = $newpat; # Change it.
750 # print "new pat: $pat";<STDIN>;
752 } elsif( $pat =~ /^(\{\S+\})(\[\w+\])(.*)/) {
753 $pat = "#$2$1$3";
754 } else {
755 push @done, $pat; last;
758 @parts = split '#', $pat;
759 push @done, $parts[1];
760 $pat = $parts[0];
761 # print "done: $parts[1]<---\nnew pat: $pat<---";<STDIN>;
762 last if not $pat;
764 return join('', reverse @done);
768 =head1 _fixpat_5
770 Title : _fixpat_5
771 Usage : n/a; called automatically by revcom()
772 Purpose : Utility method for revcom()
773 : Converts all *[XXX] ---> [XXX]*
774 : and *(XXX) ---> (XXX)*
775 Returns : String (the new, partially reversed pattern)
776 Argument : String (the expanded, partially reversed pattern)
777 Throws : n/a
779 See Also : L<revcom>()
781 =cut
783 #--------------
784 sub _fixpat_5 {
785 #--------------
786 my $pat = shift;
788 my (@done,@parts,$newpat,$oldpat);
789 while(1) {
790 # $pat =~ s/(.*)(\{\S+\})(\[\w+\])(.*)/$1#$3$2$4/ or do{ push @done, $pat; last; };
791 # $pat =~ s/(.*)([^\w.])(\{\S+\})(\[\w+\])(.*)/$1$2#$4$3$5/ or do{ push @done, $pat; last; };
792 if( $pat =~ /(.*)(.)\*(\[\w+\]|\(\w+\))(.*)/) {
793 $newpat = "$1#$2$3*$4";
794 $oldpat = "$1#$2*$3$4";
795 # print "1: $1\n2: $2\n3: $3\n4: $4\n";
796 if( exists $Processed_asterics{$2}) {
797 $pat = $oldpat; # Don't change it. Already processed.
798 # print "saved pat: $pat";<STDIN>;
799 } else {
800 $pat = $newpat; # Change it.
801 # print "new pat: $pat";<STDIN>;
803 } elsif( $pat =~ /^\*(\[\w+\]|\(\w+\))(.*)/) {
804 $pat = "#$1*$3";
805 } else {
806 push @done, $pat; last;
809 @parts = split '#', $pat;
810 push @done, $parts[1];
811 $pat = $parts[0];
812 # print "done: $parts[1]<---\nnew pat: $pat<---";<STDIN>;
813 last if not $pat;
815 return join('', reverse @done);
822 ############################
824 # PS: Added 8/7/00 to allow non-greedy matching patterns
826 ######################################
828 =head1 _fixpat_6
830 Title : _fixpat_6
831 Usage : n/a; called automatically by revcom()
832 Purpose : Utility method for revcom()
833 : Converts all ?Y{5,7} ---> Y{5,7}?
834 : and ?(XXX){5,7} ---> (XXX){5,7}?
835 : and ?[XYZ]{5,7} ---> [XYZ]{5,7}?
836 Returns : String (the new, partially reversed pattern)
837 Argument : String (the expanded, partially reversed pattern)
838 Throws : n/a
840 See Also : L<revcom>()
842 =cut
844 #--------------
845 sub _fixpat_6 {
846 #--------------
847 my $pat = shift;
848 my (@done,@parts);
850 @done = ();
851 while(1) {
852 $pat =~ /(.*)\?(\[\w+\]|\(\w+\)|\w)(\{\S+?\})?(.*)/ or do{ push @done, $pat; last; };
853 my $quantifier = $3 ? $3 : ""; # Shut up warning if no explicit quantifier
854 $pat = $1.'#'.$2.$quantifier.'?'.$4;
855 # $pat = $1.'#'.$2.$3.'?'.$4;
857 # print "1: $1\n2: $2\n3: $3\n";
858 # print "modified pat: $pat";<STDIN>;
859 @parts = split '#', $pat;
860 push @done, $parts[1];
861 $pat = $parts[0];
862 # print "done: $parts[1]<---\nnew pat: $pat<---";<STDIN>;
863 last if not $pat;
865 return join('', reverse @done);
869 =head2 str
871 Title : str
872 Usage : $obj->str($newval)
873 Function:
874 Returns : value of str
875 Args : newvalue (optional)
878 =cut
880 sub str{
881 my $obj = shift;
882 if( @_ ) {
883 my $value = shift;
884 $obj->{'str'} = $value;
886 return $obj->{'str'};
890 =head2 type
892 Title : type
893 Usage : $obj->type($newval)
894 Function:
895 Returns : value of type
896 Args : newvalue (optional)
899 =cut
901 sub type{
902 my $obj = shift;
903 if( @_ ) {
904 my $value = shift;
905 $obj->{'type'} = $value;
907 return $obj->{'type'};
913 __END__
915 #########################################################################
916 # End of class
917 #########################################################################
919 =head1 FOR DEVELOPERS ONLY
921 =head2 Data Members
923 Information about the various data members of this module is provided
924 for those wishing to modify or understand the code. Two things to bear
925 in mind:
927 =over 2
929 =item 1 Do NOT rely on these in any code outside of this module.
931 All data members are prefixed with an underscore to signify that they
932 are private. Always use accessor methods. If the accessor doesn't
933 exist or is inadequate, create or modify an accessor (and let me know,
934 too!).
936 =item 2 This documentation may be incomplete and out of date.
938 It is easy for this documentation to become obsolete as this module is
939 still evolving. Always double check this info and search for members
940 not described here.
942 =back
944 An instance of Bio::Tools::RestrictionEnzyme.pm is a blessed reference
945 to a hash containing all or some of the following fields:
947 FIELD VALUE
948 ------------------------------------------------------------------------
949 _rev : The corrected reverse complement of the fully expanded pattern.
951 INHERITED DATA MEMBERS:
953 _seq : (From Bio::Seq.pm) The original, unexpanded input sequence after untainting.
954 _type : (From Bio::Seq.pm) 'Dna' or 'Amino'
957 =cut