1 #-----------------------------------------------------------------------------
2 # PACKAGE : Bio::Tools::Sigcleave
3 # AUTHOR : Chris Dagdigian, dag@sonsorol.org
4 # CREATED : Jan 28 1999
6 # Copyright (c) 1997-9 bioperl, Chris Dagdigian and others. All Rights Reserved.
7 # This module is free software; you can redistribute it and/or
8 # modify it under the same terms as Perl itself.
12 # Object framework ripped from Steve Chervits's SeqPattern.pm
14 # Core EGCG Sigcleave emulation from perl code developed by
15 # Danh Nguyen & Kamalakar Gulukota which itself was based
16 # loosely on Colgrove's signal.c program.
18 # The overall idea is to replicate the output of the sigcleave
19 # program which was distributed with the EGCG extension to the GCG sequence
20 # analysis package. There is also an accessor method for just getting at
23 #-----------------------------------------------------------------------------
27 Bio::Tools::Sigcleave - Bioperl object for sigcleave analysis
31 =head2 Object Creation
33 use Bio::Tools::Sigcleave ();
35 # to keep the module backwar compatible, you can pass it a sequence string, but
36 # there recommended say is to pass it a Seq object
39 $seq = "MVLLLILSVLLLKEDVRGSAQSSERRVVAHMPGDIIIGALFSVHHQPTVDKVHERKCGAVREQYGI";
40 $sig = Bio::Tools::Sigcleave->new(-seq => $seq,
45 $seqobj = Bio::PrimarySeq->new(-seq => $seq);
47 $sig = Bio::Tools::Sigcleave->new(-seq => $seqobj,
51 # now you can detect procaryotic signal sequences as well as eucaryotic
52 $sig->matrix('eucaryotic'); # or 'procaryotic'
55 =head2 Object Methods & Accessors
57 # you can use this method to fine tune the threshod before printing out the results
60 %raw_results = $sig->signals;
61 $formatted_output = $sig->pretty_print;
65 "Sigcleave" was a program distributed as part of the free EGCG add-on
66 to earlier versions of the GCG Sequence Analysis package. A new
67 implementation of the algorithm is now part of EMBOSS package.
69 From the EGCG documentation:
71 SigCleave uses the von Heijne method to locate signal sequences, and
72 to identify the cleavage site. The method is 95% accurate in
73 resolving signal sequences from non-signal sequences with a cutoff
74 score of 3.5, and 75-80% accurate in identifying the cleavage
75 site. The program reports all hits above a minimum value.
77 The EGCG Sigcleave program was written by Peter Rice (E-mail:
78 pmr@sanger.ac.uk Post: Informatics Division, The Sanger Centre,
79 Wellcome Trust Genome Campus, Hinxton, Cambs, CB10 1SA, UK).
81 Since EGCG is no longer distributed for the latest versions of GCG,
82 this code was developed to emulate the output of the original program
83 as much as possible for those who lost access to sigcleave when
84 upgrading to newer versions of GCG.
86 There are 2 accessor methods for this object. "signals" will return a
87 perl associative array containing the sigcleave scores keyed by amino
88 acid position. "pretty_print" returns a formatted string similar to
89 the output of the original sigcleave utility.
91 In both cases, the "threshold" setting controls the score reporting
92 level. If no value for threshold is passed in by the user, the code
93 defaults to a reporting value of 3.5.
95 In this implementation the accessor will never return any
96 score/position pair which does not meet the threshold limit. This is
97 the slightly different from the behaviour of the 8.1 EGCG sigcleave
98 program which will report the highest of the under-threshold results
99 if nothing else is found.
102 Example of pretty_print output:
104 SIGCLEAVE of sigtest from: 1 to 146
106 Report scores over 3.5
107 Maximum score 4.9 at residue 131
109 Sequence: FVILAAMSIQGSA-NLQTQWKSTASLALET
110 | (signal) | (mature peptide)
113 Other entries above 3.5
115 Maximum score 3.7 at residue 112
117 Sequence: CSRQLFGWLFCKV-HPGAIVFVILAAMSIQGSANLQTQWKSTASLALET
118 | (signal) | (mature peptide)
124 When updating and maintaining a module, it helps to know that people
125 are actually using it. Let us know if you find a bug, think this code
126 is useful or have any improvements/features to suggest.
130 Please direct usage questions or support issues to the mailing list:
132 I<bioperl-l@bioperl.org>
134 rather than to the module maintainer directly. Many experienced and
135 reponsive experts will be able look at the problem and quickly
136 address it. Please include a thorough description of the problem
137 with code and data examples if at all possible.
139 =head2 Reporting Bugs
141 Report bugs to the Bioperl bug tracking system to help us keep track
142 the bugs and their resolution. Bug reports can be submitted via the
145 https://github.com/bioperl/bioperl-live/issues
149 Chris Dagdigian, dag-at-sonsorol.org & others
153 Heikki Lehvaslaiho, heikki-at-bioperl-dot-org
157 Bio::Tools::Sigcleave, $Id$
161 Copyright (c) 1999 Chris Dagdigian & others. All Rights Reserved.
162 This module is free software; you can redistribute it and/or modify it
163 under the same terms as Perl itself.
165 =head1 REFERENCES / SEE ALSO
167 von Heijne G. (1986) "A new method for predicting signal sequences
168 cleavage sites." Nucleic Acids Res. 14, 4683-4690.
170 von Heijne G. (1987) in "Sequence Analysis in Molecular Biology:
171 Treasure Trove or Trivial Pursuit" (Acad. Press, (1987), 113-117).
176 The following documentation describes the various functions
177 contained in this module. Some functions are for internal
178 use and are not meant to be called by the user; they are
179 preceded by an underscore ("_").
186 #### END of main POD documentation.
191 package Bio
::Tools
::Sigcleave
;
195 use base
qw(Bio::Root::Root);
197 use vars qw
($ID %WeightTable_euc %WeightTable_pro );
198 $ID = 'Bio::Tools::Sigcleave';
201 #Sample: 161 aligned sequences
202 # R -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 +1 +2 Expect
203 'A' => [16, 13, 14, 15, 20, 18, 18, 17, 25, 15, 47, 6, 80, 18, 6, 14.5],
204 'C' => [ 3, 6, 9, 7, 9, 14, 6, 8, 5, 6, 19, 3, 9, 8, 3, 4.5],
205 'D' => [ 0, 0, 0, 0, 0, 0, 0, 0, 5, 3, 0, 5, 0, 10, 11, 8.9],
206 'E' => [ 0, 0, 0, 1, 0, 0, 0, 0, 3, 7, 0, 7, 0, 13, 14, 10.0],
207 'F' => [13, 9, 11, 11, 6, 7, 18, 13, 4, 5, 0, 13, 0, 6, 4, 5.6],
208 'G' => [ 4, 4, 3, 6, 3, 13, 3, 2, 19, 34, 5, 7, 39, 10, 7, 12.1],
209 'H' => [ 0, 0, 0, 0, 0, 1, 1, 0, 5, 0, 0, 6, 0, 4, 2, 3.4],
210 'I' => [15, 15, 8, 6, 11, 5, 4, 8, 5, 1, 10, 5, 0, 8, 7, 7.4],
211 'K' => [ 0, 0, 0, 1, 0, 0, 1, 0, 0, 4, 0, 2, 0, 11, 9, 11.3],
212 'L' => [71, 68, 72, 79, 78, 45, 64, 49, 10, 23, 8, 20, 1, 8, 4, 12.1],
213 'M' => [ 0, 3, 7, 4, 1, 6, 2, 2, 0, 0, 0, 1, 0, 1, 2, 2.7],
214 'N' => [ 0, 1, 0, 1, 1, 0, 0, 0, 3, 3, 0, 10, 0, 4, 7, 7.1],
215 'P' => [ 2, 0, 2, 0, 0, 4, 1, 8, 20, 14, 0, 1, 3, 0, 22, 7.4],
216 'Q' => [ 0, 0, 0, 1, 0, 6, 1, 0, 10, 8, 0, 18, 3, 19, 10, 6.3],
217 'R' => [ 2, 0, 0, 0, 0, 1, 0, 0, 7, 4, 0, 15, 0, 12, 9, 7.6],
218 'S' => [ 9, 3, 8, 6, 13, 10, 15, 16, 26, 11, 23, 17, 20, 15, 10, 11.4],
219 'T' => [ 2, 10, 5, 4, 5, 13, 7, 7, 12, 6, 17, 8, 6, 3, 10, 9.7],
220 'V' => [20, 25, 15, 18, 13, 15, 11, 27, 0, 12, 32, 3, 0, 8, 17, 11.1],
221 'W' => [ 4, 3, 3, 1, 1, 2, 6, 3, 1, 3, 0, 9, 0, 2, 0, 1.8],
222 'Y' => [ 0, 1, 4, 0, 0, 1, 3, 1, 1, 2, 0, 5, 0, 1, 7, 5.6]
226 #Sample: 36 aligned sequences
227 # R -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 +1 +2 Expect
228 'A' => [0, 8, 8, 9, 6, 7, 5, 6, 7, 7, 24, 2, 31, 18, 4, 3.2],
229 'C' => [1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1.0],
230 'D' => [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 8, 2.0],
231 'E' => [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 4, 8, 2.2],
232 'F' => [2, 4, 3, 4, 1, 1, 8, 0, 4, 1, 0, 7, 0, 1, 0, 1.3],
233 'G' => [4, 2, 2, 2, 3, 5, 2, 4, 2, 2, 0, 2, 2, 1, 0, 2.7],
234 'H' => [0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 7, 0, 1, 0, 0.8],
235 'I' => [3, 1, 5, 1, 5, 0, 1, 3, 0, 0, 0, 0, 0, 0, 2, 1.7],
236 'K' => [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 2, 0, 3, 0, 2.5],
237 'L' => [8, 11, 9, 8, 9, 13, 1, 0, 2, 2, 1, 2, 0, 0, 1, 2.7],
238 'M' => [0, 2, 1, 1, 3, 2, 3, 0, 1, 2, 0, 4, 0, 0, 1, 0.6],
239 'N' => [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 3, 0, 1, 4, 1.6],
240 'P' => [0, 1, 1, 1, 1, 1, 2, 3, 5, 2, 0, 0, 0, 0, 5, 1.7],
241 'Q' => [0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0, 3, 0, 0, 1, 1.4],
242 'R' => [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1.7],
243 'S' => [1, 0, 1, 4, 4, 1, 5, 15, 5, 8, 5, 2, 2, 0, 0, 2.6],
244 'T' => [2, 0, 4, 2, 2, 2, 2, 2, 5, 1, 3, 0, 1, 1, 2, 2.2],
245 'V' => [5, 7, 1, 3, 1, 4, 7, 0, 0, 4, 3, 0, 0, 2, 0, 2.5],
246 'W' => [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0.4],
247 'Y' => [0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 1, 0, 0, 0, 1.3]
252 ## Now we calculate the _real_ values for the weight tables
255 ## yeah yeah yeah there is lots of math here that gets repeated
256 ## every single time a sigcleave object gets created. This is
257 ## a quick hack to make sure that we get the scores as accurate as
258 ## possible. Need all those significant digits....
260 ## suggestions for speedup aproaches welcome
264 foreach my $i (keys %WeightTable_euc) {
265 my $expected = $WeightTable_euc{$i}[15];
267 for (my $j=0; $j<16; $j++) {
268 if ($WeightTable_euc{$i}[$j] == 0) {
269 $WeightTable_euc{$i}[$j] = 1;
270 if ($j == 10 || $j == 12) {
271 $WeightTable_euc{$i}[$j] = 1.e
-10;
274 $WeightTable_euc{$i}[$j] = log($WeightTable_euc{$i}[$j]/$expected);
280 foreach my $i (keys %WeightTable_pro) {
281 my $expected = $WeightTable_pro{$i}[15];
283 for (my $j=0; $j<16; $j++) {
284 if ($WeightTable_pro{$i}[$j] == 0) {
285 $WeightTable_pro{$i}[$j] = 1;
286 if ($j == 10 || $j == 12) {
287 $WeightTable_pro{$i}[$j] = 1.e
-10;
290 $WeightTable_pro{$i}[$j] = log($WeightTable_pro{$i}[$j]/$expected);
295 #####################################################################################
297 #####################################################################################
301 my ($class, @args) = @_;
303 my $self = $class->SUPER::new
(@args);
304 #my $self = Bio::Seq->new(@args);
306 my ($seq, $threshold, $matrix) = $self->_rearrange([qw(SEQ THRESHOLD MATRIX)],@args);
308 defined $threshold && $self->threshold($threshold);
309 $matrix && $self->matrix($matrix);
310 $seq && $self->seq($seq);
320 Usage : $value = $self->threshold
321 Purpose : Read/write method sigcleave score reporting threshold.
323 Argument : new value, float
324 Throws : on non-number argument
325 Comments : defaults to 3.5
333 my ($self, $value) = @_;
334 if( defined $value) {
335 $self->throw("I need a number, not [$value]")
336 if $value !~ /^[+-]?[\d\.]+$/;
337 $self->{'_threshold'} = $value;
339 return $self->{'_threshold'} || 3.5 ;
345 Usage : $value = $self->matrix('procaryotic')
346 Purpose : Read/write method sigcleave matrix.
348 Argument : new value: 'eucaryotic' or 'procaryotic'
349 Throws : on non-number argument
350 Comments : defaults to 3.5
358 my ($self, $value) = @_;
359 if( defined $value) {
360 $self->throw("I need 'eucaryotic' or 'procaryotic', not [$value]")
361 unless $value eq 'eucaryotic' or $value eq 'procaryotic';
362 $self->{'_matrix'} = $value;
364 return $self->{'_matrix'} || 'eucaryotic' ;
370 Usage : $value = $self->seq($seq_object)
371 Purpose : set the Seq object to be used
373 Argument : protein sequence or Seq object
381 my ($self, $value) = @_;
382 if( defined $value) {
383 if ($value->isa('Bio::PrimarySeqI')) {
384 $self->{'_seq'} = $value;
386 $self->{'_seq'} = Bio
::PrimarySeq
->new(-seq
=> $value,
387 -alphabet
=> 'protein');
390 return $self->{'_seq'};
396 Usage : N/A This is an internal method. Not meant to be called from outside
399 Purpose : calculates sigcleave score and amino acid position for the
400 : given protein sequence. The score reporting threshold can
401 : be adjusted by passing in the "threshold" parameter during
402 : object construction. If no threshold is passed in, the code
403 : defaults to reporting any scores equal to or above 3.5
405 Returns : nothing. results are added to the object
435 my $seqEnd = $self->seq->length;
436 my $pep = $self->seq->seq;
437 my $minWeight = $self->threshold;
438 my $matrix = $self->matrix;
440 ## The weight table is keyed by UPPERCASE letters so we uppercase
441 ## the pep string because we don't want to alter the actual object
446 for ($seqPos = $seqBegin; $seqPos < $seqEnd; $seqPos++) {
447 $istart = (0 > $seqPos + $pVal)?
0 : $seqPos + $pVal;
448 $iend = ($seqPos + $nVal - 1 < $seqEnd)?
$seqPos + $nVal - 1 : $seqEnd;
449 $icol= $iend - $istart + 1;
451 for ($k=0; $k<$icol; $k++) {
452 $c = substr($pep, $istart + $k, 1);
454 ## CD: The if(defined) stuff was put in here because Sigcleave.pm
455 ## CD: kept getting warnings about undefined vals during 'make test' ...
456 if ($matrix eq 'eucaryotic') {
457 $weight += $WeightTable_euc{$c}[$k] if defined $WeightTable_euc{$c}[$k];
459 $weight += $WeightTable_pro{$c}[$k] if defined $WeightTable_pro{$c}[$k];
462 $signals{$seqPos+1} = sprintf ("%.1f", $weight) if $weight >= $minWeight;
464 $self->{"_signal_scores"} = { %signals };
471 Usage : %sigcleave_results = $sig->signals;
473 Purpose : Accessor method for sigcleave results
475 Returns : Associative array. The key value represents the amino acid position
476 : and the value represents the score. Only scores that
477 : are greater than or equal to the THRESHOLD value are reported.
493 # do the calculations
496 foreach $position ( sort keys %{ $self->{'_signal_scores'} } ) {
497 $results{$position} = $self->{'_signal_scores'}{$position};
506 Usage : $count = $sig->result_count;
508 Purpose : Accessor method for sigcleave results
510 Returns : Integer, number of results above the threshold
525 return keys %{ $self->{'_signal_scores'} };
532 Usage : $output = $sig->pretty_print;
533 : print $sig->pretty_print;
535 Purpose : Emulates the output of the EGCG Sigcleave
538 Returns : A formatted string.
553 my %results = $self->signals;
554 my @hits = keys %results;
555 my $hitcount = $#hits; $hitcount++;
556 my $thresh = $self->threshold;
557 my $seqlen = $self->seq->length || 0;
558 my $name = $self->seq->id || 'NONAME';
559 my $pep = $self->seq->seq;
562 $output = "SIGCLEAVE of $name from: 1 to $seqlen\n\n";
565 $output .= "Report scores over $thresh\n";
566 foreach $pos ((sort { $results{$b} cmp $results{$a} } keys %results)) {
567 my $start = $pos - 15;
568 $start = 1 if $start < 1;
569 my $sig = substr($pep,$start -1,$pos-$start );
571 $output .= sprintf ("Maximum score %1.1f at residue %3d\n",$results{$pos},$pos);
573 $output .= " Sequence: ";
575 $output .= "-" x
(15- length($sig));
577 $output .= substr($pep,$pos-1,50);
580 $output .= "| \(signal\) | \(mature peptide\)\n";
581 $output .= sprintf(" %3d %3d\n\n",$start,$pos);
583 if (($hitcount > 1) && ($cnt == 1)) {
584 $output .= " Other entries above $thresh\n\n";
597 #########################################################################
599 #########################################################################