4 Bio::Matrix::PSM::SiteMatrixI - SiteMatrixI implementation, holds a
5 position scoring matrix (or position weight matrix) and log-odds
9 # You cannot use this module directly; see Bio::Matrix::PSM::SiteMatrix
10 # for an example implementation
14 SiteMatrix is designed to provide some basic methods when working with position
15 scoring (weight) matrices, such as transcription factor binding sites for
16 example. A DNA PSM consists of four vectors with frequencies {A,C,G,T}. This is
17 the minimum information you should provide to construct a PSM object. The
18 vectors can be provided as strings with frequenciesx10 rounded to an int, going
19 from {0..a} and 'a' represents the maximum (10). This is like MEME's compressed
20 representation of a matrix and it is quite useful when working with relational
21 DB. If arrays are provided as an input (references to arrays actually) they can
22 be any number, real or integer (frequency or count).
24 When creating the object you can ask the constructor to make a simple pseudo
25 count correction by adding a number (typically 1) to all positions (with the
26 -correction option). After adding the number the frequencies will be
27 calculated. Only use correction when you supply counts, not frequencies.
29 Throws an exception if: You mix as an input array and string (for example A
30 matrix is given as array, C - as string). The position vector is (0,0,0,0). One
31 of the probability vectors is shorter than the rest.
33 Summary of the methods I use most frequently (details below):
35 iupac - return IUPAC compliant consensus as a string
36 score - Returns the score as a real number
37 IC - information content. Returns a real number
38 id - identifier. Returns a string
39 accession - accession number. Returns a string
40 next_pos - return the sequence probably for each letter, IUPAC
41 symbol, IUPAC probability and simple sequence
42 consenus letter for this position. Rewind at the end. Returns a hash.
43 pos - current position get/set. Returns an integer.
44 regexp - construct a regular expression based on IUPAC consensus.
45 For example AGWV will be [Aa][Gg][AaTt][AaCcGg]
47 get_string - gets the probability vector for a single base as a string.
48 get_array - gets the probability vector for a single base as an array.
49 get_logs_array - gets the log-odds vector for a single base as an array.
51 New methods, which might be of interest to anyone who wants to store PSM in a relational
52 database without creating an entry for each position is the ability to compress the
53 PSM vector into a string with losing usually less than 1% of the data.
54 this can be done with:
56 my $str=$matrix->get_compressed_freq('A');
60 my $str=$matrix->get_compressed_logs('A');
62 Loading from a database should be done with new, but is not yest implemented.
63 However you can still uncompress such string with:
65 my @arr=Bio::Matrix::PSM::_uncompress_string ($str,1,1); for PSM
69 my @arr=Bio::Matrix::PSM::_uncompress_string ($str,1000,2); for log odds
75 User feedback is an integral part of the evolution of this and other
76 Bioperl modules. Send your comments and suggestions preferably to one
77 of the Bioperl mailing lists. Your participation is much appreciated.
79 bioperl-l@bioperl.org - General discussion
80 http://bioperl.org/wiki/Mailing_lists - About the mailing lists
84 Please direct usage questions or support issues to the mailing list:
86 I<bioperl-l@bioperl.org>
88 rather than to the module maintainer directly. Many experienced and
89 reponsive experts will be able look at the problem and quickly
90 address it. Please include a thorough description of the problem
91 with code and data examples if at all possible.
95 Report bugs to the Bioperl bug tracking system to help us keep track
96 the bugs and their resolution. Bug reports can be submitted via the
99 https://github.com/bioperl/bioperl-live/issues
101 =head1 AUTHOR - Stefan Kirov
110 # Let the code begin...
112 package Bio
::Matrix
::PSM
::SiteMatrixI
;
115 use base
qw(Bio::Root::RootI);
120 Usage : $self->calc_weight({A=>0.2562,C=>0.2438,G=>0.2432,T=>0.2568});
121 Function: Recalculates the PSM (or weights) based on the PFM (the frequency matrix)
122 and user supplied background model.
123 Throws : if no model is supplied
126 Args : reference to a hash with background frequencies for A,C,G and T
132 $self->throw_not_implemented();
139 Usage : my %base=$site->next_pos;
142 Retrieves the next position features: frequencies and weights for
143 A,C,G,T, the main letter (as in consensus) and the
144 probabilty for this letter to occur at this position and
149 Returns : hash (pA,pC,pG,pT,lA,lC,lG,lT,base,prob,rel)
157 $self->throw_not_implemented();
163 Usage : my $pos=$site->curpos;
164 Function: Gets/sets the current position. Converts to 0 if argument is minus and
165 to width if greater than width
175 $self->throw_not_implemented();
181 Usage : my $score=$site->e_val;
182 Function: Gets/sets the e-value
185 Returns : real number
192 $self->throw_not_implemented();
199 Function: Returns the consensus
201 Args : (optional) threshold value 1 to 10, default 5
202 '5' means the returned characters had a 50% or higher presence at
209 $self->throw_not_implemented();
212 =head2 accession_number
214 Title : accession_number
216 Function: accession number, this will be unique id for the SiteMatrix object as
217 well for any other object, inheriting from SiteMatrix
225 sub accession_number
{
227 $self->throw_not_implemented();
234 Usage : my $width=$site->width;
235 Function: Returns the length of the site
245 $self->throw_not_implemented();
251 Usage : my $iupac_consensus=$site->IUPAC;
252 Function: Returns IUPAC compliant consensus
262 $self->throw_not_implemented();
268 Usage : my $ic=$site->IC;
269 Function: Information content
272 Returns : real number
279 $self->throw_not_implemented();
285 Usage : my $freq_A=$site->get_string('A');
286 Function: Returns given probability vector as a string. Useful if you want to
287 store things in a rel database, where arrays are not first choice
288 Throws : If the argument is outside {A,C,G,T}
291 Args : character {A,C,G,T}
297 $self->throw_not_implemented();
303 Usage : my $id=$site->id;
304 Function: Gets/sets the site id
314 $self->throw_not_implemented();
320 Usage : my $regexp=$site->regexp;
321 Function: Returns a regular expression which matches the IUPAC convention.
322 N will match X, N, - and .
332 $self->throw_not_implemented();
338 Usage : my @regexp=$site->regexp;
339 Function: Returns a regular expression which matches the IUPAC convention.
340 N will match X, N, - and .
345 To do : I have separated regexp and regexp_array, but
346 maybe they can be rewritten as one - just check what
353 $self->throw_not_implemented();
359 Usage : my @freq_A=$site->get_array('A');
360 Function: Returns an array with frequencies for a specified base
370 $self->throw_not_implemented();
378 Function: Converts a single position to IUPAC compliant symbol and
379 returns its probability. For rules see the implementation.
382 Returns : char, real number
383 Args : real numbers for A,C,G,T (positional)
389 $self->throw_not_implemented();
396 Function: Converts a single position to simple consensus character and
397 returns its probability. For rules see the implementation,
400 Returns : char, real number
401 Args : real numbers for A,C,G,T (positional)
407 $self->throw_not_implemented();
411 =head2 _calculate_consensus
413 Title : _calculate_consensus
415 Function: Internal stuff
423 sub _calculate_consensus
{
425 $self->throw_not_implemented();
428 =head2 _compress_array
430 Title : _compress_array
432 Function: Will compress an array of real signed numbers to a string (ie vector of bytes)
433 -127 to +127 for bi-directional(signed) and 0..255 for unsigned ;
435 Example : Internal stuff
437 Args : array reference, followed by an max value and
438 direction (optional, default 1-unsigned),1 unsigned, any other is signed.
442 sub _compress_array
{
444 $self->throw_not_implemented();
447 =head2 _uncompress_string
449 Title : _uncompress_string
451 Function: Will uncompress a string (vector of bytes) to create an array of real
452 signed numbers (opposite to_compress_array)
454 Example : Internal stuff
455 Returns : string, followed by an max value and
456 direction (optional, default 1-unsigned), 1 unsigned, any other is signed.
461 sub _uncompress_string
{
463 $self->throw_not_implemented();
466 =head2 get_compressed_freq
468 Title : get_compressed_freq
470 Function: A method to provide a compressed frequency vector. It uses one byte to
471 code the frequence for one of the probability vectors for one position.
472 Useful for relational database. Improvement of the previous 0..a coding.
474 Example : my $strA=$self->get_compressed_freq('A');
480 sub get_compressed_freq
{
482 $self->throw_not_implemented();
485 =head2 get_compressed_logs
487 Title : get_compressed_logs
489 Function: A method to provide a compressed log-odd vector. It uses one byte to
490 code the log value for one of the log-odds vectors for one position.
492 Example : my $strA=$self->get_compressed_logs('A');
498 sub get_compressed_logs
{
500 $self->throw_not_implemented();
503 =head2 sequence_match_weight
505 Title : sequence_match_weight
507 Function: This method will calculate the score of a match, based on the PWM
508 if such is associated with the matrix object. Returns undef if no
509 PWM data is available.
510 Throws : if the length of the sequence is different from the matrix width
511 Example : my $score=$matrix->sequence_match_weight('ACGGATAG');
512 Returns : Floating point
517 sub sequence_match_weight
{
519 $self->throw_not_implemented();
522 =head2 get_all_vectors
524 Title : get_all_vectors
526 Function: returns all possible sequence vectors to satisfy the PFM under
528 Throws : If threshold outside of 0..1 (no sense to do that)
529 Example : my @vectors=$self->get_all_vectors(4);
530 Returns : Array of strings
531 Args : (optional) floating
535 sub get_all_vectors
{
537 $self->throw_not_implemented();