cgi-bin/methods/unigene/unigene-precluster.pl

   1 use strict;
   2 use CXGN::Page;
   3 my $page=CXGN::Page->new('unigene-precluster.html','html2pl converter');
   4 $page->header('Sol Genomics Network');
   5 print<<END_HEREDOC;
   6
   7   <center>
   8
   9     <table summary="" width="720" cellpadding="0" cellspacing="0"
  10     border="0">
  11       <tr>
  12         <td>
  13
  14
  15           <h3>Preclustering</h3>
  16
  17           <p>Preclustering is a technique used to partition the
  18           input data into groups small enough for the assembly
  19           program to process. Even on powerful computers, an
  20           assembler such as CAP3 or PHRAP can not effectively run
  21           with more than 20,000 input sequences. Either the memory
  22           requirements are too large or the runtime is unacceptably
  23           large.</p>
  24
  25           <p>By preclustering, we reduce the input size into
  26           disjoint groups of sequences which are not at all similar
  27           to any of the sequences in other groups. This limits the
  28           work on the assembler by excluding sequences which are
  29           obviously not transcripts from the same gene. Thus, the
  30           assembly program is used to decide (and assembly into
  31           contigs) the number of unique transcripts in a "cluster"
  32           of similar ESTs, preclustering is used to partition the
  33           input set into disjoint clusters of similar sequences
  34           which are small enough to allow the assembler to run
  35           efficiently.</p>
  36
  37           <h3>Transitive Closure Clustering</h3>
  38
  39           <p>There are many general methods of clustering data. For
  40           purposes of partitioning data into disjoint sets for
  41           unigene assembly, we use a simple method which we call
  42           "transitive closure clustering." The same methodology has
  43           been described elsewhere as "single-linkage
  44           clustering."</p>
  45
  46           <p>Pairwise scores are found for all pairs of sequences.
  47           If the score for a pair of sequences is higher than some
  48           given threshold, the pair is considered linked. If A is
  49           linked to B, and B is linked to C, then A, B, and C are
  50           clustered together, even if A is not considered linked
  51           with C. Hence, the linkage relationship is transitive,
  52           and a cluster is found by finding the transitive closure
  53           of the linkage relationship.</p>
  54
  55           <p>In context of unigene assembly, this effectively
  56           yields disjoint clusters of sequences for which no
  57           sequence in a given cluster has a detectable coarse
  58           overlap with any sequence in any other cluster. Thus,
  59           there is no possibility for contig assembly of two
  60           sequences which are in different clusters, so the
  61           exclusion does not in theory alter the outcome of the
  62           assembly step. Since the preclustering pairwise
  63           comparisons are much more efficient coarse approximations
  64           than the assembler's full alignments, the overall runtime
  65           and resource consumption of the unigene build becomes
  66           manageable.</p>
  67         </td>
  68       </tr>
  69
  70     </table>
  71   </center>
  72 END_HEREDOC
  73 $page->footer();