improve debug output of 404 errors
[sgn.git] / cgi-bin / methods / unigene / unigene-precluster.pl
bloba29e8e9be4966aefc4f6c0c81b7c5390f1879839
1 use strict;
2 use CXGN::Page;
3 my $page=CXGN::Page->new('unigene-precluster.html','html2pl converter');
4 $page->header('Sol Genomics Network');
5 print<<END_HEREDOC;
7 <center>
9 <table summary="" width="720" cellpadding="0" cellspacing="0"
10 border="0">
11 <tr>
12 <td>
15 <h3>Preclustering</h3>
17 <p>Preclustering is a technique used to partition the
18 input data into groups small enough for the assembly
19 program to process. Even on powerful computers, an
20 assembler such as CAP3 or PHRAP can not effectively run
21 with more than 20,000 input sequences. Either the memory
22 requirements are too large or the runtime is unacceptably
23 large.</p>
25 <p>By preclustering, we reduce the input size into
26 disjoint groups of sequences which are not at all similar
27 to any of the sequences in other groups. This limits the
28 work on the assembler by excluding sequences which are
29 obviously not transcripts from the same gene. Thus, the
30 assembly program is used to decide (and assembly into
31 contigs) the number of unique transcripts in a "cluster"
32 of similar ESTs, preclustering is used to partition the
33 input set into disjoint clusters of similar sequences
34 which are small enough to allow the assembler to run
35 efficiently.</p>
37 <h3>Transitive Closure Clustering</h3>
39 <p>There are many general methods of clustering data. For
40 purposes of partitioning data into disjoint sets for
41 unigene assembly, we use a simple method which we call
42 "transitive closure clustering." The same methodology has
43 been described elsewhere as "single-linkage
44 clustering."</p>
46 <p>Pairwise scores are found for all pairs of sequences.
47 If the score for a pair of sequences is higher than some
48 given threshold, the pair is considered linked. If A is
49 linked to B, and B is linked to C, then A, B, and C are
50 clustered together, even if A is not considered linked
51 with C. Hence, the linkage relationship is transitive,
52 and a cluster is found by finding the transitive closure
53 of the linkage relationship.</p>
55 <p>In context of unigene assembly, this effectively
56 yields disjoint clusters of sequences for which no
57 sequence in a given cluster has a detectable coarse
58 overlap with any sequence in any other cluster. Thus,
59 there is no possibility for contig assembly of two
60 sequences which are in different clusters, so the
61 exclusion does not in theory alter the outcome of the
62 assembly step. Since the preclustering pairwise
63 comparisons are much more efficient coarse approximations
64 than the assembler's full alignments, the overall runtime
65 and resource consumption of the unigene build becomes
66 manageable.</p>
67 </td>
68 </tr>
70 </table>
71 </center>
72 END_HEREDOC
73 $page->footer();