make test pass for multicat parsing with two xlsx files for testing.
[sgn.git] / cgi-bin / methods / unigene / unigene-process-2.pl
blob035c4fb5a0813a752a8f9bd0dfac63c295b189fb
1 use strict;
2 use CXGN::Page;
3 my $page=CXGN::Page->new('unigene-process-2.html','html2pl converter');
4 $page->header('SGN Assembly Process Version 2');
5 print<<END_HEREDOC;
7 <center>
9 <table summary="" width="720" cellpadding="0" cellspacing="0"
10 border="0">
11 <tr>
12 <td>
15 <h3>SGN Assembly Process Version 2</h3>
17 <p>ESTs are preclustered using a custom developed tool to
18 coarsely identify strong sequence overlaps. (<a href=
19 "unigene-precluster.pl">Why precluster?</a>) This
20 produces a set of pairwise scores to be used in <a href=
21 "unigene-precluster.pl">transitive closure
22 clustering</a>, implemented as a graph algorithm using
23 depth-first search.</p>
25 <p>In graph theoretic terms, the sequences are considered
26 nodes of a graph. Undirected edges between nodes indicate
27 a detected overlap between the sequences represented by
28 the nodes. Edges may be weighted, indicating the strength
29 of the overlap. The connected components of the graph are
30 discovered by depth first search, yielding a depth first
31 "forest" of sequence clusters.</p>
33 <p>Articulation points in the graph are discovered by
34 analyzing the "tree edge" and "back edge" classification
35 of edges from depth first search. Nodes identified as
36 articulation points are potentially chimeric sequences
37 and their overlaps are analyzed further for adjacent but
38 distinct homology regions. Sequences with adjacent but
39 distinct homology regions are considered likely to be
40 chimeric and are discarded. Since the sequence is an
41 articulation point, this will break the cluster into two
42 separate clusters, as expected.</p>
44 <p>The resulting clusters are supplied as input, with
45 base calling quality scores, to the <a href=
46 "http://genome.cs.mtu.edu/cap/cap3.html">CAP3 assembly
47 program</a>. We have used the following parameters (for
48 Lycopersicon combined build):</p>
50 <table summary="" border="1">
51 <tr>
52 <td>CAP3 option</td>
53 <td>default value</td>
54 <td>value used</td>
55 <td>description</td>
56 </tr>
58 <tr>
59 <td>-e</td>
60 <td>30</td>
61 <td>5000</td>
62 <td>"extra" number of observed differences</td>
63 </tr>
65 <tr>
66 <td>-s</td>
67 <td>900</td>
68 <td>401</td>
69 <td>minimum similarity score for an overlap</td>
70 </tr>
72 <tr>
73 <td>-p</td>
74 <td>75</td>
75 <td>90</td>
76 <td>percent identity required for overlap</td>
77 </tr>
79 <tr>
80 <td>-d</td>
81 <td>200</td>
82 <td>10000</td>
83 <td>maximum allowed sum of quality scores of
84 mismatched bases in overlaps</td>
85 </tr>
87 <tr>
88 <td>-b</td>
89 <td>20</td>
90 <td>60</td>
91 <td>quality score threshold for scoring a base
92 mismatch</td>
93 </tr>
94 </table>
96 <p>Please see the documentation for CAP3 for further
97 information on other parameters (which are left to
98 default values) and complete descriptions of the
99 above.</p>
101 <p>The point here is to restrict or eliminate the effect
102 of the "-e, -s, -d, and -b" options, leaving "-p" in the
103 driver's seat. This makes the decisions to assemble or
104 not assemble easily interpretable. The other parameters
105 are attempts to introduce more sensitive discriminations
106 than just percent identity of a detected overlap.
107 However, our experience has shown the effects of these
108 parameters (at default or similar settings) yield
109 arbitrary assemblies that dominate over the most
110 intuitive measure, the percent identity in an overlap.
111 Preliminary experiments indicate that "-p" is the most
112 useful option for controlling CAP3's behavior, but its
113 effects are only noticeable when the other overlap
114 assessment features (options) are effectively
115 disabled.</p>
116 </td>
117 </tr>
119 </table>
120 </center>
121 END_HEREDOC
122 $page->footer();