make test pass for multicat parsing with two xlsx files for testing.
[sgn.git] / cgi-bin / methods / unigene / unigene-validation.pl
blobea47feb5bc4bd571ba920c60f7bf1499dbb04baf
1 use strict;
2 use CXGN::Page;
3 my $page=CXGN::Page->new('unigene-validation.html','html2pl converter');
4 $page->header('Assembly Process Validation');
5 print<<END_HEREDOC;
7 <center>
9 <table summary="" width="720" cellpadding="0" cellspacing="0"
10 border="0">
11 <tr>
12 <td>
15 <h3>Assembly Process Validation</h3>
17 <p>In an effort to validate SGN's unigene assembly
18 process, we have attempted to compare our combined
19 Lycopersicon build with <a href=
20 "http://www.tigr.org/tdb/tgi/">TIGR's tomato gene
21 index</a>. These comparisons are based on the latest TIGR
22 tomato gene index available at the time, published on
23 June 1, 2002. It is noted here that neither SGN's unigene
24 nor TIGR's gene index builds are supported by
25 experimental evidence, and thus both remain
26 approximations of the true nature of the genomes
27 represented.</p>
29 <p>Due to differences in input data, such as EST
30 sequences not common to both builds, and differences in
31 chromatogram processing, direct comparison of the two
32 builds exposes mostly "noisy" differences that lead to
33 inconclusive results in attempts to characterize or
34 manually curate the observed differences.</p>
36 <p>Thus, the data presented below serves to indicate the
37 observed similarity between builds and demonstrate that
38 neither build differs significantly from the other
39 indicating a suspicious assembly process. See <a href=
40 "unigene-methods.pl">this page</a> for a discussion on
41 the assembly process.</p>
43 <table summary="" border="1">
44 <tr>
45 <td></td>
46 <td>SGN Lycopersicon combined build #1</td>
47 <td>TIGR Tomato Gene Index</td>
48 </tr>
50 <tr>
51 <td>Total # of output sequences</td>
52 <td>31278</td>
53 <td>31102</td>
54 </tr>
56 <tr>
57 <td>Contigs (TCs)</td>
58 <td>16200</td>
59 <td>15211</td>
60 </tr>
62 <tr>
63 <td>Singlets</td>
64 <td>15078</td>
65 <td>15891</td>
66 </tr>
68 <tr>
69 <td>Censored inputs</td>
70 <td>14310</td>
71 <td>11054</td>
72 </tr>
74 <tr>
75 <td>Exclusive Contigs</td>
76 <td>0</td>
77 <td>0</td>
78 </tr>
80 <tr>
81 <td>Exclusive Singlets</td>
82 <td>2044</td>
83 <td>707</td>
84 </tr>
85 </table>
87 <p><strong>Contigs</strong> are unigenes or gene index
88 sequences which are composed of the consensus of an
89 alignment of two or more EST sequences.
90 <strong>Singlets</strong> are sequences which have been
91 determined not to overlap sufficiently with any other
92 sequence in the input data set. <strong>Censored
93 inputs</strong> are input sequences which are not common
94 to both sets. <strong>Exclusive contigs</strong> are
95 contigs composed entirely of input sequences which are
96 not common to both builds. <strong>Exclusive
97 singlets</strong> are singlets found only in the
98 indicated build. Since no exclusive contigs were found,
99 this indicates that every contig in SGN's build, and
100 every TC in TIGR's tomato gene index is represented by at
101 least one common input sequence for both
102 builds.</p>
104 <p>After normalizing the unigene membership data to
105 compare solely in terms of input sequences common to both
106 builds, we find:</p>
108 <table summary="" border="1">
109 <tr>
110 <td></td>
111 <td>SGN</td>
112 <td>TIGR</td>
113 </tr>
115 <tr>
116 <td>Total # of output sequences</td>
117 <td>29234</td>
118 <td>30395</td>
119 </tr>
121 <tr>
122 <td>Contigs (TCs)</td>
123 <td>15034</td>
124 <td>14432</td>
125 </tr>
127 <tr>
128 <td>Singlets</td>
129 <td>14200</td>
130 <td>15963</td>
131 </tr>
132 </table><br />
134 <p>Since the input sequences have been normalized to a
135 common set at this point, and output sequences which are
136 resultant of exclusively non-common sequences are removed
137 from consideration, this data suggests that SGN's
138 assembly process is slightly more lenient, allowing the
139 assembly of more sequences in to contigs. We find here
140 that 74.5\% of SGN unigene build is identical to TIGR's
141 gene index. Most of the remaining differences turn out to
142 be cases where a contig in SGN is represented in TIGR as
143 one contig and one or more singlets, or vice versa.
144 Investigation of these cases is consistent with the claim
145 above, that SGN's build is biased slightly toward
146 inclusion of sequences into contigs. Although above it
147 indicates that 2044 singlets are exclusive to SGN, the
148 number of singlets has not dropped by 2044 becuase some
149 contigs have become singlets after censoring non-common
150 input sequences from consideration. The same is true for
151 TIGR's build.</p>
153 <p>Since the Lycopersicon combined build and TIGR's
154 tomato gene index contain data from 3 different
155 Lycopersicon species, its useful to look at the number of
156 unigenes specific to <em>Lycopersicon hirsutum</em> and
157 <em>Lycopersicon pennellii</em>, which ought to show
158 substantial allelic variation with the species dominantly
159 represented in the input data, <em>Lycopersicon
160 esculentum</em>.</p>
162 <table summary="" border="1">
163 <tr>
164 <td></td>
165 <td>SGN</td>
166 <td>TIGR</td>
167 </tr>
169 <tr>
170 <td><em>hirsutum</em> specific contigs</td>
171 <td>94</td>
172 <td>157</td>
173 </tr>
175 <tr>
176 <td><em>pennellii</em> specific contigs</td>
177 <td>147</td>
178 <td>113</td>
179 </tr>
181 <tr>
182 <td><em>hirsutum/esculentum</em> mixed contigs</td>
183 <td>1908</td>
184 <td>1863</td>
185 </tr>
187 <tr>
188 <td><em>pennellii/esculentum</em> mixed contigs</td>
189 <td>6552</td>
190 <td>6624</td>
191 </tr>
192 </table>
194 <p>From this data, both TIGR and SGN's assembly processes
195 are allowing the contig assembly of sequences which
196 contain small evolutionary divergence as well as
197 sequencing errors. It is not clear from this data whether
198 or not orthologs are specifically isolated in the
199 assembly. Neither assembly process at this time contains
200 specific steps for isolating orthologs from paralogs in
201 cross-species assemblies. This question can not be
202 completely settled <em>in silico</em>.</p>
204 <p>In conclusion, we find that the insight gained from
205 comparing TIGR's gene index with SGN's Lycopersicon
206 combined unigene build indicates that each procedure
207 confirms the predictions of the other in most cases.
208 Differences are observed, but most are attributable to
209 differences in inputs to the processes. The reader is
210 reminded that the above data attempts to characterize the
211 differences in outputs of two separate processes, while
212 <strong>not</strong> being able to control the
213 differences in inputs. Thus, the conclusive power of the
214 analysis is limited.</p>
215 </td>
216 </tr>
218 </table>
219 </center>
220 END_HEREDOC
221 $page->footer();