add extra newline after headings in error emails, easier to read.
[sgn.git] / cgi-bin / about / tomato_sequencing_scope.pl
blobfe7842e12c01d97a0f09c34d259d524feee6e4a6
1 use strict;
2 use warnings;
4 use CXGN::Page;
5 use CXGN::Page::FormattingHelpers qw/info_section_html/;
7 my $page = CXGN::Page->new('tomato sequencing scope','Robert Buels');
9 $page->header(('Tomato Sequencing Scope and Completion Criteria') x 2);
11 print <<EOH;
12 <p>This page explains what parts of the tomato genome will be sequenced by the <a href="/about/tomato_project_overview.pl">International Tomato Sequencing Project</a>, and when the project will be considered complete.</p>
14 <div style="text-align: right; margin-bottom: 1em"><b>Download sequencing scope presentation:</b> <a href="/documents/about/tomato_sequencing_scope.ppt">[ppt]</a></div>
15 EOH
17 print info_section_html( title => '1. Sequencing scope',
18 contents =>
19 info_section_html( title => '1.1 Estimate of euchromatin size and number of BACs to sequence',
20 is_subsection => 1,
21 contents => <<EOH)
22 <p>We have developed estimates of the physical distance to be covered in sequencing the euchromatin gene space of tomato centromeric arms. While more accurate estimates will develop as the project proceeds and more sequence is generated, we note that the current estimates are similar to each other.</p>
24 <dl>
25 <dt>A. Cytologically Based Measurement of Euchromatin Content</dt>
27 <dd>We previously determined the amount of DNA in euchromatin and
28 heterochromatin of tomato chromosomes (Peterson et al. 1996).
29 First, tomato pachytene chromosomes were spread on glass slides
30 using a technique that did not stretch (deform) the chromosomes.
31 We stained the chromosomes by the Feulgen technique that has been
32 proven to be a reliable, quantitative stain for DNA (see Price
33 1988). Relative density (absorbance) of Feulgen stained
34 euchromatin and heterochromatin was determined in ten different
35 spreads. Using twenty unstretched tomato pachytene chromosomes,
36 the average width of the chromosomes in euchromatin was determined
37 from fifty separate measurements, and the average width of the
38 chromosomes in heterochromatin was determined from fifty
39 additional measurements. Transverse measurements for diameter
40 were made only in relatively straight parts of chromosomes.
41 Lengths of pachytene chromosomes were taken from Sherman and Stack
42 (1992) who carefully measured tomato pachytene chromosome lengths,
43 arm ratios, and fractions of arms in euchromatin and
44 heterochromatin on electron micrographs. This information was used
45 to calculate the total fraction of the genome in euchromatin and
46 heterochromatin.
47 <center>
48 <table>
49 <tr><td></td><th>Heterochromatin</th><th>Euchromatin</th></tr>
50 <tr>
51 <td align="right">Relative chromosome length</td>
52 <td align="right">0.36</td>
53 <td align="right">0.64</td>
54 </tr>
55 <tr>
56 <td align="right">Relative bivalent diameter</td>
57 <td align="right">&times; 1.23 <hr style="padding: 0; margin: 0 0 0 3em; border: 1px solid black" /></td>
58 <td align="right">&times; 1.00 <hr style="padding: 0; margin: 0 0 0 3em; border: 1px solid black" /></td>
59 </tr>
60 <tr>
61 <td align="right">Relative area</td>
62 <td align="right">0.44</td>
63 <td align="right">0.64</td>
64 </tr>
65 <tr>
66 <td align="right">Relative optical density</td>
67 <td align="right">&times; 4.78 <hr style="padding: 0; margin: 0 0 0 3em; border: 1px solid black" /></td>
68 <td align="right">&times; 1.00 <hr style="padding: 0; margin: 0 0 0 3em; border: 1px solid black" /></td>
69 </tr>
70 <tr>
71 <td align="right">Relative OD X relative area</td>
72 <td align="right">2.10</td>
73 <td align="right">0.64</td>
74 </tr>
75 <tr>
76 <td align="right">Total OD X area</td>
77 <td align="right">&divide; 2.74 <hr style="padding: 0; margin: 0 0 0 3em; border: 1px solid black" /></td>
78 <td align="right">&divide; 2.74 <hr style="padding: 0; margin: 0 0 0 3em; border: 1px solid black" /></td>
79 </tr>
80 <tr>
81 <td align="right">Fraction of genome</td>
82 <td align="right">0.77</td>
83 <td align="right">0.23</td>
84 </tr>
85 </table>
86 </center>
88 Estimates of the absolute size (1C amount) of the tomato genome
89 are in general agreement at approximately 95 pg of DNA, e.g.,
90 Michaelson et al. (1991). Thus, the amount of DNA in euchromatin
91 in one tomato genome is (0.23 x 0.95 pg =) 0.22 pg. Converting
92 the DNA amount in euchromatin to base pairs (Bennett and Smith
93 1976) there are [0.22 pg x (965 x106 pb/pg) =] 2.12 x 108 bp (212
94 Mb) of DNA in the euchromatin of one tomato genome (= 1C amount),
95 and converting the DNA amount in heterochromatin to base pairs
96 [0.73 pg x (965 x106 pb/pg) =], there are 7.05 x 108 bp (705 Mb)
97 of DNA in the heterochromatin of one tomato genome.
98 </dd>
99 <dt>B. Estimating Euchromatin Arm Size Based on Available Genome and EST Sequence</dt>
100 <dd>
101 As of the summer of 2006 a total of 15.5 Mb of non-overlapping
102 tomato genomic sequence had been submitted to SGN by the US team
103 and our international sequencing partners. A test set of high
104 quality tomato gene sequences was created by combining 1) all
105 published tomato gene sequences in GENBANK, 2) 2898 redundantly
106 sequenced full-length tomato cDNAs available through TIGR, and 3)
107 6742 tomato contigs containing five or more overlapping EST
108 sequences. 8,097 high quality unigene sequences remained after
109 correcting for redundancy. This set of tomato unigenes was then
110 searched against the available tomato genome sequence with
111 stringency criteria of 90% or greater ty and % coverage. 456 of
112 8,097 unigenes were identified in the genome sequence. Assuming
113 this gene set is representative of the gene space in terms of
114 localization throughout the tomato genome, we estimate that
115 456/8,097 = 5.6% of the gene space has been covered. Correcting
116 for the percentage of gene space present in the euchromatin arms
117 (85%) we can calculate that 5.6/0.85 = 6.6% of the target gene
118 space has been covered. If 15.5 Mb represents 6.6% of the
119 euchromatin arms then 15.5/0.066 = 234 Mb of genomic DNA would be
120 calculated to represent the target non-overlapping genome space for
121 the international genome sequencing project. C) In a separate
122 analysis, the 15.5 Mb of available tomato genomic DNA was searched
123 for homologies to gene sequences and 2100 non-redundant gene models
124 were identified following removal or transposon, viral and other
125 repetitive sequences. 2100 genes out of 35,000 corresponds to 6%
126 of the predicted gene space. Correcting for the percentage of gene
127 space present in the euchromatin arms (85%) we can calculate that
128 6.0/0.85 = 7.05% of the target gene space has been covered. If
129 15.5 Mb represents 7.05% of the euchromatin arms then 15.5/0.0705 =
130 220 Mb of genomic DNA would be calculated to represent the target
131 genome space for the international genome sequencing project.
133 <center>
134 <table>
135 <tr><th>Method</th><th>Sequencing Target</th></tr>
136 <tr><td>Cytology</td><td>212 Mb</td></tr>
137 <tr><td>Available Sequence and percent high quality gene models</td><td>234 Mb</td></tr>
138 <tr><td>Available sequence and total gene models</td><td>220 Mb</td></tr>
139 </table>
140 </center>
141 </dd>
142 </dl>
143 <h4>Additional Information</h4>
145 <p> When the sequencing project is advanced to the stage where BAC
146 contigs can be assayed for both total non-redundant sequence length
147 and physical distance based on in situ hybridization, we will be able
148 to develop an additional estimate of euchromatin physical size through
149 validation of the cytological measurements with actual sequence data.
150 At present there is no data available to make such estimations though
151 the UK group has developed large BAC contigs covering most of
152 chromosome 4 that will move into their sequencing pipeline in coming
153 months. Based on BAC FPC data alone they have reported that their
154 physical size estimate for chromosome 4 is consistent with the
155 original cytological estimates used in planning the international
156 sequencing effort (C. Nicholson, personal communication). In
157 addition, the Korean group has completed more BAC sequencing than any
158 other group in the consortium to date with 49 finished BACs
159 representing approximately 20% of their projected total for chromosome
160 2. In line with project plans they have started from BACs anchored to
161 the genetic map and spaced along chromosome 2. As such, they still
162 have few and short contigs, rather representative sequence islands
163 across chromosome 2. Nevertheless, based on the physical distances
164 between mapped marker sequences found in their sequenced BACs, they
165 have estimated that the BACs sequenced to date represent approximately
166 20% of the genetic map for chromosome 2. While genetic to physical
167 distance ratios can vary widely, and these numbers could change
168 dramatically (for example in an area of suppressed recombination), at
169 present their available data is consistent with the original
170 cytological results on which the project was based.
171 </p>
173 <p>In summary, the data described above is consistent with a sequencing target of 212 - 234 Mb for completion of the objectives of the international tomato genome sequencing project. At present we propose use of the larger estimate, 234 Mb, to guide our project plan as it is likely more accurate and more conservative (in terms of justifying budget and activity for completion of project goals).
174 </p>
176 .info_section_html(title => '1.2 Sequencing standards',
177 is_subsection => 1,
178 contents => <<EOH)
179 <p>A "finished BAC" is defined as one:</p>
180 <ul>
181 <li>that contains an error rate of less than 1:10,000 bases and continuous sequence across the entire BAC (HTGS phase 3)</li>
182 <li>that has an average of 8-fold redundancy in sequencing coverage with a minimum of one high quality read in both directions at any given location</li>
183 <li>that is as gap-free as possible, given all reasonable state-of-the-art gap-filling approaches available at the time of sequencing</li>
184 </ul>
187 Regarding the euchromatin pseudomolecule, a small number of recalcitrant gaps, which will be physically defined by in situ hybridization, will be tolerated. Based on the degree of completion of the rice genome and excluding gaps defined by centromeres, this would mean approximately 4 - 6 gaps per tomato chromosome on average. Once all BACs in the minimal tiling path have been sequenced through two rounds of finishing, "Difficult" BACs (those that cannot be finished within two rounds of finishing) will be set aside and finished to the degree resources allow. Similar strategies have been employed for rice and Medicago.
188 </p>
192 print info_section_html( title => '2. Completion criteria',
193 contents => <<EOH,
195 We shall use as our targeted sequencing goals two guiding principles: 1) complete sequencing of the major euchromatin "arms" flanking each of the 12 tomato chromosomes 2) to a degree of completion comparable to the standards of completion used to guide the international rice genome sequencing project (IRGSP, 2005) and enumerated above. We further define our objectives to include sequencing to at least the closest mapped marker to the visible euchromatin heterochromatin borders of each chromosome arm. In situ hybridization will be used to determine if these borders define the true euchromatin/heterochromatin borders or a gap that will be at minimum physically defined and at maximum walked via the above strategy until characteristic heterochromatin repeats are reached (at which time FISH will be performed with the closest low copy BAC or internal BAC sequence).
196 </p>
199 Estimation of gene space missed in this approach. Extrapolating from data obtained in rice we can calculate the number of genes that we might expect to miss in an approach that focuses on just the gene dense tomato euchromatin. For example, sequencing of rice chromosome 8 revealed 86 active genes in the centromere proper and distal non-recombinant regions (Yan et al., 2005). 86 genes/centromere X 12 tomato chromosomes = 1032 centromeric genes. Prior to initiation of the international tomato sequencing effort, Exelexsis Biosciences sequenced and deposited two random BACs from heterochromatin with highly repetitive DNA, which together covered greater than 200 kb and harbored one gene. While this is clearly limited data, we can make a further rough estimate that we might lose an additional (705,000 kb of DNA in heterochromatin divided by 200 kb per gene =) 3525 genes in heterochromatin or a total of approximately 4500 genes that could be missed by focusing solely on the euchromatin arms (see above for the 705,000 kb estimate of the heterochromatin). The estimated gene content of tomato is 35,000 genes (Van der Hoeven et al., 2002) suggesting that approximately 35,000 - 4,500 = 30,500 genes (87%) might be anticipated to be recovered through the euchromatin-only approach. Correcting further for the fact that non-centromere gaps represented approximately 3% of the targeted sequence space in rice, we would estimate recovery of 85% of the tomato gene space (apx. 30,000 genes) under the efforts of the international tomato sequencing effort. In summary, the target of the international genome sequencing effort is sequencing of the euchromatin arms of all twelve tomato chromosomes which we estimate will represent approximately 85% of the tomato gene space.
200 </p>
204 $page->footer;