cgi-bin/about/tomato_sequencing_scope.pl

   1 use strict;
   2 use warnings;
   3
   4 use CXGN::Page;
   5 use CXGN::Page::FormattingHelpers qw/info_section_html/;
   6
   7 my $page = CXGN::Page->new('tomato sequencing scope','Robert Buels');
   8
   9 $page->header(('Tomato Sequencing Scope and Completion Criteria') x 2);
  10
  11 print <<EOH;
  12 <p>This page explains what parts of the tomato genome will be sequenced by the <a href="/about/tomato_project_overview.pl">International Tomato Sequencing Project</a>, and when the project will be considered complete.</p>
  13
  14 <div style="text-align: right; margin-bottom: 1em"><b>Download sequencing scope presentation:</b> <a href="/documents/about/tomato_sequencing_scope.ppt">[ppt]</a></div>
  15 EOH
  16
  17 print info_section_html( title => '1. Sequencing scope',
  18                          contents =>
  19                          info_section_html( title => '1.1 Estimate of euchromatin size and number of BACs to sequence',
  20                                             is_subsection => 1,
  21                                             contents => <<EOH)
  22 <p>We have developed estimates of the physical distance to be covered in sequencing the euchromatin gene space of tomato centromeric arms.  While more accurate estimates will develop as the project proceeds and more sequence is generated, we note that the current estimates are similar to each other.</p>
  23
  24 <dl>
  25 <dt>A. Cytologically Based Measurement of Euchromatin Content</dt>
  26
  27 <dd>We previously determined the amount of DNA in euchromatin and
  28     heterochromatin of tomato chromosomes (Peterson et al. 1996).
  29     First, tomato pachytene chromosomes were spread on glass slides
  30     using a technique that did not stretch (deform) the chromosomes.
  31     We stained the chromosomes by the Feulgen technique that has been
  32     proven to be a reliable, quantitative stain for DNA (see Price
  33     1988). Relative density (absorbance) of Feulgen stained
  34     euchromatin and heterochromatin was determined in ten different
  35     spreads.  Using twenty unstretched tomato pachytene chromosomes,
  36     the average width of the chromosomes in euchromatin was determined
  37     from fifty separate measurements, and the average width of the
  38     chromosomes in heterochromatin was determined from fifty
  39     additional measurements.  Transverse measurements for diameter
  40     were made only in relatively straight parts of chromosomes.
  41     Lengths of pachytene chromosomes were taken from Sherman and Stack
  42     (1992) who carefully measured tomato pachytene chromosome lengths,
  43     arm ratios, and fractions of arms in euchromatin and
  44     heterochromatin on electron micrographs. This information was used
  45     to calculate the total fraction of the genome in euchromatin and
  46     heterochromatin.
  47     <center>
  48     <table>
  49     <tr><td></td><th>Heterochromatin</th><th>Euchromatin</th></tr>
  50     <tr>
  51         <td align="right">Relative chromosome length</td>
  52         <td align="right">0.36</td>
  53         <td align="right">0.64</td>
  54     </tr>
  55     <tr>
  56         <td align="right">Relative bivalent diameter</td>
  57         <td align="right">&times;  1.23 <hr style="padding: 0; margin: 0 0 0 3em; border: 1px solid black" /></td>
  58         <td align="right">&times;  1.00 <hr style="padding: 0; margin: 0 0 0 3em; border: 1px solid black" /></td>
  59     </tr>
  60     <tr>
  61         <td align="right">Relative area</td>
  62         <td align="right">0.44</td>
  63         <td align="right">0.64</td>
  64     </tr>
  65     <tr>
  66         <td align="right">Relative optical density</td>
  67         <td align="right">&times;  4.78 <hr style="padding: 0; margin: 0 0 0 3em; border: 1px solid black" /></td>
  68         <td align="right">&times;  1.00 <hr style="padding: 0; margin: 0 0 0 3em; border: 1px solid black" /></td>
  69     </tr>
  70     <tr>
  71         <td align="right">Relative OD X relative area</td>
  72         <td align="right">2.10</td>
  73         <td align="right">0.64</td>
  74     </tr>
  75     <tr>
  76         <td align="right">Total OD X area</td>
  77         <td align="right">&divide; 2.74 <hr style="padding: 0; margin: 0 0 0 3em; border: 1px solid black" /></td>
  78         <td align="right">&divide; 2.74 <hr style="padding: 0; margin: 0 0 0 3em; border: 1px solid black" /></td>
  79     </tr>
  80     <tr>
  81         <td align="right">Fraction of genome</td>
  82         <td align="right">0.77</td>
  83         <td align="right">0.23</td>
  84     </tr>
  85     </table>
  86     </center>
  87
  88     Estimates of the absolute size (1C amount) of the tomato genome
  89     are in general agreement at approximately 95 pg of DNA, e.g.,
  90     Michaelson et al. (1991).  Thus, the amount of DNA in euchromatin
  91     in one tomato genome is (0.23 x 0.95 pg =) 0.22 pg.  Converting
  92     the DNA amount in euchromatin to base pairs (Bennett and Smith
  93     1976) there are [0.22 pg x (965 x106 pb/pg) =] 2.12 x 108 bp (212
  94     Mb) of DNA in the euchromatin of one tomato genome (= 1C amount),
  95     and converting the DNA amount in heterochromatin to base pairs
  96      [0.73 pg x (965 x106 pb/pg) =], there are 7.05 x 108 bp (705 Mb)
  97     of DNA in the heterochromatin of one tomato genome.
  98 </dd>
  99 <dt>B. Estimating Euchromatin Arm Size Based on Available Genome and EST Sequence</dt>
 100 <dd>
 101    As of the summer of 2006 a total of 15.5 Mb of non-overlapping
 102    tomato genomic sequence had been submitted to SGN by the US team
 103    and our international sequencing partners.  A test set of high
 104    quality tomato gene sequences was created by combining 1) all
 105    published tomato gene sequences in GENBANK, 2) 2898 redundantly
 106    sequenced full-length tomato cDNAs available through TIGR, and 3)
 107    6742 tomato contigs containing five or more overlapping EST
 108    sequences.  8,097 high quality unigene sequences remained after
 109    correcting for redundancy.  This set of tomato unigenes was then
 110    searched against the available tomato genome sequence with
 111    stringency criteria of 90% or greater ty and % coverage.  456 of
 112    8,097 unigenes were identified in the genome sequence.  Assuming
 113    this gene set is representative of the gene space in terms of
 114    localization throughout the tomato genome, we estimate that
 115    456/8,097 = 5.6% of the gene space has been covered.  Correcting
 116    for the percentage of gene space present in the euchromatin arms
 117    (85%) we can calculate that 5.6/0.85 = 6.6% of the target gene
 118    space has been covered.  If 15.5 Mb represents 6.6% of the
 119    euchromatin arms then 15.5/0.066 = 234 Mb of genomic DNA would be
 120    calculated to represent the target non-overlapping genome space for
 121    the international genome sequencing project.  C) In a separate
 122    analysis, the 15.5 Mb of available tomato genomic DNA was searched
 123    for homologies to gene sequences and 2100 non-redundant gene models
 124    were identified following removal or transposon, viral and other
 125    repetitive sequences.  2100 genes out of 35,000 corresponds to 6%
 126    of the predicted gene space.  Correcting for the percentage of gene
 127    space present in the euchromatin arms (85%) we can calculate that
 128    6.0/0.85 = 7.05% of the target gene space has been covered.  If
 129    15.5 Mb represents 7.05% of the euchromatin arms then 15.5/0.0705 =
 130    220 Mb of genomic DNA would be calculated to represent the target
 131    genome space for the international genome sequencing project.
 132
 133    <center>
 134    <table>
 135    <tr><th>Method</th><th>Sequencing Target</th></tr>
 136    <tr><td>Cytology</td><td>212 Mb</td></tr>
 137    <tr><td>Available Sequence and percent high quality gene models</td><td>234 Mb</td></tr>
 138    <tr><td>Available sequence and total gene models</td><td>220 Mb</td></tr>
 139    </table>
 140    </center>
 141 </dd>
 142 </dl>
 143 <h4>Additional Information</h4>
 144
 145 <p> When the sequencing project is advanced to the stage where BAC
 146 contigs can be assayed for both total non-redundant sequence length
 147 and physical distance based on in situ hybridization, we will be able
 148 to develop an additional estimate of euchromatin physical size through
 149 validation of the cytological measurements with actual sequence data.
 150 At present there is no data available to make such estimations though
 151 the UK group has developed large BAC contigs covering most of
 152 chromosome 4 that will move into their sequencing pipeline in coming
 153 months.  Based on BAC FPC data alone they have reported that their
 154 physical size estimate for chromosome 4 is consistent with the
 155 original cytological estimates used in planning the international
 156 sequencing effort (C. Nicholson, personal communication).  In
 157 addition, the Korean group has completed more BAC sequencing than any
 158 other group in the consortium to date with 49 finished BACs
 159 representing approximately 20% of their projected total for chromosome
 160 2.  In line with project plans they have started from BACs anchored to
 161 the genetic map and spaced along chromosome 2.  As such, they still
 162 have few and short contigs, rather representative sequence islands
 163 across chromosome 2.  Nevertheless, based on the physical distances
 164 between mapped marker sequences found in their sequenced BACs, they
 165 have estimated that the BACs sequenced to date represent approximately
 166 20% of the genetic map for chromosome 2.  While genetic to physical
 167 distance ratios can vary widely, and these numbers could change
 168 dramatically (for example in an area of suppressed recombination), at
 169 present their available data is consistent with the original
 170 cytological results on which the project was based.
 171 </p>
 172
 173 <p>In summary, the data described above is consistent with a sequencing target of 212 - 234 Mb for completion of the objectives of the international tomato genome sequencing project. At present we propose use of the larger estimate, 234 Mb, to guide our project plan as it is likely more accurate and more conservative (in terms of justifying budget and activity for completion of project goals).
 174 </p>
 175 EOH
 176                          .info_section_html(title => '1.2 Sequencing standards',
 177                                             is_subsection => 1,
 178                                             contents => <<EOH)
 179 <p>A "finished BAC" is defined as one:</p>
 180 <ul>
 181 <li>that contains an error rate of less than 1:10,000 bases and continuous sequence across the entire BAC (HTGS phase 3)</li>
 182 <li>that has an average of 8-fold redundancy in sequencing coverage with a minimum of one high quality read in both directions at any given location</li>
 183 <li>that is as gap-free as possible, given all reasonable state-of-the-art gap-filling approaches available at the time of sequencing</li>
 184 </ul>
 185
 186 <p>
 187 Regarding the euchromatin pseudomolecule, a small number of recalcitrant gaps, which will be physically defined by in situ hybridization, will be tolerated. Based on the degree of completion of the rice genome and excluding gaps defined by centromeres, this would mean approximately 4 - 6 gaps per tomato chromosome on average. Once all BACs in the minimal tiling path have been sequenced through two rounds of finishing, "Difficult" BACs (those that cannot be finished within two rounds of finishing) will be set aside and finished to the degree resources allow.  Similar strategies have been employed for rice and Medicago.
 188 </p>
 189 EOH
 190                        );
 191
 192 print info_section_html( title => '2. Completion criteria',
 193                          contents => <<EOH,
 194 <p>
 195 We shall use as our targeted sequencing goals two guiding principles: 1) complete sequencing of the major euchromatin "arms" flanking each of the 12 tomato chromosomes 2) to a degree of completion comparable to the standards of completion used to guide the international rice genome sequencing project (IRGSP, 2005) and enumerated above. We further define our objectives to include sequencing to at least the closest mapped marker to the visible euchromatin heterochromatin borders of each chromosome arm.  In situ hybridization will be used to determine if these borders define the true euchromatin/heterochromatin borders or a gap that will be at minimum physically defined and at maximum walked via the above strategy until characteristic heterochromatin repeats are reached (at which time FISH will be performed with the closest low copy BAC or internal BAC sequence).
 196 </p>
 197
 198 <p>
 199 Estimation of gene space missed in this approach.  Extrapolating from data obtained in rice we can calculate the number of genes that we might expect to miss in an approach that focuses on just the gene dense tomato euchromatin.  For example, sequencing of rice chromosome 8 revealed 86 active genes in the centromere proper and distal non-recombinant regions (Yan et al., 2005).  86 genes/centromere X 12 tomato chromosomes = 1032 centromeric genes. Prior to initiation of the international tomato sequencing effort, Exelexsis Biosciences sequenced and deposited two random BACs from heterochromatin with highly repetitive DNA, which together covered greater than 200 kb and harbored one gene.  While this is clearly limited data, we can make a further rough estimate that we might lose an additional (705,000 kb of DNA in heterochromatin divided by 200 kb per gene =) 3525 genes in heterochromatin or a total of approximately 4500 genes that could be missed by focusing solely on the euchromatin arms (see above for the 705,000 kb estimate of the heterochromatin).  The estimated gene content of tomato is 35,000 genes (Van der Hoeven et al., 2002) suggesting that approximately 35,000 - 4,500 = 30,500 genes (87%) might be anticipated to be recovered through the euchromatin-only approach. Correcting further for the fact that non-centromere gaps represented approximately 3% of the targeted sequence space in rice, we would estimate recovery of 85% of the tomato gene space (apx. 30,000 genes) under the efforts of the international tomato sequencing effort.  In summary, the target of the international genome sequencing effort is sequencing of the euchromatin arms of all twelve tomato chromosomes which we estimate will represent approximately 85% of the tomato gene space.
 200 </p>
 201 EOH
 202                        );
 203
 204 $page->footer;