cgi-bin/methods/unigene/unigene-methods.pl

   1 use strict;
   2 use CXGN::Page;
   3 my $page=CXGN::Page->new('unigene-methods.html','html2pl converter');
   4 $page->header('Unigene Assembly Process Overview');
   5 print<<END_HEREDOC;
   6
   7   <center>
   8
   9     <table summary="" width="720" cellpadding="0" cellspacing="0"
  10     border="0">
  11       <tr>
  12         <td>
  13
  14
  15           <h3>Unigene Assembly Process Overview</h3>
  16
  17           <p>The "unigene problem" consists of two fundamental
  18           questions:</p>
  19
  20           <ol>
  21             <li>Are these two sequences from the same
  22             gene/transcript?</li>
  23
  24             <li>Where are the sequencing errors in this
  25             sequence?</li>
  26           </ol>
  27
  28           <p>The ability to answer either question correctly and
  29           consistently enables an algorithm for precise assembly of
  30           a unigene build from EST sequences. It is plain to see
  31           that if (1) is yes, then answers for (2) are known
  32           (errors are where the sequence differs, barring allelic
  33           variation). As well, if (2) were determined, then (1) is
  34           easily settled by examining an alignment of the sequences
  35           for true differences in the overlapping region.</p>
  36
  37           <p>Constructing a unigene build must attempt to solve
  38           both questions simultaneously. This is different from
  39           genomic DNA assembly, for the following important
  40           reasons:</p>
  41
  42           <ol>
  43             <li><p>EST sequencing methodology does not yield an
  44             expectation of stochastic oversampling of each DNA
  45             base. In genomic sequencing, with 8X expected coverage
  46             for example, answering (2) above becomes easier as
  47             there are several observations for each base once
  48             proper alignment is determined.</p></li>
  49
  50             <li><p>The optimal outcome of assembling a BAC is exactly
  51             one contig. The implied answer to question (1) above is
  52             then always yes: all subclones belong in the same
  53             contig.</p></li>
  54           </ol>
  55
  56           <p>There are no widely used, freely available assemblers
  57           for EST data, so we do the next best thing: use a genomic
  58           assembler such as <a href=
  59           "http://www.phrap.org/">phrap</a> (P. Green) or <a href=
  60           "http://genome.cs.mtu.edu/cap/cap3.html">CAP3</a> (X.
  61           Huang [1]). CAP3 is typically preferred
  62           for EST assembly (see [2] for a
  63           discussion), being less aggressive at splitting apart
  64           contigs.</p>
  65
  66           <p>In general deciding whether or not to assemble two
  67           sequences together is a very easy question as long as the
  68           observed differences between the sequences are
  69           significant. When the observed differences in two
  70           sequences approaches the rate of sequencing error,
  71           determining whether or not two different genes are
  72           represented by the sequences becomes theoretically
  73           impossible without collecting more data. Since error
  74           rates in a collection of sequences appear as a
  75           distribution, the result is a range of observed
  76           differences where actual differences and sequencing
  77           errors make assembly decisions arbitrary.</p>
  78
  79           <p>The likely result is the over-representation or
  80           under-representation of gene families which contain
  81           recently diverged paralogs. Additionally, if the organism
  82           sequenced is heterozygous at many loci with significant
  83           allelic variation, similar results may occur.</p>
  84
  85           <p>This may be controlled by selection of threshold
  86           parameters governing the assembly process, but there is
  87           no "one size fits all" threshold that accurately decides
  88           all cases. Particular choices of thresholds may either
  89           (a) promote false detection of distinct but similar genes
  90           (b) promote false detection of alleles (by assembling
  91           close paralogs together) or (c) do both (neutral choice
  92           of parameters). For SGN's assembly, we have decided to
  93           proceed with option (b), to attempt to minimize the
  94           number of false isolations of unique
  95           transcripts.</p>
  96
  97           <p>Future versions of SGN's unigene build process will
  98           include the option for the user to inspect an assembly's
  99           multiple sequence alignment (MSA) as well as view the
 100           major alternatives incorporated in any given
 101           assembly.</p>
 102           <hr />
 103
 104           <p>References:</p>
 105
 106           <ol>
 107             <li>Huang, X. and Madan, A. (1999) CAP3: A DNA Sequence
 108             Assembly Program. Genome Research, 9:
 109             868-877</li>
 110
 111             <li>Liang Feng, et. al. (2000) <a href="http://www.tigr.org/tdb/tgi/publications/NAR_Assembly.pdf">An optimized
 112             protocol for analysis of EST sequences</a> Nucleic
 113             Acids Research 28, 3657-3665</li>
 114           </ol>
 115         </td>
 116       </tr>
 117
 118     </table>
 119   </center>
 120 END_HEREDOC
 121 $page->footer();