3 my $page=CXGN
::Page
->new('unigene-methods.html','html2pl converter');
4 $page->header('Unigene Assembly Process Overview');
9 <table summary="" width="720" cellpadding="0" cellspacing="0"
15 <h3>Unigene Assembly Process Overview</h3>
17 <p>The "unigene problem" consists of two fundamental
21 <li>Are these two sequences from the same
24 <li>Where are the sequencing errors in this
28 <p>The ability to answer either question correctly and
29 consistently enables an algorithm for precise assembly of
30 a unigene build from EST sequences. It is plain to see
31 that if (1) is yes, then answers for (2) are known
32 (errors are where the sequence differs, barring allelic
33 variation). As well, if (2) were determined, then (1) is
34 easily settled by examining an alignment of the sequences
35 for true differences in the overlapping region.</p>
37 <p>Constructing a unigene build must attempt to solve
38 both questions simultaneously. This is different from
39 genomic DNA assembly, for the following important
43 <li><p>EST sequencing methodology does not yield an
44 expectation of stochastic oversampling of each DNA
45 base. In genomic sequencing, with 8X expected coverage
46 for example, answering (2) above becomes easier as
47 there are several observations for each base once
48 proper alignment is determined.</p></li>
50 <li><p>The optimal outcome of assembling a BAC is exactly
51 one contig. The implied answer to question (1) above is
52 then always yes: all subclones belong in the same
56 <p>There are no widely used, freely available assemblers
57 for EST data, so we do the next best thing: use a genomic
58 assembler such as <a href=
59 "http://www.phrap.org/">phrap</a> (P. Green) or <a href=
60 "http://genome.cs.mtu.edu/cap/cap3.html">CAP3</a> (X.
61 Huang [1]). CAP3 is typically preferred
62 for EST assembly (see [2] for a
63 discussion), being less aggressive at splitting apart
66 <p>In general deciding whether or not to assemble two
67 sequences together is a very easy question as long as the
68 observed differences between the sequences are
69 significant. When the observed differences in two
70 sequences approaches the rate of sequencing error,
71 determining whether or not two different genes are
72 represented by the sequences becomes theoretically
73 impossible without collecting more data. Since error
74 rates in a collection of sequences appear as a
75 distribution, the result is a range of observed
76 differences where actual differences and sequencing
77 errors make assembly decisions arbitrary.</p>
79 <p>The likely result is the over-representation or
80 under-representation of gene families which contain
81 recently diverged paralogs. Additionally, if the organism
82 sequenced is heterozygous at many loci with significant
83 allelic variation, similar results may occur.</p>
85 <p>This may be controlled by selection of threshold
86 parameters governing the assembly process, but there is
87 no "one size fits all" threshold that accurately decides
88 all cases. Particular choices of thresholds may either
89 (a) promote false detection of distinct but similar genes
90 (b) promote false detection of alleles (by assembling
91 close paralogs together) or (c) do both (neutral choice
92 of parameters). For SGN's assembly, we have decided to
93 proceed with option (b), to attempt to minimize the
94 number of false isolations of unique
97 <p>Future versions of SGN's unigene build process will
98 include the option for the user to inspect an assembly's
99 multiple sequence alignment (MSA) as well as view the
100 major alternatives incorporated in any given
107 <li>Huang, X. and Madan, A. (1999) CAP3: A DNA Sequence
108 Assembly Program. Genome Research, 9:
111 <li>Liang Feng, et. al. (2000) <a href="http://www.tigr.org/tdb/tgi/publications/NAR_Assembly.pdf">An optimized
112 protocol for analysis of EST sequences</a> Nucleic
113 Acids Research 28, 3657-3665</li>