cgi-bin/solanaceae-project/sol-bioinformatics/ghent2006_report.pl

   1 use strict;
   2 use CXGN::Page;
   3
   4 my $page = CXGN::Page->new('Ghent 2006 Meeting Report','Robert Buels');
   5 $page->add_style(text => '@page { size: 8.5in 11in; margin: 0.79in }\n' . <<EOS);
   6 p { margin-bottom: 0.08in }
   7 h1 { margin-bottom: 0.08in }
   8 h1.western { font-family: "Arial", sans-serif; font-size: 16pt }
   9 h1.cjk { font-family: "MS Mincho"; font-size: 16pt }
  10 h1.ctl { font-family: "Tahoma"; font-size: 16pt }
  11 h2 { margin-bottom: 0.08in }
  12 h2.western { font-family: "Arial", sans-serif; font-size: 14pt; font-style: italic }
  13 h2.cjk { font-family: "MS Mincho"; font-size: 14pt; font-style: italic }
  14 h2.ctl { font-family: "Tahoma"; font-size: 14pt; font-style: italic }
  15 h3 { margin-bottom: 0.08in }
  16 h3.western { font-family: "Arial", sans-serif }
  17 h3.cjk { font-family: "MS Mincho" }
  18 h3.ctl { font-family: "Tahoma" }
  19 EOS
  20 $page->header();
  21
  22 print <<EOHTML;
  23 <h1 class="western">Meeting Report: Tomato Annotation Meeting</h1>
  24 <h2 class="western">Ghent, Belgium October 23-25<sup>th</sup>, 2006</h2>
  25 <p style="margin-bottom: 0in"><br /></p>
  26 <p style="margin-bottom: 0in">In Attendance</p>
  27 <p style="margin-bottom: 0in"><br /></p>
  28 <p style="margin-bottom: 0in"><i>Belgium</i>:</p>
  29 <ul>
  30         <li><p style="margin-bottom: 0in">St&eacute;phane Rombauts</p></li>
  31         <li><p style="margin-bottom: 0in">Pierre Rouz&eacute;</p></li>
  32         <li><p style="margin-bottom: 0in">Yves van der Peer (off and on)</p></li>
  33 </ul>
  34 <p style="margin-bottom: 0in"><i>Netherlands</i>:</p>
  35 <ul>
  36         <li><p style="margin-bottom: 0in">Erwin Datema</p></li>
  37 </ul>
  38 <ul>
  39         <li><p style="margin-bottom: 0in">Mark Fiers</p></li>
  40         <li><p style="margin-bottom: 0in">Roeland van Hamm</p></li>
  41 </ul>
  42 <p style="margin-bottom: 0in"><i>India</i>:</p>
  43 <ul>
  44         <li><p style="margin-bottom: 0in">Saloni Mathur</p></li>
  45         <li><p style="margin-bottom: 0in">Saurabh Raghuvanshi</p></li>
  46 </ul>
  47 <p style="margin-bottom: 0in"><i>Korea</i>:</p>
  48 <ul>
  49         <li><p style="margin-bottom: 0in">Kyoo-Yeol Lee</p></li>
  50 </ul>
  51 <p style="margin-bottom: 0in"><i>Spain</i>:</p>
  52 <ul>
  53         <li><p style="margin-bottom: 0in">Francisco Camara</p></li>
  54         <li><p style="margin-bottom: 0in">Roderic Guigo</p></li>
  55 </ul>
  56 <p style="margin-bottom: 0in"><i>France</i>:</p>
  57 <ul>
  58         <li><p style="margin-bottom: 0in">Thomas Schiex</p></li>
  59 </ul>
  60 <p style="margin-bottom: 0in"><i>USA</i>:</p>
  61 <ul>
  62         <li><p style="margin-bottom: 0in">Robert Buels</p></li>
  63         <li><p style="margin-bottom: 0in">Lukas Mueller</p></li>
  64 </ul>
  65 <p style="margin-bottom: 0in"><i>Italy</i>:</p>
  66 <ul>
  67         <li><p style="margin-bottom: 0in">Maria Luisa Chiusano</p></li>
  68         <li><p style="margin-bottom: 0in">Alessandra Traini</p></li>
  69 </ul>
  70 <p style="margin-bottom: 0in"><i>Germany</i>:</p>
  71 <ul>
  72         <li><p style="margin-bottom: 0in">Heiko Schoof (day 2)</p></li>
  73         <li><p style="margin-bottom: 0in">Anika Joecker</p></li>
  74 </ul>
  75 <p style="margin-bottom: 0in"><br /></p>
  76 <h3 class="western">Purpose</h3>
  77 <p style="margin-bottom: 0in">The purpose of the meeting was to
  78 discuss the quality of a previously generated gene finder training
  79 data set, discuss the performance of already trained, tomato specific
  80 gene finders, define a distributed annotation pipeline for the tomato
  81 genome sequences that are currently being generated, and to review
  82 the data submission procedures. Representatives from 9 countries
  83 involved in tomato sequencing and tomato annotation (through the
  84 EU-SOL project) were attending the meeting.</p>
  85 <p style="margin-bottom: 0in"><br /></p>
  86 <h3 class="western">Day 1 &ndash; October 23, 2006</h3>
  87 <p style="margin-bottom: 0in">First, a representative of every
  88 country present gave a brief overview of the sequencing progress.</p>
  89 <p style="margin-bottom: 0in"><br /></p>
  90 <p style="margin-bottom: 0in">India discussed how overlapping
  91 sequences were generated by sequencing seed BACs as far apart as 6cm.
  92 Thus, the seed BACs need to be carefully analyzed for potential
  93 overlaps, using the FPC fingerprint data. However, the FPC data is
  94 not available for all BACs.</p>
  95 <p style="margin-bottom: 0in"><br /></p>
  96 <p style="margin-bottom: 0in">Mark Fiers and Erwin Datema reported on
  97 trials with 454 sequences of full BACs. The feasibility of such an
  98 approach is presently not clear. They also presented a demonstration
  99 of Cyrille2, an interactive annotation pipeline system developed in
 100 the Netherlands developed at their site.</p>
 101 <p style="margin-bottom: 0in"><br /></p>
 102 <p style="margin-bottom: 0in">Maria Luisa Chuisano gave an overview
 103 of the EST alignment work and associated web resources that has been
 104 developed in her lab at University of Naples.</p>
 105 <p style="margin-bottom: 0in"><br /></p>
 106 <p style="margin-bottom: 0in">Daniel Buchan gave a brief overview of
 107 the state of chromosome 4 sequencing. Most of the BACs should be
 108 available in early 2007.</p>
 109 <p style="margin-bottom: 0in"><br /></p>
 110 <p style="margin-bottom: 0in">Thomas, although himself not directly
 111 involved on the sequencing side of things, presented progress on the
 112 sequencing of the French project and mentioned a technology called
 113 DAC, which allows many unfinished BACs to be finished in parallel.
 114 The technology is being actively developed. He also mentioned that
 115 enough resources may be available to also sequence the
 116 heterochromatic partion of chromosome 7. He then gave a brief
 117 overview of Eugene.</p>
 118 <p style="margin-bottom: 0in"><br /></p>
 119 <p style="margin-bottom: 0in">Francisco presented an overview of
 120 GeneID and a first tomato-specific matrix that was developed.</p>
 121 <p style="margin-bottom: 0in"><br /></p>
 122 <p style="margin-bottom: 0in">Next, Remy presented an analysis of the
 123 training set that was manually generated by some members of the
 124 group.  A total of 108 BACs were hand-annotated for complete and
 125 clean gene models. However, the resulting dataset was not homogeneous
 126 in quality, and some low-quality and/or incomplete gene models were
 127 retained.  A discussion followed, in which St&eacute;phane Rombauts
 128 explained that the poplar annotation project had used a very rigorous
 129 automated method to generate a training set, and he suggested that we
 130 try the same, letting the automated set supersede the hand-annotated
 131 set.  A general agreement was reached to try this course, with the
 132 automated generation performed by St&eacute;phane, using the same
 133 methods from poplar.  St&eacute;phane also agreed to perform a trial
 134 run of his training set generator during the meeting for evaluation.</p>
 135 <p style="margin-bottom: 0in"><br /></p>
 136 <p style="margin-bottom: 0in">The cornerstone of the training set
 137 generation method is identifying annotations that are very
 138 well-supported by EST alignments, and that match at least 75% of
 139 their entire predicted protein to a known protein from Arabidopsis.
 140 With the number of sequenced BACs available, the trial run of his
 141 training set generator produced only 100 very confident gene models
 142 with the required level of EST support and Arabidopsis homology.  The
 143 general conclusion was that this number was too low, but that the
 144 method was quite promising, and that final evaluation of the method
 145 should be deferred until more finished sequence is available.</p>
 146 <p style="margin-bottom: 0in"><br /></p>
 147 <p style="margin-bottom: 0in">The next discussion focused on the
 148 submission process of the BAC sequences and annotations.  Currently,
 149 all project partners are supposed to submit to Genbank and SGN
 150 independently, which can lead to inconsistencies between the
 151 repositories when the two submission events are far-separated in
 152 time.  Daniel and Remy suggested submission to Genbank only, from
 153 which SGN could pull in the sequences to feed into the annotation
 154 pipelines. However, the problem with this approach is that under it,
 155 the actual assembly data would not be carried by SGN.  A number of
 156 attendees asserted that this assembly data is valuable for final
 157 assembly and should continue to be rigorously warehoused.  Genbank
 158 accepts the full BAC sequence and the chromatograms for the
 159 individual sequence reads, but not the actual assembly data.  After a
 160 long discussion, agreement was reached on the following protocol:
 161 First, the finished BAC sequence must be submitted to Genbank, and a
 162 Genbank accession obtained. Then, the sequences, including the
 163 chromatograms and assembly information, is submitted to SGN, using
 164 essentially the same submission format as now, but with an additional
 165 file specifying the Genbank accession of the submission. SGN will
 166 determine the Genbank accessions of the currently submitted sequences
 167 and update them accordingly on the SGN FTP site.  In addition, the
 168 following tags should be embedded in the comments field of each
 169 submission to Genbank: &ldquo;ITAG&rdquo; (for International Tomato
 170 Annotation Group) and &ldquo;TOMGEN&rdquo; (for Tomato Genome
 171 Sequencing Project). This will allow to download all BACs that were
 172 sequenced (TOMGEN) or annotated (ITAG) by searching Genbank for these
 173 keywords. A quick search of Genbank determined that these keywords
 174 are not presently in use by any other sequences.</p>
 175 <p style="margin-bottom: 0in"><br /></p>
 176 <p style="margin-bottom: 0in">In addition, Mark and Erwin at
 177 Wacheningen will set up a central wiki site for use by the annotation
 178 project for documenting the stages and interchange formats required
 179 by the pipeline. (<b>update:</b><span style="font-weight: medium">
 180 the wiki is up at <a href="http://www.ab.wur.nl/TomatoWiki">http://www.ab.wur.nl/TomatoWiki</a>
 181 )</span></p>
 182 <p style="margin-bottom: 0in"><br /></p>
 183 <p style="margin-bottom: 0in">A discussion on data formats concluded
 184 that for most things, GFF3 should be sufficient. GAME XML is richer,
 185 but is not as well-specified, and is the native format of Apollo.
 186 Artemis is also a very viable gene editor program, and it is capable
 187 of using GFF3 as a native save format.  It was agreed that, for the
 188 present at least, both GFF3 and GAME XML formats will be used, since
 189 fairly well-developed conversion scripts exist at several of the
 190 sites involved.</p>
 191 <p style="margin-bottom: 0in"><br /></p>
 192 <p style="margin-bottom: 0in"><br /></p>
 193 <h3 class="western">Day 2 &ndash; October 24, 2006</h3>
 194 <p style="margin-bottom: 0in">The main focus of day 2 was
 195 establishing a high-level design of the ITAG annotation pipeline. An
 196 important aspect of the pipeline is that it is distributed, with many
 197 annotation centers participating in the process, with each site doing
 198 what they know to do best.  The pipeline is based on BAC sequences,
 199 and whole pseudomolecule assemblies will also be run once they are
 200 available (in a format to be determined later by the ITAG and TOMGEN
 201 groups).</p>
 202 <p style="margin-bottom: 0in"><br /></p>
 203 <p style="margin-bottom: 0in">In summary, the complete pipeline is as
 204 follows:</p>
 205 <p style="margin-bottom: 0in"><br /></p>
 206 <ol>
 207         <li><p style="margin-bottom: 0in">BAC sequences are uploaded to
 208         Genbank, and a genbank accession is obtained.</p></li>
 209         <li><p style="margin-bottom: 0in">The BAC is uploaded to SGN.
 210         </p></li>
 211         <li><p style="margin-bottom: 0in">SGN runs vector screens and
 212         contamination screens (chloroplast, mitochondrial and human
 213         sequences), and does other quality control, such as comparison of <i>in
 214         vitro</i> (from FPC data) vs <i>in silico</i> restriction fragment
 215         sizes. The actual submission to Genbank will also be quality
 216         checked, sequences compared and the presence of the keywords (ITAG
 217         and TOMGEN) assured.</p></li>
 218         <li><p style="margin-bottom: 0in">SGN runs RepeatMasker with
 219         tomato-derived and other repeat databases.  This comes before the
 220         other pipeline steps so that some of them have the option of using
 221         the repeat-masked BAC sequence.</p></li>
 222         <li><p style="margin-bottom: 0in">in parallel:</p>
 223         <ul>
 224                 <li><p style="margin-bottom: 0in">TBLASTX versus mimulus and potato
 225                 sequences</p></li>
 226                 <li><p style="margin-bottom: 0in">BLASTF (script from WUR) versus
 227                 protein data sets</p>
 228                 <ul>
 229                         <li><p style="margin-bottom: 0in">arabidopsis, swissprot,
 230                         solanaceae combined &ndash; SGN/Korea</p></li>
 231                         <li><p style="margin-bottom: 0in">other plants (rice, maize,
 232                         medicago, poplar) &ndash; PSB</p></li>
 233                         <li><p style="margin-bottom: 0in">swissprot</p></li>
 234                         <li><p style="margin-bottom: 0in">uniprot</p></li>
 235                         <li><p style="margin-bottom: 0in">pfam-B</p></li>
 236                         <li><p style="margin-bottom: 0in">SPTG</p></li>
 237                         <li><p style="margin-bottom: 0in">solanaceae</p></li>
 238                 </ul>
 239                 </li>
 240                 <li><p style="margin-bottom: 0in">BLASTN</p>
 241                 <ul>
 242                         <li><p style="margin-bottom: 0in">vector</p></li>
 243                         <li><p style="margin-bottom: 0in">e. coli</p></li>
 244                         <li><p style="margin-bottom: 0in">chloroplast</p></li>
 245                         <li><p style="margin-bottom: 0in">mitochondria (when available)</p></li>
 246                         <li><p style="margin-bottom: 0in">h. sapien</p></li>
 247                 </ul>
 248                 </li>
 249                 <li><p style="margin-bottom: 0in">transcript sequence alignments
 250                 (CAB Napoli)</p>
 251                 <ul>
 252                         <li><p style="margin-bottom: 0in">tomato - 98% identity, 90%
 253                         coverage</p></li>
 254                         <li><p style="margin-bottom: 0in">solanaceae &ndash; 90% identity,
 255                         75% coverage</p></li>
 256                 </ul>
 257                 </li>
 258                 <li><p style="margin-bottom: 0in">ab-initio gene finders</p>
 259                 <ul>
 260                         <li><p style="margin-bottom: 0in">fgenesh (SGN)</p></li>
 261                         <li><p style="margin-bottom: 0in">genemark (remy)</p></li>
 262                         <li><p style="margin-bottom: 0in">glimmerhmm (erwin)</p></li>
 263                         <li><p style="margin-bottom: 0in">genscan - ?
 264                         </p></li>
 265                         <li><p style="margin-bottom: 0in">genemark - ?</p></li>
 266                         <li><p style="margin-bottom: 0in">geneid (francisco/SGN)</p></li>
 267                         <li><p style="margin-bottom: 0in">SNAP (erwin</p></li>
 268                 </ul>
 269                 </li>
 270                 <li><p style="margin-bottom: 0in">RFAM &ndash; blastn/infernal(?)</p></li>
 271                 <li><p style="margin-bottom: 0in">tRNAscan-SE (SGN)</p></li>
 272         </ul>
 273         </li>
 274 </ol>
 275 <ul>
 276         <li>
 277         <ul>
 278                 <li><p style="margin-bottom: 0in"></p></li>
 279         </ul>
 280         </li>
 281 </ul>
 282 <ol start="6">
 283         <li><p style="margin-bottom: 0in">All predictions, alignments and
 284         BLASTs are downloaded by U. Ghent and fed into Eugene.</p></li>
 285         <li><p style="margin-bottom: 0in">proteins from the Eugene
 286         predictions are then functionally annotated with</p>
 287         <ul>
 288                 <li><p style="margin-bottom: 0in">BLASTP vs. Arabidopsis and rice
 289                 proteins, against SwissProt</p></li>
 290                 <li><p style="margin-bottom: 0in">Interpro &ndash; Imperial</p></li>
 291                 <li><p style="margin-bottom: 0in">GO &ndash; MPIZ?</p></li>
 292                 <li><p style="margin-bottom: 0in">TargetP, signalP, etc. - SGN</p></li>
 293                 <li><p style="margin-bottom: 0in">RPSblast (MPIZ)</p></li>
 294                 <li><p style="margin-bottom: 0in">TmHMM &ndash; SGN</p></li>
 295                 <li><p style="margin-bottom: 0in">SGN Genes DB &ndash; SGN</p></li>
 296         </ul>
 297         </li>
 298         <li><p style="margin-bottom: 0in">SGN produces downloadable files
 299         and publishes them on FTP</p>
 300         <ul>
 301                 <li><p style="margin-bottom: 0in">protein sequences</p></li>
 302                 <li><p style="margin-bottom: 0in">cds/cdna sequences</p></li>
 303                 <li><p style="margin-bottom: 0in">non-redundant protein sequences</p></li>
 304         </ul>
 305         </li>
 306 </ol>
 307 <ol>
 308         <li><p style="margin-bottom: 0in"></p></li>
 309 </ol>
 310 <p style="margin-bottom: 0in">Following the establishment of the
 311 pipeline steps, a discussion began on data flow between the stages.
 312 Early on, it was agreed that an implementation using a central server
 313 as a pipeline coordinator would be simpler and more robust.  The bulk
 314 of the discussion was devoted to whether this central server would
 315 call on each remote pipeline stage to perform the analysis as soon as
 316 a sequence available (a &ldquo;push&rdquo; model), or whether the
 317 central server would make the data available and wait for each
 318 analysis to retrieve its input and upload its output  (a &ldquo;pull&rdquo;
 319 model).  The &ldquo;push&rdquo; model has the advantage of allowing
 320 more rigorous flow control, since the central server has more
 321 knowledge of the running status of each analysis, but requires more
 322 from the remote servers, such as availability for external
 323 connections and the capability to run the analyses in a highly
 324 automated way.  The &ldquo;pull&rdquo; model does not require
 325 external availability or complete automation from the remote pipeline
 326 stages, since they only have to download their input from and upload
 327 their output to the central pipeline server.  Flow control in the
 328 pull model would be by means of pipeline status information made
 329 available by the central pipeline server, tracking what analysis
 330 results are available, and for each analysis, whether its required
 331 inputs are ready for download.</p>
 332 <p style="margin-bottom: 0in"><br /></p>
 333 <p style="margin-bottom: 0in">Since the &ldquo;pull&rdquo; model
 334 places less of a burden on each remote pipeline stage, it was decided
 335 that (like the medicago annotation project), the tomato distributed
 336 annotation pipeline would be pull-driven.  To simplify
 337 administration, it was also decided that the pipeline should be run
 338 on batches of BACs, rather than individual BACs.</p>
 339 <p style="margin-bottom: 0in"><br /></p>
 340 <p style="margin-bottom: 0in">Next, a discussion began on the
 341 structure and location of the central annotation result repository.
 342 It was decided that SGN would house the central repository, and
 343 transfer to and from the repository would be accomplished either with
 344 scp or sftp running over an encrypted ssh2 channel.  An encrypted
 345 transfer scheme was preferred over non-encrypted FTP because it
 346 offers more secure and flexible authentication mechanisms, greater
 347 assurance of data integrity, and acceptable transfer bandwidth
 348 requirements.  The repository will be configured such that all ITAG
 349 participants have accounts and can upload, download, and if necessary
 350 delete files from their assigned parts of the repository.</p>
 351 <p style="margin-bottom: 0in"><br /></p>
 352 <p style="margin-bottom: 0in">Next, the discussion turned to file
 353 naming conventions. The general conclusion was that BACs in the
 354 annotation pipeline  should be referenced by their <b><span style="font-style: normal">unversioned
 355 </span></b>Genbank accession, which is more unambiguous than their
 356 well plate, row, and column designations, since wells can be
 357 contaminated with other BAC sequences.  The unversioned Genbank
 358 accession is used to allow for keeping the locus names more stable
 359 when the BAC sequence changes.  File names and loci names should also
 360 be based on these Genbank accessions.  Genbank accession-based naming
 361 also has the advantage that the accession tends to be shorter than
 362 the clone name.  Annotation pipeline gene identifiers should thus
 363 start with the Genbank accession, followed by an underline and a
 364 numeric index  number, unique on that BAC.  For alternative splicing,
 365 the splice variants are denoted with a parenthesized letter following
 366 the numeric index number.  This can be followed by a dot and a
 367 version number to denote slightly differently annotated versions of
 368 the same locus.  Version numbers are increased if the underlying BAC
 369 sequence changes.  For example, for the third locus to be annotated
 370 on a fictional BAC AC12310, the second of two alternative
 371 transcripts, and the first version, its identifier might be
 372 &ldquo;AC12310_3(b).1&rdquo;.   This scheme is similar to the one
 373 used in <i>Medicago</i> annotation.</p>
 374 <p style="margin-bottom: 0in"><br /></p>
 375 <p style="margin-bottom: 0in">The numeric index does not specify a
 376 position on the BAC, but reflects the order in which the gene models
 377 were created. When a new locus is annotated, a new numeric index is
 378 chosen for it that is one greater than the previous highest index
 379 number.  If a gene model is created by merging two older gene models,
 380 the two old gene model identifiers are retired from use and a new
 381 identifier is generated for the merged gene model.  For example, if
 382 AC12310_7.1 is merged with AC12310_11.1, the resulting locus might be
 383 named AC12310_42.1.</p>
 384 <p style="margin-bottom: 0in"><br /></p>
 385 <p style="margin-bottom: 0in">Thus, adjacent gene models on the
 386 genome will not necessarily have numerically adjacent identifiers,
 387 depending on the order in which loci have been added, removed,
 388 merged, and so forth since the initial assignment of locus names.</p>
 389 <p style="margin-bottom: 0in"><br /></p>
 390 <p style="margin-bottom: 0in">A predictable file naming scheme is
 391 critical for a pull-based pipeline mechanism.  The following file
 392 naming convention for pipeline result files was formulated and agreed
 393 upon:</p>
 394 <p style="margin-bottom: 0in"><br /></p>
 395 <p align="center" style="margin-bottom: 0in">&lt;versioned
 396 acc.&gt;.&lt;analysis&gt;.itag&lt;pipeline ver.&gt;.v&lt;file
 397 ver.&gt;.&lt;file type&gt;</p>
 398 <p style="margin-bottom: 0in"><br /></p>
 399 <p style="margin-bottom: 0in">For example,
 400 &ldquo;AC12310.1.repeatmasker_TIGRRepbase.itag12.v3.gff&rdquo; would
 401 be the third version of the file containing the results of running
 402 the analysis 'repeatmasker_TIGRRepbase' on the BAC sequence
 403 AC12310.1, as part of version 12 of the ITAG pipeline.</p>
 404 <p style="margin-bottom: 0in"><br /></p>
 405 <p style="margin-bottom: 0in">The analysis tags (e.g.
 406 'repeatmasker_TIGRRepbase' or 'eugene') will be determined and
 407 assigned by ITAG in the coming weeks.</p>
 408 <p style="margin-bottom: 0in"><br /></p>
 409 <p style="margin-bottom: 0in">The ITAG pipeline version is a
 410 particularly important part of the file name.  Since many analyses in
 411 the pipeline depend on the output of other analyses, any change in
 412 the methods used at any step (such as updating reference databases or
 413 changing output formats) will usually require re-running of some or
 414 all of the analyses in the pipeline to ensure that all analysis
 415 results remain directly comparable and consistent with each other.
 416 Therefore, it will be essential to make these changes in a controlled
 417 and coordinated manner.  It was agreed that each static snapshot of
 418 the analyses and reference datasets used in the pipeline will be
 419 given a pipeline version number, starting from 0 and incrementing by
 420 1 each time <i>any</i> change is
 421 made to the pipeline that may affect any analysis's output.  Pipeline
 422 versions may not be incremented while an analysis batch is in
 423 progress.  Pipeline version increments must be agreed upon
 424 beforehand, and will not be allowed while an annotation batch is in
 425 progress.  It was also agreed that pipeline version 0 should be a
 426 special development version.  While the pipeline is at version 0,
 427 developers are free to change and/or update their pipeline stages
 428 without a pipeline increment.  When the pipeline is considered to be
 429 working and producing good results, the pipeline version will be
 430 incremented to 1 and rigorous pipeline version control will begin.</p>
 431 <p style="margin-bottom: 0in"><br /></p>
 432 <p style="margin-bottom: 0in">How often should the pipeline be run?
 433 It was felt that running the pipeline on single BACs would be a waste
 434 of time and a minimum batch size of 10 should be set.  In addition,
 435 to avoid putting too much of a computational burden on our sites, we
 436 also agreed on an initial maximum batch size of 100 BACs.  However,
 437 these limits should be revisited once the pipeline is running and its
 438 performance characteristics are better established.</p>
 439 <p style="margin-bottom: 0in"><br /></p>
 440 <p style="margin-bottom: 0in"><br /></p>
 441 <p style="margin-bottom: 0in">Final gene annotations will be
 442 published primarily in the form of several fasta-format files
 443 containing:</p>
 444 <ul>
 445         <li><p style="margin-bottom: 0in">protein sequences</p></li>
 446         <li><p style="margin-bottom: 0in">cds/cdna sequences</p></li>
 447         <li><p style="margin-bottom: 0in">non-redundant proteins</p></li>
 448 </ul>
 449 <p style="margin-bottom: 0in"><br /></p>
 450 <p style="margin-bottom: 0in">Fasta files will use the following
 451 format for the description lines:</p>
 452 <p style="margin-bottom: 0in"> &gt;&lt;locus name&gt; &lt;functional
 453 description&gt; &lt;versioned seq. acc.&gt;  &lt;evidence codes&gt;
 454 &lt;location on seq&gt; &lt;timestamp&gt;</p>
 455 <p style="margin-bottom: 0in"><br /></p>
 456 <p style="margin-bottom: 0in"><b>Locus name:</b><span style="font-weight: medium">
 457  properly formatted locus name as set out above</span></p>
 458 <p style="margin-bottom: 0in"><b>Functional description:</b><span style="font-weight: medium">
 459 a draft functional description of the locus (obtained from functional
 460 analysis stages of the pipeline)</span></p>
 461 <p style="margin-bottom: 0in"><b>Versioned sequence accession: </b><span style="font-weight: medium">
 462 the versioned Genbank accession of the BAC sequence (e.g. AC12312.1)</span></p>
 463 <p style="margin-bottom: 0in"><b>Evidence codes:</b><span style="font-weight: medium">
 464 string encoding the evidence supporting this annotated locus,
 465 composed of one or more of the following letters:</span></p>
 466 <p style="margin-bottom: 0in">  F - Full length cDNA aligned</p>
 467 <p style="margin-bottom: 0in">  E - EST coverage</p>
 468 <p style="margin-bottom: 0in">  H - homology to an annotation in
 469 another sequenced species</p>
 470 <p style="margin-bottom: 0in">  I - ab initio prediction</p>
 471 <p style="margin-bottom: 0in"><b>Location on sequence:</b><span style="font-weight: medium">
 472 1-based nucleotide coordinate range on the BAC sequence, formatted as
 473 &lt;start&gt;-&lt;finish&gt;.  e.g. 41223-48128</span></p>
 474 <p style="margin-bottom: 0in; font-weight: medium"><br /></p>
 475 <p style="margin-bottom: 0in; font-weight: medium">Therefore, an
 476 example of a properly-formatted description line would be:</p>
 477 <p style="margin-bottom: 0in; font-weight: medium">&gt;AC21353_4(a).2
 478  putative x-ray vision protein AC21353.1 FEHI 12931-18446
 479 2006-10-31/14:36:22</p>
 480 <p style="margin-bottom: 0in; font-weight: medium"><br /></p>
 481 <p style="margin-bottom: 0in; font-weight: medium"><br /></p>
 482 <p style="margin-bottom: 0in">The annotation of pseudo genes will be
 483 worked out at a later date.</p>
 484 <p style="margin-bottom: 0in">The format for the pseudomolecules to
 485 be used will be worked out at a later date.</p>
 486 <p style="margin-bottom: 0in"><br /></p>
 487 <h3 class="western">Day 3 &ndash; October 25, 2006</h3>
 488 <p style="margin-bottom: 0in">This was a half-day meeting, and was
 489 mostly devoted to clarifications and additions to the decisions made
 490 in the preceding two days.  Minimum and maximum BAC batch sizes were
 491 discussed again briefly, agreeing on an initial minimum and maximum
 492 batch size of 10 and 100 BACs respectively.</p>
 493 <p style="margin-bottom: 0in"><br /></p>
 494 <p style="margin-bottom: 0in">Additionally, a request by Lincoln
 495 Stein for permission to do a genome-wide annotation using the
 496 ensemble annotation pipeline was discussed.  The decision was made
 497 not to grant permission for him to publish an annotation at this
 498 time, since his analysis pipeline will not be specifically tailored
 499 to tomato, leading to a lower-quality annotation, and it would lead
 500 to confusion about which genome annotation is the &ldquo;official&rdquo;
 501 one.</p>
 502 <p style="margin-bottom: 0in"><br /></p>
 503 <p style="margin-bottom: 0in">Also, there was a discussion of the
 504 need for a note to be attached to our BAC sequences in Genbank,
 505 asking that people defer genome-wide analyses until our official
 506 annotation comes out.  A consensus was reached that the text of this
 507 note should be discussed and agreed upon at the upcoming SOL project
 508 meeting in November.</p>
 509 <p style="margin-bottom: 0in"><br /></p>
 510 <p style="margin-bottom: 0in">Next, some clarifications to the
 511 pipeline versioning scheme were made.  The idea of a free-development
 512 pipeline version 0 was introduced (already covered above).  The
 513 mechanics of pipeline synchronization were briefly discussed, with
 514 Rob clarifying that SGN intended to provide both a human-readable web
 515 page showing pipeline status and a machine-readable pipeline status
 516 web service, as described above.</p>
 517 <p style="margin-bottom: 0in"><br /></p>
 518 <p style="margin-bottom: 0in">Next came a discussion of arrangements
 519 for further tomato annotation meetings.  An agreement was
 520 reached to hold a tomato annotation meeting at PAG in San Diego in
 521 January.  Also, an agreement was made to try to have a phone
 522 conference of tomato annotators every two weeks.  St&eacute;phane
 523 introduced the VRVS service (<a href="http://www.vrvs.org/">http://www.vrvs.org</a>),
 524 a non-commercial internet conferencing service, as a possible
 525 mechanism for doing this without the cost of international phone
 526 calls.</p>
 527 <p style="margin-bottom: 0in"><br /></p>
 528 <p style="margin-bottom: 0in"><br /></p>
 529
 530 EOHTML
 531
 532 $page->footer;