lib/Bio/SeqIO/swiss.pm

   1 #
   2 # BioPerl module for Bio::SeqIO::swiss
   3 #
   4 # Copyright Elia Stupka
   5 #
   6 # You may distribute this module under the same terms as perl itself
   7
   8 # POD documentation - main docs before the code
   9
  10 =head1 NAME
  11
  12 Bio::SeqIO::swiss - Swissprot sequence input/output stream
  13
  14 =head1 SYNOPSIS
  15
  16 It is probably best not to use this object directly, but
  17 rather go through the SeqIO handler system:
  18
  19     use Bio::SeqIO;
  20
  21     $stream = Bio::SeqIO->new(-file => $filename,
  22                               -format => 'swiss');
  23
  24     while ( my $seq = $stream->next_seq() ) {
  25        # do something with $seq
  26     }
  27
  28 =head1 DESCRIPTION
  29
  30 This object can transform Bio::Seq objects to and from Swiss-Pprot flat
  31 file databases.
  32
  33 There is a lot of flexibility here about how to dump things which needs
  34 to be documented.
  35
  36 =head2 GN (Gene name) line management details
  37
  38 A Uniprot/Swiss-Prot entry holds information on one protein
  39 sequence. If that sequence is identical across genes and species, they
  40 are all merged into one entry. This creates complex needs for several
  41 annotation fields in swiss-prot format.
  42
  43 The latest syntax for GN line is described in the user manual:
  44
  45   http://www.expasy.ch/sprot/userman.html#GN_line
  46
  47 Each of the possibly multiple genes in an entry can have Name,
  48 Synonyms (only if there is a name), OrderedLocusNames (names from
  49 genomic sequences) and ORFNames (temporary or cosmid names). "Name"
  50 here really means "symbol". This complexity is now dealt with the
  51 following way:
  52
  53 A new Bio::AnnotationI class was created in order to store the
  54 data in tag-value pairs.  This class (Bio::Annotation::TagTree)
  55 is stored in the Bio::Annotation::Collection object and is
  56 accessed like all other annotations. The tag name is 'gene_name'.
  57
  58 There is a single Bio::Annotation::TagTree per sequence record, which
  59 corresponds to the original class that stored this data
  60 (Bio::Annotation::StructuredValue).  Depending on how we progress
  61 this may change to represent each group of gene names.
  62
  63 For now, to access the gene name tree annotation, one uses the below method:
  64
  65    my ($gene) = $seq->annotation->get_Annotations('gene_name');
  66
  67 If you are only interested in displaying the values, value() returns a
  68 string with similar formatting.
  69
  70 There are several ways to get directly at the information you want if you
  71 know the element (tag) for the data.  For gene names all data is stored with
  72 the element-tag pairs:
  73
  74   "element1=tag1, tag2, tag3; element2=tag4, tag5;"
  75
  76 This normally means the element will be 'Name', 'Synonyms', etc. and the
  77 gene names the values.  Using findval(), you can do the following:
  78
  79   # grab a flattened list of all gene names
  80   my @names = $ann->findval('Name');
  81
  82   # or iterated through the nodes and grab the name for each group
  83   for my $node ($ann->findnode('gene_name')) {
  84      my @names = $node->findval('Name');
  85   }
  86
  87 The current method for parsing gene name data (and reconstructing gene name
  88 output) is very generic. This is somewhat preemptive if, for instance, UniProt
  89 decides to update and add another element name to the current ones using the
  90 same formatting layout. Under those circumstances, one can iterate through the
  91 tag tree in a safe way and retrieve all node data like so.
  92
  93   # retrieve the gene name nodes (groups like names, synonyms, etc).
  94   for my $ann ($seq->annotation->get_Annotations('gene_name')) {
  95
  96       # each gene name group
  97       for my $node ($ann->findnode('gene_name')) {
  98           print "Gene name:\n";
  99
 100           # each gene name node (tag => value pair)
 101           for my $n ($node->children) {
 102               print "\t".$n->element.": ".$n->children."\n";
 103           }
 104       }
 105   }
 106
 107 For more uses see Bio::Annotation::TagTree.
 108
 109 Since Uniprot/Swiss-Prot format have been around for quite some time, the
 110 parser is also able to read in the older GN line syntax where genes
 111 are separated by AND and various symbols by OR. The first symbol is
 112 taken to be the 'Name' and the remaining ones are stored as 'Synonyms'.
 113
 114 Also, for UniProt output we support using other Bio::AnnotationI, but in this
 115 case we only use the stringified version of the annotation. This is to allow for
 116 backwards compatibility with code that previously used
 117 Bio::Annotation::SimpleValue or other Bio::AnnotationI classes.
 118
 119 =head2 Optional functions
 120
 121 =over 3
 122
 123 =item _show_dna()
 124
 125 (output only) shows the dna or not
 126
 127 =item _post_sort()
 128
 129 (output only) provides a sorting func which is applied to the FTHelpers
 130 before printing
 131
 132 =item _id_generation_func()
 133
 134 This is function which is called as
 135
 136    print "ID   ", $func($seq), "\n";
 137
 138 To generate the ID line. If it is not there, it generates a sensible ID
 139 line using a number of tools.
 140
 141 If you want to output annotations in Swissprot format they need to be
 142 stored in a Bio::Annotation::Collection object which is accessible
 143 through the Bio::SeqI interface method L<annotation()|annotation>.
 144
 145 The following are the names of the keys which are polled from a
 146 L<Bio::Annotation::Collection> object.
 147
 148  reference   - Should contain Bio::Annotation::Reference objects
 149  comment     - Should contain Bio::Annotation::Comment objects
 150  dblink      - Should contain Bio::Annotation::DBLink objects
 151  gene_name   - Should contain Bio::Annotation::SimpleValue object
 152
 153 =back
 154
 155 =head1 FEEDBACK
 156
 157 =head2 Mailing Lists
 158
 159 User feedback is an integral part of the evolution of this
 160 and other Bioperl modules. Send your comments and suggestions,
 161 preferably to one of the Bioperl mailing lists.
 162 Your participation is much appreciated.
 163
 164   bioperl-l@bioperl.org                  - General discussion
 165   http://bioperl.org/wiki/Mailing_lists  - About the mailing lists
 166
 167 =head2 Support
 168
 169 Please direct usage questions or support issues to the mailing list:
 170
 171 I<bioperl-l@bioperl.org>
 172
 173 rather than to the module maintainer directly. Many experienced and
 174 reponsive experts will be able look at the problem and quickly
 175 address it. Please include a thorough description of the problem
 176 with code and data examples if at all possible.
 177
 178 =head2 Reporting Bugs
 179
 180 Report bugs to the Bioperl bug tracking system to help us keep track
 181 the bugs and their resolution.
 182 Bug reports can be submitted via the web:
 183
 184   https://github.com/bioperl/bioperl-live/issues
 185
 186 =head1 AUTHOR - Elia Stupka
 187
 188 Email elia@tll.org.sg
 189
 190 =head1 APPENDIX
 191
 192 The rest of the documentation details each of the object methods.
 193 Internal methods are usually preceded with a _
 194
 195 =cut
 196
 197 # Let the code begin...
 198
 199 package Bio::SeqIO::swiss;
 200
 201 use vars qw(@Unknown_names @Unknown_genus);
 202 use strict;
 203 use Bio::SeqIO::FTHelper;
 204 use Bio::SeqFeature::Generic;
 205 use Bio::Species;
 206 use Bio::Tools::SeqStats;
 207 use Bio::Seq::SeqFactory;
 208 use Bio::Annotation::Collection;
 209 use Bio::Annotation::Comment;
 210 use Bio::Annotation::Reference;
 211 use Bio::Annotation::DBLink;
 212 use Bio::Annotation::SimpleValue;
 213 use Bio::Annotation::TagTree;
 214
 215 use base qw(Bio::SeqIO);
 216
 217 our $LINE_LENGTH = 76;
 218
 219 # this is for doing species name parsing
 220 @Unknown_names=('other', 'unidentified',
 221                 'unknown organism', 'not specified',
 222                 'not shown', 'Unspecified', 'Unknown',
 223                 'None', 'unclassified', 'unidentified organism',
 224                 'not supplied'
 225                );
 226 # dictionary of synonyms for taxid 32644
 227 # all above can be part of valid species name
 228 @Unknown_genus = qw(unknown unclassified uncultured unidentified);
 229
 230 # if there are any other gene name tags, they are added to the end
 231 our @GENE_NAME_ORDER = qw(Name Synonyms OrderedLocusNames ORFNames);
 232
 233 sub _initialize {
 234     my($self,@args) = @_;
 235     $self->SUPER::_initialize(@args);
 236     # hash for functions for decoding keys.
 237     $self->{'_func_ftunit_hash'} = {};
 238     # sets this to one by default. People can change it
 239     $self->_show_dna(1);
 240     if ( ! defined $self->sequence_factory ) {
 241         $self->sequence_factory(Bio::Seq::SeqFactory->new
 242                                 (-verbose => $self->verbose(),
 243                                  -type => 'Bio::Seq::RichSeq'));
 244     }
 245 }
 246
 247 =head2 next_seq
 248
 249  Title   : next_seq
 250  Usage   : $seq = $stream->next_seq()
 251  Function: returns the next sequence in the stream
 252  Returns : Bio::Seq object
 253  Args    :
 254
 255 =cut
 256
 257 sub next_seq {
 258     my ($self,@args) = @_;
 259     my ($pseq,$c,$line,$name,$desc,$acc,$seqc,$mol,$div, $sptr,$seq_div,
 260         $date,$comment,@date_arr);
 261     my $genename = "";
 262     my ($annotation, %params, @features) = ( Bio::Annotation::Collection->new());
 263
 264     local $_;
 265
 266     1 while defined($_ = $self->_readline) && /^\s+$/;
 267     return unless defined $_ && /^ID\s/;
 268
 269     # fixed to allow _DIVISION to be optional for bug #946
 270     # see bug report for more information
 271     #
 272     # 9/6/06 Note: Swiss/TrEMBL sequences have no division acc. to UniProt
 273     # release notes; this is fixed to simplify the regex parsing
 274     # STANDARD (SwissProt) and PRELIMINARY (TrEMBL) added to namespace()
 275     unless(  m{^
 276                ID              \s+     #
 277                (\S+)           \s+     #  $1  entryname
 278                ([^\s;]+);      \s+     #  $2  DataClass
 279                (?:PRT;)?       \s+     #  Molecule Type (optional)
 280                [0-9]+[ ]AA     \.      #  Sequencelength (capture?)
 281                $
 282            }ox ) {
 283             # I couldn't find any new current UniProt sequences
 284             # that matched this format:
 285             # || m/^ID\s+(\S+)\s+(_([^\s_]+))? /ox ) {
 286         $self->throw("swissprot stream with no ID. Not swissprot in my book");
 287     }
 288     ($name, $seq_div) = ($1, $2);
 289     $params{'-namespace'} =
 290         ($seq_div eq 'Reviewed'   || $seq_div eq 'STANDARD')     ? 'Swiss-Prot' :
 291         ($seq_div eq 'Unreviewed' || $seq_div eq 'PRELIMINARY')  ? 'TrEMBL'     :
 292          $seq_div;
 293     # we shouldn't be setting the division, but for now...
 294     my ($junk, $division) = split q(_), $name;
 295     $params{'-division'} = $division;
 296     $params{'-alphabet'} = 'protein';
 297     # this is important to have the id for display in e.g. FTHelper, otherwise
 298     # you won't know which entry caused an error
 299     $params{'-display_id'} = $name;
 300
 301     BEFORE_FEATURE_TABLE :
 302           while ( defined($_ = $self->_readline) ) {
 303               # Exit at start of Feature table and at the sequence at the
 304               # latest HL 05/11/2000
 305               last if( /^(FT|SQ)/ );
 306
 307               # Description line(s)
 308               if (/^DE\s+(\S.*\S)/) {
 309                   $desc .= $desc ? " $1" : $1;
 310               }
 311               #Gene name
 312               elsif (/^GN\s+(.*)/) {
 313                   $genename .= " " if $genename;
 314                   $genename .= $1;
 315               }
 316               #accession number(s)
 317               elsif ( /^AC\s+(.+)/) {
 318                   my @accs = split(/[; ]+/, $1); # allow space in addition
 319                   $params{'-accession_number'} = shift @accs
 320                       unless defined $params{'-accession_number'};
 321                   push @{$params{'-secondary_accessions'}}, @accs;
 322               }
 323               #date and sequence version
 324               elsif ( /^DT\s+(.*)/ ) {
 325                   my $line = $1;
 326                   my ($date, $version) = split(' ', $line, 2);
 327                   $date =~ tr/,//d; # remove comma if new version
 328                   if ($version =~ /\(Rel\. (\d+), Last sequence update\)/ || # old
 329                       /sequence version (\d+)/) { #new
 330                       my $update = Bio::Annotation::SimpleValue->new
 331                           (-tagname    => 'seq_update',
 332                            -value      => $1
 333                           );
 334                       $annotation->add_Annotation($update);
 335                   } elsif ($version =~ /\(Rel\. (\d+), Last annotation update\)/ || #old
 336                            /entry version (\d+)/) { #new
 337                       $params{'-version'} = $1;
 338                   }
 339                   push @{$params{'-dates'}}, $date;
 340               }
 341               # Evidence level
 342               elsif ( /^PE\s+(.*)/ ) {
 343                   my $line = $1;
 344                   $line =~ s/;\s*//; # trim trailing semicolon and any spaces at the end of the line
 345                   my $evidence = Bio::Annotation::SimpleValue->new
 346                       (-tagname   => 'evidence',
 347                        -value     => $line
 348                       );
 349                   $annotation->add_Annotation($evidence);
 350               }
 351               # Organism name and phylogenetic information
 352               elsif (/^O[SCG]/) {
 353                   my $species = $self->_read_swissprot_Species($_);
 354                   $params{'-species'}= $species;
 355                   # now we are one line ahead -- so continue without reading the next
 356                   # line   HL 05/11/2000
 357               }
 358               # References
 359               elsif (/^R/) {
 360                   my $refs = $self->_read_swissprot_References($_);
 361                   foreach my $r (@$refs) {
 362                       $annotation->add_Annotation('reference',$r);
 363                   }
 364               }
 365               # Comments
 366               elsif (/^CC\s{3}(.*)/) {
 367                   $comment .= $1;
 368                   $comment .= "\n";
 369                   while (defined ($_ = $self->_readline) && /^CC\s{3}(.*)/ ) {
 370                       $comment .= $1 . "\n";
 371                   }
 372
 373                   # note: don't try to process comments here -- they may contain
 374                   # structure. LP 07/30/2000
 375
 376                   my $commobj = Bio::Annotation::Comment->new(-tagname => 'comment',
 377                                                               -text    => $comment);
 378                   $annotation->add_Annotation('comment',$commobj);
 379                   $comment = "";
 380                   $self->_pushback($_);
 381               }
 382               #DBLinks
 383               # old regexp
 384               # /^DR\s+(\S+)\;\s+(\S+)\;\s+(\S+)[\;\.](.*)$/) {
 385               # new regexp from Andreas Kahari  bug #1584
 386               elsif (/^DR\s+(\S+)\;\s+(\S+)\;\s+([^;]+)[\;\.](.*)$/) {
 387                   my ($database,$primaryid,$optional,$comment) = ($1,$2,$3,$4);
 388
 389                   # drop leading and training spaces and trailing .
 390                   $comment =~ s/\.\s*$//;
 391                   $comment =~ s/^\s+//;
 392
 393                   my $dblinkobj =  Bio::Annotation::DBLink->new
 394                       (-database    => $database,
 395                        -primary_id  => $primaryid,
 396                        -optional_id => $optional,
 397                        -comment     => $comment,
 398                        -tagname     => 'dblink',
 399                       );
 400
 401                   $annotation->add_Annotation('dblink',$dblinkobj);
 402               }
 403               #keywords
 404               elsif ( /^KW\s+(.*)$/ ) {
 405                   my @kw = split(/\s*\;\s*/,$1);
 406                   defined $kw[-1] && $kw[-1] =~ s/\.$//;
 407                   push @{$params{'-keywords'}}, @kw;
 408               }
 409           }
 410     # process and parse the gene name line if there was one (note: we
 411     # can't do this above b/c GN may be multi-line and we can't
 412     # unequivocally determine whether we've seen the last GN line in
 413     # the new format)
 414     if ($genename) {
 415         my @stags;
 416         if ($genename =~ /\w=\w/) {
 417             # new format (e.g., Name=RCHY1; Synonyms=ZNF363, CHIMP)
 418             for my $n (split(m{\s+and\s+},$genename)) {
 419                 my @genenames;
 420                 for my $section (split(m{\s*;\s*},$n)) {
 421                     my ($tag, $rest) = split("=",$section);
 422                     $rest ||= '';
 423                     for my $val (split(m{\s*,\s*},$rest)) {
 424                         push @genenames, [$tag => $val];
 425                     }
 426                 }
 427                 push @stags, ['gene_name' => \@genenames];
 428             }
 429         } else {
 430             # old format
 431             for my $section (split(/ AND /, $genename)) {
 432                 my @genenames;
 433                 $section =~ s/[\(\)\.]//g;
 434                 my @names = split(m{\s+OR\s+}, $section);
 435                 push @genenames, ['Name' => shift @names];
 436                 push @genenames, map {['Synonyms' => $_]} @names;
 437                 push @stags, ['gene_name' => \@genenames]
 438             }
 439         } #use Data::Dumper; print Dumper $gn, $genename;# exit;
 440         my $gn = Bio::Annotation::TagTree->new(-tagname => 'gene_name',
 441                                                -value => ['gene_names' => \@stags]);
 442         $annotation->add_Annotation('gene_name', $gn);
 443     }
 444
 445     FEATURE_TABLE :
 446           # if there is no feature table, or if we've got beyond, exit loop or don't
 447           # even enter    HL 05/11/2000
 448           while (defined $_ && /^FT/ ) {
 449               my $ftunit = $self->_read_FTHelper_swissprot($_);
 450
 451               # process ftunit
 452               # when parsing of the line fails we get undef returned
 453               if ($ftunit) {
 454                   push(@features,
 455                        $ftunit->_generic_seqfeature($self->location_factory(),
 456                                                     $params{'-seqid'}, "SwissProt"));
 457               } else {
 458                   $self->warn("failed to parse feature table line for seq " .
 459                               $params{'-display_id'}. "\n$_");
 460               }
 461               $_ = $self->_readline;
 462           }
 463     while ( defined($_) && ! /^SQ/ ) {
 464         $_ = $self->_readline;
 465     }
 466     $seqc = "";
 467     while ( defined ($_ = $self->_readline) ) {
 468         last if m{^//};
 469         s/[^A-Za-z]//g;
 470         $seqc .= uc($_);
 471     }
 472
 473     my $seq=  $self->sequence_factory->create
 474         (-verbose  => $self->verbose,
 475          %params,
 476          -seq      => $seqc,
 477          -desc     => $desc,
 478          -features => \@features,
 479          -annotation => $annotation,
 480         );
 481
 482     # The annotation doesn't get added by the constructor
 483     $seq->annotation($annotation);
 484
 485     return $seq;
 486 }
 487
 488 =head2 write_seq
 489
 490  Title   : write_seq
 491  Usage   : $stream->write_seq($seq)
 492  Function: writes the $seq object (must be seq) to the stream
 493  Returns : 1 for success and 0 for error
 494  Args    : array of 1 to n Bio::SeqI objects
 495
 496
 497 =cut
 498
 499 sub write_seq {
 500     my ($self,@seqs) = @_;
 501     foreach my $seq ( @seqs ) {
 502         $self->throw("Attempting to write with no seq!") unless defined $seq;
 503
 504         if ( ! ref $seq || ! $seq->isa('Bio::SeqI') ) {
 505             $self->warn(" $seq is not a SeqI compliant module. Attempting to dump, but may fail!");
 506         }
 507
 508         my $i;
 509         my $str = $seq->seq;
 510
 511         my $div;
 512         my $ns = ($seq->can('namespace')) && $seq->namespace();
 513         my $len = $seq->length();
 514
 515         if ( !$seq->can('division') || ! defined ($div = $seq->division()) ) {
 516             $div = 'UNK';
 517         }
 518
 519         # namespace dictates database, takes precedent over division. Sorry!
 520         if (defined($ns) && $ns ne '') {
 521             $div = ($ns eq 'Swiss-Prot') ? 'Reviewed'    :
 522                 ($ns eq 'TrEMBL')     ? 'Unreviewed' :
 523                     $ns;
 524         } else {
 525             $ns = 'Swiss-Prot';
 526             # division not reset; acts as fallback
 527         }
 528
 529         $self->warn("No whitespace allowed in SWISS-PROT display id [". $seq->display_id. "]")
 530             if $seq->display_id =~ /\s/;
 531
 532         my $temp_line;
 533         if ( $self->_id_generation_func ) {
 534             $temp_line = &{$self->_id_generation_func}($seq);
 535         } else {
 536             #$temp_line = sprintf ("%10s     STANDARD;      %3s;   %d AA.",
 537             #            $seq->primary_id()."_".$div,$mol,$len);
 538             # Reconstructing the ID relies heavily upon the input source having
 539             # been in a format that is parsed as this routine expects it -- that is,
 540             # by this module itself. This is bad, I think, and immediately breaks
 541             # if e.g. the Bio::DB::GenPept module is used as input.
 542             # Hence, switch to display_id(); _every_ sequence is supposed to have
 543             # this. HL 2000/09/03
 544             # Changed to reflect ID line changes in UniProt
 545             # Oct 2006 - removal of molecule type - see bug 2134
 546             $temp_line = sprintf ("%-24s%-12s%9d AA.",
 547                                   $seq->display_id(), $div.';', $len);
 548         }
 549
 550         $self->_print( "ID   $temp_line\n");
 551
 552         # if there, write the accession line
 553         local($^W) = 0;         # suppressing warnings about uninitialized fields
 554
 555         if ( $self->_ac_generation_func ) {
 556             $temp_line = &{$self->_ac_generation_func}($seq);
 557             $self->_print( "AC   $temp_line\n");
 558         }
 559         elsif ($seq->can('accession_number') ) {
 560             my $ac_line = $seq->accession_number;
 561             if ($seq->can('get_secondary_accessions') ) {
 562                 foreach my $sacc ($seq->get_secondary_accessions) {
 563                     $ac_line .= "; ". $sacc;;
 564                 }
 565                 $ac_line .= ";";
 566             }
 567
 568             $self->_write_line_swissprot_regex("AC   ","AC   ",$ac_line,
 569                                                "\\s\+\|\$",$LINE_LENGTH);
 570         }
 571         # otherwise - cannot print <sigh>
 572
 573
 574         # Date lines and sequence versions (changed 6/15/2006)
 575         # This is rebuilt from scratch using the current SwissProt/UniProt format
 576         if ( $seq->can('get_dates') ) {
 577             my @dates =  $seq->get_dates();
 578             my $ct = 1;
 579             my $seq_version = $seq->version;
 580             my ($update_version) = $seq->annotation->get_Annotations("seq_update");
 581             foreach my $dt (@dates) {
 582                 $self->_write_line_swissprot_regex("DT   ","DT   ",
 583                                                    $dt.', integrated into UniProtKB/'.$ns.'.',
 584                                                    "\\s\+\|\$",$LINE_LENGTH) if $ct == 1;
 585                 $self->_write_line_swissprot_regex("DT   ","DT   ",
 586                                                    $dt.", sequence version ".$update_version->display_text.'.',
 587                                                    "\\s\+\|\$",$LINE_LENGTH) if $ct == 2;
 588                 $self->_write_line_swissprot_regex("DT   ","DT   ",
 589                                                    $dt.", entry version $seq_version.",
 590                                                    "\\s\+\|\$",$LINE_LENGTH) if $ct == 3;
 591                 $ct++;
 592             }
 593         }
 594
 595         #Definition lines
 596         $self->_write_line_swissprot_regex("DE   ","DE   ",$seq->desc(),"\\s\+\|\$",$LINE_LENGTH);
 597
 598         #Gene name; print out new format
 599         foreach my $gene ( my @genes = $seq->annotation->get_Annotations('gene_name') ) {
 600             # gene is a Bio::Annotation::TagTree;
 601             if ($gene->isa('Bio::Annotation::TagTree')) {
 602                 my @genelines;
 603                 for my $node ($gene->findnode('gene_name')) {
 604                     # check for Name and Synonym first, then the rest get tagged on
 605                     my $geneline = "GN   ";
 606                     my %genedata = $node->hash;
 607                     for my $tag (@GENE_NAME_ORDER) {
 608                         if (exists $genedata{$tag}) {
 609                             $geneline .= (ref $genedata{$tag} eq 'ARRAY') ?
 610                                 "$tag=".join(', ',@{$genedata{$tag}})."; " :
 611                                 "$tag=$genedata{$tag}; ";
 612                             delete $genedata{$tag};
 613                         }
 614                     }
 615                     # add rest
 616                     for my $tag (sort keys %genedata) {
 617                         $geneline .= (ref $genedata{$tag} eq 'ARRAY') ?
 618                             "$tag=".join(', ',@{$genedata{$tag}})."; " :
 619                             "$tag=$genedata{$tag}; ";
 620                         delete $genedata{$tag};
 621                     }
 622                     push @genelines, "$geneline\n";
 623                 }
 624                 $self->_print(join("GN   and\n",@genelines));
 625             } else { # fall back to getting stringified output
 626                 $self->_write_line_swissprot_regex("GN   ","GN   ",
 627                                                    $gene->display_text,
 628                                                    "\\s\+\|\$",
 629                                                    $LINE_LENGTH);
 630             }
 631         }
 632
 633         # Organism lines
 634         if ($seq->can('species') && (my $spec = $seq->species)) {
 635             my @class = $spec->classification();
 636             shift(@class);
 637             my $species = $spec->species;
 638             my $genus = $spec->genus;
 639             my $OS = $spec->scientific_name;
 640             if ($class[-1] =~ /viruses/i) {
 641                 $OS = $species;
 642                 $OS .=  " ". $spec->sub_species if $spec->sub_species;
 643             }
 644             foreach (($spec->variant, $spec->common_name)) {
 645                 $OS .= " ($_)" if $_;
 646             }
 647             $self->_print( "OS   $OS.\n");
 648             my $OC = join('; ', reverse(@class)) .'.';
 649             $self->_write_line_swissprot_regex("OC   ","OC   ",$OC,"\; \|\$",$LINE_LENGTH);
 650             if ($spec->organelle) {
 651                 $self->_write_line_swissprot_regex("OG   ","OG   ",$spec->organelle,"\; \|\$",$LINE_LENGTH);
 652             }
 653             if ($spec->ncbi_taxid) {
 654                 $self->_print("OX   NCBI_TaxID=".$spec->ncbi_taxid.";\n");
 655             }
 656         }
 657
 658         # Reference lines
 659         my $t = 1;
 660         foreach my $ref ( $seq->annotation->get_Annotations('reference') ) {
 661             $self->_print( "RN   [$t]\n");
 662             # changed by lorenz 08/03/00
 663             # j.gilbert and h.lapp agreed that the rp line in swissprot seems
 664             # more like a comment than a parseable value, so print it as is
 665             if ($ref->rp) {
 666                 $self->_write_line_swissprot_regex("RP   ","RP   ",$ref->rp,
 667                                                    "\\s\+\|\$",$LINE_LENGTH);
 668             }
 669             if ($ref->comment) {
 670                 $self->_write_line_swissprot_regex("RC   ","RC   ",$ref->comment,
 671                                                    "\\s\+\|\$",$LINE_LENGTH);
 672             }
 673             if ($ref->medline or $ref->pubmed or $ref->doi) {
 674                 # new RX format in swissprot LP 09/17/00
 675                 # RX line can now have a DOI, Heikki 13 Feb 2008
 676
 677                 my $line;
 678                 $line .= "MEDLINE=". $ref->medline. '; ' if $ref->medline;
 679                 $line .= "PubMed=". $ref->pubmed. '; ' if $ref->pubmed;
 680                 $line .= "DOI=". $ref->doi. '; ' if $ref->doi;
 681                 chop $line;
 682
 683                 $self->_write_line_swissprot_regex("RX   ","RX   ",
 684                                                    $line,
 685                                                    "\\s\+\|\$",$LINE_LENGTH);
 686
 687             }
 688             my $author = $ref->authors .';' if($ref->authors);
 689             my $title = $ref->title .';' if( $ref->title);
 690             my $rg = $ref->rg . ';' if $ref->rg;
 691             $author =~ s/([\w\.]) (\w)/$1#$2/g;  # add word wrap protection char '#'
 692
 693             $self->_write_line_swissprot_regex("RG   ","RG   ",$rg,"\\s\+\|\$",$LINE_LENGTH) if $rg;
 694             $self->_write_line_swissprot_regex("RA   ","RA   ",$author,"\\s\+\|\$",$LINE_LENGTH) if $author;
 695             $self->_write_line_swissprot_regex("RT   ","RT   ",$title,'[\s\-]+|$',$LINE_LENGTH) if $title;
 696             $self->_write_line_swissprot_regex("RL   ","RL   ",$ref->location,"\\s\+\|\$",$LINE_LENGTH);
 697             $t++;
 698         }
 699
 700         # Comment lines
 701
 702         foreach my $comment ( $seq->annotation->get_Annotations('comment') ) {
 703             foreach my $cline (split ("\n", $comment->text)) {
 704                 while (length $cline > 74) {
 705                     $self->_print("CC   ",(substr $cline,0,74),"\n");
 706                     $cline = substr $cline,74;
 707                 }
 708                 $self->_print("CC   ",$cline,"\n");
 709             }
 710         }
 711
 712         # Database xref lines
 713
 714         foreach my $dblink ( $seq->annotation->get_Annotations('dblink') ) {
 715             my ($primary_id) = $dblink->primary_id;
 716
 717             if (defined($dblink->comment) && ($dblink->comment) ) {
 718                 $self->_print("DR   ",$dblink->database,"; ",$primary_id,"; ",
 719                               $dblink->optional_id,"; ",$dblink->comment,".\n");
 720             } elsif ($dblink->optional_id) {
 721                 $self->_print("DR   ",$dblink->database,"; ",
 722                               $primary_id,"; ",
 723                               $dblink->optional_id,".\n");
 724             } else {
 725                 $self->_print("DR   ",$dblink->database,
 726                               "; ",$primary_id,"; ","-.\n");
 727             }
 728         }
 729
 730         # Evidence lines
 731
 732         foreach my $evidence ( $seq->annotation->get_Annotations('evidence') ) {
 733             $self->_print("PE   ",$evidence->value,";\n");
 734         }
 735
 736         # if there, write the kw line
 737         {
 738             my $kw;
 739             if ( my $func = $self->_kw_generation_func ) {
 740                 $kw = &{$func}($seq);
 741             } elsif ( $seq->can('keywords') ) {
 742                 $kw = $seq->keywords;
 743                 if ( ref($kw) =~ /ARRAY/i ) {
 744                     $kw = join("; ", @$kw);
 745                 }
 746                 $kw .= '.' if $kw and $kw !~ /\.$/ ;
 747             }
 748             $kw =~ s/([\w\.]) (\w)/$1#$2/g;  # add word wrap protection char '#'
 749             $self->_write_line_swissprot_regex("KW   ","KW   ",
 750                                                $kw, "\\s\+\|\$",$LINE_LENGTH)
 751                 if $kw;
 752         }
 753
 754         #Check if there is seqfeatures before printing the FT line
 755         my @feats = $seq->can('top_SeqFeatures') ? $seq->top_SeqFeatures : ();
 756         if ($feats[0]) {
 757             if ( defined $self->_post_sort ) {
 758
 759                 # we need to read things into an array. Process. Sort them. Print 'em
 760
 761                 my $post_sort_func = $self->_post_sort();
 762                 my @fth;
 763
 764                 foreach my $sf ( @feats ) {
 765                     push(@fth,Bio::SeqIO::FTHelper::from_SeqFeature($sf,$seq));
 766                 }
 767                 @fth = sort { &$post_sort_func($a,$b) } @fth;
 768
 769                 foreach my $fth ( @fth ) {
 770                     $self->_print_swissprot_FTHelper($fth);
 771                 }
 772             } else {
 773                 # not post sorted. And so we can print as we get them.
 774                 # lower memory load...
 775
 776                 foreach my $sf ( @feats ) {
 777                     my @fth = Bio::SeqIO::FTHelper::from_SeqFeature($sf,$seq);
 778                     foreach my $fth ( @fth ) {
 779                         if ( ! $fth->isa('Bio::SeqIO::FTHelper') ) {
 780                             $sf->throw("Cannot process FTHelper... $fth");
 781                         }
 782                         $self->_print_swissprot_FTHelper($fth);
 783                     }
 784                 }
 785             }
 786
 787             if ( $self->_show_dna() == 0 ) {
 788                 return;
 789             }
 790         }
 791         # finished printing features.
 792
 793         # molecular weight
 794         my $mw = ${Bio::Tools::SeqStats->get_mol_wt($seq->primary_seq)}[0];
 795         # checksum
 796         # was crc32 checksum, changed it to crc64
 797         my $crc64 = $self->_crc64(\$str);
 798         $self->_print( sprintf("SQ   SEQUENCE  %4d AA;  %d MW;  %16s CRC64;\n",
 799                                $len,$mw,$crc64));
 800         $self->_print( "    ");
 801         my $linepos;
 802         for ($i = 0; $i < length($str); $i += 10) {
 803             $self->_print( " ", substr($str,$i,10));
 804             $linepos += 11;
 805             if ( ($i+10)%60 == 0 && (($i+10) < length($str))) {
 806                 $self->_print( "\n    ");
 807             }
 808         }
 809         $self->_print( "\n//\n");
 810
 811         $self->flush if $self->_flush_on_write && defined $self->_fh;
 812         return 1;
 813     }
 814 }
 815
 816 # Thanks to James Gilbert for the following two. LP 08/01/2000
 817
 818 =head2 _generateCRCTable
 819
 820  Title   : _generateCRCTable
 821  Usage   :
 822  Function:
 823  Example :
 824  Returns :
 825  Args    :
 826
 827
 828 =cut
 829
 830 sub _generateCRCTable {
 831   # 10001000001010010010001110000100
 832   # 32
 833     my $poly = 0xEDB88320;
 834     my ($self) = shift;
 835
 836     $self->{'_crcTable'} = [];
 837     foreach my $i (0..255) {
 838         my $crc = $i;
 839         for (my $j=8; $j > 0; $j--) {
 840             if ($crc & 1) {
 841                 $crc = ($crc >> 1) ^ $poly;
 842             } else {
 843                 $crc >>= 1;
 844             }
 845         }
 846         ${$self->{'_crcTable'}}[$i] = $crc;
 847     }
 848 }
 849
 850
 851 =head2 _crc32
 852
 853  Title   : _crc32
 854  Usage   :
 855  Function:
 856  Example :
 857  Returns :
 858  Args    :
 859
 860
 861 =cut
 862
 863 sub _crc32 {
 864     my( $self, $str ) = @_;
 865
 866     $self->throw("Argument to crc32() must be ref to scalar")
 867         unless ref($str) eq 'SCALAR';
 868
 869     $self->_generateCRCTable() unless exists $self->{'_crcTable'};
 870
 871     my $len = length($$str);
 872
 873     my $crc = 0xFFFFFFFF;
 874     for (my $i = 0; $i < $len; $i++) {
 875         # Get upper case value of each letter
 876         my $int = ord uc substr $$str, $i, 1;
 877         $crc = (($crc >> 8) & 0x00FFFFFF) ^
 878             ${$self->{'_crcTable'}}[ ($crc ^ $int) & 0xFF ];
 879     }
 880     return $crc;
 881 }
 882
 883 =head2 _crc64
 884
 885  Title   : _crc64
 886  Usage   :
 887  Function:
 888  Example :
 889  Returns :
 890  Args    :
 891
 892
 893 =cut
 894
 895 sub _crc64{
 896     my ($self, $sequence) = @_;
 897     my $POLY64REVh = 0xd8000000;
 898     my @CRCTableh = 256;
 899     my @CRCTablel = 256;
 900     my $initialized;
 901
 902     my $seq = $$sequence;
 903
 904     my $crcl = 0;
 905     my $crch = 0;
 906     if (!$initialized) {
 907         $initialized = 1;
 908         for (my $i=0; $i<256; $i++) {
 909             my $partl = $i;
 910             my $parth = 0;
 911             for (my $j=0; $j<8; $j++) {
 912                 my $rflag = $partl & 1;
 913                 $partl >>= 1;
 914                 $partl |= (1 << 31) if $parth & 1;
 915                 $parth >>= 1;
 916                 $parth ^= $POLY64REVh if $rflag;
 917             }
 918             $CRCTableh[$i] = $parth;
 919             $CRCTablel[$i] = $partl;
 920         }
 921     }
 922
 923     foreach (split '', $seq) {
 924         my $shr = ($crch & 0xFF) << 24;
 925         my $temp1h = $crch >> 8;
 926         my $temp1l = ($crcl >> 8) | $shr;
 927         my $tableindex = ($crcl ^ (unpack "C", $_)) & 0xFF;
 928         $crch = $temp1h ^ $CRCTableh[$tableindex];
 929         $crcl = $temp1l ^ $CRCTablel[$tableindex];
 930     }
 931     my $crc64 = sprintf("%08X%08X", $crch, $crcl);
 932     return $crc64;
 933 }
 934
 935 =head2 _print_swissprot_FTHelper
 936
 937  Title   : _print_swissprot_FTHelper
 938  Usage   :
 939  Function:
 940  Example :
 941  Returns :
 942  Args    :
 943
 944
 945 =cut
 946
 947 sub _print_swissprot_FTHelper {
 948     my ($self,$fth,$always_quote) = @_;
 949     $always_quote ||= 0;
 950     my ($start,$end) = ('?', '?');
 951
 952     if ( ! ref $fth || ! $fth->isa('Bio::SeqIO::FTHelper') ) {
 953         $fth->warn("$fth is not a FTHelper class. ".
 954                    "Attempting to print, but there could be tears!");
 955     }
 956     my $desc = "";
 957
 958     for my $tag ( qw(description gene note product) ) {
 959         if ( exists $fth->field->{$tag} ) {
 960             $desc = @{$fth->field->{$tag}}[0].".";
 961             last;
 962         }
 963     }
 964     $desc =~ s/\.$//;
 965
 966     my $ftid = "";
 967     if ( exists $fth->field->{'FTId'} ) {
 968         $ftid = @{$fth->field->{'FTId'}}[0]. '.';
 969     }
 970
 971     my $key =substr($fth->key,0,8);
 972     my $loc = $fth->loc;
 973     if ( $loc =~ /(\?|\d+|\>\d+|<\d+)?\.\.(\?|\d+|<\d+|>\d+)?/ ) {
 974         $start = $1 if defined $1;
 975         $end = $2 if defined $2;
 976
 977         # to_FTString only returns one value when start == end, #JB955
 978         # so if no match is found, assume it is both start and end #JB955
 979     } elsif ( $loc =~ /join\((\d+)((?:,\d+)+)?\)/) {
 980         my @y = ($1);
 981         if ( defined( my $m = $2) ) {
 982             $m =~ s/^\,//;
 983             push @y, split(/,/,$m);
 984         }
 985         for my $x ( @y ) {
 986             $self->_write_line_swissprot_regex(
 987                                                sprintf("FT   %-8s %6s %6s       ",
 988                                                        $key,
 989                                                        $x ,$x),
 990                                                "FT                                ",
 991                                                $desc.'.','\s+|$',$LINE_LENGTH);
 992         }
 993         return;
 994     } else {
 995         $start = $end = $fth->loc;
 996     }
 997     if ($desc) {
 998         $self->_write_line_swissprot_regex(sprintf("FT   %-8s %6s %6s       ",
 999                                                    $key,
1000                                                    $start ,$end),
1001                                            "FT                                ",
1002                                            $desc. '.', '\s+|$', $LINE_LENGTH);
1003     } else {                    #HELIX and STRAND do not have descriptions
1004         $self->_write_line_swissprot_regex(sprintf("FT   %-8s %6s %6s",
1005                                                    $key,
1006                                                    $start ,$end),
1007                                            "FT                         ",
1008                                            ' ', '\s+|$', $LINE_LENGTH);
1009     }
1010
1011
1012     if ($ftid) {
1013         $self->_write_line_swissprot_regex("FT                                ",
1014                                            "FT                                ",
1015                                            "/FTId=$ftid",'.|$',$LINE_LENGTH);
1016
1017     }
1018
1019 }
1020 #'
1021
1022 =head2 _read_swissprot_References
1023
1024  Title   : _read_swissprot_References
1025  Usage   :
1026  Function: Reads references from swissprot format. Internal function really
1027  Example :
1028  Returns :
1029  Args    :
1030
1031
1032 =cut
1033
1034 sub _read_swissprot_References{
1035     my ($self,$line) = @_;
1036     my ($b1, $b2, $rp, $rg, $title, $loc, $au, $med, $com, $pubmed, $doi);
1037     my @refs;
1038     local $_ = $line;
1039     while ( defined $_ ) {
1040         if ( /^[^R]/ || /^RN/ ) {
1041             if ( $rp ) {
1042                 $rg =~ s/;\s*$//g if defined($rg);
1043                 if (defined($au)) {
1044                     $au =~ s/;\s*$//;
1045                 } else {
1046                     $au = $rg;
1047                 }
1048                 $title =~ s/;\s*$//g if defined($title);
1049                 push @refs, Bio::Annotation::Reference->new
1050                     (-title   => $title,
1051                      -start   => $b1,
1052                      -end     => $b2,
1053                      -authors => $au,
1054                      -location=> $loc,
1055                      -medline => $med,
1056                      -pubmed  => $pubmed,
1057                      -doi     => $doi,
1058                      -comment => $com,
1059                      -rp      => $rp,
1060                      -rg      => $rg,
1061                      -tagname => 'reference',
1062                     );
1063                 # reset state for the next reference
1064                 $rp = '';
1065             }
1066             if (index($_,'R') != 0) {
1067                 $self->_pushback($_); # want this line to go back on the list
1068                 last;           # may be the safest exit point HL 05/11/2000
1069             }
1070             # don't forget to reset the state for the next reference
1071             $b1 = $b2 = $rg = $med = $com = $pubmed = $doi = undef;
1072             $title = $loc = $au = undef;
1073         } elsif ( /^RP\s{3}(.+? OF (\d+)-(\d+).*)/) {
1074             $rp  .= $1;
1075             $b1   = $2;
1076             $b2   = $3;
1077         } elsif ( /^RP\s{3}(.*)/) {
1078             if ($rp) {
1079                 $rp .= " ".$1;
1080             } else {
1081                 $rp = $1;
1082             }
1083         } elsif (/^RX\s{3}(.*)/) { # each reference can have only one RX line
1084             my $line = $1;
1085             $med = $1 if $line =~ /MEDLINE=(\d+);/;
1086             $pubmed = $1 if $line =~ /PubMed=(\d+);/;
1087             $doi = $1 if $line =~ /DOI=(.+);/;
1088         } elsif ( /^RA\s{3}(.*)/ ) {
1089             $au .= $au ? " $1" : $1;
1090         } elsif ( /^RG\s{3}(.*)/ ) {
1091             $rg .= $rg ? " $1" : $1;
1092         } elsif ( /^RT\s{3}(.*)/ ) {
1093             if ($title) {
1094                 my $tline = $1;
1095                 $title .= ($title =~ /[\w;,:\?!]$/) ? " $tline" : $tline;
1096             } else {
1097                 $title = $1;
1098             }
1099         } elsif (/^RL\s{3}(.*)/ ) {
1100             $loc .= $loc ? " $1" : $1;
1101         } elsif ( /^RC\s{3}(.*)/ ) {
1102             $com .= $com ? " $1" : $1;
1103         }
1104         $_ = $self->_readline;
1105     }
1106     return \@refs;
1107 }
1108
1109
1110 =head2 _read_swissprot_Species
1111
1112  Title   : _read_swissprot_Species
1113  Usage   :
1114  Function: Reads the swissprot Organism species and classification
1115            lines.
1116              Able to deal with unconventional species names.
1117  Example : OS Unknown prokaryotic organism
1118              $genus = undef ; $species = Unknown prokaryotic organism
1119  Returns : A Bio::Species object
1120  Args    :
1121
1122 =cut
1123
1124 sub _read_swissprot_Species {
1125     my( $self,$line ) = @_;
1126     my $org;
1127     local $_ = $line;
1128
1129     my( $sub_species, $species, $genus, $common, $variant, $ncbi_taxid, $sci_name, $class_lines, $descr );
1130     my $osline = "";
1131     my $do_genus_check = 1;
1132     while ( defined $_ ) {
1133         last unless /^O[SCGX]/;
1134         # believe it or not, but OS may come multiple times -- at this time
1135         # we can't capture multiple species
1136         if (/^OS\s+(\S.+)/ && (! defined($sci_name))) {
1137             $osline .= " " if $osline;
1138             $osline .= $1;
1139             if ($osline =~ s/(,|, and|\.)$//) {
1140                 # OS lines are usually like:
1141                 # Homo sapiens (human)
1142                 # where we have $sci_name followed by $descr (common name) in
1143                 # brackets, but we can also have:
1144                 # Venerupis (Ruditapes) philippinarum
1145                 # where we have brackets but they don't indicate a $descr
1146                 if ($osline =~ /[^\(\)]+\(.+\)[^\(\)]+$/) {
1147                     #*** Danger! no idea if this will pick up some syntaxes for
1148                     #    common names as well)
1149                     $sci_name = $osline;
1150                     $sci_name =~ s/\.$//;
1151                     $descr = '';
1152                     $do_genus_check = 0;
1153                 } else {
1154                     ($sci_name, $descr) = $osline =~ /(\S[^\(]+)(.*)/;
1155                 }
1156                 $sci_name =~ s/\s+$//;
1157
1158                 while ($descr =~ /\(([^\)]+)\)/g) {
1159                     my $item = $1;
1160                     # strain etc may not necessarily come first (yes, swissprot
1161                     # is messy)
1162                     if ((! defined($variant)) &&
1163                         (($item =~ /(^|[^\(\w])([Ss]train|isolate|serogroup|serotype|subtype|clone)\b/) ||
1164                          ($item =~ /^(biovar|pv\.|type\s+)/))) {
1165                         $variant = $item;
1166                     } elsif ($item =~ s/^subsp\.\s+//) {
1167                         if (! $sub_species) {
1168                             $sub_species = $item;
1169                         } elsif (! $variant) {
1170                             $variant = $item;
1171                         }
1172                     } elsif (! defined($common)) {
1173                         # we're only interested in the first common name
1174                         $common = $item;
1175                         if ((index($common, '(') >= 0) &&
1176                             (index($common, ')') < 0)) {
1177                             $common .= ')';
1178                         }
1179                     }
1180                 }
1181             }
1182         } elsif (s/^OC\s+(\S.+)$//) {
1183             $class_lines .= $1;
1184         } elsif (/^OG\s+(.*)/) {
1185             $org = $1;
1186         } elsif (/^OX\s+(.*)/ && (! defined($ncbi_taxid))) {
1187             my $taxstring = $1;
1188             # we only keep the first one and ignore all others
1189             if ($taxstring =~ /NCBI_TaxID=([\w\d]+)/) {
1190                 $ncbi_taxid = $1;
1191             } else {
1192                 $self->throw("$taxstring doesn't look like NCBI_TaxID");
1193             }
1194         }
1195         $_ = $self->_readline;
1196     }
1197     $self->_pushback($_);       # pushback the last line because we need it
1198
1199     $sci_name || return;
1200
1201     # if the organism belongs to taxid 32644 then no Bio::Species object.
1202     return if grep { $_ eq $sci_name } @Unknown_names;
1203
1204     # Convert data in classification lines into classification array.
1205     # Remove trailing . then split on ';' or '.;' so that classification that is 2
1206     # or more words will still get matched, use map() to remove trailing/leading/intervening
1207     # spaces
1208     $class_lines=~s/\.\s*$//;
1209     my @class = map { s/^\s+//; s/\s+$//; s/\s{2,}/ /g; $_; } split /[;\.]*;/, $class_lines;
1210
1211     if ($class[0] =~ /viruses/i) {
1212         # viruses have different OS/OC syntax
1213         my @virusnames = split(/\s+/, $sci_name);
1214         $species = (@virusnames > 1) ? pop(@virusnames) : '';
1215         $genus = join(" ", @virusnames);
1216         $sub_species = $descr;
1217     } elsif ($do_genus_check) {
1218         # do we have a genus?
1219         my $possible_genus = $class[-1];
1220         $possible_genus .= "|$class[-2]" if $class[-2];
1221         if ($sci_name =~ /^($possible_genus)/) {
1222             $genus = $1;
1223             ($species) = $sci_name =~ /^$genus\s+(.+)/;
1224         } else {
1225             $species = $sci_name;
1226         }
1227         # is this organism of rank species or is it lower?
1228         # (doesn't catch everything, but at least the guess isn't dangerous)
1229         if ($species && $species =~ /subsp\.|var\./) {
1230             ($species, $sub_species) = $species =~ /(.+)\s+((?:subsp\.|var\.).+)/;
1231         }
1232     }
1233
1234     # Bio::Species array needs array in Species -> Kingdom direction
1235     unless ($class[-1] eq $sci_name) {
1236         push(@class, $sci_name);
1237     }
1238     @class = reverse @class;
1239
1240     my $taxon = Bio::Species->new();
1241     $taxon->scientific_name($sci_name);
1242     $taxon->classification(@class);
1243     $taxon->common_name($common) if $common;
1244     $taxon->sub_species($sub_species) if $sub_species;
1245     $taxon->organelle($org) if $org;
1246     $taxon->ncbi_taxid($ncbi_taxid) if $ncbi_taxid;
1247     $taxon->variant($variant) if $variant;
1248
1249     # done
1250     return $taxon;
1251 }
1252
1253 =head2 _filehandle
1254
1255  Title   : _filehandle
1256  Usage   : $obj->_filehandle($newval)
1257  Function:
1258  Example :
1259  Returns : value of _filehandle
1260  Args    : newvalue (optional)
1261
1262
1263 =cut
1264
1265 # inherited from SeqIO.pm ! HL 05/11/2000
1266
1267 =head2 _read_FTHelper_swissprot
1268
1269  Title   : _read_FTHelper_swissprot
1270  Usage   : _read_FTHelper_swissprot(\$buffer)
1271  Function: reads the next FT key line
1272  Example :
1273  Returns : Bio::SeqIO::FTHelper object
1274  Args    :
1275
1276
1277 =cut
1278
1279 sub _read_FTHelper_swissprot {
1280     my ($self,$line ) = @_;
1281     # initial version implemented by HL 05/10/2000
1282     # FIXME this may not be perfect, so please review
1283     # lots of cleaning up by JES 2004/07/01, still may not be perfect =)
1284     # FTId now sepated from description as a qualifier
1285
1286     local $_ = $line;
1287     my ($key,                   # The key of the feature
1288         $loc,                   # The location line from the feature
1289         $desc,                  # The descriptive text
1290         $ftid,                  # feature Id is like a qualifier but there can be only one of them
1291        );
1292     if ( m/^FT\s{3}(\w+)\s+([\d\?\<]+)[\s.]+([\d\?\>]+)\s*(.*)$/ox) {
1293         $key = $1;
1294         my $loc1 = $2;
1295         my $loc2 = $3;
1296         $loc = "$loc1..$loc2";
1297         if ($4 && (length($4) > 0)) {
1298             $desc = $4;
1299             chomp($desc);
1300         } else {
1301             $desc = "";
1302         }
1303     } elsif ( m/^FT\s{3}(\w+)\s+([\d\?\<]+)\s+(.*)$/ox) {
1304         $key = $1;
1305         my $loc1 = $2;
1306         my $loc2 = $2;
1307         $loc = "$loc1";
1308         if ($3 && (length($3) > 0)) {
1309             $desc = $3;
1310             chomp($desc);
1311         } else {
1312             $desc = "";
1313         }
1314     }
1315
1316     while ( defined($_ = $self->_readline) && /^FT\s{4,}(\S.*)$/ ) {
1317         my $continuation_line = $1;
1318         if ( $continuation_line =~ /.FTId=(.*)\./ ) {
1319             $ftid=$1;
1320         }
1321         elsif ( $desc) {
1322             $desc .= " $continuation_line";
1323         } else {
1324             $desc = $continuation_line;
1325         }
1326         chomp $desc;
1327     }
1328     $self->_pushback($_);
1329     unless( $key ) {
1330         # No feature key. What's this?
1331         $self->warn("No feature key in putative feature table line: $line");
1332         return;
1333     }
1334
1335     # Make the new FTHelper object
1336     my $out = Bio::SeqIO::FTHelper->new(-verbose => $self->verbose());
1337     $out->key($key);
1338     $out->loc($loc);
1339
1340     # store the description if there is one
1341     if ( $desc && length($desc) ) {
1342         $desc =~ s/\.$//;
1343         push(@{$out->field->{"description"}}, $desc);
1344     }
1345     # Store the qualifier i.e. FTId
1346     if ( $ftid ) {
1347         push(@{$out->field->{"FTId"}}, $ftid);
1348     }
1349     return $out;
1350 }
1351
1352
1353 =head2 _write_line_swissprot
1354
1355  Title   : _write_line_swissprot
1356  Usage   :
1357  Function: internal function
1358  Example :
1359  Returns :
1360  Args    :
1361
1362
1363 =cut
1364
1365 sub _write_line_swissprot{
1366     my ($self,$pre1,$pre2,$line,$length) = @_;
1367
1368     $length || $self->throw( "Miscalled write_line_swissprot without length. Programming error!");
1369     my $subl = $length - length $pre2;
1370     my $linel = length $line;
1371     my $i;
1372
1373     my $sub = substr($line,0,$length - length $pre1);
1374
1375     $self->_print( "$pre1$sub\n");
1376
1377     for ($i= ($length - length $pre1);$i < $linel;) {
1378         $sub = substr($line,$i,($subl));
1379         $self->_print( "$pre2$sub\n");
1380         $i += $subl;
1381     }
1382
1383 }
1384
1385 =head2 _write_line_swissprot_regex
1386
1387  Title   : _write_line_swissprot_regex
1388  Usage   :
1389  Function: internal function for writing lines of specified
1390            length, with different first and the next line
1391            left hand headers and split at specific points in the
1392            text
1393  Example :
1394  Returns : nothing
1395  Args    : file handle, first header, second header, text-line, regex for line breaks, total line length
1396
1397
1398 =cut
1399
1400 sub _write_line_swissprot_regex {
1401     my ($self,$pre1,$pre2,$line,$regex,$length) = @_;
1402
1403     #print STDOUT "Going to print with $line!\n";
1404
1405     $length || $self->throw( "Miscalled write_line_swissprot without length. Programming error!");
1406
1407     if ( length $pre1 != length $pre2 ) {
1408         $self->warn( "len 1 is ". length ($pre1) . " len 2 is ". length ($pre2) . "\n");
1409         $self->throw( "Programming error - cannot called write_line_swissprot_regex with different length \npre1 ($pre1) and \npre2 ($pre2) tags!");
1410     }
1411
1412     my $subl = $length - (length $pre1) -1 ;
1413
1414     my $first_line = 1;
1415     while ($line =~ m/(.{1,$subl})($regex)/g) {
1416         my $s = $1.$2;
1417         $s =~ s/([\w\.])#(\w)/$1 $2/g  # remove word wrap protection char '#'
1418             if $pre1 eq "RA   " or $pre1 eq "KW   ";
1419         # remove annoying extra spaces at the end of the wrapped lines
1420         substr($s, -1, 1, '') if substr($s, -1, 1) eq ' ';
1421         if ($first_line) {
1422             $self->_print( "$pre1$s\n");
1423             $first_line = 0;
1424         } else {
1425             $self->_print( "$pre2$s\n");
1426         }
1427
1428     }
1429 }
1430
1431 =head2 _post_sort
1432
1433  Title   : _post_sort
1434  Usage   : $obj->_post_sort($newval)
1435  Function:
1436  Returns : value of _post_sort
1437  Args    : newvalue (optional)
1438
1439
1440 =cut
1441
1442 sub _post_sort{
1443     my $obj = shift;
1444     if ( @_ ) {
1445         my $value = shift;
1446         $obj->{'_post_sort'} = $value;
1447     }
1448     return $obj->{'_post_sort'};
1449
1450 }
1451
1452 =head2 _show_dna
1453
1454  Title   : _show_dna
1455  Usage   : $obj->_show_dna($newval)
1456  Function:
1457  Returns : value of _show_dna
1458  Args    : newvalue (optional)
1459
1460
1461 =cut
1462
1463 sub _show_dna{
1464     my $obj = shift;
1465     if ( @_ ) {
1466         my $value = shift;
1467         $obj->{'_show_dna'} = $value;
1468     }
1469     return $obj->{'_show_dna'};
1470
1471 }
1472
1473 =head2 _id_generation_func
1474
1475  Title   : _id_generation_func
1476  Usage   : $obj->_id_generation_func($newval)
1477  Function:
1478  Returns : value of _id_generation_func
1479  Args    : newvalue (optional)
1480
1481
1482 =cut
1483
1484 sub _id_generation_func{
1485     my $obj = shift;
1486     if ( @_ ) {
1487         my $value = shift;
1488         $obj->{'_id_generation_func'} = $value;
1489     }
1490     return $obj->{'_id_generation_func'};
1491
1492 }
1493
1494 =head2 _ac_generation_func
1495
1496  Title   : _ac_generation_func
1497  Usage   : $obj->_ac_generation_func($newval)
1498  Function:
1499  Returns : value of _ac_generation_func
1500  Args    : newvalue (optional)
1501
1502
1503 =cut
1504
1505 sub _ac_generation_func{
1506     my $obj = shift;
1507     if ( @_ ) {
1508         my $value = shift;
1509         $obj->{'_ac_generation_func'} = $value;
1510     }
1511     return $obj->{'_ac_generation_func'};
1512
1513 }
1514
1515 =head2 _sv_generation_func
1516
1517  Title   : _sv_generation_func
1518  Usage   : $obj->_sv_generation_func($newval)
1519  Function:
1520  Returns : value of _sv_generation_func
1521  Args    : newvalue (optional)
1522
1523
1524 =cut
1525
1526 sub _sv_generation_func{
1527     my $obj = shift;
1528     if ( @_ ) {
1529         my $value = shift;
1530         $obj->{'_sv_generation_func'} = $value;
1531     }
1532     return $obj->{'_sv_generation_func'};
1533
1534 }
1535
1536 =head2 _kw_generation_func
1537
1538  Title   : _kw_generation_func
1539  Usage   : $obj->_kw_generation_func($newval)
1540  Function:
1541  Returns : value of _kw_generation_func
1542  Args    : newvalue (optional)
1543
1544
1545 =cut
1546
1547 sub _kw_generation_func{
1548     my $obj = shift;
1549     if ( @_ ) {
1550         my $value = shift;
1551         $obj->{'_kw_generation_func'} = $value;
1552     }
1553     return $obj->{'_kw_generation_func'};
1554
1555 }
1556
1557 1;