2 # BioPerl module for Bio::SeqIO::swiss
4 # Copyright Elia Stupka
6 # You may distribute this module under the same terms as perl itself
8 # POD documentation - main docs before the code
12 Bio::SeqIO::swiss - Swissprot sequence input/output stream
16 It is probably best not to use this object directly, but
17 rather go through the SeqIO handler system:
21 $stream = Bio::SeqIO->new(-file => $filename,
24 while ( my $seq = $stream->next_seq() ) {
25 # do something with $seq
30 This object can transform Bio::Seq objects to and from Swiss-Pprot flat
33 There is a lot of flexibility here about how to dump things which needs
36 =head2 GN (Gene name) line management details
38 A Uniprot/Swiss-Prot entry holds information on one protein
39 sequence. If that sequence is identical across genes and species, they
40 are all merged into one entry. This creates complex needs for several
41 annotation fields in swiss-prot format.
43 The latest syntax for GN line is described in the user manual:
45 http://www.expasy.ch/sprot/userman.html#GN_line
47 Each of the possibly multiple genes in an entry can have Name,
48 Synonyms (only if there is a name), OrderedLocusNames (names from
49 genomic sequences) and ORFNames (temporary or cosmid names). "Name"
50 here really means "symbol". This complexity is now dealt with the
53 A new Bio::AnnotationI class was created in order to store the
54 data in tag-value pairs. This class (Bio::Annotation::TagTree)
55 is stored in the Bio::Annotation::Collection object and is
56 accessed like all other annotations. The tag name is 'gene_name'.
58 There is a single Bio::Annotation::TagTree per sequence record, which
59 corresponds to the original class that stored this data
60 (Bio::Annotation::StructuredValue). Depending on how we progress
61 this may change to represent each group of gene names.
63 For now, to access the gene name tree annotation, one uses the below method:
65 my ($gene) = $seq->annotation->get_Annotations('gene_name');
67 If you are only interested in displaying the values, value() returns a
68 string with similar formatting.
70 There are several ways to get directly at the information you want if you
71 know the element (tag) for the data. For gene names all data is stored with
72 the element-tag pairs:
74 "element1=tag1, tag2, tag3; element2=tag4, tag5;"
76 This normally means the element will be 'Name', 'Synonyms', etc. and the
77 gene names the values. Using findval(), you can do the following:
79 # grab a flattened list of all gene names
80 my @names = $ann->findval('Name');
82 # or iterated through the nodes and grab the name for each group
83 for my $node ($ann->findnode('gene_name')) {
84 my @names = $node->findval('Name');
87 The current method for parsing gene name data (and reconstructing gene name
88 output) is very generic. This is somewhat preemptive if, for instance, UniProt
89 decides to update and add another element name to the current ones using the
90 same formatting layout. Under those circumstances, one can iterate through the
91 tag tree in a safe way and retrieve all node data like so.
93 # retrieve the gene name nodes (groups like names, synonyms, etc).
94 for my $ann ($seq->annotation->get_Annotations('gene_name')) {
96 # each gene name group
97 for my $node ($ann->findnode('gene_name')) {
100 # each gene name node (tag => value pair)
101 for my $n ($node->children) {
102 print "\t".$n->element.": ".$n->children."\n";
107 For more uses see Bio::Annotation::TagTree.
109 Since Uniprot/Swiss-Prot format have been around for quite some time, the
110 parser is also able to read in the older GN line syntax where genes
111 are separated by AND and various symbols by OR. The first symbol is
112 taken to be the 'Name' and the remaining ones are stored as 'Synonyms'.
114 Also, for UniProt output we support using other Bio::AnnotationI, but in this
115 case we only use the stringified version of the annotation. This is to allow for
116 backwards compatibility with code that previously used
117 Bio::Annotation::SimpleValue or other Bio::AnnotationI classes.
119 =head2 Optional functions
125 (output only) shows the dna or not
129 (output only) provides a sorting func which is applied to the FTHelpers
132 =item _id_generation_func()
134 This is function which is called as
136 print "ID ", $func($seq), "\n";
138 To generate the ID line. If it is not there, it generates a sensible ID
139 line using a number of tools.
141 If you want to output annotations in Swissprot format they need to be
142 stored in a Bio::Annotation::Collection object which is accessible
143 through the Bio::SeqI interface method L<annotation()|annotation>.
145 The following are the names of the keys which are polled from a
146 L<Bio::Annotation::Collection> object.
148 reference - Should contain Bio::Annotation::Reference objects
149 comment - Should contain Bio::Annotation::Comment objects
150 dblink - Should contain Bio::Annotation::DBLink objects
151 gene_name - Should contain Bio::Annotation::SimpleValue object
159 User feedback is an integral part of the evolution of this
160 and other Bioperl modules. Send your comments and suggestions,
161 preferably to one of the Bioperl mailing lists.
162 Your participation is much appreciated.
164 bioperl-l@bioperl.org - General discussion
165 http://bioperl.org/wiki/Mailing_lists - About the mailing lists
169 Please direct usage questions or support issues to the mailing list:
171 I<bioperl-l@bioperl.org>
173 rather than to the module maintainer directly. Many experienced and
174 reponsive experts will be able look at the problem and quickly
175 address it. Please include a thorough description of the problem
176 with code and data examples if at all possible.
178 =head2 Reporting Bugs
180 Report bugs to the Bioperl bug tracking system to help us keep track
181 the bugs and their resolution.
182 Bug reports can be submitted via the web:
184 https://github.com/bioperl/bioperl-live/issues
186 =head1 AUTHOR - Elia Stupka
188 Email elia@tll.org.sg
192 The rest of the documentation details each of the object methods.
193 Internal methods are usually preceded with a _
197 # Let the code begin...
199 package Bio
::SeqIO
::swiss
;
201 use vars
qw(@Unknown_names @Unknown_genus);
203 use Bio::SeqIO::FTHelper;
204 use Bio::SeqFeature::Generic;
206 use Bio::Tools::SeqStats;
207 use Bio::Seq::SeqFactory;
208 use Bio::Annotation::Collection;
209 use Bio::Annotation::Comment;
210 use Bio::Annotation::Reference;
211 use Bio::Annotation::DBLink;
212 use Bio::Annotation::SimpleValue;
213 use Bio::Annotation::TagTree;
215 use base qw(Bio::SeqIO);
217 our $LINE_LENGTH = 76;
219 # this is for doing species name parsing
220 @Unknown_names=('other', 'unidentified',
221 'unknown organism', 'not specified',
222 'not shown', 'Unspecified', 'Unknown',
223 'None', 'unclassified', 'unidentified organism',
226 # dictionary of synonyms for taxid 32644
227 # all above can be part of valid species name
228 @Unknown_genus = qw(unknown unclassified uncultured unidentified);
230 # if there are any other gene name tags, they are added to the end
231 our @GENE_NAME_ORDER = qw(Name Synonyms OrderedLocusNames ORFNames);
234 my($self,@args) = @_;
235 $self->SUPER::_initialize
(@args);
236 # hash for functions for decoding keys.
237 $self->{'_func_ftunit_hash'} = {};
238 # sets this to one by default. People can change it
240 if ( ! defined $self->sequence_factory ) {
241 $self->sequence_factory(Bio
::Seq
::SeqFactory
->new
242 (-verbose
=> $self->verbose(),
243 -type
=> 'Bio::Seq::RichSeq'));
250 Usage : $seq = $stream->next_seq()
251 Function: returns the next sequence in the stream
252 Returns : Bio::Seq object
258 my ($self,@args) = @_;
259 my ($pseq,$c,$line,$name,$desc,$acc,$seqc,$mol,$div, $sptr,$seq_div,
260 $date,$comment,@date_arr);
262 my ($annotation, %params, @features) = ( Bio
::Annotation
::Collection
->new());
266 1 while defined($_ = $self->_readline) && /^\s+$/;
267 return unless defined $_ && /^ID\s/;
269 # fixed to allow _DIVISION to be optional for bug #946
270 # see bug report for more information
272 # 9/6/06 Note: Swiss/TrEMBL sequences have no division acc. to UniProt
273 # release notes; this is fixed to simplify the regex parsing
274 # STANDARD (SwissProt) and PRELIMINARY (TrEMBL) added to namespace()
277 (\S
+) \s
+ # $1 entryname
278 ([^\s
;]+); \s
+ # $2 DataClass
279 (?
:PRT
;)? \s
+ # Molecule Type (optional)
280 [0-9]+[ ]AA \
. # Sequencelength (capture?)
283 # I couldn't find any new current UniProt sequences
284 # that matched this format:
285 # || m/^ID\s+(\S+)\s+(_([^\s_]+))? /ox ) {
286 $self->throw("swissprot stream with no ID. Not swissprot in my book");
288 ($name, $seq_div) = ($1, $2);
289 $params{'-namespace'} =
290 ($seq_div eq 'Reviewed' || $seq_div eq 'STANDARD') ?
'Swiss-Prot' :
291 ($seq_div eq 'Unreviewed' || $seq_div eq 'PRELIMINARY') ?
'TrEMBL' :
293 # we shouldn't be setting the division, but for now...
294 my ($junk, $division) = split q
(_
), $name;
295 $params{'-division'} = $division;
296 $params{'-alphabet'} = 'protein';
297 # this is important to have the id for display in e.g. FTHelper, otherwise
298 # you won't know which entry caused an error
299 $params{'-display_id'} = $name;
301 BEFORE_FEATURE_TABLE
:
302 while ( defined($_ = $self->_readline) ) {
303 # Exit at start of Feature table and at the sequence at the
304 # latest HL 05/11/2000
305 last if( /^(FT|SQ)/ );
307 # Description line(s)
308 if (/^DE\s+(\S.*\S)/) {
309 $desc .= $desc ?
" $1" : $1;
312 elsif (/^GN\s+(.*)/) {
313 $genename .= " " if $genename;
317 elsif ( /^AC\s+(.+)/) {
318 my @accs = split(/[; ]+/, $1); # allow space in addition
319 $params{'-accession_number'} = shift @accs
320 unless defined $params{'-accession_number'};
321 push @
{$params{'-secondary_accessions'}}, @accs;
323 #date and sequence version
324 elsif ( /^DT\s+(.*)/ ) {
326 my ($date, $version) = split(' ', $line, 2);
327 $date =~ tr/,//d; # remove comma if new version
328 if ($version =~ /\(Rel\. (\d+), Last sequence update\)/ || # old
329 /sequence version (\d+)/) { #new
330 my $update = Bio
::Annotation
::SimpleValue
->new
331 (-tagname
=> 'seq_update',
334 $annotation->add_Annotation($update);
335 } elsif ($version =~ /\(Rel\. (\d+), Last annotation update\)/ || #old
336 /entry version (\d+)/) { #new
337 $params{'-version'} = $1;
339 push @
{$params{'-dates'}}, $date;
342 elsif ( /^PE\s+(.*)/ ) {
344 $line =~ s/;\s*//; # trim trailing semicolon and any spaces at the end of the line
345 my $evidence = Bio
::Annotation
::SimpleValue
->new
346 (-tagname
=> 'evidence',
349 $annotation->add_Annotation($evidence);
351 # Organism name and phylogenetic information
353 my $species = $self->_read_swissprot_Species($_);
354 $params{'-species'}= $species;
355 # now we are one line ahead -- so continue without reading the next
360 my $refs = $self->_read_swissprot_References($_);
361 foreach my $r (@
$refs) {
362 $annotation->add_Annotation('reference',$r);
366 elsif (/^CC\s{3}(.*)/) {
369 while (defined ($_ = $self->_readline) && /^CC\s{3}(.*)/ ) {
370 $comment .= $1 . "\n";
373 # note: don't try to process comments here -- they may contain
374 # structure. LP 07/30/2000
376 my $commobj = Bio
::Annotation
::Comment
->new(-tagname
=> 'comment',
378 $annotation->add_Annotation('comment',$commobj);
380 $self->_pushback($_);
384 # /^DR\s+(\S+)\;\s+(\S+)\;\s+(\S+)[\;\.](.*)$/) {
385 # new regexp from Andreas Kahari bug #1584
386 elsif (/^DR\s+(\S+)\;\s+(\S+)\;\s+([^;]+)[\;\.](.*)$/) {
387 my ($database,$primaryid,$optional,$comment) = ($1,$2,$3,$4);
389 # drop leading and training spaces and trailing .
390 $comment =~ s/\.\s*$//;
391 $comment =~ s/^\s+//;
393 my $dblinkobj = Bio
::Annotation
::DBLink
->new
394 (-database
=> $database,
395 -primary_id
=> $primaryid,
396 -optional_id
=> $optional,
397 -comment
=> $comment,
398 -tagname
=> 'dblink',
401 $annotation->add_Annotation('dblink',$dblinkobj);
404 elsif ( /^KW\s+(.*)$/ ) {
405 my @kw = split(/\s*\;\s*/,$1);
406 defined $kw[-1] && $kw[-1] =~ s/\.$//;
407 push @
{$params{'-keywords'}}, @kw;
410 # process and parse the gene name line if there was one (note: we
411 # can't do this above b/c GN may be multi-line and we can't
412 # unequivocally determine whether we've seen the last GN line in
416 if ($genename =~ /\w=\w/) {
417 # new format (e.g., Name=RCHY1; Synonyms=ZNF363, CHIMP)
418 for my $n (split(m{\s+and\s+},$genename)) {
420 for my $section (split(m{\s*;\s*},$n)) {
421 my ($tag, $rest) = split("=",$section);
423 for my $val (split(m{\s*,\s*},$rest)) {
424 push @genenames, [$tag => $val];
427 push @stags, ['gene_name' => \
@genenames];
431 for my $section (split(/ AND /, $genename)) {
433 $section =~ s/[\(\)\.]//g;
434 my @names = split(m{\s+OR\s+}, $section);
435 push @genenames, ['Name' => shift @names];
436 push @genenames, map {['Synonyms' => $_]} @names;
437 push @stags, ['gene_name' => \
@genenames]
439 } #use Data::Dumper; print Dumper $gn, $genename;# exit;
440 my $gn = Bio
::Annotation
::TagTree
->new(-tagname
=> 'gene_name',
441 -value
=> ['gene_names' => \
@stags]);
442 $annotation->add_Annotation('gene_name', $gn);
446 # if there is no feature table, or if we've got beyond, exit loop or don't
447 # even enter HL 05/11/2000
448 while (defined $_ && /^FT/ ) {
449 my $ftunit = $self->_read_FTHelper_swissprot($_);
452 # when parsing of the line fails we get undef returned
455 $ftunit->_generic_seqfeature($self->location_factory(),
456 $params{'-seqid'}, "SwissProt"));
458 $self->warn("failed to parse feature table line for seq " .
459 $params{'-display_id'}. "\n$_");
461 $_ = $self->_readline;
463 while ( defined($_) && ! /^SQ/ ) {
464 $_ = $self->_readline;
467 while ( defined ($_ = $self->_readline) ) {
473 my $seq= $self->sequence_factory->create
474 (-verbose
=> $self->verbose,
478 -features
=> \
@features,
479 -annotation
=> $annotation,
482 # The annotation doesn't get added by the constructor
483 $seq->annotation($annotation);
491 Usage : $stream->write_seq($seq)
492 Function: writes the $seq object (must be seq) to the stream
493 Returns : 1 for success and 0 for error
494 Args : array of 1 to n Bio::SeqI objects
500 my ($self,@seqs) = @_;
501 foreach my $seq ( @seqs ) {
502 $self->throw("Attempting to write with no seq!") unless defined $seq;
504 if ( ! ref $seq || ! $seq->isa('Bio::SeqI') ) {
505 $self->warn(" $seq is not a SeqI compliant module. Attempting to dump, but may fail!");
512 my $ns = ($seq->can('namespace')) && $seq->namespace();
513 my $len = $seq->length();
515 if ( !$seq->can('division') || ! defined ($div = $seq->division()) ) {
519 # namespace dictates database, takes precedent over division. Sorry!
520 if (defined($ns) && $ns ne '') {
521 $div = ($ns eq 'Swiss-Prot') ?
'Reviewed' :
522 ($ns eq 'TrEMBL') ?
'Unreviewed' :
526 # division not reset; acts as fallback
529 $self->warn("No whitespace allowed in SWISS-PROT display id [". $seq->display_id. "]")
530 if $seq->display_id =~ /\s/;
533 if ( $self->_id_generation_func ) {
534 $temp_line = &{$self->_id_generation_func}($seq);
536 #$temp_line = sprintf ("%10s STANDARD; %3s; %d AA.",
537 # $seq->primary_id()."_".$div,$mol,$len);
538 # Reconstructing the ID relies heavily upon the input source having
539 # been in a format that is parsed as this routine expects it -- that is,
540 # by this module itself. This is bad, I think, and immediately breaks
541 # if e.g. the Bio::DB::GenPept module is used as input.
542 # Hence, switch to display_id(); _every_ sequence is supposed to have
543 # this. HL 2000/09/03
544 # Changed to reflect ID line changes in UniProt
545 # Oct 2006 - removal of molecule type - see bug 2134
546 $temp_line = sprintf ("%-24s%-12s%9d AA.",
547 $seq->display_id(), $div.';', $len);
550 $self->_print( "ID $temp_line\n");
552 # if there, write the accession line
553 local($^W
) = 0; # suppressing warnings about uninitialized fields
555 if ( $self->_ac_generation_func ) {
556 $temp_line = &{$self->_ac_generation_func}($seq);
557 $self->_print( "AC $temp_line\n");
559 elsif ($seq->can('accession_number') ) {
560 my $ac_line = $seq->accession_number;
561 if ($seq->can('get_secondary_accessions') ) {
562 foreach my $sacc ($seq->get_secondary_accessions) {
563 $ac_line .= "; ". $sacc;;
568 $self->_write_line_swissprot_regex("AC ","AC ",$ac_line,
569 "\\s\+\|\$",$LINE_LENGTH);
571 # otherwise - cannot print <sigh>
574 # Date lines and sequence versions (changed 6/15/2006)
575 # This is rebuilt from scratch using the current SwissProt/UniProt format
576 if ( $seq->can('get_dates') ) {
577 my @dates = $seq->get_dates();
579 my $seq_version = $seq->version;
580 my ($update_version) = $seq->annotation->get_Annotations("seq_update");
581 foreach my $dt (@dates) {
582 $self->_write_line_swissprot_regex("DT ","DT ",
583 $dt.', integrated into UniProtKB/'.$ns.'.',
584 "\\s\+\|\$",$LINE_LENGTH) if $ct == 1;
585 $self->_write_line_swissprot_regex("DT ","DT ",
586 $dt.", sequence version ".$update_version->display_text.'.',
587 "\\s\+\|\$",$LINE_LENGTH) if $ct == 2;
588 $self->_write_line_swissprot_regex("DT ","DT ",
589 $dt.", entry version $seq_version.",
590 "\\s\+\|\$",$LINE_LENGTH) if $ct == 3;
596 $self->_write_line_swissprot_regex("DE ","DE ",$seq->desc(),"\\s\+\|\$",$LINE_LENGTH);
598 #Gene name; print out new format
599 foreach my $gene ( my @genes = $seq->annotation->get_Annotations('gene_name') ) {
600 # gene is a Bio::Annotation::TagTree;
601 if ($gene->isa('Bio::Annotation::TagTree')) {
603 for my $node ($gene->findnode('gene_name')) {
604 # check for Name and Synonym first, then the rest get tagged on
605 my $geneline = "GN ";
606 my %genedata = $node->hash;
607 for my $tag (@GENE_NAME_ORDER) {
608 if (exists $genedata{$tag}) {
609 $geneline .= (ref $genedata{$tag} eq 'ARRAY') ?
610 "$tag=".join(', ',@
{$genedata{$tag}})."; " :
611 "$tag=$genedata{$tag}; ";
612 delete $genedata{$tag};
616 for my $tag (sort keys %genedata) {
617 $geneline .= (ref $genedata{$tag} eq 'ARRAY') ?
618 "$tag=".join(', ',@
{$genedata{$tag}})."; " :
619 "$tag=$genedata{$tag}; ";
620 delete $genedata{$tag};
622 push @genelines, "$geneline\n";
624 $self->_print(join("GN and\n",@genelines));
625 } else { # fall back to getting stringified output
626 $self->_write_line_swissprot_regex("GN ","GN ",
634 if ($seq->can('species') && (my $spec = $seq->species)) {
635 my @class = $spec->classification();
637 my $species = $spec->species;
638 my $genus = $spec->genus;
639 my $OS = $spec->scientific_name;
640 if ($class[-1] =~ /viruses/i) {
642 $OS .= " ". $spec->sub_species if $spec->sub_species;
644 foreach (($spec->variant, $spec->common_name)) {
645 $OS .= " ($_)" if $_;
647 $self->_print( "OS $OS.\n");
648 my $OC = join('; ', reverse(@class)) .'.';
649 $self->_write_line_swissprot_regex("OC ","OC ",$OC,"\; \|\$",$LINE_LENGTH);
650 if ($spec->organelle) {
651 $self->_write_line_swissprot_regex("OG ","OG ",$spec->organelle,"\; \|\$",$LINE_LENGTH);
653 if ($spec->ncbi_taxid) {
654 $self->_print("OX NCBI_TaxID=".$spec->ncbi_taxid.";\n");
660 foreach my $ref ( $seq->annotation->get_Annotations('reference') ) {
661 $self->_print( "RN [$t]\n");
662 # changed by lorenz 08/03/00
663 # j.gilbert and h.lapp agreed that the rp line in swissprot seems
664 # more like a comment than a parseable value, so print it as is
666 $self->_write_line_swissprot_regex("RP ","RP ",$ref->rp,
667 "\\s\+\|\$",$LINE_LENGTH);
670 $self->_write_line_swissprot_regex("RC ","RC ",$ref->comment,
671 "\\s\+\|\$",$LINE_LENGTH);
673 if ($ref->medline or $ref->pubmed or $ref->doi) {
674 # new RX format in swissprot LP 09/17/00
675 # RX line can now have a DOI, Heikki 13 Feb 2008
678 $line .= "MEDLINE=". $ref->medline. '; ' if $ref->medline;
679 $line .= "PubMed=". $ref->pubmed. '; ' if $ref->pubmed;
680 $line .= "DOI=". $ref->doi. '; ' if $ref->doi;
683 $self->_write_line_swissprot_regex("RX ","RX ",
685 "\\s\+\|\$",$LINE_LENGTH);
688 my $author = $ref->authors .';' if($ref->authors);
689 my $title = $ref->title .';' if( $ref->title);
690 my $rg = $ref->rg . ';' if $ref->rg;
691 $author =~ s/([\w\.]) (\w)/$1#$2/g; # add word wrap protection char '#'
693 $self->_write_line_swissprot_regex("RG ","RG ",$rg,"\\s\+\|\$",$LINE_LENGTH) if $rg;
694 $self->_write_line_swissprot_regex("RA ","RA ",$author,"\\s\+\|\$",$LINE_LENGTH) if $author;
695 $self->_write_line_swissprot_regex("RT ","RT ",$title,'[\s\-]+|$',$LINE_LENGTH) if $title;
696 $self->_write_line_swissprot_regex("RL ","RL ",$ref->location,"\\s\+\|\$",$LINE_LENGTH);
702 foreach my $comment ( $seq->annotation->get_Annotations('comment') ) {
703 foreach my $cline (split ("\n", $comment->text)) {
704 while (length $cline > 74) {
705 $self->_print("CC ",(substr $cline,0,74),"\n");
706 $cline = substr $cline,74;
708 $self->_print("CC ",$cline,"\n");
712 # Database xref lines
714 foreach my $dblink ( $seq->annotation->get_Annotations('dblink') ) {
715 my ($primary_id) = $dblink->primary_id;
717 if (defined($dblink->comment) && ($dblink->comment) ) {
718 $self->_print("DR ",$dblink->database,"; ",$primary_id,"; ",
719 $dblink->optional_id,"; ",$dblink->comment,".\n");
720 } elsif ($dblink->optional_id) {
721 $self->_print("DR ",$dblink->database,"; ",
723 $dblink->optional_id,".\n");
725 $self->_print("DR ",$dblink->database,
726 "; ",$primary_id,"; ","-.\n");
732 foreach my $evidence ( $seq->annotation->get_Annotations('evidence') ) {
733 $self->_print("PE ",$evidence->value,";\n");
736 # if there, write the kw line
739 if ( my $func = $self->_kw_generation_func ) {
740 $kw = &{$func}($seq);
741 } elsif ( $seq->can('keywords') ) {
742 $kw = $seq->keywords;
743 if ( ref($kw) =~ /ARRAY/i ) {
744 $kw = join("; ", @
$kw);
746 $kw .= '.' if $kw and $kw !~ /\.$/ ;
748 $kw =~ s/([\w\.]) (\w)/$1#$2/g; # add word wrap protection char '#'
749 $self->_write_line_swissprot_regex("KW ","KW ",
750 $kw, "\\s\+\|\$",$LINE_LENGTH)
754 #Check if there is seqfeatures before printing the FT line
755 my @feats = $seq->can('top_SeqFeatures') ?
$seq->top_SeqFeatures : ();
757 if ( defined $self->_post_sort ) {
759 # we need to read things into an array. Process. Sort them. Print 'em
761 my $post_sort_func = $self->_post_sort();
764 foreach my $sf ( @feats ) {
765 push(@fth,Bio
::SeqIO
::FTHelper
::from_SeqFeature
($sf,$seq));
767 @fth = sort { &$post_sort_func($a,$b) } @fth;
769 foreach my $fth ( @fth ) {
770 $self->_print_swissprot_FTHelper($fth);
773 # not post sorted. And so we can print as we get them.
774 # lower memory load...
776 foreach my $sf ( @feats ) {
777 my @fth = Bio
::SeqIO
::FTHelper
::from_SeqFeature
($sf,$seq);
778 foreach my $fth ( @fth ) {
779 if ( ! $fth->isa('Bio::SeqIO::FTHelper') ) {
780 $sf->throw("Cannot process FTHelper... $fth");
782 $self->_print_swissprot_FTHelper($fth);
787 if ( $self->_show_dna() == 0 ) {
791 # finished printing features.
794 my $mw = ${Bio
::Tools
::SeqStats
->get_mol_wt($seq->primary_seq)}[0];
796 # was crc32 checksum, changed it to crc64
797 my $crc64 = $self->_crc64(\
$str);
798 $self->_print( sprintf("SQ SEQUENCE %4d AA; %d MW; %16s CRC64;\n",
802 for ($i = 0; $i < length($str); $i += 10) {
803 $self->_print( " ", substr($str,$i,10));
805 if ( ($i+10)%60 == 0 && (($i+10) < length($str))) {
806 $self->_print( "\n ");
809 $self->_print( "\n//\n");
811 $self->flush if $self->_flush_on_write && defined $self->_fh;
816 # Thanks to James Gilbert for the following two. LP 08/01/2000
818 =head2 _generateCRCTable
820 Title : _generateCRCTable
830 sub _generateCRCTable
{
831 # 10001000001010010010001110000100
833 my $poly = 0xEDB88320;
836 $self->{'_crcTable'} = [];
837 foreach my $i (0..255) {
839 for (my $j=8; $j > 0; $j--) {
841 $crc = ($crc >> 1) ^ $poly;
846 ${$self->{'_crcTable'}}[$i] = $crc;
864 my( $self, $str ) = @_;
866 $self->throw("Argument to crc32() must be ref to scalar")
867 unless ref($str) eq 'SCALAR';
869 $self->_generateCRCTable() unless exists $self->{'_crcTable'};
871 my $len = length($$str);
873 my $crc = 0xFFFFFFFF;
874 for (my $i = 0; $i < $len; $i++) {
875 # Get upper case value of each letter
876 my $int = ord uc substr $$str, $i, 1;
877 $crc = (($crc >> 8) & 0x00FFFFFF) ^
878 ${$self->{'_crcTable'}}[ ($crc ^ $int) & 0xFF ];
896 my ($self, $sequence) = @_;
897 my $POLY64REVh = 0xd8000000;
902 my $seq = $$sequence;
908 for (my $i=0; $i<256; $i++) {
911 for (my $j=0; $j<8; $j++) {
912 my $rflag = $partl & 1;
914 $partl |= (1 << 31) if $parth & 1;
916 $parth ^= $POLY64REVh if $rflag;
918 $CRCTableh[$i] = $parth;
919 $CRCTablel[$i] = $partl;
923 foreach (split '', $seq) {
924 my $shr = ($crch & 0xFF) << 24;
925 my $temp1h = $crch >> 8;
926 my $temp1l = ($crcl >> 8) | $shr;
927 my $tableindex = ($crcl ^ (unpack "C", $_)) & 0xFF;
928 $crch = $temp1h ^ $CRCTableh[$tableindex];
929 $crcl = $temp1l ^ $CRCTablel[$tableindex];
931 my $crc64 = sprintf("%08X%08X", $crch, $crcl);
935 =head2 _print_swissprot_FTHelper
937 Title : _print_swissprot_FTHelper
947 sub _print_swissprot_FTHelper
{
948 my ($self,$fth,$always_quote) = @_;
950 my ($start,$end) = ('?', '?');
952 if ( ! ref $fth || ! $fth->isa('Bio::SeqIO::FTHelper') ) {
953 $fth->warn("$fth is not a FTHelper class. ".
954 "Attempting to print, but there could be tears!");
958 for my $tag ( qw(description gene note product) ) {
959 if ( exists $fth->field->{$tag} ) {
960 $desc = @
{$fth->field->{$tag}}[0].".";
967 if ( exists $fth->field->{'FTId'} ) {
968 $ftid = @
{$fth->field->{'FTId'}}[0]. '.';
971 my $key =substr($fth->key,0,8);
973 if ( $loc =~ /(\?|\d+|\>\d+|<\d+)?\.\.(\?|\d+|<\d+|>\d+)?/ ) {
974 $start = $1 if defined $1;
975 $end = $2 if defined $2;
977 # to_FTString only returns one value when start == end, #JB955
978 # so if no match is found, assume it is both start and end #JB955
979 } elsif ( $loc =~ /join\((\d+)((?:,\d+)+)?\)/) {
981 if ( defined( my $m = $2) ) {
983 push @y, split(/,/,$m);
986 $self->_write_line_swissprot_regex(
987 sprintf("FT %-8s %6s %6s ",
991 $desc.'.','\s+|$',$LINE_LENGTH);
995 $start = $end = $fth->loc;
998 $self->_write_line_swissprot_regex(sprintf("FT %-8s %6s %6s ",
1002 $desc. '.', '\s+|$', $LINE_LENGTH);
1003 } else { #HELIX and STRAND do not have descriptions
1004 $self->_write_line_swissprot_regex(sprintf("FT %-8s %6s %6s",
1008 ' ', '\s+|$', $LINE_LENGTH);
1013 $self->_write_line_swissprot_regex("FT ",
1015 "/FTId=$ftid",'.|$',$LINE_LENGTH);
1022 =head2 _read_swissprot_References
1024 Title : _read_swissprot_References
1026 Function: Reads references from swissprot format. Internal function really
1034 sub _read_swissprot_References
{
1035 my ($self,$line) = @_;
1036 my ($b1, $b2, $rp, $rg, $title, $loc, $au, $med, $com, $pubmed, $doi);
1039 while ( defined $_ ) {
1040 if ( /^[^R]/ || /^RN/ ) {
1042 $rg =~ s/;\s*$//g if defined($rg);
1048 $title =~ s/;\s*$//g if defined($title);
1049 push @refs, Bio
::Annotation
::Reference
->new
1061 -tagname
=> 'reference',
1063 # reset state for the next reference
1066 if (index($_,'R') != 0) {
1067 $self->_pushback($_); # want this line to go back on the list
1068 last; # may be the safest exit point HL 05/11/2000
1070 # don't forget to reset the state for the next reference
1071 $b1 = $b2 = $rg = $med = $com = $pubmed = $doi = undef;
1072 $title = $loc = $au = undef;
1073 } elsif ( /^RP\s{3}(.+? OF (\d+)-(\d+).*)/) {
1077 } elsif ( /^RP\s{3}(.*)/) {
1083 } elsif (/^RX\s{3}(.*)/) { # each reference can have only one RX line
1085 $med = $1 if $line =~ /MEDLINE=(\d+);/;
1086 $pubmed = $1 if $line =~ /PubMed=(\d+);/;
1087 $doi = $1 if $line =~ /DOI=(.+);/;
1088 } elsif ( /^RA\s{3}(.*)/ ) {
1089 $au .= $au ?
" $1" : $1;
1090 } elsif ( /^RG\s{3}(.*)/ ) {
1091 $rg .= $rg ?
" $1" : $1;
1092 } elsif ( /^RT\s{3}(.*)/ ) {
1095 $title .= ($title =~ /[\w;,:\?!]$/) ?
" $tline" : $tline;
1099 } elsif (/^RL\s{3}(.*)/ ) {
1100 $loc .= $loc ?
" $1" : $1;
1101 } elsif ( /^RC\s{3}(.*)/ ) {
1102 $com .= $com ?
" $1" : $1;
1104 $_ = $self->_readline;
1110 =head2 _read_swissprot_Species
1112 Title : _read_swissprot_Species
1114 Function: Reads the swissprot Organism species and classification
1116 Able to deal with unconventional species names.
1117 Example : OS Unknown prokaryotic organism
1118 $genus = undef ; $species = Unknown prokaryotic organism
1119 Returns : A Bio::Species object
1124 sub _read_swissprot_Species
{
1125 my( $self,$line ) = @_;
1129 my( $sub_species, $species, $genus, $common, $variant, $ncbi_taxid, $sci_name, $class_lines, $descr );
1131 my $do_genus_check = 1;
1132 while ( defined $_ ) {
1133 last unless /^O[SCGX]/;
1134 # believe it or not, but OS may come multiple times -- at this time
1135 # we can't capture multiple species
1136 if (/^OS\s+(\S.+)/ && (! defined($sci_name))) {
1137 $osline .= " " if $osline;
1139 if ($osline =~ s/(,|, and|\.)$//) {
1140 # OS lines are usually like:
1141 # Homo sapiens (human)
1142 # where we have $sci_name followed by $descr (common name) in
1143 # brackets, but we can also have:
1144 # Venerupis (Ruditapes) philippinarum
1145 # where we have brackets but they don't indicate a $descr
1146 if ($osline =~ /[^\(\)]+\(.+\)[^\(\)]+$/) {
1147 #*** Danger! no idea if this will pick up some syntaxes for
1148 # common names as well)
1149 $sci_name = $osline;
1150 $sci_name =~ s/\.$//;
1152 $do_genus_check = 0;
1154 ($sci_name, $descr) = $osline =~ /(\S[^\(]+)(.*)/;
1156 $sci_name =~ s/\s+$//;
1158 while ($descr =~ /\(([^\)]+)\)/g) {
1160 # strain etc may not necessarily come first (yes, swissprot
1162 if ((! defined($variant)) &&
1163 (($item =~ /(^|[^\(\w])([Ss]train|isolate|serogroup|serotype|subtype|clone)\b/) ||
1164 ($item =~ /^(biovar|pv\.|type\s+)/))) {
1166 } elsif ($item =~ s/^subsp\.\s+//) {
1167 if (! $sub_species) {
1168 $sub_species = $item;
1169 } elsif (! $variant) {
1172 } elsif (! defined($common)) {
1173 # we're only interested in the first common name
1175 if ((index($common, '(') >= 0) &&
1176 (index($common, ')') < 0)) {
1182 } elsif (s/^OC\s+(\S.+)$//) {
1184 } elsif (/^OG\s+(.*)/) {
1186 } elsif (/^OX\s+(.*)/ && (! defined($ncbi_taxid))) {
1188 # we only keep the first one and ignore all others
1189 if ($taxstring =~ /NCBI_TaxID=([\w\d]+)/) {
1192 $self->throw("$taxstring doesn't look like NCBI_TaxID");
1195 $_ = $self->_readline;
1197 $self->_pushback($_); # pushback the last line because we need it
1199 $sci_name || return;
1201 # if the organism belongs to taxid 32644 then no Bio::Species object.
1202 return if grep { $_ eq $sci_name } @Unknown_names;
1204 # Convert data in classification lines into classification array.
1205 # Remove trailing . then split on ';' or '.;' so that classification that is 2
1206 # or more words will still get matched, use map() to remove trailing/leading/intervening
1208 $class_lines=~s/\.\s*$//;
1209 my @class = map { s/^\s+//; s/\s+$//; s/\s{2,}/ /g; $_; } split /[;\.]*;/, $class_lines;
1211 if ($class[0] =~ /viruses/i) {
1212 # viruses have different OS/OC syntax
1213 my @virusnames = split(/\s+/, $sci_name);
1214 $species = (@virusnames > 1) ?
pop(@virusnames) : '';
1215 $genus = join(" ", @virusnames);
1216 $sub_species = $descr;
1217 } elsif ($do_genus_check) {
1218 # do we have a genus?
1219 my $possible_genus = $class[-1];
1220 $possible_genus .= "|$class[-2]" if $class[-2];
1221 if ($sci_name =~ /^($possible_genus)/) {
1223 ($species) = $sci_name =~ /^$genus\s+(.+)/;
1225 $species = $sci_name;
1227 # is this organism of rank species or is it lower?
1228 # (doesn't catch everything, but at least the guess isn't dangerous)
1229 if ($species && $species =~ /subsp\.|var\./) {
1230 ($species, $sub_species) = $species =~ /(.+)\s+((?:subsp\.|var\.).+)/;
1234 # Bio::Species array needs array in Species -> Kingdom direction
1235 unless ($class[-1] eq $sci_name) {
1236 push(@class, $sci_name);
1238 @class = reverse @class;
1240 my $taxon = Bio
::Species
->new();
1241 $taxon->scientific_name($sci_name);
1242 $taxon->classification(@class);
1243 $taxon->common_name($common) if $common;
1244 $taxon->sub_species($sub_species) if $sub_species;
1245 $taxon->organelle($org) if $org;
1246 $taxon->ncbi_taxid($ncbi_taxid) if $ncbi_taxid;
1247 $taxon->variant($variant) if $variant;
1256 Usage : $obj->_filehandle($newval)
1259 Returns : value of _filehandle
1260 Args : newvalue (optional)
1265 # inherited from SeqIO.pm ! HL 05/11/2000
1267 =head2 _read_FTHelper_swissprot
1269 Title : _read_FTHelper_swissprot
1270 Usage : _read_FTHelper_swissprot(\$buffer)
1271 Function: reads the next FT key line
1273 Returns : Bio::SeqIO::FTHelper object
1279 sub _read_FTHelper_swissprot
{
1280 my ($self,$line ) = @_;
1281 # initial version implemented by HL 05/10/2000
1282 # FIXME this may not be perfect, so please review
1283 # lots of cleaning up by JES 2004/07/01, still may not be perfect =)
1284 # FTId now sepated from description as a qualifier
1287 my ($key, # The key of the feature
1288 $loc, # The location line from the feature
1289 $desc, # The descriptive text
1290 $ftid, # feature Id is like a qualifier but there can be only one of them
1292 if ( m/^FT\s{3}(\w+)\s+([\d\?\<]+)[\s.]+([\d\?\>]+)\s*(.*)$/ox) {
1296 $loc = "$loc1..$loc2";
1297 if ($4 && (length($4) > 0)) {
1303 } elsif ( m/^FT\s{3}(\w+)\s+([\d\?\<]+)\s+(.*)$/ox) {
1308 if ($3 && (length($3) > 0)) {
1316 while ( defined($_ = $self->_readline) && /^FT\s{4,}(\S.*)$/ ) {
1317 my $continuation_line = $1;
1318 if ( $continuation_line =~ /.FTId=(.*)\./ ) {
1322 $desc .= " $continuation_line";
1324 $desc = $continuation_line;
1328 $self->_pushback($_);
1330 # No feature key. What's this?
1331 $self->warn("No feature key in putative feature table line: $line");
1335 # Make the new FTHelper object
1336 my $out = Bio
::SeqIO
::FTHelper
->new(-verbose
=> $self->verbose());
1340 # store the description if there is one
1341 if ( $desc && length($desc) ) {
1343 push(@
{$out->field->{"description"}}, $desc);
1345 # Store the qualifier i.e. FTId
1347 push(@
{$out->field->{"FTId"}}, $ftid);
1353 =head2 _write_line_swissprot
1355 Title : _write_line_swissprot
1357 Function: internal function
1365 sub _write_line_swissprot
{
1366 my ($self,$pre1,$pre2,$line,$length) = @_;
1368 $length || $self->throw( "Miscalled write_line_swissprot without length. Programming error!");
1369 my $subl = $length - length $pre2;
1370 my $linel = length $line;
1373 my $sub = substr($line,0,$length - length $pre1);
1375 $self->_print( "$pre1$sub\n");
1377 for ($i= ($length - length $pre1);$i < $linel;) {
1378 $sub = substr($line,$i,($subl));
1379 $self->_print( "$pre2$sub\n");
1385 =head2 _write_line_swissprot_regex
1387 Title : _write_line_swissprot_regex
1389 Function: internal function for writing lines of specified
1390 length, with different first and the next line
1391 left hand headers and split at specific points in the
1395 Args : file handle, first header, second header, text-line, regex for line breaks, total line length
1400 sub _write_line_swissprot_regex
{
1401 my ($self,$pre1,$pre2,$line,$regex,$length) = @_;
1403 #print STDOUT "Going to print with $line!\n";
1405 $length || $self->throw( "Miscalled write_line_swissprot without length. Programming error!");
1407 if ( length $pre1 != length $pre2 ) {
1408 $self->warn( "len 1 is ". length ($pre1) . " len 2 is ". length ($pre2) . "\n");
1409 $self->throw( "Programming error - cannot called write_line_swissprot_regex with different length \npre1 ($pre1) and \npre2 ($pre2) tags!");
1412 my $subl = $length - (length $pre1) -1 ;
1415 while ($line =~ m/(.{1,$subl})($regex)/g) {
1417 $s =~ s/([\w\.])#(\w)/$1 $2/g # remove word wrap protection char '#'
1418 if $pre1 eq "RA " or $pre1 eq "KW ";
1419 # remove annoying extra spaces at the end of the wrapped lines
1420 substr($s, -1, 1, '') if substr($s, -1, 1) eq ' ';
1422 $self->_print( "$pre1$s\n");
1425 $self->_print( "$pre2$s\n");
1434 Usage : $obj->_post_sort($newval)
1436 Returns : value of _post_sort
1437 Args : newvalue (optional)
1446 $obj->{'_post_sort'} = $value;
1448 return $obj->{'_post_sort'};
1455 Usage : $obj->_show_dna($newval)
1457 Returns : value of _show_dna
1458 Args : newvalue (optional)
1467 $obj->{'_show_dna'} = $value;
1469 return $obj->{'_show_dna'};
1473 =head2 _id_generation_func
1475 Title : _id_generation_func
1476 Usage : $obj->_id_generation_func($newval)
1478 Returns : value of _id_generation_func
1479 Args : newvalue (optional)
1484 sub _id_generation_func
{
1488 $obj->{'_id_generation_func'} = $value;
1490 return $obj->{'_id_generation_func'};
1494 =head2 _ac_generation_func
1496 Title : _ac_generation_func
1497 Usage : $obj->_ac_generation_func($newval)
1499 Returns : value of _ac_generation_func
1500 Args : newvalue (optional)
1505 sub _ac_generation_func
{
1509 $obj->{'_ac_generation_func'} = $value;
1511 return $obj->{'_ac_generation_func'};
1515 =head2 _sv_generation_func
1517 Title : _sv_generation_func
1518 Usage : $obj->_sv_generation_func($newval)
1520 Returns : value of _sv_generation_func
1521 Args : newvalue (optional)
1526 sub _sv_generation_func
{
1530 $obj->{'_sv_generation_func'} = $value;
1532 return $obj->{'_sv_generation_func'};
1536 =head2 _kw_generation_func
1538 Title : _kw_generation_func
1539 Usage : $obj->_kw_generation_func($newval)
1541 Returns : value of _kw_generation_func
1542 Args : newvalue (optional)
1547 sub _kw_generation_func
{
1551 $obj->{'_kw_generation_func'} = $value;
1553 return $obj->{'_kw_generation_func'};