2 # BioPerl module for Bio::SeqIO::swiss
4 # Copyright Elia Stupka
6 # You may distribute this module under the same terms as perl itself
8 # POD documentation - main docs before the code
12 Bio::SeqIO::swiss - Swissprot sequence input/output stream
16 It is probably best not to use this object directly, but
17 rather go through the SeqIO handler system:
21 $stream = Bio::SeqIO->new(-file => $filename,
24 while ( my $seq = $stream->next_seq() ) {
25 # do something with $seq
30 This object can transform Bio::Seq objects to and from Swiss-Pprot flat
33 There is a lot of flexibility here about how to dump things which needs
36 =head2 GN (Gene name) line management details
38 A Uniprot/Swiss-Prot entry holds information on one protein
39 sequence. If that sequence is identical across genes and species, they
40 are all merged into one entry. This creates complex needs for several
41 annotation fields in swiss-prot format.
43 The latest syntax for GN line is described in the user manual:
45 http://www.expasy.ch/sprot/userman.html#GN_line
47 Each of the possibly multiple genes in an entry can have Name,
48 Synonyms (only if there is a name), OrderedLocusNames (names from
49 genomic sequences) and ORFNames (temporary or cosmid names). "Name"
50 here really means "symbol". This complexity is now dealt with the
53 A new Bio::AnnotationI class was created in order to store the
54 data in tag-value pairs. This class (Bio::Annotation::TagTree)
55 is stored in the Bio::Annotation::Collection object and is
56 accessed like all other annotations. The tag name is 'gene_name'.
58 There is a single Bio::Annotation::TagTree per sequence record, which
59 corresponds to the original class that stored this data
60 (Bio::Annotation::StructuredValue). Depending on how we progress
61 this may change to represent each group of gene names.
63 For now, to access the gene name tree annotation, one uses the below method:
65 my ($gene) = $seq->annotation->get_Annotations('gene_name');
67 If you are only interested in displaying the values, value() returns a
68 string with similar formatting.
70 There are several ways to get directly at the information you want if you
71 know the element (tag) for the data. For gene names all data is stored with
72 the element-tag pairs:
74 "element1=tag1, tag2, tag3; element2=tag4, tag5;"
76 This normally means the element will be 'Name', 'Synonyms', etc. and the
77 gene names the values. Using findval(), you can do the following:
79 # grab a flattened list of all gene names
80 my @names = $ann->findval('Name');
82 # or iterated through the nodes and grab the name for each group
83 for my $node ($ann->findnode('gene_name')) {
84 my @names = $node->findval('Name');
87 The current method for parsing gene name data (and reconstructing gene name
88 output) is very generic. This is somewhat preemptive if, for instance, UniProt
89 decides to update and add another element name to the current ones using the
90 same formatting layout. Under those circumstances, one can iterate through the
91 tag tree in a safe way and retrieve all node data like so.
93 # retrieve the gene name nodes (groups like names, synonyms, etc).
94 for my $ann ($seq->annotation->get_Annotations('gene_name')) {
96 # each gene name group
97 for my $node ($ann->findnode('gene_name')) {
100 # each gene name node (tag => value pair)
101 for my $n ($node->children) {
102 print "\t".$n->element.": ".$n->children."\n";
107 For more uses see Bio::Annotation::TagTree.
109 Since Uniprot/Swiss-Prot format have been around for quite some time, the
110 parser is also able to read in the older GN line syntax where genes
111 are separated by AND and various symbols by OR. The first symbol is
112 taken to be the 'Name' and the remaining ones are stored as 'Synonyms'.
114 Also, for UniProt output we support using other Bio::AnnotationI, but in this
115 case we only use the stringified version of the annotation. This is to allow for
116 backwards compatibility with code that previously used
117 Bio::Annotation::SimpleValue or other Bio::AnnotationI classes.
119 =head2 Optional functions
125 (output only) shows the dna or not
129 (output only) provides a sorting func which is applied to the FTHelpers
132 =item _id_generation_func()
134 This is function which is called as
136 print "ID ", $func($seq), "\n";
138 To generate the ID line. If it is not there, it generates a sensible ID
139 line using a number of tools.
141 If you want to output annotations in Swissprot format they need to be
142 stored in a Bio::Annotation::Collection object which is accessible
143 through the Bio::SeqI interface method L<annotation()|annotation>.
145 The following are the names of the keys which are polled from a
146 L<Bio::Annotation::Collection> object.
148 reference - Should contain Bio::Annotation::Reference objects
149 comment - Should contain Bio::Annotation::Comment objects
150 dblink - Should contain Bio::Annotation::DBLink objects
151 gene_name - Should contain Bio::Annotation::SimpleValue object
159 User feedback is an integral part of the evolution of this
160 and other Bioperl modules. Send your comments and suggestions,
161 preferably to one of the Bioperl mailing lists.
162 Your participation is much appreciated.
164 bioperl-l@bioperl.org - General discussion
165 http://bioperl.org/wiki/Mailing_lists - About the mailing lists
169 Please direct usage questions or support issues to the mailing list:
171 I<bioperl-l@bioperl.org>
173 rather than to the module maintainer directly. Many experienced and
174 reponsive experts will be able look at the problem and quickly
175 address it. Please include a thorough description of the problem
176 with code and data examples if at all possible.
178 =head2 Reporting Bugs
180 Report bugs to the Bioperl bug tracking system to help us keep track
181 the bugs and their resolution.
182 Bug reports can be submitted via the web:
184 https://github.com/bioperl/bioperl-live/issues
186 =head1 AUTHOR - Elia Stupka
188 Email elia@tll.org.sg
192 The rest of the documentation details each of the object methods.
193 Internal methods are usually preceded with a _
197 # Let the code begin...
199 package Bio
::SeqIO
::swiss
;
200 use vars
qw(@Unknown_names @Unknown_genus);
202 use Bio::SeqIO::FTHelper;
203 use Bio::SeqFeature::Generic;
205 use Bio::Tools::SeqStats;
206 use Bio::Seq::SeqFactory;
207 use Bio::Annotation::Collection;
208 use Bio::Annotation::Comment;
209 use Bio::Annotation::Reference;
210 use Bio::Annotation::DBLink;
211 use Bio::Annotation::SimpleValue;
212 use Bio::Annotation::TagTree;
214 use base qw(Bio::SeqIO);
216 our $LINE_LENGTH = 76;
218 # this is for doing species name parsing
219 @Unknown_names=('other', 'unidentified',
220 'unknown organism', 'not specified',
221 'not shown', 'Unspecified', 'Unknown',
222 'None', 'unclassified', 'unidentified organism',
225 # dictionary of synonyms for taxid 32644
226 # all above can be part of valid species name
227 @Unknown_genus = qw(unknown unclassified uncultured unidentified);
229 # if there are any other gene name tags, they are added to the end
230 our @GENE_NAME_ORDER = qw(Name Synonyms OrderedLocusNames ORFNames);
233 my($self,@args) = @_;
234 $self->SUPER::_initialize
(@args);
235 # hash for functions for decoding keys.
236 $self->{'_func_ftunit_hash'} = {};
237 # sets this to one by default. People can change it
239 if ( ! defined $self->sequence_factory ) {
240 $self->sequence_factory(Bio
::Seq
::SeqFactory
->new
241 (-verbose
=> $self->verbose(),
242 -type
=> 'Bio::Seq::RichSeq'));
249 Usage : $seq = $stream->next_seq()
250 Function: returns the next sequence in the stream
251 Returns : Bio::Seq object
257 my ($self,@args) = @_;
258 my ($pseq,$c,$line,$name,$desc,$acc,$seqc,$mol,$div, $sptr,$seq_div,
259 $date,$comment,@date_arr);
261 my ($annotation, %params, @features) = ( Bio
::Annotation
::Collection
->new());
265 1 while defined($_ = $self->_readline) && /^\s+$/;
266 return unless defined $_ && /^ID\s/;
268 # fixed to allow _DIVISION to be optional for bug #946
269 # see bug report for more information
271 # 9/6/06 Note: Swiss/TrEMBL sequences have no division acc. to UniProt
272 # release notes; this is fixed to simplify the regex parsing
273 # STANDARD (SwissProt) and PRELIMINARY (TrEMBL) added to namespace()
276 (\S
+) \s
+ # $1 entryname
277 ([^\s
;]+); \s
+ # $2 DataClass
278 (?
:PRT
;)? \s
+ # Molecule Type (optional)
279 [0-9]+[ ]AA \
. # Sequencelength (capture?)
282 # I couldn't find any new current UniProt sequences
283 # that matched this format:
284 # || m/^ID\s+(\S+)\s+(_([^\s_]+))? /ox ) {
285 $self->throw("swissprot stream with no ID. Not swissprot in my book");
287 ($name, $seq_div) = ($1, $2);
288 $params{'-namespace'} =
289 ($seq_div eq 'Reviewed' || $seq_div eq 'STANDARD') ?
'Swiss-Prot' :
290 ($seq_div eq 'Unreviewed' || $seq_div eq 'PRELIMINARY') ?
'TrEMBL' :
292 # we shouldn't be setting the division, but for now...
293 my ($junk, $division) = split q
(_
), $name;
294 $params{'-division'} = $division;
295 $params{'-alphabet'} = 'protein';
296 # this is important to have the id for display in e.g. FTHelper, otherwise
297 # you won't know which entry caused an error
298 $params{'-display_id'} = $name;
300 BEFORE_FEATURE_TABLE
:
301 while ( defined($_ = $self->_readline) ) {
302 # Exit at start of Feature table and at the sequence at the
303 # latest HL 05/11/2000
304 last if( /^(FT|SQ)/ );
306 # Description line(s)
307 if (/^DE\s+(\S.*\S)/) {
308 $desc .= $desc ?
" $1" : $1;
311 elsif (/^GN\s+(.*)/) {
312 $genename .= " " if $genename;
316 elsif ( /^AC\s+(.+)/) {
317 my @accs = split(/[; ]+/, $1); # allow space in addition
318 $params{'-accession_number'} = shift @accs
319 unless defined $params{'-accession_number'};
320 push @
{$params{'-secondary_accessions'}}, @accs;
322 #date and sequence version
323 elsif ( /^DT\s+(.*)/ ) {
325 my ($date, $version) = split(' ', $line, 2);
326 $date =~ tr/,//d; # remove comma if new version
327 if ($version =~ /\(Rel\. (\d+), Last sequence update\)/ || # old
328 /sequence version (\d+)/) { #new
329 my $update = Bio
::Annotation
::SimpleValue
->new
330 (-tagname
=> 'seq_update',
333 $annotation->add_Annotation($update);
334 } elsif ($version =~ /\(Rel\. (\d+), Last annotation update\)/ || #old
335 /entry version (\d+)/) { #new
336 $params{'-version'} = $1;
338 push @
{$params{'-dates'}}, $date;
341 elsif ( /^PE\s+(.*)/ ) {
343 $line =~ s/;\s*//; # trim trailing semicolon and any spaces at the end of the line
344 my $evidence = Bio
::Annotation
::SimpleValue
->new
345 (-tagname
=> 'evidence',
348 $annotation->add_Annotation($evidence);
350 # Organism name and phylogenetic information
352 my $species = $self->_read_swissprot_Species($_);
353 $params{'-species'}= $species;
354 # now we are one line ahead -- so continue without reading the next
359 my $refs = $self->_read_swissprot_References($_);
360 foreach my $r (@
$refs) {
361 $annotation->add_Annotation('reference',$r);
365 elsif (/^CC\s{3}(.*)/) {
368 while (defined ($_ = $self->_readline) && /^CC\s{3}(.*)/ ) {
369 $comment .= $1 . "\n";
372 # note: don't try to process comments here -- they may contain
373 # structure. LP 07/30/2000
375 my $commobj = Bio
::Annotation
::Comment
->new(-tagname
=> 'comment',
377 $annotation->add_Annotation('comment',$commobj);
379 $self->_pushback($_);
383 # /^DR\s+(\S+)\;\s+(\S+)\;\s+(\S+)[\;\.](.*)$/) {
384 # new regexp from Andreas Kahari bug #1584
385 elsif (/^DR\s+(\S+)\;\s+(\S+)\;\s+([^;]+)[\;\.](.*)$/) {
386 my ($database,$primaryid,$optional,$comment) = ($1,$2,$3,$4);
388 # drop leading and training spaces and trailing .
389 $comment =~ s/\.\s*$//;
390 $comment =~ s/^\s+//;
392 my $dblinkobj = Bio
::Annotation
::DBLink
->new
393 (-database
=> $database,
394 -primary_id
=> $primaryid,
395 -optional_id
=> $optional,
396 -comment
=> $comment,
397 -tagname
=> 'dblink',
400 $annotation->add_Annotation('dblink',$dblinkobj);
403 elsif ( /^KW\s+(.*)$/ ) {
404 my @kw = split(/\s*\;\s*/,$1);
405 defined $kw[-1] && $kw[-1] =~ s/\.$//;
406 push @
{$params{'-keywords'}}, @kw;
409 # process and parse the gene name line if there was one (note: we
410 # can't do this above b/c GN may be multi-line and we can't
411 # unequivocally determine whether we've seen the last GN line in
415 if ($genename =~ /\w=\w/) {
416 # new format (e.g., Name=RCHY1; Synonyms=ZNF363, CHIMP)
417 for my $n (split(m{\s+and\s+},$genename)) {
419 for my $section (split(m{\s*;\s*},$n)) {
420 my ($tag, $rest) = split("=",$section);
422 for my $val (split(m{\s*,\s*},$rest)) {
423 push @genenames, [$tag => $val];
426 push @stags, ['gene_name' => \
@genenames];
430 for my $section (split(/ AND /, $genename)) {
432 $section =~ s/[\(\)\.]//g;
433 my @names = split(m{\s+OR\s+}, $section);
434 push @genenames, ['Name' => shift @names];
435 push @genenames, map {['Synonyms' => $_]} @names;
436 push @stags, ['gene_name' => \
@genenames]
438 } #use Data::Dumper; print Dumper $gn, $genename;# exit;
439 my $gn = Bio
::Annotation
::TagTree
->new(-tagname
=> 'gene_name',
440 -value
=> ['gene_names' => \
@stags]);
441 $annotation->add_Annotation('gene_name', $gn);
445 # if there is no feature table, or if we've got beyond, exit loop or don't
446 # even enter HL 05/11/2000
447 while (defined $_ && /^FT/ ) {
448 my $ftunit = $self->_read_FTHelper_swissprot($_);
451 # when parsing of the line fails we get undef returned
454 $ftunit->_generic_seqfeature($self->location_factory(),
455 $params{'-seqid'}, "SwissProt"));
457 $self->warn("failed to parse feature table line for seq " .
458 $params{'-display_id'}. "\n$_");
460 $_ = $self->_readline;
462 while ( defined($_) && ! /^SQ/ ) {
463 $_ = $self->_readline;
466 while ( defined ($_ = $self->_readline) ) {
472 my $seq= $self->sequence_factory->create
473 (-verbose
=> $self->verbose,
477 -features
=> \
@features,
478 -annotation
=> $annotation,
481 # The annotation doesn't get added by the constructor
482 $seq->annotation($annotation);
490 Usage : $stream->write_seq($seq)
491 Function: writes the $seq object (must be seq) to the stream
492 Returns : 1 for success and 0 for error
493 Args : array of 1 to n Bio::SeqI objects
499 my ($self,@seqs) = @_;
500 foreach my $seq ( @seqs ) {
501 $self->throw("Attempting to write with no seq!") unless defined $seq;
503 if ( ! ref $seq || ! $seq->isa('Bio::SeqI') ) {
504 $self->warn(" $seq is not a SeqI compliant module. Attempting to dump, but may fail!");
511 my $ns = ($seq->can('namespace')) && $seq->namespace();
512 my $len = $seq->length();
514 if ( !$seq->can('division') || ! defined ($div = $seq->division()) ) {
518 # namespace dictates database, takes precedent over division. Sorry!
519 if (defined($ns) && $ns ne '') {
520 $div = ($ns eq 'Swiss-Prot') ?
'Reviewed' :
521 ($ns eq 'TrEMBL') ?
'Unreviewed' :
525 # division not reset; acts as fallback
528 $self->warn("No whitespace allowed in SWISS-PROT display id [". $seq->display_id. "]")
529 if $seq->display_id =~ /\s/;
532 if ( $self->_id_generation_func ) {
533 $temp_line = &{$self->_id_generation_func}($seq);
535 #$temp_line = sprintf ("%10s STANDARD; %3s; %d AA.",
536 # $seq->primary_id()."_".$div,$mol,$len);
537 # Reconstructing the ID relies heavily upon the input source having
538 # been in a format that is parsed as this routine expects it -- that is,
539 # by this module itself. This is bad, I think, and immediately breaks
540 # if e.g. the Bio::DB::GenPept module is used as input.
541 # Hence, switch to display_id(); _every_ sequence is supposed to have
542 # this. HL 2000/09/03
543 # Changed to reflect ID line changes in UniProt
544 # Oct 2006 - removal of molecule type - see bug 2134
545 $temp_line = sprintf ("%-24s%-12s%9d AA.",
546 $seq->display_id(), $div.';', $len);
549 $self->_print( "ID $temp_line\n");
551 # if there, write the accession line
552 local($^W
) = 0; # suppressing warnings about uninitialized fields
554 if ( $self->_ac_generation_func ) {
555 $temp_line = &{$self->_ac_generation_func}($seq);
556 $self->_print( "AC $temp_line\n");
558 elsif ($seq->can('accession_number') ) {
559 my $ac_line = $seq->accession_number;
560 if ($seq->can('get_secondary_accessions') ) {
561 foreach my $sacc ($seq->get_secondary_accessions) {
562 $ac_line .= "; ". $sacc;;
567 $self->_write_line_swissprot_regex("AC ","AC ",$ac_line,
568 "\\s\+\|\$",$LINE_LENGTH);
570 # otherwise - cannot print <sigh>
573 # Date lines and sequence versions (changed 6/15/2006)
574 # This is rebuilt from scratch using the current SwissProt/UniProt format
575 if ( $seq->can('get_dates') ) {
576 my @dates = $seq->get_dates();
578 my $seq_version = $seq->version;
579 my ($update_version) = $seq->annotation->get_Annotations("seq_update");
580 foreach my $dt (@dates) {
581 $self->_write_line_swissprot_regex("DT ","DT ",
582 $dt.', integrated into UniProtKB/'.$ns.'.',
583 "\\s\+\|\$",$LINE_LENGTH) if $ct == 1;
584 $self->_write_line_swissprot_regex("DT ","DT ",
585 $dt.", sequence version ".$update_version->display_text.'.',
586 "\\s\+\|\$",$LINE_LENGTH) if $ct == 2;
587 $self->_write_line_swissprot_regex("DT ","DT ",
588 $dt.", entry version $seq_version.",
589 "\\s\+\|\$",$LINE_LENGTH) if $ct == 3;
595 $self->_write_line_swissprot_regex("DE ","DE ",$seq->desc(),"\\s\+\|\$",$LINE_LENGTH);
597 #Gene name; print out new format
598 foreach my $gene ( my @genes = $seq->annotation->get_Annotations('gene_name') ) {
599 # gene is a Bio::Annotation::TagTree;
600 if ($gene->isa('Bio::Annotation::TagTree')) {
602 for my $node ($gene->findnode('gene_name')) {
603 # check for Name and Synonym first, then the rest get tagged on
604 my $geneline = "GN ";
605 my %genedata = $node->hash;
606 for my $tag (@GENE_NAME_ORDER) {
607 if (exists $genedata{$tag}) {
608 $geneline .= (ref $genedata{$tag} eq 'ARRAY') ?
609 "$tag=".join(', ',@
{$genedata{$tag}})."; " :
610 "$tag=$genedata{$tag}; ";
611 delete $genedata{$tag};
615 for my $tag (sort keys %genedata) {
616 $geneline .= (ref $genedata{$tag} eq 'ARRAY') ?
617 "$tag=".join(', ',@
{$genedata{$tag}})."; " :
618 "$tag=$genedata{$tag}; ";
619 delete $genedata{$tag};
621 push @genelines, "$geneline\n";
623 $self->_print(join("GN and\n",@genelines));
624 } else { # fall back to getting stringified output
625 $self->_write_line_swissprot_regex("GN ","GN ",
633 if ($seq->can('species') && (my $spec = $seq->species)) {
634 my @class = $spec->classification();
636 my $species = $spec->species;
637 my $genus = $spec->genus;
638 my $OS = $spec->scientific_name;
639 if ($class[-1] =~ /viruses/i) {
641 $OS .= " ". $spec->sub_species if $spec->sub_species;
643 foreach (($spec->variant, $spec->common_name)) {
644 $OS .= " ($_)" if $_;
646 $self->_print( "OS $OS.\n");
647 my $OC = join('; ', reverse(@class)) .'.';
648 $self->_write_line_swissprot_regex("OC ","OC ",$OC,"\; \|\$",$LINE_LENGTH);
649 if ($spec->organelle) {
650 $self->_write_line_swissprot_regex("OG ","OG ",$spec->organelle,"\; \|\$",$LINE_LENGTH);
652 if ($spec->ncbi_taxid) {
653 $self->_print("OX NCBI_TaxID=".$spec->ncbi_taxid.";\n");
659 foreach my $ref ( $seq->annotation->get_Annotations('reference') ) {
660 $self->_print( "RN [$t]\n");
661 # changed by lorenz 08/03/00
662 # j.gilbert and h.lapp agreed that the rp line in swissprot seems
663 # more like a comment than a parseable value, so print it as is
665 $self->_write_line_swissprot_regex("RP ","RP ",$ref->rp,
666 "\\s\+\|\$",$LINE_LENGTH);
669 $self->_write_line_swissprot_regex("RC ","RC ",$ref->comment,
670 "\\s\+\|\$",$LINE_LENGTH);
672 if ($ref->medline or $ref->pubmed or $ref->doi) {
673 # new RX format in swissprot LP 09/17/00
674 # RX line can now have a DOI, Heikki 13 Feb 2008
677 $line .= "MEDLINE=". $ref->medline. '; ' if $ref->medline;
678 $line .= "PubMed=". $ref->pubmed. '; ' if $ref->pubmed;
679 $line .= "DOI=". $ref->doi. '; ' if $ref->doi;
682 $self->_write_line_swissprot_regex("RX ","RX ",
684 "\\s\+\|\$",$LINE_LENGTH);
687 my $author = $ref->authors .';' if($ref->authors);
688 my $title = $ref->title .';' if( $ref->title);
689 my $rg = $ref->rg . ';' if $ref->rg;
690 $author =~ s/([\w\.]) (\w)/$1#$2/g; # add word wrap protection char '#'
692 $self->_write_line_swissprot_regex("RG ","RG ",$rg,"\\s\+\|\$",$LINE_LENGTH) if $rg;
693 $self->_write_line_swissprot_regex("RA ","RA ",$author,"\\s\+\|\$",$LINE_LENGTH) if $author;
694 $self->_write_line_swissprot_regex("RT ","RT ",$title,'[\s\-]+|$',$LINE_LENGTH) if $title;
695 $self->_write_line_swissprot_regex("RL ","RL ",$ref->location,"\\s\+\|\$",$LINE_LENGTH);
701 foreach my $comment ( $seq->annotation->get_Annotations('comment') ) {
702 foreach my $cline (split ("\n", $comment->text)) {
703 while (length $cline > 74) {
704 $self->_print("CC ",(substr $cline,0,74),"\n");
705 $cline = substr $cline,74;
707 $self->_print("CC ",$cline,"\n");
711 # Database xref lines
713 foreach my $dblink ( $seq->annotation->get_Annotations('dblink') ) {
714 my ($primary_id) = $dblink->primary_id;
716 if (defined($dblink->comment) && ($dblink->comment) ) {
717 $self->_print("DR ",$dblink->database,"; ",$primary_id,"; ",
718 $dblink->optional_id,"; ",$dblink->comment,".\n");
719 } elsif ($dblink->optional_id) {
720 $self->_print("DR ",$dblink->database,"; ",
722 $dblink->optional_id,".\n");
724 $self->_print("DR ",$dblink->database,
725 "; ",$primary_id,"; ","-.\n");
731 foreach my $evidence ( $seq->annotation->get_Annotations('evidence') ) {
732 $self->_print("PE ",$evidence->value,";\n");
735 # if there, write the kw line
738 if ( my $func = $self->_kw_generation_func ) {
739 $kw = &{$func}($seq);
740 } elsif ( $seq->can('keywords') ) {
741 $kw = $seq->keywords;
742 if ( ref($kw) =~ /ARRAY/i ) {
743 $kw = join("; ", @
$kw);
745 $kw .= '.' if $kw and $kw !~ /\.$/ ;
747 $kw =~ s/([\w\.]) (\w)/$1#$2/g; # add word wrap protection char '#'
748 $self->_write_line_swissprot_regex("KW ","KW ",
749 $kw, "\\s\+\|\$",$LINE_LENGTH)
753 #Check if there is seqfeatures before printing the FT line
754 my @feats = $seq->can('top_SeqFeatures') ?
$seq->top_SeqFeatures : ();
756 if ( defined $self->_post_sort ) {
758 # we need to read things into an array. Process. Sort them. Print 'em
760 my $post_sort_func = $self->_post_sort();
763 foreach my $sf ( @feats ) {
764 push(@fth,Bio
::SeqIO
::FTHelper
::from_SeqFeature
($sf,$seq));
766 @fth = sort { &$post_sort_func($a,$b) } @fth;
768 foreach my $fth ( @fth ) {
769 $self->_print_swissprot_FTHelper($fth);
772 # not post sorted. And so we can print as we get them.
773 # lower memory load...
775 foreach my $sf ( @feats ) {
776 my @fth = Bio
::SeqIO
::FTHelper
::from_SeqFeature
($sf,$seq);
777 foreach my $fth ( @fth ) {
778 if ( ! $fth->isa('Bio::SeqIO::FTHelper') ) {
779 $sf->throw("Cannot process FTHelper... $fth");
781 $self->_print_swissprot_FTHelper($fth);
786 if ( $self->_show_dna() == 0 ) {
790 # finished printing features.
793 my $mw = ${Bio
::Tools
::SeqStats
->get_mol_wt($seq->primary_seq)}[0];
795 # was crc32 checksum, changed it to crc64
796 my $crc64 = $self->_crc64(\
$str);
797 $self->_print( sprintf("SQ SEQUENCE %4d AA; %d MW; %16s CRC64;\n",
801 for ($i = 0; $i < length($str); $i += 10) {
802 $self->_print( " ", substr($str,$i,10));
804 if ( ($i+10)%60 == 0 && (($i+10) < length($str))) {
805 $self->_print( "\n ");
808 $self->_print( "\n//\n");
810 $self->flush if $self->_flush_on_write && defined $self->_fh;
815 # Thanks to James Gilbert for the following two. LP 08/01/2000
817 =head2 _generateCRCTable
819 Title : _generateCRCTable
829 sub _generateCRCTable
{
830 # 10001000001010010010001110000100
832 my $poly = 0xEDB88320;
835 $self->{'_crcTable'} = [];
836 foreach my $i (0..255) {
838 for (my $j=8; $j > 0; $j--) {
840 $crc = ($crc >> 1) ^ $poly;
845 ${$self->{'_crcTable'}}[$i] = $crc;
863 my( $self, $str ) = @_;
865 $self->throw("Argument to crc32() must be ref to scalar")
866 unless ref($str) eq 'SCALAR';
868 $self->_generateCRCTable() unless exists $self->{'_crcTable'};
870 my $len = length($$str);
872 my $crc = 0xFFFFFFFF;
873 for (my $i = 0; $i < $len; $i++) {
874 # Get upper case value of each letter
875 my $int = ord uc substr $$str, $i, 1;
876 $crc = (($crc >> 8) & 0x00FFFFFF) ^
877 ${$self->{'_crcTable'}}[ ($crc ^ $int) & 0xFF ];
895 my ($self, $sequence) = @_;
896 my $POLY64REVh = 0xd8000000;
901 my $seq = $$sequence;
907 for (my $i=0; $i<256; $i++) {
910 for (my $j=0; $j<8; $j++) {
911 my $rflag = $partl & 1;
913 $partl |= (1 << 31) if $parth & 1;
915 $parth ^= $POLY64REVh if $rflag;
917 $CRCTableh[$i] = $parth;
918 $CRCTablel[$i] = $partl;
922 foreach (split '', $seq) {
923 my $shr = ($crch & 0xFF) << 24;
924 my $temp1h = $crch >> 8;
925 my $temp1l = ($crcl >> 8) | $shr;
926 my $tableindex = ($crcl ^ (unpack "C", $_)) & 0xFF;
927 $crch = $temp1h ^ $CRCTableh[$tableindex];
928 $crcl = $temp1l ^ $CRCTablel[$tableindex];
930 my $crc64 = sprintf("%08X%08X", $crch, $crcl);
934 =head2 _print_swissprot_FTHelper
936 Title : _print_swissprot_FTHelper
946 sub _print_swissprot_FTHelper
{
947 my ($self,$fth,$always_quote) = @_;
949 my ($start,$end) = ('?', '?');
951 if ( ! ref $fth || ! $fth->isa('Bio::SeqIO::FTHelper') ) {
952 $fth->warn("$fth is not a FTHelper class. ".
953 "Attempting to print, but there could be tears!");
957 for my $tag ( qw(description gene note product) ) {
958 if ( exists $fth->field->{$tag} ) {
959 $desc = @
{$fth->field->{$tag}}[0].".";
966 if ( exists $fth->field->{'FTId'} ) {
967 $ftid = @
{$fth->field->{'FTId'}}[0]. '.';
970 my $key =substr($fth->key,0,8);
972 if ( $loc =~ /(\?|\d+|\>\d+|<\d+)?\.\.(\?|\d+|<\d+|>\d+)?/ ) {
973 $start = $1 if defined $1;
974 $end = $2 if defined $2;
976 # to_FTString only returns one value when start == end, #JB955
977 # so if no match is found, assume it is both start and end #JB955
978 } elsif ( $loc =~ /join\((\d+)((?:,\d+)+)?\)/) {
980 if ( defined( my $m = $2) ) {
982 push @y, split(/,/,$m);
985 $self->_write_line_swissprot_regex(
986 sprintf("FT %-8s %6s %6s ",
990 $desc.'.','\s+|$',$LINE_LENGTH);
994 $start = $end = $fth->loc;
997 $self->_write_line_swissprot_regex(sprintf("FT %-8s %6s %6s ",
1001 $desc. '.', '\s+|$', $LINE_LENGTH);
1002 } else { #HELIX and STRAND do not have descriptions
1003 $self->_write_line_swissprot_regex(sprintf("FT %-8s %6s %6s",
1007 ' ', '\s+|$', $LINE_LENGTH);
1012 $self->_write_line_swissprot_regex("FT ",
1014 "/FTId=$ftid",'.|$',$LINE_LENGTH);
1021 =head2 _read_swissprot_References
1023 Title : _read_swissprot_References
1025 Function: Reads references from swissprot format. Internal function really
1033 sub _read_swissprot_References
{
1034 my ($self,$line) = @_;
1035 my ($b1, $b2, $rp, $rg, $title, $loc, $au, $med, $com, $pubmed, $doi);
1038 while ( defined $_ ) {
1039 if ( /^[^R]/ || /^RN/ ) {
1041 $rg =~ s/;\s*$//g if defined($rg);
1047 $title =~ s/;\s*$//g if defined($title);
1048 push @refs, Bio
::Annotation
::Reference
->new
1060 -tagname
=> 'reference',
1062 # reset state for the next reference
1065 if (index($_,'R') != 0) {
1066 $self->_pushback($_); # want this line to go back on the list
1067 last; # may be the safest exit point HL 05/11/2000
1069 # don't forget to reset the state for the next reference
1070 $b1 = $b2 = $rg = $med = $com = $pubmed = $doi = undef;
1071 $title = $loc = $au = undef;
1072 } elsif ( /^RP\s{3}(.+? OF (\d+)-(\d+).*)/) {
1076 } elsif ( /^RP\s{3}(.*)/) {
1082 } elsif (/^RX\s{3}(.*)/) { # each reference can have only one RX line
1084 $med = $1 if $line =~ /MEDLINE=(\d+);/;
1085 $pubmed = $1 if $line =~ /PubMed=(\d+);/;
1086 $doi = $1 if $line =~ /DOI=(.+);/;
1087 } elsif ( /^RA\s{3}(.*)/ ) {
1088 $au .= $au ?
" $1" : $1;
1089 } elsif ( /^RG\s{3}(.*)/ ) {
1090 $rg .= $rg ?
" $1" : $1;
1091 } elsif ( /^RT\s{3}(.*)/ ) {
1094 $title .= ($title =~ /[\w;,:\?!]$/) ?
" $tline" : $tline;
1098 } elsif (/^RL\s{3}(.*)/ ) {
1099 $loc .= $loc ?
" $1" : $1;
1100 } elsif ( /^RC\s{3}(.*)/ ) {
1101 $com .= $com ?
" $1" : $1;
1103 $_ = $self->_readline;
1109 =head2 _read_swissprot_Species
1111 Title : _read_swissprot_Species
1113 Function: Reads the swissprot Organism species and classification
1115 Able to deal with unconventional species names.
1116 Example : OS Unknown prokaryotic organism
1117 $genus = undef ; $species = Unknown prokaryotic organism
1118 Returns : A Bio::Species object
1123 sub _read_swissprot_Species
{
1124 my( $self,$line ) = @_;
1128 my( $sub_species, $species, $genus, $common, $variant, $ncbi_taxid, $sci_name, $class_lines, $descr );
1130 my $do_genus_check = 1;
1131 while ( defined $_ ) {
1132 last unless /^O[SCGX]/;
1133 # believe it or not, but OS may come multiple times -- at this time
1134 # we can't capture multiple species
1135 if (/^OS\s+(\S.+)/ && (! defined($sci_name))) {
1136 $osline .= " " if $osline;
1138 if ($osline =~ s/(,|, and|\.)$//) {
1139 # OS lines are usually like:
1140 # Homo sapiens (human)
1141 # where we have $sci_name followed by $descr (common name) in
1142 # brackets, but we can also have:
1143 # Venerupis (Ruditapes) philippinarum
1144 # where we have brackets but they don't indicate a $descr
1145 if ($osline =~ /[^\(\)]+\(.+\)[^\(\)]+$/) {
1146 #*** Danger! no idea if this will pick up some syntaxes for
1147 # common names as well)
1148 $sci_name = $osline;
1149 $sci_name =~ s/\.$//;
1151 $do_genus_check = 0;
1153 ($sci_name, $descr) = $osline =~ /(\S[^\(]+)(.*)/;
1155 $sci_name =~ s/\s+$//;
1157 while ($descr =~ /\(([^\)]+)\)/g) {
1159 # strain etc may not necessarily come first (yes, swissprot
1161 if ((! defined($variant)) &&
1162 (($item =~ /(^|[^\(\w])([Ss]train|isolate|serogroup|serotype|subtype|clone)\b/) ||
1163 ($item =~ /^(biovar|pv\.|type\s+)/))) {
1165 } elsif ($item =~ s/^subsp\.\s+//) {
1166 if (! $sub_species) {
1167 $sub_species = $item;
1168 } elsif (! $variant) {
1171 } elsif (! defined($common)) {
1172 # we're only interested in the first common name
1174 if ((index($common, '(') >= 0) &&
1175 (index($common, ')') < 0)) {
1181 } elsif (s/^OC\s+(\S.+)$//) {
1183 } elsif (/^OG\s+(.*)/) {
1185 } elsif (/^OX\s+(.*)/ && (! defined($ncbi_taxid))) {
1187 # we only keep the first one and ignore all others
1188 if ($taxstring =~ /NCBI_TaxID=([\w\d]+)/) {
1191 $self->throw("$taxstring doesn't look like NCBI_TaxID");
1194 $_ = $self->_readline;
1196 $self->_pushback($_); # pushback the last line because we need it
1198 $sci_name || return;
1200 # if the organism belongs to taxid 32644 then no Bio::Species object.
1201 return if grep { $_ eq $sci_name } @Unknown_names;
1203 # Convert data in classification lines into classification array.
1204 # Remove trailing . then split on ';' or '.;' so that classification that is 2
1205 # or more words will still get matched, use map() to remove trailing/leading/intervening
1207 $class_lines=~s/\.\s*$//;
1208 my @class = map { s/^\s+//; s/\s+$//; s/\s{2,}/ /g; $_; } split /[;\.]*;/, $class_lines;
1210 if ($class[0] =~ /viruses/i) {
1211 # viruses have different OS/OC syntax
1212 my @virusnames = split(/\s+/, $sci_name);
1213 $species = (@virusnames > 1) ?
pop(@virusnames) : '';
1214 $genus = join(" ", @virusnames);
1215 $sub_species = $descr;
1216 } elsif ($do_genus_check) {
1217 # do we have a genus?
1218 my $possible_genus = $class[-1];
1219 $possible_genus .= "|$class[-2]" if $class[-2];
1220 if ($sci_name =~ /^($possible_genus)/) {
1222 ($species) = $sci_name =~ /^$genus\s+(.+)/;
1224 $species = $sci_name;
1226 # is this organism of rank species or is it lower?
1227 # (doesn't catch everything, but at least the guess isn't dangerous)
1228 if ($species && $species =~ /subsp\.|var\./) {
1229 ($species, $sub_species) = $species =~ /(.+)\s+((?:subsp\.|var\.).+)/;
1233 # Bio::Species array needs array in Species -> Kingdom direction
1234 unless ($class[-1] eq $sci_name) {
1235 push(@class, $sci_name);
1237 @class = reverse @class;
1239 my $taxon = Bio
::Species
->new();
1240 $taxon->scientific_name($sci_name);
1241 $taxon->classification(@class);
1242 $taxon->common_name($common) if $common;
1243 $taxon->sub_species($sub_species) if $sub_species;
1244 $taxon->organelle($org) if $org;
1245 $taxon->ncbi_taxid($ncbi_taxid) if $ncbi_taxid;
1246 $taxon->variant($variant) if $variant;
1255 Usage : $obj->_filehandle($newval)
1258 Returns : value of _filehandle
1259 Args : newvalue (optional)
1264 # inherited from SeqIO.pm ! HL 05/11/2000
1266 =head2 _read_FTHelper_swissprot
1268 Title : _read_FTHelper_swissprot
1269 Usage : _read_FTHelper_swissprot(\$buffer)
1270 Function: reads the next FT key line
1272 Returns : Bio::SeqIO::FTHelper object
1278 sub _read_FTHelper_swissprot
{
1279 my ($self,$line ) = @_;
1280 # initial version implemented by HL 05/10/2000
1281 # FIXME this may not be perfect, so please review
1282 # lots of cleaning up by JES 2004/07/01, still may not be perfect =)
1283 # FTId now sepated from description as a qualifier
1286 my ($key, # The key of the feature
1287 $loc, # The location line from the feature
1288 $desc, # The descriptive text
1289 $ftid, # feature Id is like a qualifier but there can be only one of them
1291 if ( m/^FT\s{3}(\w+)\s+([\d\?\<]+)\s+([\d\?\>]+)\s*(.*)$/ox) {
1295 $loc = "$loc1..$loc2";
1296 if ($4 && (length($4) > 0)) {
1304 while ( defined($_ = $self->_readline) && /^FT\s{20,}(\S.*)$/ ) {
1305 my $continuation_line = $1;
1306 if ( $continuation_line =~ /.FTId=(.*)\./ ) {
1310 $desc .= " $continuation_line";
1312 $desc = $continuation_line;
1316 $self->_pushback($_);
1318 # No feature key. What's this?
1319 $self->warn("No feature key in putative feature table line: $line");
1323 # Make the new FTHelper object
1324 my $out = Bio
::SeqIO
::FTHelper
->new(-verbose
=> $self->verbose());
1328 # store the description if there is one
1329 if ( $desc && length($desc) ) {
1331 push(@
{$out->field->{"description"}}, $desc);
1333 # Store the qualifier i.e. FTId
1335 push(@
{$out->field->{"FTId"}}, $ftid);
1341 =head2 _write_line_swissprot
1343 Title : _write_line_swissprot
1345 Function: internal function
1353 sub _write_line_swissprot
{
1354 my ($self,$pre1,$pre2,$line,$length) = @_;
1356 $length || $self->throw( "Miscalled write_line_swissprot without length. Programming error!");
1357 my $subl = $length - length $pre2;
1358 my $linel = length $line;
1361 my $sub = substr($line,0,$length - length $pre1);
1363 $self->_print( "$pre1$sub\n");
1365 for ($i= ($length - length $pre1);$i < $linel;) {
1366 $sub = substr($line,$i,($subl));
1367 $self->_print( "$pre2$sub\n");
1373 =head2 _write_line_swissprot_regex
1375 Title : _write_line_swissprot_regex
1377 Function: internal function for writing lines of specified
1378 length, with different first and the next line
1379 left hand headers and split at specific points in the
1383 Args : file handle, first header, second header, text-line, regex for line breaks, total line length
1388 sub _write_line_swissprot_regex
{
1389 my ($self,$pre1,$pre2,$line,$regex,$length) = @_;
1391 #print STDOUT "Going to print with $line!\n";
1393 $length || $self->throw( "Miscalled write_line_swissprot without length. Programming error!");
1395 if ( length $pre1 != length $pre2 ) {
1396 $self->warn( "len 1 is ". length ($pre1) . " len 2 is ". length ($pre2) . "\n");
1397 $self->throw( "Programming error - cannot called write_line_swissprot_regex with different length \npre1 ($pre1) and \npre2 ($pre2) tags!");
1400 my $subl = $length - (length $pre1) -1 ;
1403 while ($line =~ m/(.{1,$subl})($regex)/g) {
1405 $s =~ s/([\w\.])#(\w)/$1 $2/g # remove word wrap protection char '#'
1406 if $pre1 eq "RA " or $pre1 eq "KW ";
1407 # remove annoying extra spaces at the end of the wrapped lines
1408 substr($s, -1, 1, '') if substr($s, -1, 1) eq ' ';
1410 $self->_print( "$pre1$s\n");
1413 $self->_print( "$pre2$s\n");
1422 Usage : $obj->_post_sort($newval)
1424 Returns : value of _post_sort
1425 Args : newvalue (optional)
1434 $obj->{'_post_sort'} = $value;
1436 return $obj->{'_post_sort'};
1443 Usage : $obj->_show_dna($newval)
1445 Returns : value of _show_dna
1446 Args : newvalue (optional)
1455 $obj->{'_show_dna'} = $value;
1457 return $obj->{'_show_dna'};
1461 =head2 _id_generation_func
1463 Title : _id_generation_func
1464 Usage : $obj->_id_generation_func($newval)
1466 Returns : value of _id_generation_func
1467 Args : newvalue (optional)
1472 sub _id_generation_func
{
1476 $obj->{'_id_generation_func'} = $value;
1478 return $obj->{'_id_generation_func'};
1482 =head2 _ac_generation_func
1484 Title : _ac_generation_func
1485 Usage : $obj->_ac_generation_func($newval)
1487 Returns : value of _ac_generation_func
1488 Args : newvalue (optional)
1493 sub _ac_generation_func
{
1497 $obj->{'_ac_generation_func'} = $value;
1499 return $obj->{'_ac_generation_func'};
1503 =head2 _sv_generation_func
1505 Title : _sv_generation_func
1506 Usage : $obj->_sv_generation_func($newval)
1508 Returns : value of _sv_generation_func
1509 Args : newvalue (optional)
1514 sub _sv_generation_func
{
1518 $obj->{'_sv_generation_func'} = $value;
1520 return $obj->{'_sv_generation_func'};
1524 =head2 _kw_generation_func
1526 Title : _kw_generation_func
1527 Usage : $obj->_kw_generation_func($newval)
1529 Returns : value of _kw_generation_func
1530 Args : newvalue (optional)
1535 sub _kw_generation_func
{
1539 $obj->{'_kw_generation_func'} = $value;
1541 return $obj->{'_kw_generation_func'};