Bio::Tools::CodonTable::is_start_codon: check in case of ambiguous codons (#266)
[bioperl-live.git] / lib / Bio / SeqIO / genbank.pm
blob9f9926c3ada82ac1d3fd3af1882c8d7c2a501d86
2 # BioPerl module for Bio::SeqIO::genbank
4 # Please direct questions and support issues to <bioperl-l@bioperl.org>
6 # Cared for by Bioperl project bioperl-l(at)bioperl.org
8 # Copyright Elia Stupka and contributors see AUTHORS section
10 # You may distribute this module under the same terms as perl itself
12 # POD documentation - main docs before the code
14 =head1 NAME
16 Bio::SeqIO::genbank - GenBank sequence input/output stream
18 =head1 SYNOPSIS
20 It is probably best not to use this object directly, but
21 rather go through the SeqIO handler:
23 $stream = Bio::SeqIO->new(-file => $filename,
24 -format => 'GenBank');
26 while ( my $seq = $stream->next_seq ) {
27 # do something with $seq
31 =head1 DESCRIPTION
33 This object can transform Bio::Seq objects to and from GenBank flat
34 file databases.
36 There is some flexibility here about how to write GenBank output
37 that is not fully documented.
39 =head2 Optional functions
41 =over 3
43 =item _show_dna()
45 (output only) shows the dna or not
47 =item _post_sort()
49 (output only) provides a sorting func which is applied to the FTHelpers
50 before printing
52 =item _id_generation_func()
54 This is function which is called as
56 print "ID ", $func($seq), "\n";
58 To generate the ID line. If it is not there, it generates a sensible ID
59 line using a number of tools.
61 If you want to output annotations in Genbank format they need to be
62 stored in a Bio::Annotation::Collection object which is accessible
63 through the Bio::SeqI interface method L<annotation()|annotation>.
65 The following are the names of the keys which are pulled from a
66 L<Bio::Annotation::Collection> object:
68 reference - Should contain Bio::Annotation::Reference objects
69 comment - Should contain Bio::Annotation::Comment objects
70 dblink - Should contain a Bio::Annotation::DBLink object
71 segment - Should contain a Bio::Annotation::SimpleValue object
72 origin - Should contain a Bio::Annotation::SimpleValue object
73 wgs - Should contain a Bio::Annotation::SimpleValue object
75 =back
77 =head1 Where does the data go?
79 Data parsed in Bio::SeqIO::genbank is stored in a variety of data
80 fields in the sequence object that is returned. Here is a partial list
81 of fields.
83 Items listed as RichSeq or Seq or PrimarySeq and then NAME() tell you
84 the top level object which defines a function called NAME() which
85 stores this information.
87 Items listed as Annotation 'NAME' tell you the data is stored the
88 associated Bio::AnnotationCollectionI object which is associated with
89 Bio::Seq objects. If it is explicitly requested that no annotations
90 should be stored when parsing a record of course they will not be
91 available when you try and get them. If you are having this problem
92 look at the type of SeqBuilder that is being used to construct your
93 sequence object.
95 Comments Annotation 'comment'
96 References Annotation 'reference'
97 Segment Annotation 'segment'
98 Origin Annotation 'origin'
99 Dbsource Annotation 'dblink'
101 Accessions PrimarySeq accession_number()
102 Secondary accessions RichSeq get_secondary_accessions()
103 GI number PrimarySeq primary_id()
104 LOCUS PrimarySeq display_id()
105 Keywords RichSeq get_keywords()
106 Dates RichSeq get_dates()
107 Molecule RichSeq molecule()
108 Seq Version RichSeq seq_version()
109 PID RichSeq pid()
110 Division RichSeq division()
111 Features Seq get_SeqFeatures()
112 Alphabet PrimarySeq alphabet()
113 Definition PrimarySeq description() or desc()
114 Version PrimarySeq version()
116 Sequence PrimarySeq seq()
118 There is more information in the Feature-Annotation HOWTO about each
119 field and how it is mapped to the Sequence object
120 L<http://bioperl.org/howtos/Features_and_Annotations_HOWTO.html>.
122 =head1 FEEDBACK
124 =head2 Mailing Lists
126 User feedback is an integral part of the evolution of this and other
127 Bioperl modules. Send your comments and suggestions preferably to one
128 of the Bioperl mailing lists. Your participation is much appreciated.
130 bioperl-l@bioperl.org - General discussion
131 http://bioperl.org/wiki/Mailing_lists - About the mailing lists
133 =head2 Support
135 Please direct usage questions or support issues to the mailing list:
137 I<bioperl-l@bioperl.org>
139 rather than to the module maintainer directly. Many experienced and
140 reponsive experts will be able look at the problem and quickly
141 address it. Please include a thorough description of the problem
142 with code and data examples if at all possible.
144 =head2 Reporting Bugs
146 Report bugs to the Bioperl bug tracking system to help us keep track
147 the bugs and their resolution. Bug reports can be submitted via the web:
149 https://github.com/bioperl/bioperl-live/issues
151 =head1 AUTHOR - Bioperl Project
153 bioperl-l at bioperl.org
155 Original author Elia Stupka, elia -at- tigem.it
157 =head1 CONTRIBUTORS
159 Ewan Birney birney at ebi.ac.uk
160 Jason Stajich jason at bioperl.org
161 Chris Mungall cjm at fruitfly.bdgp.berkeley.edu
162 Lincoln Stein lstein at cshl.org
163 Heikki Lehvaslaiho, heikki at ebi.ac.uk
164 Hilmar Lapp, hlapp at gmx.net
165 Donald G. Jackson, donald.jackson at bms.com
166 James Wasmuth, james.wasmuth at ed.ac.uk
167 Brian Osborne, bosborne at alum.mit.edu
168 Chris Fields, cjfields at bioperl dot org
170 =head1 APPENDIX
172 The rest of the documentation details each of the object
173 methods. Internal methods are usually preceded with a _
175 =cut
177 # Let the code begin...
179 package Bio::SeqIO::genbank;
181 use strict;
183 use Bio::SeqIO::FTHelper;
184 use Bio::SeqFeature::Generic;
185 use Bio::Species;
186 use Bio::Seq::SeqFactory;
187 use Bio::Annotation::Collection;
188 use Bio::Annotation::Comment;
189 use Bio::Annotation::Reference;
190 use Bio::Annotation::DBLink;
192 use base qw(Bio::SeqIO);
194 our $FTQUAL_LINE_LENGTH = 60;
196 our %FTQUAL_NO_QUOTE = map {$_ => 1} qw(
197 anticodon citation
198 codon codon_start
199 compare cons_splice
200 direction estimated_length
201 evidence label
202 mod_base number
203 rpt_type rpt_unit
204 rpt_unit_range tag_peptide
205 transl_except transl_table
206 usedin
209 our %DBSOURCE = map {$_ => 1} qw(
210 EchoBASE IntAct SWISS-2DPAGE ECO2DBASE ECOGENE TIGRFAMs
211 TIGR GO InterPro Pfam PROSITE SGD GermOnline
212 HSSP PhosSite Ensembl RGD AGD ArrayExpress KEGG
213 H-InvDB HGNC LinkHub PANTHER PRINTS SMART SMR
214 MGI MIM RZPD-ProtExp ProDom MEROPS TRANSFAC Reactome
215 UniGene GlycoSuiteDB PIRSF HSC-2DPAGE PHCI-2DPAGE
216 PMMA-2DPAGE Siena-2DPAGE Rat-heart-2DPAGE Aarhus/Ghent-2DPAGE
217 Biocyc MetaCyc Biocyc:Metacyc GenomeReviews FlyBase
218 TMHOBP COMPLUYEAST-2DPAGE OGP DictyBase HAMAP
219 PhotoList Gramene WormBase WormPep Genew ZFIN
220 PeroxiBase MaizeDB TAIR DrugBank REBASE HPA
221 swissprot GenBank GenPept REFSEQ embl PDB UniProtKB
222 DIP PeptideAtlas PRIDE CYGD HOGENOME Gene3D
223 Project);
225 our %VALID_MOLTYPE = map {$_ => 1} qw(NA DNA RNA tRNA rRNA cDNA cRNA ms-DNA
226 mRNA uRNA ss-RNA ss-DNA snRNA snoRNA PRT);
228 our %VALID_ALPHABET = (
229 'bp' => 'dna',
230 'aa' => 'protein',
231 'rc' => '' # rc = release candidate; file has no sequences
234 sub _initialize {
235 my($self, @args) = @_;
237 $self->SUPER::_initialize(@args);
238 # hash for functions for decoding keys.
239 $self->{'_func_ftunit_hash'} = {};
240 $self->_show_dna(1); # sets this to one by default. People can change it
241 if ( not defined $self->sequence_factory ) {
242 $self->sequence_factory
243 (Bio::Seq::SeqFactory->new(-verbose => $self->verbose,
244 -type => 'Bio::Seq::RichSeq'));
248 =head2 next_seq
250 Title : next_seq
251 Usage : $seq = $stream->next_seq()
252 Function: returns the next sequence in the stream
253 Returns : Bio::Seq object
254 Args :
256 =cut
258 sub next_seq {
259 my ($self, @args) = @_;
260 my %args = @args;
261 my $builder = $self->sequence_builder;
262 my $seq;
263 my %params;
265 RECORDSTART:
266 while (1) {
267 my $buffer;
268 my ( @acc, @features );
269 my ( $display_id, $annotation );
270 my $species;
272 # initialize; we may come here because of starting over
273 @features = ();
274 $annotation = undef;
275 @acc = ();
276 $species = undef;
277 %params = ( -verbose => $self->verbose ); # reset hash
278 local ($/) = "\n";
279 while ( defined( $buffer = $self->_readline ) ) {
280 last if index( $buffer, 'LOCUS ' ) == 0;
282 return unless defined $buffer; # end of file
283 $buffer =~ /^LOCUS\s+(\S.*)$/o
284 or $self->throw( "GenBank stream with bad LOCUS line. "
285 . "Not GenBank in my book. Got '$buffer'");
286 my @tokens = split( ' ', $1 );
288 # this is important to have the id for display in e.g. FTHelper,
289 # otherwise you won't know which entry caused an error
290 $display_id = shift @tokens;
291 $params{'-display_id'} = $display_id;
293 # may still be useful if we don't want the seq
294 my $seqlength = shift @tokens;
295 if ( exists $VALID_ALPHABET{$seqlength} ) {
296 # moved one token too far. No locus name?
297 $self->warn( "Bad LOCUS name? Changing [$params{'-display_id'}] "
298 . "to 'unknown' and length to '$display_id'"
300 $params{'-display_id'} = 'unknown';
301 $params{'-length'} = $display_id;
303 # add token back...
304 unshift @tokens, $seqlength;
306 else {
307 $params{'-length'} = $seqlength;
310 # the alphabet of the entry
311 # shouldn't assign alphabet unless one
312 # is specifically designated (such as for rc files)
313 my $alphabet = lc( shift @tokens );
314 $params{'-alphabet'} =
315 ( exists $VALID_ALPHABET{$alphabet} )
316 ? $VALID_ALPHABET{$alphabet}
317 : $self->warn("Unknown alphabet: $alphabet");
319 # for aa there is usually no 'molecule' (mRNA etc)
320 if ( $params{'-alphabet'} eq 'protein' ) {
321 $params{'-molecule'} = 'PRT';
323 else {
324 $params{'-molecule'} = shift(@tokens);
327 # take care of lower case issues
328 if ( $params{'-molecule'} eq 'dna' or $params{'-molecule'} eq 'rna' ) {
329 $params{'-molecule'} = uc $params{'-molecule'};
331 $self->debug( "Unrecognized molecule type: " . $params{'-molecule'} )
332 if not exists( $VALID_MOLTYPE{ $params{'-molecule'} } );
334 my $circ = shift @tokens;
335 if ( $circ eq 'circular' ) {
336 $params{'-is_circular'} = 1;
337 $params{'-division'} = shift @tokens;
339 else {
340 # 'linear' or 'circular' may actually be omitted altogether
341 $params{'-division'} =
342 ( CORE::length($circ) == 3 ) ? $circ : shift @tokens;
344 my $date = join( ' ', @tokens ); # we lump together the rest
346 # this is per request bug #1513
347 # we can handle:
348 # 9-10-2003
349 # 9-10-03
350 # 09-10-2003
351 # 09-10-03
352 if ( $date =~ s/\s*((\d{1,2})-(\w{3})-(\d{2,4})).*/$1/ ) {
353 if ( length($date) < 11 ) {
354 # improperly formatted date
355 # But we'll be nice and fix it for them
356 my ( $d, $m, $y ) = ( $2, $3, $4 );
357 if ( length($d) == 1 ) {
358 $d = "0$d";
361 # guess the century here
362 if ( length($y) == 2 ) {
363 if ( $y > 60 ) { # arbitrarily guess that '60' means 1960
364 $y = "19$y";
366 else {
367 $y = "20$y";
369 $self->warn( "Date was malformed, guessing the "
370 . "century for $date to be $y\n"
373 $params{'-dates'} = [ join( '-', $d, $m, $y ) ];
375 else {
376 $params{'-dates'} = [$date];
380 # set them all at once
381 $builder->add_slot_value(%params);
382 %params = ();
384 # parse the rest if desired, otherwise start over
385 if ( not $builder->want_object ) {
386 $builder->make_object;
387 next RECORDSTART;
390 # set up annotation depending on what the builder wants
391 if ( $builder->want_slot('annotation') ) {
392 $annotation = Bio::Annotation::Collection->new;
395 $buffer = $self->_readline;
396 while ( defined( my $line = $buffer ) ) {
397 # Description line(s)
398 if ($line =~ /^DEFINITION\s+(\S.*\S)/) {
399 my @desc = ($1);
400 while ( defined( $line = $self->_readline ) ) {
401 if ($line =~ /^\s+(.*)/) {
402 push( @desc, $1 );
403 next;
405 last;
407 $builder->add_slot_value( -desc => join( ' ', @desc ) );
409 # we'll continue right here because DEFINITION
410 # always comes at the top of the entry
411 $buffer = $line;
414 # accession number (there can be multiple accessions)
415 if ($line =~ /^ACCESSION\s+(\S.*\S)/) {
416 push( @acc, split( /\s+/, $1 ) );
417 while ( defined( $line = $self->_readline ) ) {
418 if ($line =~ /^\s+(.*)/) {
419 push( @acc, split( /\s+/, $1 ) );
420 next;
422 last;
424 $buffer = $line;
425 next;
428 # PID
429 elsif ($line =~ /^PID\s+(\S+)/) {
430 $params{'-pid'} = $1;
433 # Version number
434 elsif ($line =~ /^VERSION\s+(\S.+)$/) {
435 my ( $acc, $gi ) = split( ' ', $1 );
436 if ( $acc =~ /^\w+\.(\d+)/ ) {
437 $params{'-version'} = $1;
438 $params{'-seq_version'} = $1;
440 if ( $gi && ( index( $gi, "GI:" ) == 0 ) ) {
441 $params{'-primary_id'} = substr( $gi, 3 );
445 # Keywords
446 elsif ($line =~ /^KEYWORDS\s+(\S.*)/) {
447 my @kw = split( /\s*\;\s*/, $1 );
448 while ( defined( $line = $self->_readline ) ) {
449 chomp $line;
450 if ($line =~ /^\s+(.*)/) {
451 push( @kw, split( /\s*\;\s*/, $1 ) );
452 next;
454 last;
457 @kw && $kw[-1] =~ s/\.$//;
458 $params{'-keywords'} = \@kw;
459 $buffer = $line;
460 next;
463 # Organism name and phylogenetic information
464 elsif ($line =~ /^SOURCE\s+\S/) {
465 if ( $builder->want_slot('species') ) {
466 $species = $self->_read_GenBank_Species( \$buffer );
467 $builder->add_slot_value( -species => $species );
469 else {
470 while ( defined( $buffer = $self->_readline ) ) {
471 last if substr( $buffer, 0, 1 ) ne ' ';
474 next;
477 # References
478 elsif ($line =~ /^REFERENCE\s+\S/) {
479 if ($annotation) {
480 my @refs = $self->_read_GenBank_References( \$buffer );
481 foreach my $ref (@refs) {
482 $annotation->add_Annotation( 'reference', $ref );
485 else {
486 while ( defined( $buffer = $self->_readline ) ) {
487 last if substr( $buffer, 0, 1 ) ne ' ';
490 next;
493 # Project
494 elsif ($line =~ /^PROJECT\s+(\S.*)/) {
495 if ($annotation) {
496 my $project =
497 Bio::Annotation::SimpleValue->new( -value => $1 );
498 $annotation->add_Annotation( 'project', $project );
502 # Comments may be plain text or Structured Comments.
503 # Structured Comments are made up of tag/value pairs and have beginning
504 # and end delimiters like ##*-Data-START## and ##*-Data-END##
505 elsif ($line =~ /^COMMENT\s+(\S.*)/) {
506 if ($annotation) {
507 my $comment = $1;
508 while ( defined( $line = $self->_readline ) ) {
509 last if ($line =~ /^\S/);
510 $comment .= $line;
512 $comment =~ s/ +/ /g;
513 # Structured Comment, do not remove returns in the tabular section
514 if ( my ( $text, $table )= $comment
515 =~ /([^#]*)(##\S+Data-START##.+?##\S+Data-END##)/is
517 $text =~ s/\n/ /g if $text;
518 $table =~ s/START##/START##\n/;
519 $table =~ s/^\s+//gm;
520 $comment = $text . "\n" . $table;
522 # Plain text, remove returns
523 else {
524 $comment =~ s/\n/ /g;
526 $annotation->add_Annotation(
527 'comment',
528 Bio::Annotation::Comment->new(
529 -text => $comment,
530 -tagname => 'comment'
533 $buffer = $line;
535 else {
536 while ( defined( $buffer = $self->_readline ) ) {
537 last if substr( $buffer, 0, 1 ) ne ' ';
540 next;
543 # Corresponding Genbank nucleotide id, Genpept only
544 elsif ($line =~ /^DB(?:SOURCE|LINK)\s+(\S.+)/) {
545 if ($annotation) {
546 my $dbsource = $1;
547 while ( defined( $line = $self->_readline ) ) {
548 last if ($line =~ /^\S/);
549 $dbsource .= $line;
552 # deal with UniProKB dbsources
553 if ( $dbsource =~
554 s/(UniProt(?:KB)?|swissprot):\s+locus\s+(\S+)\,.+\n//
556 $annotation->add_Annotation(
557 'dblink',
558 Bio::Annotation::DBLink->new(
559 -primary_id => $2,
560 -database => $1,
561 -tagname => 'dblink'
564 if ( $dbsource =~ s/\s+created:\s+([^\.]+)\.\n// ) {
565 $annotation->add_Annotation(
566 'swissprot_dates',
567 Bio::Annotation::SimpleValue->new(
568 -tagname => 'date_created',
569 -value => $1
573 while ( $dbsource =~
574 s/\s+(sequence|annotation)\s+updated:\s+([^\.]+)\.\n//g
576 $annotation->add_Annotation(
577 'swissprot_dates',
578 Bio::Annotation::SimpleValue->new(
579 -tagname => 'date_updated',
580 -value => $2
584 $dbsource =~ s/\n/ /g;
585 if ( $dbsource =~
586 s/\s+xrefs:\s+((?:\S+,\s+)+\S+)\s+xrefs/xrefs/
588 # will use $i to determine even or odd
589 # for swissprot the accessions are paired
590 my $i = 0;
591 for my $dbsrc ( split( /,\s+/, $1 ) ) {
592 if ( $dbsrc =~ /(\S+)\.(\d+)/
593 or $dbsrc =~ /(\S+)/
595 my ( $id, $version ) = ( $1, $2 );
596 $version = '' unless defined $version;
597 my $db = ( $id =~ /^\d\S{3}/ ) ? 'PDB'
598 : ( $i++ % 2 ) ? 'GenPept'
599 : 'GenBank';
601 $annotation->add_Annotation(
602 'dblink',
603 Bio::Annotation::DBLink->new(
604 -primary_id => $id,
605 -version => $version,
606 -database => $db,
607 -tagname => 'dblink'
613 elsif ( $dbsource =~ s/\s+xrefs:\s+(.+)\s+xrefs/xrefs/i ) {
614 # download screwed up and ncbi didn't put acc in for gi numbers
615 my $i = 0;
616 for my $id ( split( /\,\s+/, $1 ) ) {
617 my ( $acc, $db );
618 if ( $id =~ /gi:\s+(\d+)/ ) {
619 $acc = $1;
620 $db = ( $i++ % 2 ) ? 'GenPept' : 'GenBank';
622 elsif ( $id =~ /pdb\s+accession\s+(\S+)/ ) {
623 $acc = $1;
624 $db = 'PDB';
626 else {
627 $acc = $id;
628 $db = '';
630 $annotation->add_Annotation(
631 'dblink',
632 Bio::Annotation::DBLink->new(
633 -primary_id => $acc,
634 -database => $db,
635 -tagname => 'dblink'
640 else {
641 $self->debug("Cannot match $dbsource\n");
643 if ( $dbsource =~ s/xrefs\s+
644 \(non\-sequence\s+databases\):\s+
645 ((?:\S+,\s+)+\S+)//x
647 for my $id ( split( /\,\s+/, $1 ) ) {
648 my $db;
650 # this is because GenBank dropped the spaces!!!
651 # I'm sure we're not going to get this right
652 ##if ( $id =~ s/^://i ) {
653 ## $db = $1;
655 $db = substr( $id, 0, index( $id, ':' ) );
656 if ( not exists $DBSOURCE{$db} ) {
657 $db = ''; # do we want 'GenBank' here?
659 $id = substr( $id, index( $id, ':' ) + 1 );
660 $annotation->add_Annotation(
661 'dblink',
662 Bio::Annotation::DBLink->new(
663 -primary_id => $id,
664 -database => $db,
665 -tagname => 'dblink'
671 else {
672 if ( $dbsource =~
673 /^(\S*?):?\s*accession\s+(\S+)\.(\d+)/
675 my ( $db, $id, $version ) = ( $1, $2, $3 );
676 $annotation->add_Annotation(
677 'dblink',
678 Bio::Annotation::DBLink->new(
679 -primary_id => $id,
680 -version => $version,
681 -database => $db || 'GenBank',
682 -tagname => 'dblink'
686 elsif ( $dbsource =~ /^(\S*?):?\s*accession\s+(\S+)/ ) {
687 my ( $db, $id ) = ( $1, $2 );
688 $annotation->add_Annotation(
689 'dblink',
690 Bio::Annotation::DBLink->new(
691 -primary_id => $id,
692 -database => $db || 'GenBank',
693 -tagname => 'dblink'
697 elsif ( $dbsource =~ /(\S+)([\.:])\s*(\S+)/ ) {
698 my ( $db, $version );
699 my @ids = ();
700 if ( $2 eq ':' ) {
701 $db = $1;
703 # Genbank 192 release notes say this: "The second
704 # field can consist of multiple comma-separated
705 # identifiers, if a sequence record has multiple
706 # DBLINK cross-references of a given type."
707 # For example: DBLINK Project:100,200,300"
708 @ids = split( /,/, $3 );
710 else {
711 ( $db, $version ) = ( 'GenBank', $3 );
712 $ids[0] = $1;
715 foreach my $id (@ids) {
716 $annotation->add_Annotation(
717 'dblink',
718 Bio::Annotation::DBLink->new(
719 -primary_id => $id,
720 -version => $version,
721 -database => $db,
722 -tagname => 'dblink'
727 else {
728 $self->warn(
729 "Unrecognized DBSOURCE data: $dbsource\n");
732 $buffer = $line;
734 else {
735 while ( defined( $buffer = $self->_readline ) ) {
736 last if substr( $buffer, 0, 1 ) ne ' ';
739 next;
742 # Exit at start of Feature table, or start of sequence
743 if ($line =~ /^(FEATURES|ORIGIN)/) {
744 my $trap;
746 last if ($line =~ /^(FEATURES|ORIGIN)/);
748 # Get next line and loop again
749 $buffer = $self->_readline;
751 return unless defined $buffer;
753 # add them all at once for efficiency
754 $builder->add_slot_value(
755 -accession_number => shift(@acc),
756 -secondary_accessions => \@acc,
757 %params
759 $builder->add_slot_value( -annotation => $annotation ) if $annotation;
760 %params = (); # reset before possible re-use to avoid setting twice
762 # start over if we don't want to continue with this entry
763 if ( not $builder->want_object ) {
764 $builder->make_object;
765 next RECORDSTART;
768 # some "minimal" formats may not necessarily have a feature table
769 if ( $builder->want_slot('features')
770 and defined $buffer
771 and $buffer =~ /^FEATURES/o
773 # need to read the first line of the feature table
774 $buffer = $self->_readline;
776 # DO NOT read lines in the while condition -- this is done
777 # as a side effect in _read_FTHelper_GenBank!
779 # part of new circular spec:
780 # commented out for now until kinks worked out
781 #my $sourceEnd = 0;
782 #$sourceEnd = $2 if ($buffer =~ /(\d+?)\.\.(\d+?)$/);
784 while ( defined $buffer ) {
785 # check immediately -- not at the end of the loop
786 # note: GenPept entries obviously do not have a BASE line
787 last if ( $buffer =~ /^BASE|ORIGIN|CONTIG|WGS/o );
789 # slurp in one feature at a time -- at return, the start of
790 # the next feature will have been read already, so we need
791 # to pass a reference, and the called method must set this
792 # to the last line read before returning
793 my $ftunit = $self->_read_FTHelper_GenBank( \$buffer );
795 # implement new circular spec: features that cross the origin are now
796 # seamless instead of being 2 separate joined features
797 # commented out until kinks get worked out
798 #if ((! $args{'-nojoin'}) && $ftunit->{'loc'} =~ /^join\((\d+?)\.\.(\d+?),(\d+?)..(\d+?)\)$/
799 #&& $sourceEnd == $2 && $3 == 1) {
800 #my $start = $1;
801 #my $end = $2 + $4;
802 #$ftunit->{'loc'} = "$start..$end";
805 # fix suggested by James Diggans
807 if ( not defined $ftunit ) {
808 # GRRRR. We have fallen over. Try to recover
809 $self->warn( "Unexpected error in feature table for "
810 . $params{'-display_id'}
811 . " Skipping feature, attempting to recover" );
813 unless ( $buffer =~ /^\s{5,5}\S+/o
814 or $buffer =~ /^\S+/o
816 $buffer = $self->_readline;
818 next; # back to reading FTHelpers
821 # process ftunit
822 my $feat =
823 $ftunit->_generic_seqfeature( $self->location_factory,
824 $display_id );
826 # add taxon_id from source if available
827 if ( $species
828 and $feat->primary_tag eq 'source'
829 and $feat->has_tag('db_xref')
830 and ( not $species->ncbi_taxid
831 or ( $species->ncbi_taxid
832 and $species->ncbi_taxid =~ /^list/ ) )
834 foreach my $tagval ( $feat->get_tag_values('db_xref') ) {
835 if ( index( $tagval, "taxon:" ) == 0 ) {
836 $species->ncbi_taxid( substr( $tagval, 6 ) );
837 last;
842 # add feature to list of features
843 push( @features, $feat );
845 $builder->add_slot_value( -features => \@features );
848 if ( defined $buffer ) {
849 # CONTIG lines: TODO, this needs to be cleaned up
850 if ($buffer =~/^CONTIG\s+(.*)/o) {
851 my $ctg = $1;
852 while ( defined( $buffer = $self->_readline ) ) {
853 last if $buffer =~ m{^ORIGIN|//}o;
854 $buffer =~ s/\s+(.*)/$1/;
855 $ctg .= $buffer;
857 if ($ctg) {
858 $annotation->add_Annotation(
859 Bio::Annotation::SimpleValue->new(
860 -tagname => 'contig',
861 -value => $ctg
866 elsif ($buffer =~ /^WGS|WGS_SCAFLD\s+/o) { # catch WGS/WGS_SCAFLD lines
867 while ( $buffer =~ s/(^WGS|WGS_SCAFLD)\s+// ) { # gulp lines
868 chomp $buffer;
869 $annotation->add_Annotation(
870 Bio::Annotation::SimpleValue->new(
871 -value => $buffer,
872 -tagname => lc $1
875 $buffer = $self->_readline;
878 elsif ( $buffer !~ m{^ORIGIN|//}o ) { # advance to the sequence, if any
879 while ( defined( $buffer = $self->_readline ) ) {
880 last if $buffer =~ m{^(ORIGIN|//)};
884 if ( not $builder->want_object ) {
885 $builder->make_object; # implicit end-of-object
886 next RECORDSTART;
888 if ( $builder->want_slot('seq') ) {
889 # the fact that we want a sequence does not necessarily mean that
890 # there also is a sequence ...
891 if ( defined $buffer and $buffer =~ s/^ORIGIN\s+// ) {
892 if ( $annotation and length($buffer) > 0 ) {
893 $annotation->add_Annotation(
894 'origin',
895 Bio::Annotation::SimpleValue->new(
896 -tagname => 'origin',
897 -value => $buffer
901 my $seqc = '';
902 while ( defined( $buffer = $self->_readline ) ) {
903 last if $buffer =~ m{^//};
904 $buffer = uc $buffer;
905 $buffer =~ s/[^A-Za-z]//g;
906 $seqc .= $buffer;
909 $builder->add_slot_value( -seq => $seqc );
912 elsif ( defined($buffer) and ( substr( $buffer, 0, 2 ) ne '//' ) ) {
913 # advance to the end of the record
914 while ( defined( $buffer = $self->_readline ) ) {
915 last if substr( $buffer, 0, 2 ) eq '//';
919 # Unlikely, but maybe the sequence is so weird that we don't want it
920 # anymore. We don't want to return undef if the stream's not exhausted
921 # yet.
922 $seq = $builder->make_object;
923 next RECORDSTART unless $seq;
924 last RECORDSTART;
925 } # end while RECORDSTART
927 return $seq;
930 =head2 write_seq
932 Title : write_seq
933 Usage : $stream->write_seq($seq)
934 Function: writes the $seq object (must be seq) to the stream
935 Returns : 1 for success and 0 for error
936 Args : array of 1 to n Bio::SeqI objects
938 =cut
940 sub write_seq {
941 my ($self,@seqs) = @_;
943 foreach my $seq ( @seqs ) {
944 $self->throw("Attempting to write with no seq!") unless defined $seq;
946 if ( not ref $seq or not $seq->isa('Bio::SeqI') ) {
947 $self->warn(" $seq is not a SeqI compliant module. Attempting to dump, but may fail!");
950 my $str = $seq->seq;
951 my $len = $seq->length;
952 my $alpha = $seq->alphabet;
954 my ($div, $mol);
955 if ( not $seq->can('division')
956 or not defined($div = $seq->division)
958 $div = 'UNK';
960 if ( not $seq->can('molecule')
961 or not defined ($mol = $seq->molecule)
963 $mol = $alpha || 'DNA';
966 my $circular = ($seq->is_circular) ? 'circular' : 'linear ';
968 local($^W) = 0; # suppressing warnings about uninitialized fields.
970 my $temp_line;
971 if ( $self->_id_generation_func ) {
972 $temp_line = &{$self->_id_generation_func}($seq);
974 else {
975 my $date = '';
976 if ( $seq->can('get_dates') ) {
977 ($date) = $seq->get_dates;
980 $self->warn("No whitespace allowed in GenBank display id [". $seq->display_id. "]")
981 if $seq->display_id =~ /\s/;
983 my @data = ( lc($alpha) eq 'protein' ) ? ('aa', '', '') : ('bp', '', $mol);
984 $temp_line = sprintf ("%-12s%-15s%13s %s%4s%-8s%-8s %3s %-s\n",
985 'LOCUS', $seq->id, $len,
986 @data, $circular, $div, $date);
989 $self->_print($temp_line);
990 $self->_write_line_GenBank_regex("DEFINITION ", " ",
991 $seq->desc, "\\s\+\|\$",80);
993 # if there, write the accession line
995 if ( $self->_ac_generation_func ) {
996 $temp_line = &{$self->_ac_generation_func}($seq);
997 $self->_print("ACCESSION $temp_line\n");
999 else {
1000 my @acc = ();
1001 push @acc, $seq->accession_number;
1002 if ( $seq->isa('Bio::Seq::RichSeqI') ) {
1003 push @acc, $seq->get_secondary_accessions;
1005 $self->_print("ACCESSION ", join(" ", @acc), "\n");
1006 # otherwise - cannot print <sigh>
1009 # if PID defined, print it
1010 if ($seq->isa('Bio::Seq::RichSeqI') and $seq->pid) {
1011 $self->_print("PID ", $seq->pid, "\n");
1014 # if there, write the version line
1015 if ( defined $self->_sv_generation_func ) {
1016 $temp_line = &{$self->_sv_generation_func}($seq);
1017 if ( $temp_line ) {
1018 $self->_print("VERSION $temp_line\n");
1021 elsif ($seq->isa('Bio::Seq::RichSeqI') and defined($seq->seq_version)) {
1022 my $id = $seq->primary_id; # this may be a GI number
1023 my $data = (defined $id and $id =~ /^\d+$/) ? " GI:$id" : "";
1024 $self->_print("VERSION ",
1025 $seq->accession_number, ".",
1026 $seq->seq_version, $data, "\n");
1029 # if there, write the PROJECT line
1030 for my $proj ( $seq->annotation->get_Annotations('project') ) {
1031 $self->_print("PROJECT ".$proj->value."\n");
1034 # if there, write the DBSOURCE line
1035 foreach my $ref ( $seq->annotation->get_Annotations('dblink') ) {
1036 my ($db, $id) = ($ref->database, $ref->primary_id);
1037 my $prefix = $db eq 'Project' ? 'DBLINK' : 'DBSOURCE';
1038 my $text = $db eq 'GenBank' ? ''
1039 : $db eq 'Project' ? "$db:$id"
1040 : "$db accession $id";
1041 $self->_print(sprintf ("%-11s %s\n", $prefix, $text));
1044 # if there, write the keywords line
1045 if ( defined $self->_kw_generation_func ) {
1046 $temp_line = &{$self->_kw_generation_func}($seq);
1047 $self->_print("KEYWORDS $temp_line\n");
1049 elsif ( $seq->can('keywords') ) {
1050 my $kw = $seq->keywords;
1051 $kw .= '.' if ( $kw !~ /\.$/ );
1052 $self->_print("KEYWORDS $kw\n");
1055 # SEGMENT if it exists
1056 foreach my $ref ( $seq->annotation->get_Annotations('segment') ) {
1057 $self->_print(sprintf ("%-11s %s\n",'SEGMENT',
1058 $ref->value));
1061 # Organism lines
1062 if (my $spec = $seq->species) {
1063 my ($on, $sn, $cn) = ($spec->can('organelle') ? $spec->organelle : '',
1064 $spec->scientific_name,
1065 $spec->common_name);
1066 my @classification;
1067 if ($spec->isa('Bio::Species')) {
1068 @classification = $spec->classification;
1069 shift @classification;
1071 else {
1072 # Bio::Taxon should have a DB handle of some type attached, so
1073 # derive the classification from that
1074 my $node = $spec;
1075 while ($node) {
1076 $node = $node->ancestor || last;
1077 unshift @classification, $node->node_name;
1078 #$node eq $root && last;
1080 @classification = reverse @classification;
1082 my $abname = $spec->name('abbreviated') ? # from genbank file
1083 $spec->name('abbreviated')->[0] : $sn;
1084 my $sl = $on ? "$on " : '';
1085 $sl .= $cn ? "$abname ($cn)" : $abname;
1087 $self->_write_line_GenBank_regex("SOURCE ", ' 'x12, $sl, "\\s\+\|\$", 80);
1088 $self->_print(" ORGANISM ", $spec->scientific_name, "\n");
1089 my $OC = join('; ', reverse @classification) . '.';
1090 $self->_write_line_GenBank_regex(' 'x12,' 'x12, $OC, "\\s\+\|\$", 80);
1093 # Reference lines
1094 my $count = 1;
1095 foreach my $ref ( $seq->annotation->get_Annotations('reference') ) {
1096 $temp_line = "REFERENCE $count";
1097 if ($ref->start) {
1098 $temp_line .= sprintf (" (%s %d to %d)",
1099 ($seq->alphabet() eq "protein" ?
1100 "residues" : "bases"),
1101 $ref->start, $ref->end);
1103 elsif ($ref->gb_reference) {
1104 $temp_line .= sprintf (" (%s)", $ref->gb_reference);
1106 $self->_print("$temp_line\n");
1107 $self->_write_line_GenBank_regex(" AUTHORS ", ' 'x12,
1108 $ref->authors, "\\s\+\|\$", 80);
1109 $self->_write_line_GenBank_regex(" CONSRTM ", ' 'x12,
1110 $ref->consortium, "\\s\+\|\$", 80) if $ref->consortium;
1111 $self->_write_line_GenBank_regex(" TITLE ", ' 'x12,
1112 $ref->title, "\\s\+\|\$", 80);
1113 $self->_write_line_GenBank_regex(" JOURNAL ", ' 'x12,
1114 $ref->location, "\\s\+\|\$", 80);
1115 if ( $ref->medline) {
1116 $self->_write_line_GenBank_regex(" MEDLINE ", ' 'x12,
1117 $ref->medline, "\\s\+\|\$", 80);
1118 # I am assuming that pubmed entries only exist when there
1119 # are also MEDLINE entries due to the indentation
1121 # This could be a wrong assumption
1122 if ( $ref->pubmed ) {
1123 $self->_write_line_GenBank_regex(" PUBMED ", ' 'x12,
1124 $ref->pubmed, "\\s\+\|\$", 80);
1126 # put remark at the end
1127 if ($ref->comment) {
1128 $self->_write_line_GenBank_regex(" REMARK ", ' 'x12,
1129 $ref->comment, "\\s\+\|\$", 80);
1131 $count++;
1134 # Comment lines
1135 foreach my $comment ( $seq->annotation->get_Annotations('comment') ) {
1136 $self->_write_line_GenBank_regex("COMMENT ", ' 'x12,
1137 $comment->text, "\\s\+\|\$", 80);
1140 # FEATURES section
1141 $self->_print("FEATURES Location/Qualifiers\n");
1143 if ( defined $self->_post_sort ) {
1144 # we need to read things into an array. Process. Sort them. Print 'em
1145 my $post_sort_func = $self->_post_sort;
1146 my @fth;
1148 foreach my $sf ( $seq->top_SeqFeatures ) {
1149 push @fth, Bio::SeqIO::FTHelper::from_SeqFeature($sf, $seq);
1152 @fth = sort { &$post_sort_func($a, $b) } @fth;
1154 foreach my $fth ( @fth ) {
1155 $self->_print_GenBank_FTHelper($fth);
1158 else {
1159 # not post sorted. And so we can print as we get them.
1160 # lower memory load...
1161 foreach my $sf ( $seq->top_SeqFeatures ) {
1162 my @fth = Bio::SeqIO::FTHelper::from_SeqFeature($sf, $seq);
1163 foreach my $fth ( @fth ) {
1164 if ( ! $fth->isa('Bio::SeqIO::FTHelper') ) {
1165 $sf->throw("Cannot process FTHelper... $fth");
1167 $self->_print_GenBank_FTHelper($fth);
1172 # deal with WGS; WGS_SCAFLD present only if WGS is also present
1173 if ($seq->annotation->get_Annotations('wgs')) {
1174 foreach my $wgs (map {$seq->annotation->get_Annotations($_)}
1175 qw(wgs wgs_scaffold)
1177 $self->_print(sprintf ("%-11s %s\n",
1178 uc($wgs->tagname),
1179 $wgs->value));
1181 $self->_show_dna(0);
1183 if ($seq->annotation->get_Annotations('contig')) {
1184 my $ct = 0;
1185 my $cline;
1186 foreach my $contig ($seq->annotation->get_Annotations('contig')) {
1187 unless ($ct) {
1188 $cline = uc($contig->tagname) . " " . $contig->value . "\n";
1190 else {
1191 $cline = " " . $contig->value . "\n";
1193 $self->_print($cline);
1194 $ct++;
1197 if ( $seq->length == 0 ) {
1198 $self->_show_dna(0);
1201 if ( $self->_show_dna == 0 ) {
1202 $self->_print("\n//\n");
1203 return;
1206 # finished printing features.
1208 $str =~ tr/A-Z/a-z/;
1210 my ($o) = $seq->annotation->get_Annotations('origin');
1211 $self->_print(sprintf("%-12s%s\n",
1212 'ORIGIN', $o ? $o->value : ''));
1213 # print out the sequence
1214 my $nuc = 60; # Number of nucleotides per line
1215 my $whole_pat = 'a10' x 6; # Pattern for unpacking a whole line
1216 my $out_pat = 'A11' x 6; # Pattern for packing a line
1217 my $length = length $str;
1219 # Calculate the number of nucleotides which fit on whole lines
1220 my $whole = int($length / $nuc) * $nuc;
1222 # Print the whole lines
1223 my $i;
1224 for ($i = 0; $i < $whole; $i += $nuc) {
1225 my $blocks = pack $out_pat,
1226 unpack $whole_pat,
1227 substr($str, $i, $nuc);
1228 chop $blocks;
1229 $self->_print(sprintf("%9d $blocks\n", $i + $nuc - 59));
1232 # Print the last line
1233 if (my $last = substr($str, $i)) {
1234 my $last_len = length($last);
1235 my $last_pat = 'a10' x int($last_len / 10)
1236 . 'a' . $last_len % 10;
1237 my $blocks = pack $out_pat,
1238 unpack($last_pat, $last);
1239 $blocks =~ s/ +$//;
1240 $self->_print(sprintf("%9d $blocks\n",
1241 $length - $last_len + 1));
1244 $self->_print("//\n");
1246 $self->flush if $self->_flush_on_write && defined $self->_fh;
1247 return 1;
1251 =head2 _print_GenBank_FTHelper
1253 Title : _print_GenBank_FTHelper
1254 Usage :
1255 Function:
1256 Example :
1257 Returns :
1258 Args :
1260 =cut
1262 sub _print_GenBank_FTHelper {
1263 my ( $self, $fth ) = @_;
1265 if ( not ref $fth or not $fth->isa('Bio::SeqIO::FTHelper') ) {
1266 $fth->warn(
1267 "$fth is not a FTHelper class. Attempting to print but there could be issues"
1271 my $spacer = ( length $fth->key >= 15 ) ? ' ' : '';
1272 $self->_write_line_GenBank_regex(
1273 sprintf( " %-16s%s", $fth->key, $spacer ),
1274 " " x 21, $fth->loc, "\,\|\$", 80 );
1276 foreach my $tag ( sort keys %{ $fth->field } ) {
1277 # Account for hash structure in Annotation::DBLink, not the expected array
1278 if ( $tag eq 'db_xref' and grep /HASH/, @{ $fth->field->{$tag} }) {
1279 for my $ref ( @{ $fth->field->{$tag} } ) {
1280 my $db = $ref->{'database'};
1281 my $id = $ref->{'primary_id'};
1282 $self->_write_line_GenBank_regex
1283 ( " " x 21, " " x 21,
1284 "/$tag=\"$db:$id\"", "\.\|\$", 80 );
1287 # The usual case, where all values are found in an array
1288 else {
1289 foreach my $value ( @{ $fth->field->{$tag} } ) {
1290 $value =~ s/\"/\"\"/g;
1291 if ( $value eq "_no_value" ) {
1292 $self->_write_line_GenBank_regex
1293 ( " " x 21, " " x 21,
1294 "/$tag", "\.\|\$", 80 );
1297 # There are almost 3x more quoted qualifier values and they
1298 # are more common too so we take quoted ones first.
1299 # Long qualifiers, that will be line wrapped, are always quoted
1301 # Note (cjf): any tags explicitly *not* quoted must
1302 # remain unquoted in all output for proper round trip
1303 # parsing:
1304 # see https://github.com/bioperl/bioperl-live/issues/321
1306 # If this breaks *NCBI* examples please file an issue with
1307 # example files
1308 elsif ( not $FTQUAL_NO_QUOTE{$tag} ) {
1309 my ($pat) = ( $value =~ /\s/ ? '\s|$' :
1310 '.|$' );
1311 $self->_write_line_GenBank_regex
1312 ( " " x 21, " " x 21,
1313 "/$tag=\"$value\"", $pat, 80 );
1315 else {
1316 my ($pat) = ( $value =~ /[,]/ ? '[,]|$' :
1317 '.|$' );
1318 $self->_write_line_GenBank_regex
1319 ( " " x 21, " " x 21,
1320 "/$tag=$value", $pat, 80 );
1327 =head2 _read_GenBank_References
1329 Title : _read_GenBank_References
1330 Usage :
1331 Function: Reads references from GenBank format. Internal function really
1332 Returns :
1333 Args :
1335 =cut
1337 sub _read_GenBank_References {
1338 my ($self, $buffer) = @_;
1339 my (@refs);
1340 my $ref;
1342 # assume things are starting with RN
1343 if ( $$buffer !~ /^REFERENCE/ ) {
1344 warn("Not parsing line '$$buffer' which maybe important");
1347 my $line = $$buffer;
1349 my (@title,@loc,@authors,@consort,@com,@medline,@pubmed);
1351 REFLOOP:
1352 while( defined($line) or defined($line = $self->_readline) ) {
1353 if ($line =~ /^\s{2}AUTHORS\s+(.*)/o) {
1354 push @authors, $1;
1355 while ( defined($line = $self->_readline) ) {
1356 if ($line =~ /^\s{9,}(.*)/o) {
1357 push @authors, $1;
1358 next;
1360 last;
1362 $ref->authors(join(' ', @authors));
1365 if ($line =~ /^\s{2}CONSRTM\s+(.*)/o) {
1366 push @consort, $1;
1367 while ( defined($line = $self->_readline) ) {
1368 if ($line =~ /^\s{9,}(.*)/o) {
1369 push @consort, $1;
1370 next;
1372 last;
1374 $ref->consortium(join(' ', @consort));
1377 if ($line =~ /^\s{2}TITLE\s+(.*)/o) {
1378 push @title, $1;
1379 while ( defined($line = $self->_readline) ) {
1380 if ($line =~ /^\s{9,}(.*)/o) {
1381 push @title, $1;
1382 next;
1384 last;
1386 $ref->title(join(' ', @title));
1389 if ($line =~ /^\s{2}JOURNAL\s+(.*)/o) {
1390 push @loc, $1;
1391 while ( defined($line = $self->_readline) ) {
1392 # we only match when there are at least 4 spaces
1393 # there is probably a better way to match this
1394 # as it assumes that the describing tag is short enough
1395 if ($line =~ /^\s{9,}(.*)/o) {
1396 push @loc, $1;
1397 next;
1399 last;
1401 $ref->location(join(' ', @loc));
1402 redo REFLOOP;
1405 if ($line =~ /^\s{2}REMARK\s+(.*)/o) {
1406 push @com, $1;
1407 while ( defined($line = $self->_readline) ) {
1408 if ($line =~ /^\s{9,}(.*)/o) {
1409 push @com, $1;
1410 next;
1412 last;
1414 $ref->comment(join(' ', @com));
1415 redo REFLOOP;
1418 if ( $line =~ /^\s{2}MEDLINE\s+(.*)/ ) {
1419 push @medline, $1;
1420 while ( defined($line = $self->_readline) ) {
1421 if ($line =~ /^\s{9,}(.*)/) {
1422 push @medline, $1;
1423 next;
1425 last;
1427 $ref->medline(join(' ', @medline));
1428 redo REFLOOP;
1431 if ( $line =~ /^\s{3}PUBMED\s+(.*)/ ) {
1432 push @pubmed, $1;
1433 while ( defined($line = $self->_readline) ) {
1434 if ($line =~ /^\s{9,}(.*)/) {
1435 push @pubmed, $1;
1436 next;
1438 last;
1440 $ref->pubmed(join(' ', @pubmed));
1441 redo REFLOOP;
1444 if ( $line =~ /^REFERENCE/o ) {
1445 # store current reference
1446 $self->_add_ref_to_array(\@refs,$ref) if defined $ref;
1448 # reset
1449 @authors = ();
1450 @title = ();
1451 @loc = ();
1452 @com = ();
1453 @pubmed = ();
1454 @medline = ();
1456 # create the new reference object
1457 $ref = Bio::Annotation::Reference->new(-tagname => 'reference');
1459 # check whether start and end base is given
1460 if ($line =~ /^REFERENCE\s+\d+\s+\([a-z]+ (\d+) to (\d+)\)/){
1461 $ref->start($1);
1462 $ref->end($2);
1464 elsif ($line =~ /^REFERENCE\s+\d+\s+\((.*)\)/) {
1465 $ref->gb_reference($1);
1469 last if ($line =~ /^(FEATURES)|(COMMENT)/o);
1471 $line = undef; # Empty $line to trigger read of next line
1474 # store last reference
1475 $self->_add_ref_to_array(\@refs, $ref) if defined $ref;
1477 $$buffer = $line;
1479 #print "\nnumber of references found: ", $#refs+1,"\n";
1481 return @refs;
1484 =head2 _add_ref_to_array
1486 Title: _add_ref_to_array
1487 Usage:
1488 Function: Adds a Reference object to an array of Reference objects, takes
1489 care of possible cleanups to be done (currently, only author and title
1490 will be chopped of trailing semicolons).
1491 Args: A reference to an array of Reference objects and
1492 the Reference object to be added
1493 Returns: nothing
1495 =cut
1497 sub _add_ref_to_array {
1498 my ($self, $refs, $ref) = @_;
1500 # first, polish author and title by removing possible trailing semicolons
1501 my $au = $ref->authors;
1502 my $title = $ref->title;
1503 $au =~ s/;\s*$//g if $au;
1504 $title =~ s/;\s*$//g if $title;
1505 $ref->authors($au);
1506 $ref->title($title);
1507 # the rest should be clean already, so go ahead and add it
1508 push @{$refs}, $ref;
1511 =head2 _read_GenBank_Species
1513 Title : _read_GenBank_Species
1514 Usage :
1515 Function: Reads the GenBank Organism species and classification
1516 lines. Able to deal with unconvential Organism naming
1517 formats, and varietas in plants
1518 Example : ORGANISM unknown marine gamma proteobacterium NOR5
1519 $genus = undef
1520 $species = unknown marine gamma proteobacterium NOR5
1522 ORGANISM Drosophila sp. 'white tip scutellum'
1523 $genus = Drosophila
1524 $species = sp. 'white tip scutellum'
1525 (yes, this really is a species and that is its name)
1526 $subspecies = undef
1528 ORGANISM Ajellomyces capsulatus var. farciminosus
1529 $genus = Ajellomyces
1530 $species = capsulatus
1531 $subspecies = var. farciminosus
1533 ORGANISM Hepatitis delta virus
1534 $genus = undef (though this virus has a genus in its lineage, we
1535 cannot know that without a database lookup)
1536 $species = Hepatitis delta virus
1538 Returns : A Bio::Species object
1539 Args : A reference to the current line buffer
1541 =cut
1543 sub _read_GenBank_Species {
1544 my ($self, $buffer) = @_;
1546 my @unkn_names = ('other', 'unknown organism', 'not specified', 'not shown',
1547 'Unspecified', 'Unknown', 'None', 'unclassified',
1548 'unidentified organism', 'not supplied');
1549 # dictionary of synonyms for taxid 32644
1550 my @unkn_genus = ('unknown', 'unclassified', 'uncultured', 'unidentified');
1551 # all above can be part of valid species name
1553 my $line = $$buffer;
1555 my( $sub_species, $species, $genus, $sci_name, $common,
1556 $class_lines, $source_flag, $abbr_name, $organelle, $sl );
1557 my %source = map { $_ => 1 } qw(SOURCE ORGANISM CLASSIFICATION);
1559 # upon first entering the loop, we must not read a new line -- the SOURCE
1560 # line is already in the buffer (HL 05/10/2000)
1561 my ($ann, $tag, $data);
1562 while (defined($line) or defined($line = $self->_readline)) {
1563 # de-HTMLify (links that may be encountered here don't contain
1564 # escaped '>', so a simple-minded approach suffices)
1565 $line =~ s{<[^>]+>}{}g;
1566 if ($line =~ m{^(?:\s{0,2})(\w+)\s+(.+)?$}ox) {
1567 ($tag, $data) = ($1, $2 || '');
1568 last if ($tag and not exists $source{$tag});
1570 else {
1571 return unless $tag;
1572 ($data = $line) =~ s{^\s+}{};
1573 chomp $data;
1574 $tag = 'CLASSIFICATION' if ( $tag ne 'CLASSIFICATION'
1575 and $tag eq 'ORGANISM'
1576 # Don't match "str." or "var." (fix NC_021815),
1577 # and don't match ".1" (fix NC_021902)
1578 and $line =~ m{(?<!\bstr|\bvar)[;\.]+(?!\d)});
1580 (exists $ann->{$tag}) ? ($ann->{$tag} .= ' '.$data) : ($ann->{$tag} .= $data);
1581 $line = undef;
1584 ($sl, $class_lines, $sci_name) = ($ann->{SOURCE}, $ann->{CLASSIFICATION}, $ann->{ORGANISM});
1586 $$buffer = $line;
1588 $sci_name or return;
1590 # parse out organelle, common name, abbreviated name if present;
1591 # this should catch everything, but falls back to
1592 # entire SOURCE line just in case
1593 if ($sl =~ m{^(mitochondrion|chloroplast|plastid)?
1594 \s*(.*?)
1595 \s*(?: \( (.*?) \) )?\.?
1597 }xms
1599 ($organelle, $abbr_name, $common) = ($1, $2, $3); # optional
1601 else {
1602 $abbr_name = $sl; # nothing caught; this is a backup!
1605 # Convert data in classification lines into classification array.
1606 # only split on ';' or '.' so that classification that is 2 or more words will
1607 # still get matched, use map() to remove trailing/leading/intervening spaces
1608 my @class = map { $_ =~ s/^\s+//;
1609 $_ =~ s/\s+$//;
1610 $_ =~ s/\s{2,}/ /g;
1611 $_; }
1612 split /(?<!subgen)[;\.]+/, $class_lines;
1614 # do we have a genus?
1615 my $possible_genus = quotemeta($class[-1])
1616 . ($class[-2] ? "|" . quotemeta($class[-2]) : '');
1617 if ($sci_name =~ /^($possible_genus)/) {
1618 $genus = $1;
1619 ($species) = $sci_name =~ /^$genus\s+(.+)/;
1621 else {
1622 $species = $sci_name;
1625 # is this organism of rank species or is it lower?
1626 # (we don't catch everything lower than species, but it doesn't matter -
1627 # this is just so we abide by previous behaviour whilst not calling a
1628 # species a subspecies)
1629 if ($species and $species =~ /(.+)\s+((?:subsp\.|var\.).+)/) {
1630 ($species, $sub_species) = ($1, $2);
1633 # Don't make a species object if it's empty or "Unknown" or "None"
1634 # return unless $genus and $genus !~ /^(Unknown|None)$/oi;
1635 # Don't make a species object if it belongs to taxid 32644
1636 # my $unkn = grep { $_ =~ /^\Q$sl\E$/; } @unkn_names;
1637 my $unkn = grep { $_ eq $sl } @unkn_names;
1638 return unless (defined $species or defined $genus) and $unkn == 0;
1640 # Bio::Species array needs array in Species -> Kingdom direction
1641 push @class, $sci_name;
1642 @class = reverse @class;
1644 my $make = Bio::Species->new;
1645 $make->scientific_name($sci_name);
1646 $make->classification(@class) if @class > 0;
1647 $make->common_name( $common ) if $common;
1648 $make->name('abbreviated', $abbr_name) if $abbr_name;
1649 $make->organelle($organelle) if $organelle;
1650 #$make->sub_species( $sub_species ) if $sub_species;
1651 return $make;
1654 =head2 _read_FTHelper_GenBank
1656 Title : _read_FTHelper_GenBank
1657 Usage : _read_FTHelper_GenBank($buffer)
1658 Function: reads the next FT key line
1659 Example :
1660 Returns : Bio::SeqIO::FTHelper object
1661 Args : filehandle and reference to a scalar
1663 =cut
1665 sub _read_FTHelper_GenBank {
1666 my ($self, $buffer) = @_;
1668 my ($key, # The key of the feature
1669 $loc # The location line from the feature
1671 my @qual = (); # An array of lines making up the qualifiers
1673 if ($$buffer =~ /^\s{5}(\S+)\s+(.+?)\s*$/o) {
1674 $key = $1;
1675 $loc = $2;
1676 # Read all the lines up to the next feature
1677 while ( defined(my $line = $self->_readline) ) {
1678 if ($line =~ /^(\s+)(.+?)\s*$/o) {
1679 # Lines inside features are preceded by 21 spaces
1680 # A new feature is preceded by 5 spaces
1681 if (length($1) > 6) {
1682 # Add to qualifiers if we're in the qualifiers, or if it's
1683 # the first qualifier
1684 if (@qual or (index($2,'/') == 0)) {
1685 push @qual, $2;
1687 # We're still in the location line, so append to location
1688 else {
1689 $loc .= $2;
1692 else {
1693 # We've reached the start of the next feature
1694 # Put the first line of the next feature into the buffer
1695 $$buffer = $line;
1696 last;
1699 else {
1700 # We're at the end of the feature table
1701 # Put the first line of the next feature into the buffer
1702 $$buffer = $line;
1703 last;
1707 else {
1708 # No feature key
1709 $self->debug("no feature key!\n");
1710 # change suggested by JDiggans to avoid infinite loop-
1711 # see bugreport 1062.
1712 # reset buffer to prevent infinite loop
1713 $$buffer = $self->_readline;
1714 return;
1717 # Make the new FTHelper object
1718 my $out = Bio::SeqIO::FTHelper->new;
1719 $out->verbose($self->verbose);
1720 $out->key($key);
1721 $out->loc($loc);
1723 # Now parse and add any qualifiers. (@qual is kept
1724 # intact to provide informative error messages.)
1725 QUAL:
1726 for (my $i = 0; $i < @qual; $i++) {
1727 my $data = $qual[$i];
1728 my ( $qualifier, $value ) = ($data =~ m{^/([^=]+)(?:=\s*(.+))?})
1729 or $self->warn( "cannot see new qualifier in feature $key: "
1730 . $data);
1731 $qualifier = '' if not defined $qualifier;
1733 if (defined $value) {
1734 # Do we have a quoted value?
1735 if (substr($value, 0, 1) eq '"') {
1736 # Keep adding to value until we find the trailing quote
1737 # and the quotes are balanced
1738 while ($value !~ /\"$/ or $value =~ tr/"/"/ % 2) {
1739 if ($i >= $#qual) {
1740 $self->warn( "Unbalanced quote in:\n"
1741 . join("\n", @qual)
1742 . "No further qualifiers will "
1743 . "be added for this feature");
1744 last QUAL;
1746 # modifying a for-loop variable inside of the loop
1747 # is not the best programming style ...
1748 $i++;
1749 my $next = $qual[$i];
1751 # add to value with a space unless the value appears
1752 # to be a sequence (translation for example)
1753 # if (($value.$next) =~ /[^A-Za-z\"\-]/o) {
1754 # changed to explicitly look for translation tag - cjf 06/8/29
1755 if ($qualifier !~ /^translation$/i ) {
1756 $value .= " ";
1758 $value .= $next;
1760 # Trim leading and trailing quotes
1761 $value =~ s/^"|"$//g;
1762 # Undouble internal quotes
1763 $value =~ s/""/\"/g;
1765 elsif ( $value =~ /^\(/ ) { # values quoted by ()s
1766 # Keep adding to value until we find the trailing bracket
1767 # and the ()s are balanced
1768 my $left = ($value =~ tr/\(/\(/); # count left parens
1769 my $right = ($value =~ tr/\)/\)/); # count right parens
1771 while( $left != $right ) { # was "$value !~ /\)$/ or $left != $right"
1772 if ( $i >= $#qual) {
1773 $self->warn( "Unbalanced parens in:\n"
1774 . join("\n", @qual)
1775 . "\nNo further qualifiers will "
1776 . "be added for this feature");
1777 last QUAL;
1779 $i++;
1780 my $next = $qual[$i];
1781 $value .= $next;
1782 $left += ($next =~ tr/\(/\(/);
1783 $right += ($next =~ tr/\)/\)/);
1787 else {
1788 $value = '_no_value';
1790 # Store the qualifier
1791 $out->field->{$qualifier} ||= [];
1792 push @{$out->field->{$qualifier}}, $value;
1794 return $out;
1797 =head2 _write_line_GenBank
1799 Title : _write_line_GenBank
1800 Usage :
1801 Function: internal function
1802 Example :
1803 Returns :
1804 Args :
1806 =cut
1808 sub _write_line_GenBank {
1809 my ($self, $pre1, $pre2, $line, $length) = @_;
1811 $length or $self->throw("Miscalled write_line_GenBank without length. Programming error!");
1812 my $subl = $length - length $pre2;
1813 my $linel = length $line;
1814 my $i;
1816 my $subr = substr($line,0,$length - length $pre1);
1818 $self->_print("$pre1$subr\n");
1819 for($i = ($length - length $pre1); $i < $linel; $i += $subl) {
1820 $subr = substr($line, $i, $subl);
1821 $self->_print("$pre2$subr\n");
1825 =head2 _write_line_GenBank_regex
1827 Title : _write_line_GenBank_regex
1828 Usage :
1829 Function: internal function for writing lines of specified
1830 length, with different first and the next line
1831 left hand headers and split at specific points in the
1832 text
1833 Example :
1834 Returns : nothing
1835 Args : file handle,
1836 first header,
1837 second header,
1838 text-line,
1839 regex for line breaks,
1840 total line length
1842 =cut
1844 sub _write_line_GenBank_regex {
1845 my ($self, $pre1, $pre2, $line, $regex, $length) = @_;
1847 # print STDOUT "Going to print with $line!\n";
1849 $length or $self->throw("Miscalled write_line_GenBank without length. Programming error!");
1851 my $subl = $length - (length $pre1) - 2;
1852 my @lines = ();
1854 CHUNK:
1855 while ($line) {
1856 foreach my $pat ($regex, '[,;\.\/-]\s|'.$regex, '[,;\.\/-]|'.$regex) {
1857 if ($line =~ m/^(.{0,$subl})($pat)(.*)/ ) {
1858 my $l = $1 . $2;
1859 $line = substr($line, length $l);
1860 # be strict about not padding spaces according to
1861 # genbank format
1862 $l =~ s/\s+$//;
1863 next CHUNK if ($l eq '');
1864 push @lines, $l;
1865 next CHUNK;
1868 # if we get here none of the patterns matched $subl or less chars
1869 $self->warn( "trouble dissecting \"$line\"\n into chunks "
1870 . "of $subl chars or less - this tag won't print right");
1871 # insert a space char to prevent infinite loops
1872 $line = substr($line, 0, $subl) . " " . substr($line, $subl);
1874 my $s = shift @lines;
1875 $self->_print("$pre1$s\n") if $s;
1876 foreach my $s ( @lines ) {
1877 $self->_print("$pre2$s\n");
1881 =head2 _post_sort
1883 Title : _post_sort
1884 Usage : $obj->_post_sort($newval)
1885 Function:
1886 Returns : value of _post_sort
1887 Args : newvalue (optional)
1889 =cut
1891 sub _post_sort {
1892 my ($obj,$value) = @_;
1893 if ( defined $value) {
1894 $obj->{'_post_sort'} = $value;
1896 return $obj->{'_post_sort'};
1899 =head2 _show_dna
1901 Title : _show_dna
1902 Usage : $obj->_show_dna($newval)
1903 Function:
1904 Returns : value of _show_dna
1905 Args : newvalue (optional)
1907 =cut
1909 sub _show_dna {
1910 my ($obj,$value) = @_;
1911 if ( defined $value) {
1912 $obj->{'_show_dna'} = $value;
1914 return $obj->{'_show_dna'};
1917 =head2 _id_generation_func
1919 Title : _id_generation_func
1920 Usage : $obj->_id_generation_func($newval)
1921 Function:
1922 Returns : value of _id_generation_func
1923 Args : newvalue (optional)
1925 =cut
1927 sub _id_generation_func {
1928 my ($obj,$value) = @_;
1929 if ( defined $value ) {
1930 $obj->{'_id_generation_func'} = $value;
1932 return $obj->{'_id_generation_func'};
1935 =head2 _ac_generation_func
1937 Title : _ac_generation_func
1938 Usage : $obj->_ac_generation_func($newval)
1939 Function:
1940 Returns : value of _ac_generation_func
1941 Args : newvalue (optional)
1943 =cut
1945 sub _ac_generation_func {
1946 my ($obj,$value) = @_;
1947 if ( defined $value ) {
1948 $obj->{'_ac_generation_func'} = $value;
1950 return $obj->{'_ac_generation_func'};
1953 =head2 _sv_generation_func
1955 Title : _sv_generation_func
1956 Usage : $obj->_sv_generation_func($newval)
1957 Function:
1958 Returns : value of _sv_generation_func
1959 Args : newvalue (optional)
1961 =cut
1963 sub _sv_generation_func {
1964 my ($obj,$value) = @_;
1965 if ( defined $value ) {
1966 $obj->{'_sv_generation_func'} = $value;
1968 return $obj->{'_sv_generation_func'};
1971 =head2 _kw_generation_func
1973 Title : _kw_generation_func
1974 Usage : $obj->_kw_generation_func($newval)
1975 Function:
1976 Returns : value of _kw_generation_func
1977 Args : newvalue (optional)
1979 =cut
1981 sub _kw_generation_func {
1982 my ($obj,$value) = @_;
1983 if ( defined $value ) {
1984 $obj->{'_kw_generation_func'} = $value;
1986 return $obj->{'_kw_generation_func'};