1 # BioPerl module for Bio::SeqIO
3 # Please direct questions and support issues to <bioperl-l@bioperl.org>
5 # Cared for by Ewan Birney <birney@ebi.ac.uk>
6 # and Lincoln Stein <lstein@cshl.org>
8 # Copyright Ewan Birney
10 # You may distribute this module under the same terms as perl itself
13 # October 18, 1999 Largely rewritten by Lincoln Stein
15 # POD documentation - main docs before the code
19 Bio::SeqIO - Handler for SeqIO Formats
25 $in = Bio::SeqIO->new(-file => "inputfilename" ,
27 $out = Bio::SeqIO->new(-file => ">outputfilename" ,
30 while ( my $seq = $in->next_seq() ) {
31 $out->write_seq($seq);
34 # Now, to actually get at the sequence object, use the standard Bio::Seq
35 # methods (look at Bio::Seq if you don't know what they are)
39 $in = Bio::SeqIO->new(-file => "inputfilename" ,
40 -format => 'genbank');
42 while ( my $seq = $in->next_seq() ) {
43 print "Sequence ",$seq->id, " first 10 bases ",
44 $seq->subseq(1,10), "\n";
48 # The SeqIO system does have a filehandle binding. Most people find this
49 # a little confusing, but it does mean you can write the world's
50 # smallest reformatter
54 $in = Bio::SeqIO->newFh(-file => "inputfilename" ,
56 $out = Bio::SeqIO->newFh(-format => 'EMBL');
58 # World's shortest Fasta<->EMBL format converter:
59 print $out $_ while <$in>;
64 Bio::SeqIO is a handler module for the formats in the SeqIO set (eg,
65 Bio::SeqIO::fasta). It is the officially sanctioned way of getting at
66 the format objects, which most people should use.
68 The Bio::SeqIO system can be thought of like biological file handles.
69 They are attached to filehandles with smart formatting rules (eg,
70 genbank format, or EMBL format, or binary trace file format) and
71 can either read or write sequence objects (Bio::Seq objects, or
72 more correctly, Bio::SeqI implementing objects, of which Bio::Seq is
73 one such object). If you want to know what to do with a Bio::Seq
74 object, read L<Bio::Seq>.
76 The idea is that you request a stream object for a particular format.
77 All the stream objects have a notion of an internal file that is read
78 from or written to. A particular SeqIO object instance is configured
79 for either input or output. A specific example of a stream object is
80 the Bio::SeqIO::fasta object.
82 Each stream object has functions
88 $stream->write_seq($seq);
90 As an added bonus, you can recover a filehandle that is tied to the
91 SeqIO object, allowing you to use the standard E<lt>E<gt> and print
92 operations to read and write sequence objects:
96 $stream = Bio::SeqIO->newFh(-format => 'Fasta',
98 # read from standard input or the input filenames
100 while ( $seq = <$stream> ) {
101 # do something with $seq
106 print $stream $seq; # when stream is in output mode
108 This makes the simplest ever reformatter
113 my $format2 = shift || die
114 "Usage: reformat format1 format2 < input > output";
118 my $in = Bio::SeqIO->newFh(-format => $format1, -fh => \*ARGV );
119 my $out = Bio::SeqIO->newFh(-format => $format2 );
120 # Note: you might want to quote -format to keep older
121 # perl's from complaining.
123 print $out $_ while <$in>;
128 =head2 Bio::SeqIO-E<gt>new()
130 $seqIO = Bio::SeqIO->new(-file => 'seqs.fasta', -format => $format);
131 $seqIO = Bio::SeqIO->new(-fh => \*FILEHANDLE, -format => $format);
132 $seqIO = Bio::SeqIO->new(-string => $string , -format => $format);
133 $seqIO = Bio::SeqIO->new(-format => $format);
135 The new() class method constructs a new Bio::SeqIO object. The returned object
136 can be used to retrieve or print Seq objects. new() accepts the following
143 A file path to be opened for reading or writing. The usual Perl
146 'file' # open file for reading
147 '>file' # open file for writing
148 '>>file' # open file for appending
149 '+<file' # open file read/write
151 To read from or write to a piped command, open a filehandle and use the -fh
156 You may use new() with a opened filehandle, provided as a glob reference. For
157 example, to read from STDIN:
159 my $seqIO = Bio::SeqIO->new(-fh => \*STDIN);
161 A string filehandle is handy if you want to modify the output in the
162 memory, before printing it out. The following program reads in EMBL
163 formatted entries from a file and prints them out in fasta format with
168 my $in = Bio::SeqIO->new(-file => "emblfile",
170 while ( my $seq = $in->next_seq() ) {
171 # the output handle is reset for every file
172 my $stringio = IO::String->new($string);
173 my $out = Bio::SeqIO->new(-fh => $stringio,
175 # output goes into $string
176 $out->write_seq($seq);
178 $string =~ s|(>)(\w+)|$1<font color="Red">$2</font>|g;
183 Filehandles can also be used to read from or write to a piped command:
186 #convert .fastq.gz to .fasta
187 open my $zcat, 'zcat seq.fastq.gz |' or die $!;
188 my $in=Bio::SeqIO->new(-fh=>$zcat,
190 my $out=Bio::SeqIO->new(-file=>'>seq.fasta',
192 while (my $seq=$in->next_seq) {
193 $out->write_seq($seq)
198 A string to read the sequences from. For example:
200 my $string = ">seq1\nACGCTAGCTAGC\n";
201 my $seqIO = Bio::SeqIO->new(-string => $string);
205 Specify the format of the file. Supported formats include fasta,
206 genbank, embl, swiss (SwissProt), Entrez Gene and tracefile formats
207 such as abi (ABI) and scf. There are many more, for a complete listing
208 see the SeqIO HOWTO (L<http://bioperl.org/howtos/SeqIO_HOWTO.html>).
210 If no format is specified and a filename is given then the module will
211 attempt to deduce the format from the filename suffix. If there is no
212 suffix that Bioperl understands then it will attempt to guess the
213 format based on file content. If this is unsuccessful then SeqIO will
216 The format name is case-insensitive: 'FASTA', 'Fasta' and 'fasta' are
219 Currently, the tracefile formats (except for SCF) require installation
220 of the external Staden "io_lib" package, as well as the
221 Bio::SeqIO::staden::read package available from the bioperl-ext
226 Sets the alphabet ('dna', 'rna', or 'protein'). When the alphabet is
227 set then Bioperl will not attempt to guess what the alphabet is. This
228 may be important because Bioperl does not always guess correctly.
232 By default, all files (or filehandles) opened for writing sequences
233 will be flushed after each write_seq() (making the file immediately
234 usable). If you do not need this facility and would like to marginally
235 improve the efficiency of writing multiple sequences to the same file
236 (or filehandle), pass the -flush option '0' or any other value that
237 evaluates as defined but false:
239 my $gb = Bio::SeqIO->new(-file => "<gball.gbk",
241 my $fa = Bio::SeqIO->new(-file => ">gball.fa",
243 -flush => 0); # go as fast as we can!
244 while($seq = $gb->next_seq) { $fa->write_seq($seq) }
248 Provide a Bio::Factory::SequenceFactoryI object. See the sequence_factory() method.
252 Provide a Bio::Factory::LocationFactoryI object. See the location_factory() method.
256 Provide a Bio::Factory::ObjectBuilderI object. See the object_builder() method.
260 =head2 Bio::SeqIO-E<gt>newFh()
262 $fh = Bio::SeqIO->newFh(-fh => \*FILEHANDLE, -format=>$format);
263 $fh = Bio::SeqIO->newFh(-format => $format);
266 This constructor behaves like new(), but returns a tied filehandle
267 rather than a Bio::SeqIO object. You can read sequences from this
268 object using the familiar E<lt>E<gt> operator, and write to it using
269 print(). The usual array and $_ semantics work. For example, you can
270 read all sequence objects into an array like this:
274 Other operations, such as read(), sysread(), write(), close(), and
275 printf() are not supported.
277 =head1 OBJECT METHODS
279 See below for more detailed summaries. The main methods are:
281 =head2 $sequence = $seqIO-E<gt>next_seq()
283 Fetch the next sequence from the stream, or nothing if no more.
285 =head2 $seqIO-E<gt>write_seq($sequence [,$another_sequence,...])
287 Write the specified sequence(s) to the stream.
289 =head2 TIEHANDLE(), READLINE(), PRINT()
291 These provide the tie interface. See L<perltie> for more details.
297 User feedback is an integral part of the evolution of this and other
298 Bioperl modules. Send your comments and suggestions preferably to one
299 of the Bioperl mailing lists.
301 Your participation is much appreciated.
303 bioperl-l@bioperl.org - General discussion
304 http://bioperl.org/wiki/Mailing_lists - About the mailing lists
308 Please direct usage questions or support issues to the mailing list:
310 bioperl-l@bioperl.org
312 rather than to the module maintainer directly. Many experienced and
313 responsive experts will be able look at the problem and quickly
314 address it. Please include a thorough description of the problem
315 with code and data examples if at all possible.
317 =head2 Reporting Bugs
319 Report bugs to the Bioperl bug tracking system to help us keep track
320 the bugs and their resolution. Bug reports can be submitted via the
323 https://github.com/bioperl/bioperl-live/issues
325 =head1 AUTHOR - Ewan Birney, Lincoln Stein
327 Email birney@ebi.ac.uk
332 The rest of the documentation details each of the object
333 methods. Internal methods are usually preceded with a _
337 #' Let the code begin...
344 use Bio
::Factory
::FTLocationFactory
;
345 use Bio
::Seq
::SeqBuilder
;
346 use Bio
::Tools
::GuessSeqFormat
;
349 use parent
qw(Bio::Root::Root Bio::Root::IO Bio::Factory::SequenceStreamI);
351 my %valid_alphabet_cache;
357 Usage : $stream = Bio::SeqIO->new(-file => 'sequences.fasta',
359 Function: Returns a new sequence stream
360 Returns : A Bio::SeqIO stream initialised with the appropriate format
361 Args : Named parameters indicating where to read the sequences from or to
363 -file => filename, OR
364 -fh => filehandle to attach to, OR
367 Additional arguments, all with reasonable defaults:
368 -format => format of the sequences, usually auto-detected
369 -alphabet => 'dna', 'rna', or 'protein'
370 -flush => 0 or 1 (default: flush filehandles after each write)
371 -seqfactory => sequence factory
372 -locfactory => location factory
373 -objbuilder => object builder
375 See L<Bio::SeqIO::Handler>
382 my ($caller, @args) = @_;
383 my $class = ref($caller) || $caller;
385 # or do we want to call SUPER on an object if $caller is an
387 if( $class =~ /Bio::SeqIO::(\S+)/ ) {
388 my ($self) = $class->SUPER::new
(@args);
389 $self->_initialize(@args);
393 @params{ map { lc $_ } keys %params } = values %params; # lowercase keys
395 unless( defined $params{-file
} ||
396 defined $params{-fh
} ||
397 defined $params{-string
} ) {
398 $class->throw("file argument provided, but with an undefined value")
399 if exists $params{'-file'};
400 $class->throw("fh argument provided, but with an undefined value")
401 if exists $params{'-fh'};
402 $class->throw("string argument provided, but with an undefined value")
403 if exists($params{'-string'});
404 # $class->throw("No file, fh, or string argument provided"); # neither defined
407 # Determine or guess sequence format and variant
408 my $format = $params{'-format'};
410 if ($params{-file
}) {
411 # Guess from filename extension, and then from file content
412 $format = $class->_guess_format( $params{-file
} ) ||
413 Bio
::Tools
::GuessSeqFormat
->new(-file
=> $params{-file
} )->guess;
414 } elsif ($params{-fh
}) {
415 # Guess from filehandle content
416 $format = Bio
::Tools
::GuessSeqFormat
->new(-fh
=> $params{-fh
} )->guess;
417 } elsif ($params{-string
}) {
418 # Guess from string content
419 $format = Bio
::Tools
::GuessSeqFormat
->new(-text
=> $params{-string
})->guess;
423 # changed 1-3-11; no need to print out an empty string (only way this
424 # exception is triggered) - cjfields
425 $class->throw("Could not guess format from file, filehandle or string")
427 $format = "\L$format"; # normalize capitalization to lower case
429 if ($format =~ /-/) {
430 ($format, my $variant) = split('-', $format, 2);
431 $params{-variant
} = $variant;
434 return unless( $class->_load_format_module($format) );
435 return "Bio::SeqIO::$format"->new(%params);
443 Usage : $fh = Bio::SeqIO->newFh(-file=>$filename,-format=>'Format')
444 Function: Does a new() followed by an fh()
445 Example : $fh = Bio::SeqIO->newFh(-file=>$filename,-format=>'Format')
446 $sequence = <$fh>; # read a sequence object
447 print $fh $sequence; # write a sequence object
448 Returns : filehandle tied to the Bio::SeqIO::Fh class
451 See L<Bio::SeqIO::Fh>
457 return unless my $self = $class->new(@_);
466 Function: Get or set the IO filehandle
467 Example : $fh = $obj->fh; # make a tied filehandle
468 $sequence = <$fh>; # read a sequence object
469 print $fh $sequence; # write a sequence object
470 Returns : filehandle tied to Bio::SeqIO class
477 my $class = ref($self) || $self;
478 my $s = Symbol
::gensym
;
479 tie
$$s,$class,$self;
484 # _initialize is chained for all SeqIO classes
487 my($self, @args) = @_;
489 # flush is initialized by the Root::IO init
491 my ($seqfact,$locfact,$objbuilder, $alphabet) =
492 $self->_rearrange([qw(SEQFACTORY
498 $locfact = Bio
::Factory
::FTLocationFactory
->new(-verbose
=> $self->verbose)
500 $objbuilder = Bio
::Seq
::SeqBuilder
->new(-verbose
=> $self->verbose)
502 $self->sequence_builder($objbuilder);
503 $self->location_factory($locfact);
505 # note that this should come last because it propagates the sequence
506 # factory to the sequence builder
507 $seqfact && $self->sequence_factory($seqfact);
510 $alphabet && $self->alphabet($alphabet);
512 # initialize the IO part
513 $self->_initialize_io(@args);
520 Usage : $seq = stream->next_seq
521 Function: Reads the next sequence object from the stream and returns it.
523 Certain driver modules may encounter entries in the stream
524 that are either misformatted or that use syntax not yet
525 understood by the driver. If such an incident is
526 recoverable, e.g., by dismissing a feature of a feature
527 table or some other non-mandatory part of an entry, the
528 driver will issue a warning. In the case of a
529 non-recoverable situation an exception will be thrown. Do
530 not assume that you can resume parsing the same stream
531 after catching the exception. Note that you can always turn
532 recoverable errors into exceptions by calling
535 Returns : a Bio::Seq sequence object, or nothing if no more sequences
540 See L<Bio::Root::RootI>, L<Bio::Factory::SeqStreamI>, L<Bio::Seq>
545 my ($self, $seq) = @_;
546 $self->throw("Sorry, you cannot read from a generic Bio::SeqIO object.");
553 Usage : $stream->write_seq($seq)
554 Function: writes the $seq object into the stream
555 Returns : 1 for success and 0 for error
556 Args : Bio::Seq object
561 my ($self, $seq) = @_;
562 $self->throw("Sorry, you cannot write to a generic Bio::SeqIO object.");
569 Usage : $format = $stream->format()
570 Function: Get the sequence format
571 Returns : sequence format, e.g. fasta, fastq
576 # format() method inherited from Bio::Root::IO
582 Usage : $self->alphabet($newval)
583 Function: Set/get the molecule type for the Seq objects to be created.
584 Example : $seqio->alphabet('protein')
585 Returns : value of alphabet: 'dna', 'rna', or 'protein'
586 Args : newvalue (optional)
587 Throws : Exception if the argument is not one of 'dna', 'rna', or 'protein'
592 my ($self, $value) = @_;
594 if ( defined $value) {
596 unless ($valid_alphabet_cache{$value}) {
597 # instead of hard-coding the allowed values once more, we check by
598 # creating a dummy sequence object
600 require Bio
::PrimarySeq
;
601 my $seq = Bio
::PrimarySeq
->new( -verbose
=> $self->verbose,
602 -alphabet
=> $value );
605 $self->throw("Invalid alphabet: $value\n. See Bio::PrimarySeq for allowed values.");
607 $valid_alphabet_cache{$value} = 1;
609 $self->{'alphabet'} = $value;
611 return $self->{'alphabet'};
615 =head2 _load_format_module
617 Title : _load_format_module
618 Usage : *INTERNAL SeqIO stuff*
619 Function: Loads up (like use) a module at run time on demand
626 sub _load_format_module
{
627 my ($self, $format) = @_;
628 my $module = "Bio::SeqIO::" . $format;
632 $ok = $self->_load_module($module);
636 $self: $format cannot be found
638 For more information about the SeqIO system please see the SeqIO docs.
639 This includes ways of checking for formats at compile time, not run time
647 =head2 _concatenate_lines
649 Title : _concatenate_lines
650 Usage : $s = _concatenate_lines($line, $continuation_line)
651 Function: Private. Concatenates two strings assuming that the second stems
652 from a continuation line of the first. Adds a space between both
653 unless the first ends with a dash.
655 Takes care of either arg being empty.
662 sub _concatenate_lines
{
663 my ($self, $s1, $s2) = @_;
664 $s1 .= " " if($s1 && ($s1 !~ /-$/) && $s2);
665 return ($s1 ?
$s1 : "") . ($s2 ?
$s2 : "");
672 Usage : $obj->_filehandle($newval)
673 Function: This method is deprecated. Call _fh() instead.
675 Returns : value of _filehandle
676 Args : newvalue (optional)
681 my ($self,@args) = @_;
682 return $self->_fh(@args);
688 Title : _guess_format
689 Usage : $obj->_guess_format($filename)
690 Function: guess format based on file suffix
692 Returns : guessed format of filename (lower case)
694 Notes : formats that _filehandle() will guess include fasta,
695 genbank, scf, pir, embl, raw, gcg, ace, bsml, swissprot,
702 return unless $_ = shift;
704 return 'abi' if /\.ab[i1]$/i;
705 return 'ace' if /\.ace$/i;
706 return 'alf' if /\.alf$/i;
707 return 'bsml' if /\.(bsm|bsml)$/i;
708 return 'ctf' if /\.ctf$/i;
709 return 'embl' if /\.(embl|ebl|emb|dat)$/i;
710 return 'entrezgene' if /\.asn$/i;
711 return 'exp' if /\.exp$/i;
712 return 'fasta' if /\.(fasta|fast|fas|seq|fa|fsa|nt|aa|fna|faa)$/i;
713 return 'fastq' if /\.fastq$/i;
714 return 'gcg' if /\.gcg$/i;
715 return 'genbank' if /\.(gb|gbank|genbank|gbk|gbs)$/i;
716 return 'phd' if /\.(phd|phred)$/i;
717 return 'pir' if /\.pir$/i;
718 return 'pln' if /\.pln$/i;
719 return 'qual' if /\.qual$/i;
720 return 'raw' if /\.txt$/i;
721 return 'scf' if /\.scf$/i;
722 # from Strider 1.4 Release Notes: The file name extensions used by
723 # Strider 1.4 are ".xdna", ".xdgn", ".xrna" and ".xprt" for DNA,
724 # DNA Degenerate, RNA and Protein Sequence Files, respectively
725 return 'strider' if /\.(xdna|xdgn|xrna|xprt)$/i;
726 return 'swiss' if /\.(swiss|sp)$/i;
727 return 'ztr' if /\.ztr$/i;
738 my ($class,$val) = @_;
739 return bless {'seqio' => $val}, $class;
745 return $self->{'seqio'}->next_seq() || undef unless wantarray;
747 push @list, $obj while $obj = $self->{'seqio'}->next_seq();
754 $self->{'seqio'}->write_seq(@_);
758 =head2 sequence_factory
760 Title : sequence_factory
761 Usage : $seqio->sequence_factory($seqfactory)
762 Function: Get/Set the Bio::Factory::SequenceFactoryI
763 Returns : Bio::Factory::SequenceFactoryI
764 Args : [optional] Bio::Factory::SequenceFactoryI
768 sub sequence_factory
{
769 my ($self, $obj) = @_;
771 if( ! ref($obj) || ! $obj->isa('Bio::Factory::SequenceFactoryI') ) {
772 $self->throw("Must provide a valid Bio::Factory::SequenceFactoryI object to ".ref($self)."::sequence_factory()");
774 $self->{'_seqio_seqfactory'} = $obj;
775 my $builder = $self->sequence_builder();
776 if($builder && $builder->can('sequence_factory') &&
777 (! $builder->sequence_factory())) {
778 $builder->sequence_factory($obj);
781 $self->{'_seqio_seqfactory'};
785 =head2 object_factory
787 Title : object_factory
788 Usage : $obj->object_factory($newval)
789 Function: This is an alias to sequence_factory with a more generic name.
791 Returns : value of object_factory (a scalar)
792 Args : on set, new value (a scalar or undef, optional)
797 return shift->sequence_factory(@_);
801 =head2 sequence_builder
803 Title : sequence_builder
804 Usage : $seqio->sequence_builder($seqfactory)
805 Function: Get/Set the Bio::Factory::ObjectBuilderI used to build sequence
806 objects. This applies to rich sequence formats only, e.g. genbank
809 If you do not set the sequence object builder yourself, it
810 will in fact be an instance of L<Bio::Seq::SeqBuilder>, and
811 you may use all methods documented there to configure it.
813 Returns : a Bio::Factory::ObjectBuilderI compliant object
814 Args : [optional] a Bio::Factory::ObjectBuilderI compliant object
818 sub sequence_builder
{
819 my ($self, $obj) = @_;
821 if( ! ref($obj) || ! $obj->isa('Bio::Factory::ObjectBuilderI') ) {
822 $self->throw("Must provide a valid Bio::Factory::ObjectBuilderI object to ".ref($self)."::sequence_builder()");
824 $self->{'_object_builder'} = $obj;
826 $self->{'_object_builder'};
830 =head2 location_factory
832 Title : location_factory
833 Usage : $seqio->location_factory($locfactory)
834 Function: Get/Set the Bio::Factory::LocationFactoryI object to be used for
835 location string parsing
836 Returns : a Bio::Factory::LocationFactoryI implementing object
837 Args : [optional] on set, a Bio::Factory::LocationFactoryI implementing
842 sub location_factory
{
843 my ($self,$obj) = @_;
845 if( ! ref($obj) || ! $obj->isa('Bio::Factory::LocationFactoryI') ) {
846 $self->throw("Must provide a valid Bio::Factory::LocationFactoryI" .
847 " object to ".ref($self)."->location_factory()");
849 $self->{'_seqio_locfactory'} = $obj;
851 $self->{'_seqio_locfactory'};