1 # BioPerl module for Bio::SeqIO
3 # Please direct questions and support issues to <bioperl-l@bioperl.org>
5 # Cared for by Ewan Birney <birney@ebi.ac.uk>
6 # and Lincoln Stein <lstein@cshl.org>
8 # Copyright Ewan Birney
10 # You may distribute this module under the same terms as perl itself
13 # October 18, 1999 Largely rewritten by Lincoln Stein
15 # POD documentation - main docs before the code
19 Bio::SeqIO - Handler for SeqIO Formats
25 $in = Bio::SeqIO->new(-file => "inputfilename" ,
27 $out = Bio::SeqIO->new(-file => ">outputfilename" ,
30 while ( my $seq = $in->next_seq() ) {
31 $out->write_seq($seq);
34 # Now, to actually get at the sequence object, use the standard Bio::Seq
35 # methods (look at Bio::Seq if you don't know what they are)
39 $in = Bio::SeqIO->new(-file => "inputfilename" ,
40 -format => 'genbank');
42 while ( my $seq = $in->next_seq() ) {
43 print "Sequence ",$seq->id, " first 10 bases ",
44 $seq->subseq(1,10), "\n";
48 # The SeqIO system does have a filehandle binding. Most people find this
49 # a little confusing, but it does mean you can write the world's
50 # smallest reformatter
54 $in = Bio::SeqIO->newFh(-file => "inputfilename" ,
56 $out = Bio::SeqIO->newFh(-format => 'EMBL');
58 # World's shortest Fasta<->EMBL format converter:
59 print $out $_ while <$in>;
64 Bio::SeqIO is a handler module for the formats in the SeqIO set (eg,
65 Bio::SeqIO::fasta). It is the officially sanctioned way of getting at
66 the format objects, which most people should use.
68 The Bio::SeqIO system can be thought of like biological file handles.
69 They are attached to filehandles with smart formatting rules (eg,
70 genbank format, or EMBL format, or binary trace file format) and
71 can either read or write sequence objects (Bio::Seq objects, or
72 more correctly, Bio::SeqI implementing objects, of which Bio::Seq is
73 one such object). If you want to know what to do with a Bio::Seq
74 object, read L<Bio::Seq>.
76 The idea is that you request a stream object for a particular format.
77 All the stream objects have a notion of an internal file that is read
78 from or written to. A particular SeqIO object instance is configured
79 for either input or output. A specific example of a stream object is
80 the Bio::SeqIO::fasta object.
82 Each stream object has functions
88 $stream->write_seq($seq);
90 As an added bonus, you can recover a filehandle that is tied to the
91 SeqIO object, allowing you to use the standard E<lt>E<gt> and print
92 operations to read and write sequence objects:
96 $stream = Bio::SeqIO->newFh(-format => 'Fasta',
98 # read from standard input or the input filenames
100 while ( $seq = <$stream> ) {
101 # do something with $seq
106 print $stream $seq; # when stream is in output mode
108 This makes the simplest ever reformatter
113 my $format2 = shift || die
114 "Usage: reformat format1 format2 < input > output";
118 my $in = Bio::SeqIO->newFh(-format => $format1, -fh => \*ARGV );
119 my $out = Bio::SeqIO->newFh(-format => $format2 );
120 # Note: you might want to quote -format to keep older
121 # perl's from complaining.
123 print $out $_ while <$in>;
128 =head2 Bio::SeqIO-E<gt>new()
130 $seqIO = Bio::SeqIO->new(-file => 'seqs.fasta', -format => $format);
131 $seqIO = Bio::SeqIO->new(-fh => \*FILEHANDLE, -format => $format);
132 $seqIO = Bio::SeqIO->new(-string => $string , -format => $format);
133 $seqIO = Bio::SeqIO->new(-format => $format);
135 The new() class method constructs a new Bio::SeqIO object. The returned object
136 can be used to retrieve or print Seq objects. new() accepts the following
143 A file path to be opened for reading or writing. The usual Perl
146 'file' # open file for reading
147 '>file' # open file for writing
148 '>>file' # open file for appending
149 '+<file' # open file read/write
151 To read from or write to a piped command, open a filehandle and use the -fh
156 You may use new() with a opened filehandle, provided as a glob reference. For
157 example, to read from STDIN:
159 my $seqIO = Bio::SeqIO->new(-fh => \*STDIN);
161 A string filehandle is handy if you want to modify the output in the
162 memory, before printing it out. The following program reads in EMBL
163 formatted entries from a file and prints them out in fasta format with
168 my $in = Bio::SeqIO->new(-file => "emblfile",
170 while ( my $seq = $in->next_seq() ) {
171 # the output handle is reset for every file
172 my $stringio = IO::String->new($string);
173 my $out = Bio::SeqIO->new(-fh => $stringio,
175 # output goes into $string
176 $out->write_seq($seq);
178 $string =~ s|(>)(\w+)|$1<font color="Red">$2</font>|g;
183 Filehandles can also be used to read from or write to a piped command:
186 #convert .fastq.gz to .fasta
187 open my $zcat, 'zcat seq.fastq.gz |' or die $!;
188 my $in=Bio::SeqIO->new(-fh=>$zcat,
190 my $out=Bio::SeqIO->new(-file=>'>seq.fasta',
192 while (my $seq=$in->next_seq) {
193 $out->write_seq($seq)
198 A string to read the sequences from. For example:
200 my $string = ">seq1\nACGCTAGCTAGC\n";
201 my $seqIO = Bio::SeqIO->new(-string => $string);
205 Specify the format of the file. Supported formats include fasta,
206 genbank, embl, swiss (SwissProt), Entrez Gene and tracefile formats
207 such as abi (ABI) and scf. There are many more, for a complete listing
208 see the SeqIO HOWTO (L<http://bioperl.org/howtos/SeqIO_HOWTO.html>).
210 If no format is specified and a filename is given then the module will
211 attempt to deduce the format from the filename suffix. If there is no
212 suffix that Bioperl understands then it will attempt to guess the
213 format based on file content. If this is unsuccessful then SeqIO will
216 The format name is case-insensitive: 'FASTA', 'Fasta' and 'fasta' are
221 Sets the alphabet ('dna', 'rna', or 'protein'). When the alphabet is
222 set then Bioperl will not attempt to guess what the alphabet is. This
223 may be important because Bioperl does not always guess correctly.
227 By default, all files (or filehandles) opened for writing sequences
228 will be flushed after each write_seq() (making the file immediately
229 usable). If you do not need this facility and would like to marginally
230 improve the efficiency of writing multiple sequences to the same file
231 (or filehandle), pass the -flush option '0' or any other value that
232 evaluates as defined but false:
234 my $gb = Bio::SeqIO->new(-file => "<gball.gbk",
236 my $fa = Bio::SeqIO->new(-file => ">gball.fa",
238 -flush => 0); # go as fast as we can!
239 while($seq = $gb->next_seq) { $fa->write_seq($seq) }
243 Provide a Bio::Factory::SequenceFactoryI object. See the sequence_factory() method.
247 Provide a Bio::Factory::LocationFactoryI object. See the location_factory() method.
251 Provide a Bio::Factory::ObjectBuilderI object. See the object_builder() method.
255 =head2 Bio::SeqIO-E<gt>newFh()
257 $fh = Bio::SeqIO->newFh(-fh => \*FILEHANDLE, -format=>$format);
258 $fh = Bio::SeqIO->newFh(-format => $format);
261 This constructor behaves like new(), but returns a tied filehandle
262 rather than a Bio::SeqIO object. You can read sequences from this
263 object using the familiar E<lt>E<gt> operator, and write to it using
264 print(). The usual array and $_ semantics work. For example, you can
265 read all sequence objects into an array like this:
269 Other operations, such as read(), sysread(), write(), close(), and
270 printf() are not supported.
272 =head1 OBJECT METHODS
274 See below for more detailed summaries. The main methods are:
276 =head2 $sequence = $seqIO-E<gt>next_seq()
278 Fetch the next sequence from the stream, or nothing if no more.
280 =head2 $seqIO-E<gt>write_seq($sequence [,$another_sequence,...])
282 Write the specified sequence(s) to the stream.
284 =head2 TIEHANDLE(), READLINE(), PRINT()
286 These provide the tie interface. See L<perltie> for more details.
292 User feedback is an integral part of the evolution of this and other
293 Bioperl modules. Send your comments and suggestions preferably to one
294 of the Bioperl mailing lists.
296 Your participation is much appreciated.
298 bioperl-l@bioperl.org - General discussion
299 http://bioperl.org/wiki/Mailing_lists - About the mailing lists
303 Please direct usage questions or support issues to the mailing list:
305 bioperl-l@bioperl.org
307 rather than to the module maintainer directly. Many experienced and
308 responsive experts will be able look at the problem and quickly
309 address it. Please include a thorough description of the problem
310 with code and data examples if at all possible.
312 =head2 Reporting Bugs
314 Report bugs to the Bioperl bug tracking system to help us keep track
315 the bugs and their resolution. Bug reports can be submitted via the
318 https://github.com/bioperl/bioperl-live/issues
320 =head1 AUTHOR - Ewan Birney, Lincoln Stein
322 Email birney@ebi.ac.uk
327 The rest of the documentation details each of the object
328 methods. Internal methods are usually preceded with a _
332 #' Let the code begin...
339 use Bio
::Factory
::FTLocationFactory
;
340 use Bio
::Seq
::SeqBuilder
;
341 use Bio
::Tools
::GuessSeqFormat
;
344 use parent
qw(Bio::Root::Root Bio::Root::IO Bio::Factory::SequenceStreamI);
346 my %valid_alphabet_cache;
352 Usage : $stream = Bio::SeqIO->new(-file => 'sequences.fasta',
354 Function: Returns a new sequence stream
355 Returns : A Bio::SeqIO stream initialised with the appropriate format
356 Args : Named parameters indicating where to read the sequences from or to
358 -file => filename, OR
359 -fh => filehandle to attach to, OR
362 Additional arguments, all with reasonable defaults:
363 -format => format of the sequences, usually auto-detected
364 -alphabet => 'dna', 'rna', or 'protein'
365 -flush => 0 or 1 (default: flush filehandles after each write)
366 -seqfactory => sequence factory
367 -locfactory => location factory
368 -objbuilder => object builder
370 See L<Bio::SeqIO::Handler>
377 my ($caller, @args) = @_;
378 my $class = ref($caller) || $caller;
380 # or do we want to call SUPER on an object if $caller is an
382 if( $class =~ /Bio::SeqIO::(\S+)/ ) {
383 my ($self) = $class->SUPER::new
(@args);
384 $self->_initialize(@args);
388 @params{ map { lc $_ } keys %params } = values %params; # lowercase keys
390 unless( defined $params{-file
} ||
391 defined $params{-fh
} ||
392 defined $params{-string
} ) {
393 $class->throw("file argument provided, but with an undefined value")
394 if exists $params{'-file'};
395 $class->throw("fh argument provided, but with an undefined value")
396 if exists $params{'-fh'};
397 $class->throw("string argument provided, but with an undefined value")
398 if exists($params{'-string'});
399 # $class->throw("No file, fh, or string argument provided"); # neither defined
402 # Determine or guess sequence format and variant
403 my $format = $params{'-format'};
405 if ($params{-file
}) {
406 # Guess from filename extension, and then from file content
407 $format = $class->_guess_format( $params{-file
} ) ||
408 Bio
::Tools
::GuessSeqFormat
->new(-file
=> $params{-file
} )->guess;
409 } elsif ($params{-fh
}) {
410 # Guess from filehandle content
411 $format = Bio
::Tools
::GuessSeqFormat
->new(-fh
=> $params{-fh
} )->guess;
412 } elsif ($params{-string
}) {
413 # Guess from string content
414 $format = Bio
::Tools
::GuessSeqFormat
->new(-text
=> $params{-string
})->guess;
418 # changed 1-3-11; no need to print out an empty string (only way this
419 # exception is triggered) - cjfields
420 $class->throw("Could not guess format from file, filehandle or string")
422 $format = "\L$format"; # normalize capitalization to lower case
424 if ($format =~ /-/) {
425 ($format, my $variant) = split('-', $format, 2);
426 $params{-variant
} = $variant;
429 return unless( $class->_load_format_module($format) );
430 return "Bio::SeqIO::$format"->new(%params);
438 Usage : $fh = Bio::SeqIO->newFh(-file=>$filename,-format=>'Format')
439 Function: Does a new() followed by an fh()
440 Example : $fh = Bio::SeqIO->newFh(-file=>$filename,-format=>'Format')
441 $sequence = <$fh>; # read a sequence object
442 print $fh $sequence; # write a sequence object
443 Returns : filehandle tied to the Bio::SeqIO::Fh class
446 See L<Bio::SeqIO::Fh>
452 return unless my $self = $class->new(@_);
461 Function: Get or set the IO filehandle
462 Example : $fh = $obj->fh; # make a tied filehandle
463 $sequence = <$fh>; # read a sequence object
464 print $fh $sequence; # write a sequence object
465 Returns : filehandle tied to Bio::SeqIO class
472 my $class = ref($self) || $self;
473 my $s = Symbol
::gensym
;
474 tie
$$s,$class,$self;
479 # _initialize is chained for all SeqIO classes
482 my($self, @args) = @_;
484 # flush is initialized by the Root::IO init
486 my ($seqfact,$locfact,$objbuilder, $alphabet) =
487 $self->_rearrange([qw(SEQFACTORY
493 $locfact = Bio
::Factory
::FTLocationFactory
->new(-verbose
=> $self->verbose)
495 $objbuilder = Bio
::Seq
::SeqBuilder
->new(-verbose
=> $self->verbose)
497 $self->sequence_builder($objbuilder);
498 $self->location_factory($locfact);
500 # note that this should come last because it propagates the sequence
501 # factory to the sequence builder
502 $seqfact && $self->sequence_factory($seqfact);
505 $alphabet && $self->alphabet($alphabet);
507 # initialize the IO part
508 $self->_initialize_io(@args);
515 Usage : $seq = stream->next_seq
516 Function: Reads the next sequence object from the stream and returns it.
518 Certain driver modules may encounter entries in the stream
519 that are either misformatted or that use syntax not yet
520 understood by the driver. If such an incident is
521 recoverable, e.g., by dismissing a feature of a feature
522 table or some other non-mandatory part of an entry, the
523 driver will issue a warning. In the case of a
524 non-recoverable situation an exception will be thrown. Do
525 not assume that you can resume parsing the same stream
526 after catching the exception. Note that you can always turn
527 recoverable errors into exceptions by calling
530 Returns : a Bio::Seq sequence object, or nothing if no more sequences
535 See L<Bio::Root::RootI>, L<Bio::Factory::SeqStreamI>, L<Bio::Seq>
540 my ($self, $seq) = @_;
541 $self->throw("Sorry, you cannot read from a generic Bio::SeqIO object.");
548 Usage : $stream->write_seq($seq)
549 Function: writes the $seq object into the stream
550 Returns : 1 for success and 0 for error
551 Args : Bio::Seq object
556 my ($self, $seq) = @_;
557 $self->throw("Sorry, you cannot write to a generic Bio::SeqIO object.");
564 Usage : $format = $stream->format()
565 Function: Get the sequence format
566 Returns : sequence format, e.g. fasta, fastq
571 # format() method inherited from Bio::Root::IO
577 Usage : $self->alphabet($newval)
578 Function: Set/get the molecule type for the Seq objects to be created.
579 Example : $seqio->alphabet('protein')
580 Returns : value of alphabet: 'dna', 'rna', or 'protein'
581 Args : newvalue (optional)
582 Throws : Exception if the argument is not one of 'dna', 'rna', or 'protein'
587 my ($self, $value) = @_;
589 if ( defined $value) {
591 unless ($valid_alphabet_cache{$value}) {
592 # instead of hard-coding the allowed values once more, we check by
593 # creating a dummy sequence object
595 require Bio
::PrimarySeq
;
596 my $seq = Bio
::PrimarySeq
->new( -verbose
=> $self->verbose,
597 -alphabet
=> $value );
600 $self->throw("Invalid alphabet: $value\n. See Bio::PrimarySeq for allowed values.");
602 $valid_alphabet_cache{$value} = 1;
604 $self->{'alphabet'} = $value;
606 return $self->{'alphabet'};
610 =head2 _load_format_module
612 Title : _load_format_module
613 Usage : *INTERNAL SeqIO stuff*
614 Function: Loads up (like use) a module at run time on demand
621 sub _load_format_module
{
622 my ($self, $format) = @_;
623 my $module = "Bio::SeqIO::" . $format;
627 $ok = $self->_load_module($module);
631 $self: $format cannot be found
633 For more information about the SeqIO system please see the SeqIO docs.
634 This includes ways of checking for formats at compile time, not run time
642 =head2 _concatenate_lines
644 Title : _concatenate_lines
645 Usage : $s = _concatenate_lines($line, $continuation_line)
646 Function: Private. Concatenates two strings assuming that the second stems
647 from a continuation line of the first. Adds a space between both
648 unless the first ends with a dash.
650 Takes care of either arg being empty.
657 sub _concatenate_lines
{
658 my ($self, $s1, $s2) = @_;
659 $s1 .= " " if($s1 && ($s1 !~ /-$/) && $s2);
660 return ($s1 ?
$s1 : "") . ($s2 ?
$s2 : "");
667 Usage : $obj->_filehandle($newval)
668 Function: This method is deprecated. Call _fh() instead.
670 Returns : value of _filehandle
671 Args : newvalue (optional)
676 my ($self,@args) = @_;
677 return $self->_fh(@args);
683 Title : _guess_format
684 Usage : $obj->_guess_format($filename)
685 Function: guess format based on file suffix
687 Returns : guessed format of filename (lower case)
689 Notes : formats that _filehandle() will guess include fasta,
690 genbank, scf, pir, embl, raw, gcg, ace, bsml, swissprot,
697 return unless $_ = shift;
699 return 'abi' if /\.ab[i1]$/i;
700 return 'ace' if /\.ace$/i;
701 return 'alf' if /\.alf$/i;
702 return 'bsml' if /\.(bsm|bsml)$/i;
703 return 'ctf' if /\.ctf$/i;
704 return 'embl' if /\.(embl|ebl|emb|dat)$/i;
705 return 'entrezgene' if /\.asn$/i;
706 return 'exp' if /\.exp$/i;
707 return 'fasta' if /\.(fasta|fast|fas|seq|fa|fsa|nt|aa|fna|faa)$/i;
708 return 'fastq' if /\.fastq$/i;
709 return 'gcg' if /\.gcg$/i;
710 return 'genbank' if /\.(gb|gbank|genbank|gbk|gbs)$/i;
711 return 'phd' if /\.(phd|phred)$/i;
712 return 'pir' if /\.pir$/i;
713 return 'pln' if /\.pln$/i;
714 return 'qual' if /\.qual$/i;
715 return 'raw' if /\.txt$/i;
716 return 'scf' if /\.scf$/i;
717 # from Strider 1.4 Release Notes: The file name extensions used by
718 # Strider 1.4 are ".xdna", ".xdgn", ".xrna" and ".xprt" for DNA,
719 # DNA Degenerate, RNA and Protein Sequence Files, respectively
720 return 'strider' if /\.(xdna|xdgn|xrna|xprt)$/i;
721 return 'swiss' if /\.(swiss|sp)$/i;
722 return 'ztr' if /\.ztr$/i;
733 my ($class,$val) = @_;
734 return bless {'seqio' => $val}, $class;
740 return $self->{'seqio'}->next_seq() || undef unless wantarray;
742 push @list, $obj while $obj = $self->{'seqio'}->next_seq();
749 $self->{'seqio'}->write_seq(@_);
753 =head2 sequence_factory
755 Title : sequence_factory
756 Usage : $seqio->sequence_factory($seqfactory)
757 Function: Get/Set the Bio::Factory::SequenceFactoryI
758 Returns : Bio::Factory::SequenceFactoryI
759 Args : [optional] Bio::Factory::SequenceFactoryI
763 sub sequence_factory
{
764 my ($self, $obj) = @_;
766 if( ! ref($obj) || ! $obj->isa('Bio::Factory::SequenceFactoryI') ) {
767 $self->throw("Must provide a valid Bio::Factory::SequenceFactoryI object to ".ref($self)."::sequence_factory()");
769 $self->{'_seqio_seqfactory'} = $obj;
770 my $builder = $self->sequence_builder();
771 if($builder && $builder->can('sequence_factory') &&
772 (! $builder->sequence_factory())) {
773 $builder->sequence_factory($obj);
776 $self->{'_seqio_seqfactory'};
780 =head2 object_factory
782 Title : object_factory
783 Usage : $obj->object_factory($newval)
784 Function: This is an alias to sequence_factory with a more generic name.
786 Returns : value of object_factory (a scalar)
787 Args : on set, new value (a scalar or undef, optional)
792 return shift->sequence_factory(@_);
796 =head2 sequence_builder
798 Title : sequence_builder
799 Usage : $seqio->sequence_builder($seqfactory)
800 Function: Get/Set the Bio::Factory::ObjectBuilderI used to build sequence
801 objects. This applies to rich sequence formats only, e.g. genbank
804 If you do not set the sequence object builder yourself, it
805 will in fact be an instance of L<Bio::Seq::SeqBuilder>, and
806 you may use all methods documented there to configure it.
808 Returns : a Bio::Factory::ObjectBuilderI compliant object
809 Args : [optional] a Bio::Factory::ObjectBuilderI compliant object
813 sub sequence_builder
{
814 my ($self, $obj) = @_;
816 if( ! ref($obj) || ! $obj->isa('Bio::Factory::ObjectBuilderI') ) {
817 $self->throw("Must provide a valid Bio::Factory::ObjectBuilderI object to ".ref($self)."::sequence_builder()");
819 $self->{'_object_builder'} = $obj;
821 $self->{'_object_builder'};
825 =head2 location_factory
827 Title : location_factory
828 Usage : $seqio->location_factory($locfactory)
829 Function: Get/Set the Bio::Factory::LocationFactoryI object to be used for
830 location string parsing
831 Returns : a Bio::Factory::LocationFactoryI implementing object
832 Args : [optional] on set, a Bio::Factory::LocationFactoryI implementing
837 sub location_factory
{
838 my ($self,$obj) = @_;
840 if( ! ref($obj) || ! $obj->isa('Bio::Factory::LocationFactoryI') ) {
841 $self->throw("Must provide a valid Bio::Factory::LocationFactoryI" .
842 " object to ".ref($self)."->location_factory()");
844 $self->{'_seqio_locfactory'} = $obj;
846 $self->{'_seqio_locfactory'};