1 package Bio
::DB
::SeqFeature
::Store
::FeatureFileLoader
;
6 Bio::DB::SeqFeature::Store::FeatureFileLoader -- feature file loader for Bio::DB::SeqFeature::Store
10 use Bio::DB::SeqFeature::Store;
11 use Bio::DB::SeqFeature::Store::FeatureFileLoader;
13 # Open the sequence database
14 my $db = Bio::DB::SeqFeature::Store->new( -adaptor => 'DBI::mysql',
15 -dsn => 'dbi:mysql:test',
19 Bio::DB::SeqFeature::Store::FeatureFileLoader->new(-store => $db,
23 $loader->load('./my_genome.fff');
28 The Bio::DB::SeqFeature::Store::FeatureFileLoader object parsers
29 FeatureFile-format sequence annotation files and loads
30 Bio::DB::SeqFeature::Store databases. For certain combinations of
31 SeqFeature classes and SeqFeature::Store databases it features a "fast
32 load" mode which will greatly accelerate the loading of databases by a
35 FeatureFile Format (.fff) is very simple:
37 mRNA B0511.1 Chr1:1..100 Type=UTR;Note="putative primase"
38 mRNA B0511.1 Chr1:101..200,300..400,500..800 Type=CDS
39 mRNA B0511.1 Chr1:801..1000 Type=UTR
43 Cosmid B0511 3185..3294
44 Cosmid B0511 10946..11208
45 Cosmid B0511 13126..13511
46 Cosmid B0511 11394..11539
47 EST yk260e10.5 15569..15724
48 EST yk672a12.5 537..618,3187..3294
49 EST yk595e6.5 552..618
50 EST yk595e6.5 3187..3294
51 EST yk846e07.3 11015..11208
53 yk53c10.3 15000..15500,15700..15800
54 yk53c10.5 18892..19154
55 EST yk53c10.5 16032..16105
56 SwissProt PECANEX 13153-13656 Note="Swedish fish"
57 FGENESH "Predicted gene 1" 1-205,518-616,661-735,3187-3365,3436-3846 "Pfam domain"
60 There are up to four columns of WHITESPACE (not necessarily tab)
61 delimited text. Embedded whitespace must be escaped using shell
62 escaping rules (quoting the column or backslashing whitespace).
64 Column 1: The feature type. You may use type:subtype as a convention
67 Column 2: The feature name/ID.
69 Column 3: The position of this feature in base pair
70 coordinates. Ranges can be given as either
71 start-end or start..end. A chromosome position
72 can be specified using the format "reference:start..end".
73 A discontinuous feature can be specified by giving
74 multiple ranges separated by commas. Minus-strand features
75 are indicated by specifying a start > end.
77 Column 4: Comment/attribute field. A single Note can be given, or
78 a series of attribute=value pairs, separated by
79 spaces or semicolons, as in "score=23;type=transmembrane"
81 =head2 Specifying Positions and Ranges
83 A feature position is specified using a sequence ID (a genbank
84 accession number, a chromosome name, a contig, or any other meaningful
85 reference system, followed by a colon and a position range. Ranges are
86 two integers separated by double dots or the hyphen. Examples:
87 "Chr1:516..11208", "ctgA:1-5000". Negative coordinates are allowed, as
90 A discontinuous range ("split location") uses commas to separate the
93 Gene B0511.1 Chr1:516..619,3185..3294,10946..11208
95 In the case of a split location, the sequence id only has to appear in
96 front of the first range.
98 Alternatively, a split location can be indicated by repeating the
99 features type and name on multiple adjacent lines:
101 Gene B0511.1 Chr1:516..619
102 Gene B0511.1 Chr1:3185..3294
103 Gene B0511.1 Chr1:10946..11208
105 If all the locations are on the same reference sequence, you can
106 specify a default chromosome using a "reference=E<lt>seqidE<gt>":
109 Gene B0511.1 516..619
110 Gene B0511.1 3185..3294
111 Gene B0511.1 10946..11208
113 The default seqid is in effect until the next "reference" line
118 Tags can be added to features by adding a fourth column consisting of
121 Gene B0511.1 Chr1:516..619,3185..3294 Note="Putative primase"
123 Tags and their values take any form you want, and multiple tags can be
124 separated by semicolons. You can also repeat tags multiple times:
126 Gene B0511.1 Chr1:516..619,3185..3294 GO_Term=GO:100;GO_Term=GO:2087
128 Several tags have special meanings:
133 Type The primary tag for a subfeature.
134 Score The score of a feature or subfeature.
135 Phase The phase of a feature or subfeature.
136 URL A URL to link to (via the Bio::Graphics library).
137 Note A note to attach to the feature for display by the Bio::Graphics library.
139 For example, in the common case of an mRNA, you can use the "Type" tag
140 to distinguish the parts of the mRNA into UTR and CDS:
142 mRNA B0511.1 Chr1:1..100 Type=UTR
143 mRNA B0511.1 Chr1:101..200,300..400,500..800 Type=CDS
144 mRNA B0511.1 Chr1:801..1000 Type=UTR
146 The top level feature's primary tag will be "mRNA", and its subparts
147 will have types UTR and CDS as indicated. Additional tags that are
148 placed in the first line of the feature will be applied to the top
149 level. In this example, the note "Putative primase" will be applied to
150 the mRNA at the top level of the feature:
152 mRNA B0511.1 Chr1:1..100 Type=UTR;Note="Putative primase"
153 mRNA B0511.1 Chr1:101..200,300..400,500..800 Type=CDS
154 mRNA B0511.1 Chr1:801..1000 Type=UTR
156 =head2 Feature Groups
158 Features can be grouped so that they are rendered by the "group"
159 glyph. To start a group, create a two-column feature entry showing
160 the group type and a name for the group. Follow this with a list of
161 feature entries with a blank type. For example:
164 yk53c10.3 15000-15500,15700-15800
165 yk53c10.5 18892-19154
167 This example is declaring that the ESTs named yk53c10.3 and yk53c10.5
168 belong to the same group named yk53c10.
170 =head2 Comments and the #include Directive
172 Lines that begin with the # sign are treated as comments and
173 ignored. When a # sign appears within a line, everything to the right
174 of the symbol is also ignored, unless it looks like an HTML fragment or
179 glyph = generic # this comment is ignored
181 link = http://www.google.com/search?q=$name#results
183 Be careful, because the processing of # signs uses a regexp heuristic. To be safe,
184 always put a space after the # sign to make sure it is treated as a comment.
186 The special comment "#include 'filename'" acts like the C preprocessor
187 directive and will insert the comments of a named file into the
188 position at which it occurs. Relative paths will be treated relative
189 to the file in which the #include occurs. Nested #include directives
192 #include "/usr/local/share/my_directives.txt"
193 #include 'my_directives.txt'
194 #include chromosome3_features.gff3
196 You can enclose the file path in single or double quotes as shown
197 above. If there are no spaces in the filename the quotes are optional.
199 Include file processing is not very smart. Avoid creating circular
200 #include references. You have been warned!
204 Note that this loader always creates denormalized features such that
205 subfeatures and their parents are stored as one big database
206 object. The GFF3 format and its loader is usually preferred for both
207 space and execution efficiency.
217 use Text
::ParseWords
'shellwords','quotewords';
219 use base
'Bio::DB::SeqFeature::Store::Loader';
224 Usage : $loader = Bio::DB::SeqFeature::Store::FeatureFileLoader->new(@options)
225 Function: create a new parser
226 Returns : a Bio::DB::SeqFeature::Store::FeatureFileLoader parser and loader
227 Args : several - see below
230 This method creates a new FeatureFile loader and establishes its connection
231 with a Bio::DB::SeqFeature::Store database. Arguments are -name=E<gt>$value
232 pairs as described in this table:
237 -store A writable Bio::DB::SeqFeature::Store database handle.
239 -seqfeature_class The name of the type of Bio::SeqFeatureI object to create
240 and store in the database (Bio::DB::SeqFeature by default)
242 -sf_class A shorter alias for -seqfeature_class
244 -verbose Send progress information to standard error.
246 -fast If true, activate fast loading (see below)
248 -chunk_size Set the storage chunk size for nucleotide/protein sequences
251 -tmp Indicate a temporary directory to use when loading non-normalized
254 When you call new(), a connection to a Bio::DB::SeqFeature::Store
255 database should already have been established and the database
256 initialized (if appropriate).
258 Some combinations of Bio::SeqFeatures and Bio::DB::SeqFeature::Store
259 databases support a fast loading mode. Currently the only reliable
260 implementation of fast loading is the combination of DBI::mysql with
261 Bio::DB::SeqFeature. The other important restriction on fast loading
262 is the requirement that a feature that contains subfeatures must occur
263 in the FeatureFile file before any of its subfeatures. Otherwise the
264 subfeatures that occurred before the parent feature will not be
265 attached to the parent correctly. This restriction does not apply to
266 normal (slow) loading.
268 If you use an unnormalized feature class, such as
269 Bio::SeqFeature::Generic, then the loader needs to create a temporary
270 database in which to cache features until all their parts and subparts
271 have been seen. This temporary databases uses the "bdb" adaptor. The
272 -tmp option specifies the directory in which that database will be
273 created. If not present, it defaults to the system default tmp
274 directory specified by File::Spec-E<gt>tmpdir().
276 The -chunk_size option allows you to tune the representation of
277 DNA/Protein sequence in the Store database. By default, sequences are
278 split into 2000 base/residue chunks and then reassembled as
279 needed. This avoids the problem of pulling a whole chromosome into
280 memory in order to fetch a short subsequence from somewhere in the
281 middle. Depending on your usage patterns, you may wish to tune this
282 parameter using a chunk size that is larger or smaller than the
287 # sub new {} inherited
292 Usage : $count = $loader->load(@ARGV)
293 Function: load the indicated files or filehandles
294 Returns : number of feature lines loaded
295 Args : list of files or filehandles
298 Once the loader is created, invoke its load() method with a list of
299 FeatureFile or FASTA file paths or previously-opened filehandles in order to
300 load them into the database. Compressed files ending with .gz, .Z and
301 .bz2 are automatically recognized and uncompressed on the fly. Paths
302 beginning with http: or ftp: are treated as URLs and opened using the
303 LWP GET program (which must be on your path).
305 FASTA files are recognized by their initial "E<gt>" character. Do not feed
306 the loader a file that is neither FeatureFile nor FASTA; I don't know what
307 will happen, but it will probably not be what you expect.
311 # sub load {} inherited
315 The following read-only accessors return values passed or created during new():
317 store() the long-term Bio::DB::SeqFeature::Store object
319 tmp_store() the temporary Bio::DB::SeqFeature::Store object used
322 sfclass() the Bio::SeqFeatureI class
324 fast() whether fast loading is active
326 seq_chunk_size() the sequence chunk size
328 verbose() verbose progress messages
332 # sub store {} inherited
333 # sub tmp_store {} inherited
334 # sub sfclass {} inherited
335 # sub fast {} inherited
336 # sub seq_chunk_size {} inherited
337 # sub verbose {} inherited
339 =head2 default_seqfeature_class
341 $class = $loader->default_seqfeature_class
343 Return the default SeqFeatureI class (Bio::Graphics::Feature).
347 sub default_seqfeature_class
{ #override
349 return 'Bio::Graphics::Feature';
355 $count = $loader->load_fh($filehandle)
357 Load the FeatureFile data at the other end of the filehandle and return true
358 if successful. Internally, load_fh() invokes:
361 do_load($filehandle);
366 # sub load_fh { } inherited
368 =head2 start_load, finish_load
370 These methods are called at the start and end of a filehandle load.
374 sub create_load_data
{
376 $self->SUPER::create_load_data
();
377 $self->{load_data
}{mode
} = 'fff';
378 $self->{load_data
}{CurrentGroup
} = undef;
384 $self->SUPER::finish_load
;
389 $loader->load_line($data);
391 Load a line of a FeatureFile file. You must bracket this with calls to
392 start_load() and finish_load()!
394 $loader->start_load();
395 $loader->load_line($_) while <FH>;
396 $loader->finish_load();
405 return unless $line =~ /\S/; # blank line
406 my $load_data = $self->{load_data
};
408 $load_data->{mode
} = 'fff' if /\s/; # if it has any whitespace in
409 # it, then back to fff mode
411 if ($line =~ /^\#\s?\#\s*([\#]+)/) { ## meta instruction
412 $load_data->{mode
} = 'fff';
413 $self->handle_meta($1);
415 } elsif ($line =~ /^\#/) {
416 $load_data->{mode
} = 'fff'; # just to be safe
420 elsif ($line =~ /^>\s*(\S+)/) { # FASTA lines are coming
421 $load_data->{mode
} = 'fasta';
422 $self->start_or_finish_sequence($1);
425 elsif ($load_data->{mode
} eq 'fasta') {
426 $self->load_sequence($line);
429 elsif ($load_data->{mode
} eq 'fff') {
430 $self->handle_feature($line);
431 if (++$load_data->{count
} % 1000 == 0) {
432 my $now = $self->time();
433 my $nl = -t STDOUT
&& !$ENV{EMACS
} ?
"\r" : "\n";
434 $self->msg(sprintf("%d features loaded in %5.2fs...$nl",
435 $load_data->{count
},$now - $load_data->{start_time
}));
436 $load_data->{start_time
} = $now;
441 $self->throw("I don't know what to do with this line:\n$line");
448 $loader->handle_meta($meta_directive)
450 This method is called to handle meta-directives such as
451 ##sequence-region. The method will receive the directive with the
452 initial ## stripped off.
456 # sub handle_meta { } inherited
458 =head2 handle_feature
460 $loader->handle_feature($gff3_line)
462 This method is called to process a single FeatureFile line. It manipulates
463 information stored a data structure called $self-E<gt>{load_data}.
471 my $ld = $self->{load_data
};
473 # handle reference line
474 if (/^reference\s*=\s*(.+)/) {
475 $ld->{reference
} = $1;
480 my @tokens = quotewords
('\s+',1,$_);
481 for (0..2) { # remove quotes from everything but last column
482 next unless defined $tokens[$_];
483 $tokens[$_] =~ s/^"//;
484 $tokens[$_] =~ s/"$//;
487 if (@tokens < 3) { # short line; assume a group identifier
488 $self->store_current_feature();
489 my $type = shift @tokens;
490 my $name = shift @tokens;
491 $ld->{CurrentGroup
} = $self->_make_indexed_feature($name,$type,'',{_ff_group
=>1});
492 $self->_indexit($name => 1);
496 my($type,$name,$strand,$bounds,$attributes);
498 if ($tokens[2] =~ /^([+-.]|[+-]?[01])$/) { # old version
499 ($type,$name,$strand,$bounds,$attributes) = @tokens;
500 } else { # new version
501 ($type,$name,$bounds,$attributes) = @tokens;
504 # handle case of there only being one value in the last column,
505 # in which case we treat it the same as Note="value"
506 my $attr = $self->parse_attributes($attributes);
508 # @parts is an array of ([ref,start,end],[ref,start,end],...)
509 my @parts = map { [/(?:(\w+):)?(-?\d+)(?:-|\.\.)(-?\d+)/]} split /(?:,| )\s*/,$bounds;
511 # deal with groups -- a group is ending if $type is defined
512 # and CurrentGroup is set
513 if ($type && $ld->{CurrentGroup
}) {
514 $self->_store_group();
517 $type = '' unless defined $type;
518 $name = '' unless defined $name;
519 $type ||= $ld->{CurrentGroup
}->primary_tag if $ld->{CurrentGroup
};
521 my $reference = $ld->{reference
} || 'ChrUN';
523 if (defined $_ && ref($_) eq 'ARRAY'
527 $strand ||= $_->[1] <= $_->[2] ?
'+' : '-';
528 ($_->[1],$_->[2]) = ($_->[2],$_->[1]) if $_->[1] > $_->[2];
530 $reference = $_->[0] if defined $_->[0];
531 $_ = [@
{$_}[1,2]]; # strip off the reference.
534 # now @parts is an array of [start,end] and $reference contains the seqid
536 # apply coordinate mapper
537 if ($self->{coordinate_mapper
} && $reference) {
538 my @remapped = $self->{coordinate_mapper
}->($reference,@parts);
539 ($reference,@parts) = @remapped if @remapped;
542 # either create a new feature or add a segment to it
543 my $feature = $ld->{CurrentFeature
};
545 $ld->{OldPartType
} = $ld->{PartType
};
546 if (exists $attr->{Type
} || exists $attr->{type
}) {
547 $ld->{PartType
} = $attr->{Type
}[0] || $attr->{type
}[0];
549 $ld->{PartType
} = $type;
553 local $^W
= 0; # avoid uninit warning when display_name() is called
555 # if this is a different feature from what we have now, then we
556 # store the current one, and create a new one
557 if ($feature->display_name ne $name ||
558 $feature->method ne $type) {
559 $self->store_current_feature; # new feature, store old one
561 } else { # create a new multipart feature
562 $self->_multilevel_feature($feature,$ld->{OldPartType
})
563 unless $feature->get_SeqFeatures;
564 my $part = $self->_make_feature($name,
570 $feature->add_SeqFeature($part);
574 $feature ||= $self->_make_indexed_feature($name,
575 $type, # side effect is to set CurrentFeature
581 # add more segments to the current feature
583 for my $part (@parts) {
584 $type ||= $feature->primary_tag;
585 my $sp = $self->_make_feature($name,
591 $feature->add_SeqFeature($sp);
596 sub _multilevel_feature
{ # turn a single-level feature into a multilevel one
600 my %attributes = $f->attributes;
601 $attributes{Score
} = [$f->score] if defined $f->score;
602 $attributes{Phase
} = [$f->phase] if defined $f->phase;
603 my @args = ($f->display_name,
610 my $subpart = $self->_make_feature(@args);
611 $f->add_SeqFeature($subpart);
614 sub _make_indexed_feature
{
616 my $f = $self->_make_feature(@_);
617 my $name = $f->display_name;
618 $self->{load_data
}{CurrentFeature
} = $f;
619 $self->{load_data
}{CurrentID
} = $name;
620 $self->_indexit($name => 1);
626 my ($name,$type,$strand,$attributes,$ref,$start,$end) = @_;
628 # some basic error checking
629 $self->throw("syntax error at line $.: '$_'")
630 if ($ref && !defined $start)
631 or ($ref && !defined $end)
632 or ($start && $start !~ /^[-\d]+$/)
633 or ($end && $end !~ /^[-\d]+$/)
639 my @args = (-name
=> $name,
640 -strand
=> $strand eq '+' ?
1
647 -attributes
=> $attributes,
650 if (my ($method,$source) = $type =~ /(\S+):(\S+)/) {
651 push @args,(-primary_tag
=> $method,
654 push @args,(-primary_tag
=> $type);
657 push @args,(-seq_id
=> $ref) if defined $ref;
658 push @args,(-start
=> $start) if defined $start;
659 push @args,(-end
=> $end) if defined $end;
661 # pull out special attributes
662 if (my $score = $attributes->{Score
} || $attributes->{score
}) {
663 push @args,(-score
=> $score->[0]);
664 delete $attributes->{$_} foreach qw(Score score);
667 if (my $note = $attributes->{Note
} || $attributes->{note
}) {
668 push @args,(-desc
=> join '; ',@
$note);
669 delete $attributes->{$_} foreach qw(Note note);
672 if (my $url = $attributes->{url
} || $attributes->{Url
}) {
673 push @args,(-url
=> $url->[0]);
674 delete $attributes->{$_} foreach qw
(Url url
);
677 if (my $phase = $attributes->{phase
} || $attributes->{Phase
}) {
678 push @args,(-phase
=> $phase->[0]);
679 delete $attributes->{$_} foreach qw
(Phase phase
);
682 $self->_indexit($name=>1)
683 if $self->index_subfeatures && $name;
685 return $self->sfclass->new(@args);
688 =head2 store_current_feature
690 $loader->store_current_feature()
692 This method is called to store the currently active feature in the
693 database. It uses a data structure stored in $self-E<gt>{load_data}.
697 sub store_current_feature
{ # overridden
701 # if there is an open group, then we simply add the current
702 # feature to the group.
703 my $ld = $self->{load_data
};
704 if ($ld->{CurrentGroup
} && $ld->{CurrentFeature
}) {
705 $ld->{CurrentGroup
}->add_SeqFeature($ld->{CurrentFeature
})
706 unless $ld->{CurrentGroup
} eq $ld->{CurrentFeature
}; # paranoia - shouldn't happen
710 $self->SUPER::store_current_feature
();
716 my $ld = $self->{load_data
};
717 my $group = $ld->{CurrentGroup
} or return;
718 # if there is an unattached feature, then add it
719 $self->store_current_feature() if $ld->{CurrentFeature
};
720 $ld->{CurrentFeature
} = $group;
721 $ld->{CurrentID
} = $group->display_name;
722 $self->_indexit($ld->{CurrentID
} => 1);
723 undef $ld->{CurrentGroup
};
724 $self->store_current_feature();
727 =head2 build_object_tree
729 $loader->build_object_tree()
731 This method gathers together features and subfeatures and builds the
732 graph that connects them.
737 # put objects together
739 sub build_object_tree
{
740 croak
"We shouldn't be building an object tree in the FeatureFileLoader";
743 =head2 build_object_tree_in_tables
745 $loader->build_object_tree_in_tables()
747 This method gathers together features and subfeatures and builds the
748 graph that connects them, assuming that parent/child relationships
749 will be stored in a database table.
753 sub build_object_tree_in_tables
{
754 croak
"We shouldn't be building an object tree in the FeatureFileLoader";
757 =head2 build_object_tree_in_features
759 $loader->build_object_tree_in_features()
761 This method gathers together features and subfeatures and builds the
762 graph that connects them, assuming that parent/child relationships are
763 stored in the seqfeature objects themselves.
767 sub build_object_tree_in_features
{
768 croak
"We shouldn't be building an object tree in the FeatureFileLoader";
771 =head2 attach_children
773 $loader->attach_children($store,$load_data,$load_id,$feature)
775 This recursively adds children to features and their subfeatures. It
776 is called when subfeatures are directly contained within other
777 features, rather than stored in a relational table.
781 sub attach_children
{
782 croak
"We shouldn't be attaching children in the
786 =head2 parse_attributes
788 @attributes = $loader->parse_attributes($attribute_line)
790 This method parses the information contained in the $attribute_line
791 into a flattened hash (array). It may return one element, in which case it is
796 sub parse_attributes
{
800 $att ||= ''; # to prevent uninit variable warnings from quotewords()
802 my @pairs = quotewords
('[;\s]',1,$att);
804 for my $pair (@pairs) {
805 unless ($pair =~ /=/) {
806 push @
{$attributes{Note
}},(quotewords
('',0,$pair))[0] || $pair;
808 my ($tag,$value) = quotewords
('\s*=\s*',0,$pair);
809 $tag = 'Note' if $tag eq 'description';
810 push @
{$attributes{$tag}},$value;
816 =head2 start_or_finish_sequence
818 $loader->start_or_finish_sequence('Chr9')
820 This method is called at the beginning and end of a fasta section.
832 This is an early version, so there are certainly some bugs. Please
833 use the BioPerl bug tracking system to report bugs.
837 L<Bio::DB::SeqFeature::Store>,
838 L<Bio::DB::SeqFeature::Segment>,
839 L<Bio::DB::SeqFeature::NormalizedFeature>,
840 L<Bio::DB::SeqFeature::GFF3Loader>,
841 L<Bio::DB::SeqFeature::Store::DBI::mysql>,
842 L<Bio::DB::SeqFeature::Store::bdb>
846 Lincoln Stein E<lt>lstein@cshl.orgE<gt>.
848 Copyright (c) 2006 Cold Spring Harbor Laboratory.
850 This library is free software; you can redistribute it and/or modify
851 it under the same terms as Perl itself.