t/AlignIO/AlignIO.t: fix number of tests in plan (fixup c523e6bed866)
[bioperl-live.git] / Bio / DB / SeqFeature / Store / GFF3Loader.pm
blobc149528596d9a4b2a976f992eb3ae25601840334
1 package Bio::DB::SeqFeature::Store::GFF3Loader;
4 =head1 NAME
6 Bio::DB::SeqFeature::Store::GFF3Loader -- GFF3 file loader for Bio::DB::SeqFeature::Store
8 =head1 SYNOPSIS
10 use Bio::DB::SeqFeature::Store;
11 use Bio::DB::SeqFeature::Store::GFF3Loader;
13 # Open the sequence database
14 my $db = Bio::DB::SeqFeature::Store->new( -adaptor => 'DBI::mysql',
15 -dsn => 'dbi:mysql:test',
16 -write => 1 );
18 my $loader = Bio::DB::SeqFeature::Store::GFF3Loader->new(-store => $db,
19 -verbose => 1,
20 -fast => 1);
22 $loader->load('./my_genome.gff3');
25 =head1 DESCRIPTION
27 The Bio::DB::SeqFeature::Store::GFF3Loader object parsers GFF3-format
28 sequence annotation files and loads Bio::DB::SeqFeature::Store
29 databases. For certain combinations of SeqFeature classes and
30 SeqFeature::Store databases it features a "fast load" mode which will
31 greatly accelerate the loading of GFF3 databases by a factor of 5-10.
33 The GFF3 file format has been extended very slightly to accommodate
34 Bio::DB::SeqFeature::Store. First, the loader recognizes is a new
35 directive:
37 # #index-subfeatures [0|1]
39 Note that you can place a space between the two #'s in order to
40 prevent GFF3 validators from complaining.
42 If this is true, then subfeatures are indexed (the default) so that
43 they can be retrieved with a query. See L<Bio::DB::SeqFeature::Store>
44 for an explanation of this. If false, then subfeatures can only be
45 accessed through their parent feature.
47 Second, the loader recognizes a new attribute tag called index, which
48 if present, controls indexing of the current feature. Example:
50 ctg123 . TF_binding_site 1000 1012 . + . ID=tfbs00001;index=1
52 You can use this to turn indexing on and off, overriding the default
53 for a particular feature.
55 Note that the loader keeps a record -- in memory -- of each feature
56 that it has processed. If you find the loader running out of memory on
57 particularly large GFF3 files, please split the input file into
58 smaller pieces and do the load in steps.
60 =cut
63 # load utility - incrementally load the store based on GFF3 file
65 # two modes:
66 # slow mode -- features can occur in any order in the GFF3 file
67 # fast mode -- all features with same ID must be contiguous in GFF3 file
69 use strict;
70 use Carp 'croak';
71 use Bio::DB::GFF::Util::Rearrange;
72 use Bio::DB::SeqFeature::Store::LoadHelper;
73 use constant DEBUG => 0;
75 use base 'Bio::DB::SeqFeature::Store::Loader';
78 my %Special_attributes =(
79 Gap => 1, Target => 1,
80 Parent => 1, Name => 1,
81 Alias => 1, ID => 1,
82 index => 1, Index => 1,
84 my %Strandedness = ( '+' => 1,
85 '-' => -1,
86 '.' => 0,
87 '' => 0,
88 0 => 0,
89 1 => 1,
90 -1 => -1,
91 +1 => 1,
92 undef => 0,
95 =head2 new
97 Title : new
98 Usage : $loader = Bio::DB::SeqFeature::Store::GFF3Loader->new(@options)
99 Function: create a new parser
100 Returns : a Bio::DB::SeqFeature::Store::GFF3Loader gff3 parser and loader
101 Args : several - see below
102 Status : public
104 This method creates a new GFF3 loader and establishes its connection
105 with a Bio::DB::SeqFeature::Store database. Arguments are -name=E<gt>$value
106 pairs as described in this table:
108 Name Value
109 ---- -----
111 -store A writable Bio::DB::SeqFeature::Store database handle.
113 -seqfeature_class The name of the type of Bio::SeqFeatureI object to create
114 and store in the database (Bio::DB::SeqFeature by default)
116 -sf_class A shorter alias for -seqfeature_class
118 -verbose Send progress information to standard error.
120 -fast If true, activate fast loading (see below)
122 -chunk_size Set the storage chunk size for nucleotide/protein sequences
123 (default 2000 bytes)
125 -tmp Indicate a temporary directory to use when loading non-normalized
126 features.
128 -ignore_seqregion Ignore ##sequence-region directives. The default is to create a
129 feature corresponding to the directive.
131 -noalias_target Don't create an Alias attribute for a target_id named in a
132 Target attribute. The default is to create an Alias
133 attribute containing the target_id found in a Target
134 attribute.
136 When you call new(), a connection to a Bio::DB::SeqFeature::Store
137 database should already have been established and the database
138 initialized (if appropriate).
140 Some combinations of Bio::SeqFeatures and Bio::DB::SeqFeature::Store
141 databases support a fast loading mode. Currently the only reliable
142 implementation of fast loading is the combination of DBI::mysql with
143 Bio::DB::SeqFeature. The other important restriction on fast loading
144 is the requirement that a feature that contains subfeatures must occur
145 in the GFF3 file before any of its subfeatures. Otherwise the
146 subfeatures that occurred before the parent feature will not be
147 attached to the parent correctly. This restriction does not apply to
148 normal (slow) loading.
150 If you use an unnormalized feature class, such as
151 Bio::SeqFeature::Generic, then the loader needs to create a temporary
152 database in which to cache features until all their parts and subparts
153 have been seen. This temporary databases uses the "berkeleydb"
154 adaptor. The -tmp option specifies the directory in which that
155 database will be created. If not present, it defaults to the system
156 default tmp directory specified by File::Spec-E<gt>tmpdir().
158 The -chunk_size option allows you to tune the representation of
159 DNA/Protein sequence in the Store database. By default, sequences are
160 split into 2000 base/residue chunks and then reassembled as
161 needed. This avoids the problem of pulling a whole chromosome into
162 memory in order to fetch a short subsequence from somewhere in the
163 middle. Depending on your usage patterns, you may wish to tune this
164 parameter using a chunk size that is larger or smaller than the
165 default.
167 =cut
169 sub new {
170 my $class = shift;
171 my $self = $class->SUPER::new(@_);
172 my ($ignore_seqregion) = rearrange(['IGNORE_SEQREGION'],@_);
173 $self->ignore_seqregion($ignore_seqregion);
174 my ($noalias_target) = rearrange(['NOALIAS_TARGET'],@_);
175 $self->noalias_target($noalias_target);
176 $self;
179 =head2 ignore_seqregion
181 $ignore_it = $loader->ignore_seqregion([$new_flag])
183 Get or set the ignore_seqregion flag, which if true, will cause
184 GFF3 ##sequence-region directives to be ignored. The default behavior
185 is to create a feature corresponding to the region.
187 =cut
189 sub ignore_seqregion {
190 my $self = shift;
191 my $d = $self->{ignore_seqregion};
192 $self->{ignore_seqregion} = shift if @_;
196 =head2 noalias_target
198 $noalias_target = $loader->noalias_target([$new_flag])
200 Get or set the noalias_target flag, which if true, will disable the creation of
201 an Alias attribute for a target_id named in a Target attribute. The default is
202 to create an Alias attribute containing the target_id found in a Target
203 attribute.
205 =cut
207 sub noalias_target {
208 my $self = shift;
209 my $d = $self->{noalias_target};
210 $self->{noalias_target} = shift if @_;
214 =head2 load
216 Title : load
217 Usage : $count = $loader->load(@ARGV)
218 Function: load the indicated files or filehandles
219 Returns : number of feature lines loaded
220 Args : list of files or filehandles
221 Status : public
223 Once the loader is created, invoke its load() method with a list of
224 GFF3 or FASTA file paths or previously-opened filehandles in order to
225 load them into the database. Compressed files ending with .gz, .Z and
226 .bz2 are automatically recognized and uncompressed on the fly. Paths
227 beginning with http: or ftp: are treated as URLs and opened using the
228 LWP GET program (which must be on your path).
230 FASTA files are recognized by their initial "E<gt>" character. Do not feed
231 the loader a file that is neither GFF3 nor FASTA; I don't know what
232 will happen, but it will probably not be what you expect.
234 =cut
236 # sub load { } inherited
238 =head2 accessors
240 The following read-only accessors return values passed or created during new():
242 store() the long-term Bio::DB::SeqFeature::Store object
244 tmp_store() the temporary Bio::DB::SeqFeature::Store object used
245 during loading
247 sfclass() the Bio::SeqFeatureI class
249 fast() whether fast loading is active
251 seq_chunk_size() the sequence chunk size
253 verbose() verbose progress messages
255 =cut
257 # sub store inherited
258 # sub tmp_store inherited
259 # sub sfclass inherited
260 # sub fast inherited
261 # sub seq_chunk_size inherited
262 # sub verbose inherited
264 =head2 Internal Methods
266 The following methods are used internally and may be overridden by
267 subclasses.
269 =over 4
271 =item default_seqfeature_class
273 $class = $loader->default_seqfeature_class
275 Return the default SeqFeatureI class (Bio::DB::SeqFeature).
277 =cut
279 # sub default_seqfeature_class { } inherited
281 =item subfeatures_normalized
283 $flag = $loader->subfeatures_normalized([$new_flag])
285 Get or set a flag that indicates that the subfeatures are
286 normalized. This is deduced from the SeqFeature class information.
288 =cut
290 # sub subfeatures_normalized { } inherited
292 =item subfeatures_in_table
294 $flag = $loader->subfeatures_in_table([$new_flag])
296 Get or set a flag that indicates that feature/subfeature relationships
297 are stored in a table. This is deduced from the SeqFeature class and
298 Store information.
300 =cut
302 # sub subfeatures_in_table { } inherited
304 =item load_fh
306 $count = $loader->load_fh($filehandle)
308 Load the GFF3 data at the other end of the filehandle and return true
309 if successful. Internally, load_fh() invokes:
311 start_load();
312 do_load($filehandle);
313 finish_load();
315 =cut
317 # sub load_fh { } inherited
319 =item start_load, finish_load
321 These methods are called at the start and end of a filehandle load.
323 =cut
325 sub create_load_data { #overridden
326 my $self = shift;
327 $self->SUPER::create_load_data;
328 $self->{load_data}{TemporaryID} = "GFFLoad0000000";
329 $self->{load_data}{IndexSubfeatures} = $self->index_subfeatures();
330 $self->{load_data}{mode} = 'gff';
332 $self->{load_data}{Helper} =
333 Bio::DB::SeqFeature::Store::LoadHelper->new($self->{tmpdir});
336 sub finish_load { #overridden
337 my $self = shift;
339 $self->store_current_feature(); # during fast loading, we will have a feature left at the very end
340 $self->start_or_finish_sequence(); # finish any half-loaded sequences
342 $self->msg("Building object tree...");
343 my $start = $self->time();
344 $self->build_object_tree;
345 $self->msg(sprintf "%5.2fs\n",$self->time()-$start);
347 if ($self->fast) {
348 $self->msg("Loading bulk data into database...");
349 $start = $self->time();
350 $self->store->finish_bulk_update;
351 $self->msg(sprintf "%5.2fs\n",$self->time()-$start);
353 eval {$self->store->commit};
355 # don't delete load data so that caller can ask for the loaded IDs
356 # $self->delete_load_data;
359 =item do_load
361 $count = $loader->do_load($fh)
363 This is called by load_fh() to load the GFF3 file's filehandle and
364 return the number of lines loaded.
366 =cut
368 # sub do_load { } inherited
370 =item load_line
372 $loader->load_line($data);
374 Load a line of a GFF3 file. You must bracket this with calls to
375 start_load() and finish_load()!
377 $loader->start_load();
378 $loader->load_line($_) while <FH>;
379 $loader->finish_load();
381 =cut
383 sub load_line { #overridden
384 my $self = shift;
385 my $line = shift;
387 chomp($line);
388 my $load_data = $self->{load_data};
389 $load_data->{line}++;
391 return unless $line =~ /^\S/; # blank line
393 # if it has a tab in it or looks like a chrom.sizes file, switch to gff mode
394 $load_data->{mode} = 'gff' if $line =~ /\t/
395 or $line =~ /^\w+\s+\d+\s*$/;
397 if ($line =~ /^\#\s?\#\s*(.+)/) { ## meta instruction
398 $load_data->{mode} = 'gff';
399 $self->handle_meta($1);
401 } elsif ($line =~ /^\#/) {
402 $load_data->{mode} = 'gff'; # just to be safe
403 return; # comment
406 elsif ($line =~ /^>\s*(\S+)/) { # FASTA lines are coming
407 $load_data->{mode} = 'fasta';
408 $self->start_or_finish_sequence($1);
411 elsif ($load_data->{mode} eq 'fasta') {
412 $self->load_sequence($line);
415 elsif ($load_data->{mode} eq 'gff') {
416 $self->handle_feature($line);
417 if (++$load_data->{count} % 1000 == 0) {
418 my $now = $self->time();
419 my $nl = -t STDOUT && !$ENV{EMACS} ? "\r" : "\n";
420 local $^W = 0; # kill uninit variable warning
421 $self->msg(sprintf("%d features loaded in %5.2fs (%5.2fs/1000 features)...%s$nl",
422 $load_data->{count},$now - $load_data->{start_time},
423 $now - $load_data->{millenium_time},
424 ' ' x 80
426 $load_data->{millenium_time} = $now;
430 else {
431 $self->throw("I don't know what to do with this line:\n$line");
436 =item handle_meta
438 $loader->handle_meta($meta_directive)
440 This method is called to handle meta-directives such as
441 ##sequence-region. The method will receive the directive with the
442 initial ## stripped off.
444 =cut
446 sub handle_meta {
447 my $self = shift;
448 my $instruction = shift;
450 if ( $instruction =~ /^#$/ ) {
451 $self->store_current_feature() ; # during fast loading, we will have a feature left at the very end
452 $self->start_or_finish_sequence(); # finish any half-loaded sequences
453 if ( $self->store->can('handle_resolution_meta') ) {
454 $self->store->handle_resolution_meta($instruction);
456 return;
459 if ($instruction =~ /sequence-region\s+(.+)\s+(-?\d+)\s+(-?\d+)/i
460 && !$self->ignore_seqregion()) {
461 my($ref,$start,$end,$strand) = $self->_remap($1,$2,$3,+1);
462 my $feature = $self->sfclass->new(-name => $ref,
463 -seq_id => $ref,
464 -start => $start,
465 -end => $end,
466 -strand => $strand,
467 -primary_tag => 'region');
468 $self->store->store($feature);
469 return;
472 if ($instruction =~/index-subfeatures\s+(\S+)/i) {
473 $self->{load_data}{IndexSubfeatures} = $1;
474 $self->store->index_subfeatures($1);
475 return;
478 if ( $self->store->can('handle_unrecognized_meta') ) {
479 $self->store->handle_unrecognized_meta($instruction);
480 return;
484 =item handle_feature
486 $loader->handle_feature($gff3_line)
488 This method is called to process a single GFF3 line. It manipulates
489 information stored a data structure called $self-E<gt>{load_data}.
491 =cut
493 sub handle_feature { #overridden
494 my $self = shift;
495 my $gff_line = shift;
496 my $ld = $self->{load_data};
498 my $allow_whitespace = $self->allow_whitespace;
500 # special case for a chrom.sizes-style line
501 my @columns;
502 if ($gff_line =~ /^(\w+)\s+(\d+)\s*$/) {
503 @columns = ($1,undef,'chromosome',1,$2,undef,undef,undef,"Name=$1");
504 } else {
505 $gff_line =~ s/\s+/\t/g if $allow_whitespace;
506 @columns = map {$_ eq '.' ? undef : $_ } split /\t/,$gff_line;
509 $self->invalid_gff($gff_line) if @columns < 4;
510 $self->invalid_gff($gff_line) if @columns > 9 && $allow_whitespace;
513 local $^W = 0;
514 if (@columns > 9) { #oops, split too much due to whitespace
515 $columns[8] = join(' ',@columns[8..$#columns]);
519 my ($refname,$source,$method,$start,$end,$score,$strand,$phase,$attributes) = @columns;
521 $self->invalid_gff($gff_line) unless defined $refname;
522 $self->invalid_gff($gff_line) unless !defined $start || $start =~ /^[\d.-]+$/;
523 $self->invalid_gff($gff_line) unless !defined $end || $end =~ /^[\d.-]+$/;
524 $self->invalid_gff($gff_line) unless defined $method;
526 $strand = $Strandedness{$strand||0};
527 my ($reserved,$unreserved) = $attributes ? $self->parse_attributes($attributes) : ();
529 my $name = ($reserved->{Name} && $reserved->{Name}[0]);
531 my $has_loadid = defined $reserved->{ID}[0];
533 my $feature_id = defined $reserved->{ID}[0] ? $reserved->{ID}[0] : $ld->{TemporaryID}++;
534 my @parent_ids = @{$reserved->{Parent}} if defined $reserved->{Parent};
536 my $index_it = $ld->{IndexSubfeatures};
537 if (exists $reserved->{Index} || exists $reserved->{index}) {
538 $index_it = $reserved->{Index}[0] || $reserved->{index}[0];
541 # Everything in the unreserved hash becomes an attribute, so we copy
542 # some attributes over
543 $unreserved->{Note} = $reserved->{Note} if exists $reserved->{Note};
544 $unreserved->{Alias} = $reserved->{Alias} if exists $reserved->{Alias};
545 $unreserved->{Target} = $reserved->{Target} if exists $reserved->{Target};
546 $unreserved->{Gap} = $reserved->{Gap} if exists $reserved->{Gap};
547 $unreserved->{load_id}= $reserved->{ID} if exists $reserved->{ID};
549 # mec@stowers-institute.org, wondering why not all attributes are
550 # carried forward, adds ID tag in particular service of
551 # round-tripping ID, which, though present in database as load_id
552 # attribute, was getting lost as itself
553 # $unreserved->{ID}= $reserved->{ID} if exists $reserved->{ID};
555 # TEMPORARY HACKS TO SIMPLIFY DEBUGGING
556 $feature_id = '' unless defined $feature_id;
557 $name = '' unless defined $name; # prevent uninit variable warnings
558 # push @{$unreserved->{Alias}},$feature_id if $has_loadid && $feature_id ne $name;
559 # If DEBUG != 0, any Parent attribute is also copied over (as 'parent_id')
560 $unreserved->{parent_id} = \@parent_ids if DEBUG && @parent_ids;
562 # POSSIBLY A PERMANENT HACK -- TARGETS BECOME ALIASES
563 # THIS IS TO ALLOW FOR TARGET-BASED LOOKUPS
564 if (exists $reserved->{Target} && !$self->{noalias_target}) {
565 my %aliases = map {$_=>1} @{$unreserved->{Alias}};
566 for my $t (@{$reserved->{Target}}) {
567 (my $tc = $t) =~ s/\s+.*$//; # get rid of coordinates
568 $name ||= $tc;
569 push @{$unreserved->{Alias}},$tc unless $name eq $tc || $aliases{$tc};
573 ($refname,$start,$end,$strand) = $self->_remap($refname,$start,$end,$strand) or return;
575 my @args = (-display_name => $name,
576 -seq_id => $refname,
577 -start => $start,
578 -end => $end,
579 -strand => $strand || 0,
580 -score => $score,
581 -phase => $phase,
582 -primary_tag => $method || 'feature',
583 -source => $source,
584 -tag => $unreserved,
585 -attributes => $unreserved,
588 # Here's where we handle feature lines that have the same ID (multiple locations, not
589 # parent/child relationships)
591 my $old_feat;
593 # Current feature is the same as the previous feature, which hasn't yet been loaded
594 if (defined $ld->{CurrentID} && $ld->{CurrentID} eq $feature_id) {
595 $old_feat = $ld->{CurrentFeature};
598 # Current feature is the same as a feature that was loaded earlier
599 elsif (defined(my $id = $self->{load_data}{Helper}->local2global($feature_id))) {
600 $old_feat = $self->fetch($feature_id)
601 or $self->warn(<<END);
602 ID=$feature_id has been used more than once, but it cannot be found in the database.
603 This can happen if you have specified fast loading, but features sharing the same ID
604 are not contiguous in the GFF file. This will be loaded as a separate feature.
605 Line $.: "$_"
609 # contiguous feature, so add a segment
610 warn $old_feat if defined $old_feat and !ref $old_feat;
611 if (defined $old_feat) {
612 # set this to 1 to disable split-location behavior
613 if (0 && @parent_ids) { # If multiple features are held together by the same ID
614 $feature_id = $ld->{TemporaryID}++; # AND they have a Parent attribute, this causes an undesirable
615 } # additional layer of aggregation. Changing the ID fixes this.
616 elsif (
617 $old_feat->seq_id ne $refname ||
618 $old_feat->start != $start ||
619 $old_feat->end != $end # make sure endpoints are distinct
622 $self->add_segment($old_feat,$self->sfclass->new(@args));
623 return;
627 # we get here if this is a new feature
628 # first of all, store the current feature if it is there
629 $self->store_current_feature() if defined $ld->{CurrentID};
631 # now create the new feature
632 # (index top-level features only if policy asks us to)
633 my $feature = $self->sfclass->new(@args);
634 $feature->object_store($self->store) if $feature->can('object_store'); # for lazy table features
635 $ld->{CurrentFeature} = $feature;
636 $ld->{CurrentID} = $feature_id;
638 my $top_level = !@parent_ids;
639 my $has_id = defined $reserved->{ID}[0];
640 $index_it ||= $top_level;
642 my $helper = $ld->{Helper};
643 $helper->indexit($feature_id=>1) if $index_it;
644 $helper->toplevel($feature_id=>1) if !$self->{fast}
645 && $top_level; # need to track top level features
648 # remember parentage
649 for my $parent (@parent_ids) {
650 $helper->add_children($parent=>$feature_id);
655 sub invalid_gff {
656 my $self = shift;
657 my $line = shift;
658 $self->throw("invalid GFF line at line $self->{load_data}{line}.\n".$line);
661 =item allow_whitespace
663 $allow_it = $loader->allow_whitespace([$newvalue]);
665 Get or set the allow_whitespace flag. If true, then GFF3 files are
666 allowed to be delimited with whitespace in addition to tabs.
668 =cut
670 sub allow_whitespace {
671 my $self = shift;
672 my $d = $self->{allow_whitespace};
673 $self->{allow_whitespace} = shift if @_;
677 =item store_current_feature
679 $loader->store_current_feature()
681 This method is called to store the currently active feature in the
682 database. It uses a data structure stored in $self-E<gt>{load_data}.
684 =cut
686 # sub store_current_feature { } inherited
688 =item build_object_tree
690 $loader->build_object_tree()
692 This method gathers together features and subfeatures and builds the graph that connects them.
694 =cut
697 # put objects together
699 sub build_object_tree {
700 my $self = shift;
701 $self->subfeatures_in_table ? $self->build_object_tree_in_tables : $self->build_object_tree_in_features;
704 =item build_object_tree_in_tables
706 $loader->build_object_tree_in_tables()
708 This method gathers together features and subfeatures and builds the
709 graph that connects them, assuming that parent/child relationships
710 will be stored in a database table.
712 =cut
714 sub build_object_tree_in_tables {
715 my $self = shift;
716 my $store = $self->store;
717 my $helper = $self->{load_data}{Helper};
719 while (my ($load_id,$children) = $helper->each_family()) {
721 my $parent_id = $helper->local2global($load_id);
722 die $self->throw("$load_id doesn't have a primary id")
723 unless defined $parent_id;
725 my @children = map {$helper->local2global($_)} @$children;
726 # this updates the table that keeps track of parent/child relationships,
727 # but does not update the parent object -- so (start,end) had better be right!!!
728 $store->add_SeqFeature($parent_id,@children);
734 =item build_object_tree_in_features
736 $loader->build_object_tree_in_features()
738 This method gathers together features and subfeatures and builds the
739 graph that connects them, assuming that parent/child relationships are
740 stored in the seqfeature objects themselves.
742 =cut
744 sub build_object_tree_in_features {
745 my $self = shift;
746 my $store = $self->store;
747 my $tmp = $self->tmp_store;
748 my $ld = $self->{load_data};
749 my $normalized = $self->subfeatures_normalized;
751 my $helper = $ld->{Helper};
753 while (my $load_id = $helper->each_toplevel) {
754 my $feature = $self->fetch($load_id)
755 or $self->throw("$load_id (id="
756 .$helper->local2global($load_id)
757 ." should have a database entry, but doesn't");
758 $self->attach_children($store,$ld,$load_id,$feature);
759 # Indexed objects are updated, not created anew
760 $feature->primary_id(undef) unless $helper->indexit($load_id);
761 $store->store($feature);
766 =item attach_children
768 $loader->attach_children($store,$load_data,$load_id,$feature)
770 This recursively adds children to features and their subfeatures. It
771 is called when subfeatures are directly contained within other
772 features, rather than stored in a relational table.
774 =cut
776 sub attach_children {
777 my $self = shift;
778 my ($store,$ld,$load_id,$feature) = @_;
780 my $children = $ld->{Helper}->children() or return;
781 for my $child_id (@$children) {
782 my $child = $self->fetch($child_id)
783 or $self->throw("$child_id should have a database entry, but doesn't");
784 $self->attach_children($store,$ld,$child_id,$child); # recursive call
785 $feature->add_SeqFeature($child);
789 =item fetch
791 my $feature = $loader->fetch($load_id)
793 Given a load ID (from the ID= attribute) this method returns the
794 feature from the temporary database or the permanent one, depending on
795 where it is stored.
797 =cut
799 sub fetch {
800 my $self = shift;
801 my $load_id = shift;
802 my $helper = $self->{load_data}{Helper};
803 my $id = $helper->local2global($load_id);
805 return
806 ($self->subfeatures_normalized || $helper->indexit($load_id)
807 ? $self->store->fetch($id)
808 : $self->tmp_store->fetch($id)
812 =item add_segment
814 $loader->add_segment($parent,$child)
816 This method is used to add a split location to the parent.
818 =cut
820 sub add_segment {
821 my $self = shift;
822 my ($parent,$child) = @_;
824 if ($parent->can('add_segment')) { # probably a lazy table feature
825 my $segment_count = $parent->can('denormalized_segment_count') ? $parent->denormalized_segment_count
826 : $parent->can('denormalized_segments ') ? $parent->denormalized_segments
827 : $parent->can('segments') ? $parent->segments
828 : 0;
829 unless ($segment_count) { # convert into a segmented object
830 my $segment;
831 if ($parent->can('clone')) {
832 $segment = $parent->clone;
833 } else {
834 my %clone = %$parent;
835 $segment = bless \%clone,ref $parent;
837 delete $segment->{segments};
838 eval {$segment->object_store(undef) };
839 $segment->primary_id(undef);
841 # this updates the object and expands its start and end positions without writing
842 # the segments into the database as individual objects
843 $parent->add_segment($segment);
845 $parent->add_segment($child);
846 1; # for debugging
849 # a conventional Bio::SeqFeature::Generic object - create a split location
850 else {
851 my $current_location = $parent->location;
852 if ($current_location->can('add_sub_Location')) {
853 $current_location->add_sub_Location($child->location);
854 } else {
855 eval "require Bio::Location::Split" unless Bio::Location::Split->can('add_sub_Location');
856 my $new_location = Bio::Location::Split->new();
857 $new_location->add_sub_Location($current_location);
858 $new_location->add_sub_Location($child->location);
859 $parent->location($new_location);
864 =item parse_attributes
866 ($reserved,$unreserved) = $loader->parse_attributes($attribute_line)
868 This method parses the information contained in the $attribute_line
869 into two hashrefs, one containing the values of reserved attribute
870 tags (e.g. ID) and the other containing the values of unreserved ones.
872 =cut
874 sub parse_attributes {
875 my $self = shift;
876 my $att = shift;
878 unless ($att =~ /=/) { # ouch! must be a GFF line
879 require Bio::DB::SeqFeature::Store::GFF2Loader
880 unless Bio::DB::SeqFeature::Store::GFF2Loader->can('parse_attributes');
881 return $self->Bio::DB::SeqFeature::Store::GFF2Loader::parse_attributes($att);
884 my @pairs = map { my ($name,$value) = split '=';
885 [$self->unescape($name) => $value];
886 } split ';',$att;
887 my (%reserved,%unreserved);
888 foreach (@pairs) {
889 my $tag = $_->[0];
891 unless (defined $_->[1]) {
892 warn "$tag does not have a value at GFF3 file line $.\n";
893 next;
896 my @values = split ',',$_->[1];
897 map {$_ = $self->unescape($_);} @values;
898 if ($Special_attributes{$tag}) { # reserved attribute
899 push @{$reserved{$tag}},@values;
900 } else {
901 push @{$unreserved{$tag}},@values
904 return (\%reserved,\%unreserved);
907 =item start_or_finish_sequence
909 $loader->start_or_finish_sequence('Chr9')
911 This method is called at the beginning and end of a fasta section.
913 =cut
915 # sub start_or_finish_sequence { } inherited
917 =item load_sequence
919 $loader->load_sequence('gatttcccaaa')
921 This method is called to load some amount of sequence after
922 start_or_finish_sequence() is first called.
924 =cut
926 # sub load_sequence { } inherited
928 =item open_fh
930 my $io_file = $loader->open_fh($filehandle_or_path)
932 This method opens up the indicated file or pipe, using some
933 intelligence to recognized compressed files and URLs and doing the
934 right thing.
936 =cut
938 # sub open_fh { } inherited
940 # sub msg { } inherited
942 =item time
944 my $time = $loader->time
946 This method returns the current time in seconds, using Time::HiRes if available.
948 =cut
950 # sub time { } inherited
952 =item unescape
954 my $unescaped = GFF3Loader::unescape($escaped)
956 This is an internal utility. It is the same as CGI::Util::unescape,
957 but doesn't change pluses into spaces and ignores unicode escapes.
959 =cut
961 # sub unescape { } inherited
963 sub _remap {
964 my $self = shift;
965 my ($ref,$start,$end,$strand) = @_;
966 my $mapper = $self->coordinate_mapper;
967 return ($ref,$start,$end,$strand) unless $mapper;
969 my ($newref,$coords) = $mapper->($ref,[$start,$end]);
970 return unless defined $coords->[0];
971 if ($coords->[0] > $coords->[1]) {
972 @{$coords} = reverse(@{$coords});
973 $strand *= -1;
975 return ($newref,@{$coords},$strand);
978 sub _indexit { # override
979 my $self = shift;
980 return $self->{load_data}{Helper}->indexit(@_);
983 sub _local2global { # override
984 my $self = shift;
985 return $self->{load_data}{Helper}->local2global(@_);
988 =item local_ids
990 my $ids = $self->local_ids;
991 my $id_cnt = @$ids;
993 After performing a load, this returns an array ref containing all the
994 load file IDs that were contained within the file just loaded.
996 =cut
998 sub local_ids { # override
999 my $self = shift;
1000 return $self->{load_data}{Helper}->local_ids(@_);
1003 =item loaded_ids
1005 my $ids = $loader->loaded_ids;
1006 my $id_cnt = @$ids;
1008 After performing a load, this returns an array ref containing all the
1009 feature primary ids that were created during the load.
1011 =cut
1013 sub loaded_ids { # override
1014 my $self = shift;
1015 return $self->{load_data}{Helper}->loaded_ids(@_);
1020 __END__
1022 =back
1024 =head1 BUGS
1026 This is an early version, so there are certainly some bugs. Please
1027 use the BioPerl bug tracking system to report bugs.
1029 =head1 SEE ALSO
1031 L<Bio::DB::SeqFeature::Store>,
1032 L<Bio::DB::SeqFeature::Segment>,
1033 L<Bio::DB::SeqFeature::NormalizedFeature>,
1034 L<Bio::DB::SeqFeature::GFF2Loader>,
1035 L<Bio::DB::SeqFeature::Store::DBI::mysql>,
1036 L<Bio::DB::SeqFeature::Store::berkeleydb>
1038 =head1 AUTHOR
1040 Lincoln Stein E<lt>lstein@cshl.orgE<gt>.
1042 Copyright (c) 2006 Cold Spring Harbor Laboratory.
1044 This library is free software; you can redistribute it and/or modify
1045 it under the same terms as Perl itself.
1047 =cut