2 # BioPerl module for Bio::SeqIO::bsml
4 # Please direct questions and support issues to <bioperl-l@bioperl.org>
6 # Cared for by Charles Tilford (tilfordc@bms.com)
7 # Copyright (C) Charles Tilford 2001
9 # This library is free software; you can redistribute it and/or
10 # modify it under the terms of the GNU Lesser General Public
11 # License as published by the Free Software Foundation; either
12 # version 2.1 of the License, or (at your option) any later version.
14 # This library is distributed in the hope that it will be useful,
15 # but WITHOUT ANY WARRANTY; without even the implied warranty of
16 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
17 # Lesser General Public License for more details.
19 # You should have received a copy of the GNU Lesser General Public
20 # License along with this library; if not, write to the Free Software
21 # Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
22 # Also at: http://www.gnu.org/copyleft/lesser.html
24 # Much of the basic documentation in this module has been
25 # cut-and-pasted from the embl.pm (Ewan Birney) SeqIO module.
29 Bio::SeqIO::bsml - BSML sequence input/output stream
33 It is probably best not to use this object directly, but rather go
34 through the SeqIO handler system. To read a BSML file:
36 $stream = Bio::SeqIO->new( -file => $filename, -format => 'bsml');
38 while ( my $bioSeqObj = $stream->next_seq() ) {
39 # do something with $bioSeqObj
42 To write a Seq object to the current file handle in BSML XML format:
44 $stream->write_seq( -seq => $seqObj);
46 If instead you would like a XML::DOM object containing the BSML, use:
48 my $newXmlObject = $stream->to_bsml( -seq => $seqObj);
52 In addition to parts of the Bio:: hierarchy, this module uses:
58 This object can transform Bio::Seq objects to and from BSML (XML)
63 2/1/02 - I have changed the API to more closely match argument
64 passing used by other BioPerl methods ( -tag => value ). Internal
65 methods are using the same API, but you should not be calling those
72 User feedback is an integral part of the evolution of this and other
73 Bioperl modules. Send your comments and suggestions preferably to one
74 of the Bioperl mailing lists. Your participation is much appreciated.
76 bioperl-l@bioperl.org - General discussion
77 http://bioperl.org/wiki/Mailing_lists - About the mailing lists
81 Please direct usage questions or support issues to the mailing list:
83 I<bioperl-l@bioperl.org>
85 rather than to the module maintainer directly. Many experienced and
86 reponsive experts will be able look at the problem and quickly
87 address it. Please include a thorough description of the problem
88 with code and data examples if at all possible.
92 Report bugs to the Bioperl bug tracking system to help us keep track
93 the bugs and their resolution.
94 Bug reports can be submitted via the web:
96 https://github.com/bioperl/bioperl-live/issues
98 =head2 Things Still to Do
100 * The module now uses the new Collection.pm system. However,
101 Annotations associated with a Feature object still seem to use the
102 old system, so parsing with the old methods are included..
104 * Generate Seq objects with no sequence data but an assigned
105 length. This appears to be an issue with Bio::Seq. It is possible
106 (and reasonable) to make a BSML document with features but no
109 * Support <Seq-data-import>. Do not know how commonly this is used.
111 * Some features are awaiting implementation in later versions of
114 * Nested feature support
116 * Complex feature (ie joins)
118 * Unambiguity in strand (ie -1,0,1, not just 'complement' )
120 * More friendly dblink structures
122 * Location.pm (or RangeI::union?) appears to have a bug when 'expand'
125 * More intelligent hunting for sequence and feature titles? It is not
126 terribly clear where the most appropriate field is located, better
127 grepping (eg looking for a reasonable count for spaces and numbers)
128 may allow for titles better than "AE008041".
130 =head1 AUTHOR - Charles Tilford
132 Bristol-Myers Squibb Bioinformatics
134 Email tilfordc@bms.com
136 I have developed the BSML specific code for this package, but have used
137 code from other SeqIO packages for much of the nuts-and-bolts. In particular
138 I have used code from the embl.pm module either directly or as a framework
139 for many of the subroutines that are common to SeqIO modules.
143 package Bio
::SeqIO
::bsml
;
146 use Bio
::SeqFeature
::Generic
;
149 use Bio
::Seq
::SeqFactory
;
150 use Bio
::Annotation
::Collection
;
151 use Bio
::Annotation
::Comment
;
152 use Bio
::Annotation
::Reference
;
153 use Bio
::Annotation
::DBLink
;
155 use base
qw(Bio::SeqIO);
157 my $idcounter = {}; # Used to generate unique id values
158 my $nvtoken = ": "; # The token used if a name/value pair has to be stuffed
165 # LS: this seems to get overwritten on line 1317, generating a redefinition error.
167 # CAT: This was inappropriately added in revision 1.10 - I added the check for
168 # existence of a sequence factory to the actual _initialize
170 # my($self,@args) = @_;
171 # $self->SUPER::_initialize(@args);
172 # if( ! defined $self->sequence_factory ) {
173 # $self->sequence_factory(Bio::Seq::SeqFactory->new(-verbose => $self->verbose(),
174 # -type => 'Bio::Seq::RichSeq'));
181 Usage : my $bioSeqObj = $stream->next_seq
182 Function: Retrieves the next sequence from a SeqIO::bsml stream.
183 Returns : A reference to a Bio::Seq::RichSeq object
191 my $bioSeq = $self->sequence_factory->create(-verbose
=>$self->verbose());
193 unless (exists $self->{'domtree'}) {
194 $self->throw("A BSML document has not yet been parsed.");
197 my $dom = $self->{'domtree'};
198 my $seqElements = $dom->getElementsByTagName ("Sequence");
199 if ($self->{'current_node'} == $seqElements->getLength ) {
200 # There are no more <Sequence>s to process
203 my $xmlSeq = $seqElements->item($self->{'current_node'});
205 # Assume that title attribute contains the best display id
206 if (my $val = $xmlSeq->getAttribute( "title")) {
207 $bioSeq->display_id($val);
210 # Set the molecule type
211 if (my $val = $xmlSeq->getAttribute( "molecule" )) {
212 my %mol = ('dna' => 'DNA', 'rna' => 'RNA', 'aa' => 'protein');
213 $bioSeq->molecule($mol{ lc($val) });
216 # Set the accession number
217 if (my $val = $xmlSeq->getAttribute( "ic-acckey" )) {
218 $bioSeq->accession_number($val);
221 # Get the sequence data for the element
222 if (my $seqData = &FIRSTDATA
($xmlSeq->getElementsByTagName("Seq-data")
224 # Sequence data exists, transfer to the Seq object
225 # Remove white space and CRs (not neccesary?)
226 $seqData =~ s/[\s\n\r]//g;
227 $bioSeq->seq($seqData);
228 } elsif (my $import = $xmlSeq->getElementsByTagName("Seq-dataimport")
230 #>>>> # What about <Seq-data-import> ??
232 } elsif (my $val = $xmlSeq->getAttribute("length")) {
233 # No sequence defined, set the length directly
235 #>>>> # This does not appear to work - length is apparently calculated
236 # from the sequence. How to make a "virtual" sequence??? Such
237 # creatures are common in BSML...
238 $bioSeq->length($val);
241 my $species = Bio
::Species
->new();
242 my @classification = ();
244 # Peruse the generic <Attributes> - those that are direct children of
245 # the <Sequence> or the <Feature-tables> element
246 # Sticky wicket here - data not controlled by schema, could be anything
248 my %specs = ('common_name' => 'y',
251 'sub_species' => 'y',
254 'add_date' => [ qw(date date-created date-last-updated)],
255 'keywords' => [ 'keyword', ],
256 'seq_version' => [ 'version' ],
257 'division' => [ 'division' ],
258 'add_secondary_accession' => ['accession'],
260 'primary_id' => [ 'primary.id', 'primary_id' ],
263 my $floppies = &GETFLOPPIES
($xmlSeq);
264 for my $attr (@
{$floppies}) {
265 # Don't want to get attributes from <Feature> or <Table> elements yet
266 my $parent = $attr->getParentNode->getNodeName;
267 next unless($parent eq "Sequence" || $parent eq "Feature-tables");
269 my ($name, $content) = &FLOPPYVALS
($attr);
271 if (exists $specs{$name}) { # It looks like part of species...
272 $species->$name($content);
276 # Cycle through the Seq methods:
277 for my $method (keys %seqMap) {
278 # Cycle through potential matching attributes:
279 for my $match (@
{$seqMap{$method}}) {
280 # If the <Attribute> name matches one of the keys,
281 # set $value, unless it has already been set
282 $value ||= $content if ($name =~ /$match/i);
286 if( $method eq 'seq_version'&& $value =~ /\S+\.(\d+)/ ) {
287 # hack for the fact that data in version is actually
291 $bioSeq->$method($value);
295 if( $name eq 'database-xref' ) {
296 my ($link_id,$link_db) = split(/:/,$value);
297 push @links, Bio
::Annotation
::DBLink
->new(-primary_id
=> $link_id,
298 -database
=> $link_db);
300 next if ($value ne "");
302 if ($name =~ /^species$/i) { # Uh, it's the species designation?
303 if ($content =~ / /) {
304 # Assume that a full species name has been provided
305 # This will screw up if the last word is the subspecies...
306 my @break = split " ", $content;
307 @classification = reverse @break;
309 $classification[0] = $content;
313 if ($name =~ /sub[_ ]?species/i) { # Should be the subspecies...
314 $species->sub_species( $content );
317 if ($name =~ /classification/i) { # Should be species classification
318 # We will assume that there are spaces separating the terms:
319 my @bits = split " ", $content;
320 # Now make sure there is not other cruft as well (eg semi-colons)
321 for my $i (0..$#bits) {
322 $bits[$i] =~ /(\w+)/;
325 $species->classification( @bits );
328 if ($name =~ /comment/) {
329 my $com = Bio
::Annotation
::Comment
->new('-text' => $content);
330 # $bioSeq->annotation->add_Comment($com);
331 $bioSeq->annotation->add_Annotation('comment', $com);
334 # Description line - collect all descriptions for later assembly
335 if ($name =~ /descr/) {
336 push @seqDesc, $content;
339 # Ok, we have no idea what this attribute is. Dump to SimpleValue
340 my $simp = Bio
::Annotation
::SimpleValue
->new( -value
=> $content);
341 $bioSeq->annotation->add_Annotation($name, $simp);
343 unless ($#seqDesc < 0) {
344 $bioSeq->desc( join "; ", @seqDesc);
347 #>>>> This should be modified so that any IDREF associated with the
348 # <Reference> is then used to associate the reference with the
349 # appropriate Feature
351 # Extract out <Reference>s associated with the sequence
354 -title
=> "RefTitle",
355 -authors
=> "RefAuthors",
356 -location
=> "RefJournal",
358 for my $ref ( $xmlSeq->getElementsByTagName ("Reference") ) {
360 for my $tag (keys %tags) {
361 my $rt = &FIRSTDATA
($ref->getElementsByTagName($tags{$tag})->item(0));
363 $rt =~ s/^[\s\r\n]+//; # Kill leading space
364 $rt =~ s/[\s\r\n]+$//; # Kill trailing space
365 $rt =~ s/[\s\r\n]+/ /; # Collapse internal space runs
366 $refVals{$tag} = $rt;
368 my $reference = Bio
::Annotation
::Reference
->new( %refVals );
370 # Pull out any <Reference> information hidden in <Attributes>
372 comment
=> [ 'comment', 'remark' ],
373 medline
=> [ 'medline', ],
374 pubmed
=> [ 'pubmed' ],
375 start
=> [ 'start', 'begin' ],
376 end
=> [ 'stop', 'end' ],
379 my $floppies = &GETFLOPPIES
($ref);
380 for my $attr (@
{$floppies}) {
381 my ($name, $content) = &FLOPPYVALS
($attr);
383 # Cycle through the Seq methods:
384 for my $method (keys %refMap) {
385 # Cycle through potential matching attributes:
386 for my $match (@
{$refMap{$method}}) {
387 # If the <Attribute> name matches one of the keys,
388 # set $value, unless it has already been set
389 $value ||= $content if ($name =~ /$match/i);
392 my $str = '$reference->' . $method . "($value)";
397 next if ($value ne "");
398 # Don't know what the <Attribute> is, dump it to comments:
399 push @refCom, $name . $nvtoken . $content;
401 unless ($#refCom < 0) {
402 # Random stuff was found, tack it to the comment field
403 my $exist = $reference->comment;
404 $exist .= join ", ", @refCom;
405 $reference->comment($exist);
407 push @refs, $reference;
409 $bioSeq->annotation->add_Annotation('reference' => $_) for @refs;
410 my $ann_col = $bioSeq->annotation;
411 # Extract the <Feature>s for this <Sequence>
412 for my $feat ( $xmlSeq->getElementsByTagName("Feature") ) {
413 $bioSeq->add_SeqFeature( $self->_parse_bsml_feature($feat) );
416 $species->classification( @classification );
417 $bioSeq->species( $species );
419 $bioSeq->annotation->add_Annotation('dblink' => $_) for @links;
421 $self->{'current_node'}++;
424 #- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
425 # Get all the <Attribute> and <Qualifier> children for an object, and
426 # return them as an array reference
427 # ('floppy' since these elements have poor/no schema control)
432 my $attributes = $obj->getElementsByTagName ("Attribute");
433 for (my $i = 0; $i < $attributes->getLength; $i++) {
434 push @floppies, $attributes->item($i);
436 my $qualifiers = $obj->getElementsByTagName ("Qualifier");
437 for (my $i = 0; $i < $qualifiers->getLength; $i++) {
438 push @floppies, $qualifiers->item($i);
442 #- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
443 # Given a DOM <Attribute> or <Qualifier> object, return the [name, value] pair
448 if ($obj->getNodeName eq "Attribute") {
449 $name = $obj->getAttribute('name');
450 $value = $obj->getAttribute('content');
451 } elsif ($obj->getNodeName eq "Qualifier") {
452 # Wheras <Attribute>s require both 'name' and 'content' attributes,
453 # <Qualifier>s can technically have either blank (and sometimes do)
454 my $n = $obj->getAttribute('value-type');
455 $name = $n if ($n ne "");
456 my $v = $obj->getAttribute('value');
457 $value = $v if ($v ne "");
459 return ($name, $value);
461 #- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
462 # Returns the value of the first TEXT_NODE encountered below an element
463 # Rational - avoid grabbing a comment rather than the PCDATA. Not foolproof...
466 return unless ($element);
468 my $hopefuls = $element->getChildNodes;
470 for (my $i = 0; $i < $hopefuls->getLength; $i++) {
471 if ($hopefuls->item($i)->getNodeType ==
472 XML
::DOM
::Node
::TEXT_NODE
() ) {
473 $data = $hopefuls->item($i)->getNodeValue;
479 #- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
480 # Just collapses whitespace runs in a string
483 $string =~ s/[\s\r\n]+/ /g;
490 Usage : my $domDoc = $obj->to_bsml(@args)
491 Function: Generates an XML structure for one or more Bio::Seq objects.
492 If $seqref is an array ref, the XML tree generated will include
493 all the sequences in the array.
494 Returns : A reference to the XML DOM::Document object generated / modified
495 Args : Argument array in form of -key => val. Recognized keys:
497 -seq A Bio::Seq reference, or an array reference of many of them
499 -xmldoc Specifies an existing XML DOM document to add the sequences
500 to. If included, then only data (no page formatting) will
501 be added. If not, a new XML::DOM::Document will be made,
502 and will be populated with both <Sequence> data, as well as
503 <Page> display elements.
505 -nodisp Do not generate <Display> elements, or any children
506 thereof, even if -xmldoc is not set.
508 -skipfeat If set to 'all', all <Feature>s will be skipped. If it is
509 a hash reference, any <Feature> with a class matching a key
510 in the hash will be skipped - for example, to skip 'source'
511 and 'score' features, use:
513 -skipfeat => { source => 'Y', score => 'Y' }
515 -skiptags As above: if set to 'all', no tags are included, and if a
516 hash reference, those specific tags will be ignored.
518 Skipping some or all tags and features can result in
519 noticeable speed improvements.
521 -nodata If true, then <Seq-data> will not be included. This may be
522 useful if you just want annotations and do not care about
523 the raw ACTG information.
525 -return Default is 'xml', which will return a reference to the BSML
526 XML object. If set to 'seq' will return an array ref of the
527 <Sequence> objects added (rather than the whole XML object)
529 -close Early BSML browsers will crash if an element *could* have
530 children but does not, and is closed as an empty element
531 e.g. <Styles/>. If -close is true, then such tags are given
532 a comment child to explicitly close them e.g. <Styles><!--
533 --></Styles>. This is default true, set to "0" if you do
534 not want this behavior.
536 Examples : my $domObj = $stream->to_bsml( -seq => \@fourCoolSequenceObjects,
537 -skipfeat => { source => 1 },
540 # Or add sequences to an existing BSML document:
541 $stream->to_bsml( -seq => \@fourCoolSequenceObjects,
542 -skipfeat => { source => 1 },
543 -xmldoc => $myBsmlDocumentInProgress, );
549 my $args = $self->_parseparams( -close => 1,
552 $args->{NODISP
} ||= $args->{NODISPLAY
};
553 my $seqref = $args->{SEQ
};
554 $seqref = (ref($seqref) eq 'ARRAY') ?
$seqref : [ $seqref ];
556 #############################
557 # Basic BSML XML Components #
558 #############################
561 my ($bsmlElem, $defsElem, $seqsElem, $dispElem);
562 if ($args->{XMLDOC
}) {
563 # The user has provided an existing XML DOM object
564 $xml = $args->{XMLDOC
};
565 unless ($xml->isa("XML::DOM::Document")) {
566 $self->throw('SeqIO::bsml.pm error:\n'.
567 'When calling ->to_bsml( { xmldoc => $myDoc }), $myDoc \n' .
568 'should be an XML::DOM::Document object, or an object that\n'.
569 'inherits from that class (like BsmlHelper.pm)');
572 # The user has not provided a new document, make one from scratch
573 $xml = XML
::DOM
::Document
->new();
574 $xml->setXMLDecl( $xml->createXMLDecl("1.0") );
575 my $url = "http://www.labbook.com/dtd/bsml2_2.dtd";
576 my $doc = $xml->createDocumentType("Bsml",$url);
577 $xml->setDoctype($doc);
578 $bsmlElem = $self->_addel( $xml, 'Bsml');
579 $defsElem = $self->_addel( $bsmlElem, 'Definitions');
580 $seqsElem = $self->_addel( $defsElem, 'Sequences');
581 unless ($args->{NODISP
}) {
582 $dispElem = $self->_addel( $bsmlElem, 'Display');
583 my $stylElem = $self->_addel( $dispElem, 'Styles');
584 my $style = $self->_addel( $stylElem, 'Style', {
585 type
=> "text/css" });
587 qq(Interval
-widget
{ display
: "1"; }\n) .
588 qq(Feature
{ display
-auto
: "1"; });
589 $style->appendChild( $xml->createTextNode($styleText) );
593 # Establish fundamental BSML elements, if they do not already exist
594 $bsmlElem ||= $xml->getElementsByTagName("Bsml")->item(0);
595 $defsElem ||= $xml->getElementsByTagName("Definitions")->item(0);
596 $seqsElem ||= $xml->getElementsByTagName("Sequences")->item(0);
602 # Map over Bio::Seq to BSML
603 my %mol = ('dna' => 'DNA', 'rna' => 'RNA', 'protein' => 'AA');
606 for my $bioSeq (@
{$seqref}) {
607 my $xmlSeq = $xml->createElement("Sequence");
608 my $FTs = $xml->createElement("Feature-tables");
610 # Array references to hold <Reference> objects:
611 my $seqRefs = []; my $featRefs = [];
612 # Array references to hold <Attribute> values (not objects):
617 "This file generated to BSML 2.2 standards - " .
618 "joins will be collapsed to a single feature enclosing all members of the join"
620 push @
{$seqDesc}, ["description" , eval{$bioSeq->desc}];
621 for my $kwd ( eval{$bioSeq->get_keywords} ) {
622 push @
{$seqDesc}, ["keyword" , $kwd];
624 push @
{$seqDesc}, ["keyword" , eval{$bioSeq->keywords}];
625 push @
{$seqDesc}, ["version" , eval{
626 join(".", $bioSeq->accession_number, $bioSeq->seq_version); }];
627 push @
{$seqDesc}, ["division" , eval{$bioSeq->division}];
628 push @
{$seqDesc}, ["pid" , eval{$bioSeq->pid}];
629 # push @{$seqDesc}, ["bio_object" , ref($bioSeq)];
630 push @
{$seqDesc}, ["primary_id" , eval{$bioSeq->primary_id}];
631 for my $dt (eval{$bioSeq->get_dates()} ) {
632 push @
{$seqDesc}, ["date" , $dt];
634 for my $ac (eval{$bioSeq->get_secondary_accessions()} ) {
635 push @
{$seqDesc}, ["secondary_accession" , $ac];
638 # Determine the accession number and a unique identifier
639 my $acc = $bioSeq->accession_number eq "unknown" ?
640 "" : $bioSeq->accession_number;
642 my $pi = $bioSeq->primary_id;
643 if ($pi && $pi !~ /Bio::/) {
644 # Not sure I understand what primary_id is... It sometimes
645 # is a string describing a reference to a BioSeq object...
646 $id = "SEQ" . $bioSeq->primary_id;
648 # Nothing useful found, make a new unique ID
649 $id = $acc || ("SEQ-io" . $idcounter->{Sequence
}++);
651 # print "$id->",ref($bioSeq->primary_id),"\n";
652 # An id field with spaces is interpreted as an idref - kill the spaces
654 # Map over <Sequence> attributes
655 my %attr = ( 'title' => $bioSeq->display_id,
656 'length' => $bioSeq->length,
659 'representation' => 'raw',
661 $attr{molecule
} = $mol{ lc($bioSeq->molecule) } if $bioSeq->can('molecule');
664 for my $a (keys %attr) {
665 $xmlSeq->setAttribute($a, $attr{$a}) if (defined $attr{$a} &&
668 # Orphaned Attributes:
669 $xmlSeq->setAttribute('topology', 'circular')
670 if ($bioSeq->is_circular);
671 # <Sequence> strand, locus
673 $self->_add_page($xml, $xmlSeq) if ($dispElem);
678 # Check for Bio::Annotations on the * <Sequence> *.
679 $self->_parse_annotation( -xml
=> $xml, -obj
=> $bioSeq,
680 -desc
=> $seqDesc, -refs
=> $seqRefs);
682 # Incorporate species data
683 if (ref($bioSeq->species) eq 'Bio::Species') {
684 # Need to peer into Bio::Species ...
685 my @specs = ('common_name', 'genus', 'species', 'sub_species');
686 for my $sp (@specs) {
687 next unless (my $val = $bioSeq->species()->$sp());
688 push @
{$seqDesc}, [$sp , $val];
690 push @
{$seqDesc}, ['classification',
691 (join " ", $bioSeq->species->classification) ];
692 # Species::binomial will return "genus species sub_species" ...
693 } elsif (my $val = $bioSeq->species) {
694 # Ok, no idea what it is, just dump it in there...
695 push @
{$seqDesc}, ["species", $val];
698 # Add the description <Attribute>s for the <Sequence>
699 for my $seqD (@
{$seqDesc}) {
700 $self->_addel($xmlSeq, "Attribute", {
701 name
=> $seqD->[0], content
=> $seqD->[1]}) if ($seqD->[1]);
704 # If sequence references were added, make a Feature-table for them
705 unless ($#{$seqRefs} < 0) {
706 my $seqFT = $self->_addel($FTs, "Feature-table", {
707 title
=> "Sequence References", });
708 for my $feat (@
{$seqRefs}) {
709 $seqFT->appendChild($feat);
713 # This is the appropriate place to add <Feature-tables>
714 $xmlSeq->appendChild($FTs);
720 #>>>> # Perhaps it is better to loop through top_Seqfeatures?...
721 #>>>> # ...however, BSML does not have a hierarchy for Features
723 if (defined $args->{SKIPFEAT
} &&
724 $args->{SKIPFEAT
} eq 'all') {
725 $args->{SKIPFEAT
} = { all
=> 1};
726 } else { $args->{SKIPFEAT
} ||= {} }
727 for my $class (keys %{$args->{SKIPFEAT
}}) {
728 $args->{SKIPFEAT
}{lc($class)} = $args->{SKIPFEAT
}{$class};
730 # Loop through all the features
731 my @features = $bioSeq->all_SeqFeatures();
732 if (@features && !$args->{SKIPFEAT
}{all
}) {
733 my $ft = $self->_addel($FTs, "Feature-table", {
734 title
=> "Features", });
735 for my $bioFeat (@features ) {
737 my $class = lc($bioFeat->primary_tag);
738 # The user may have specified to ignore this type of feature
739 next if ($args->{SKIPFEAT
}{$class});
740 my $id = "FEAT-io" . $idcounter->{Feature
}++;
741 my $xmlFeat = $self->_addel( $ft, 'Feature', {
744 'value-type' => $bioFeat->source_tag });
745 # Check for Bio::Annotations on the * <Feature> *.
746 $self->_parse_annotation( -xml
=> $xml, -obj
=> $bioFeat,
747 -desc
=> $featDesc, -id
=> $id,
748 -refs
=>$featRefs, );
749 # Add the description stuff for the <Feature>
750 for my $de (@
{$featDesc}) {
751 $self->_addel($xmlFeat, "Attribute", {
752 name
=> $de->[0], content
=> $de->[1]}) if ($de->[1]);
754 $self->_parse_location($xml, $xmlFeat, $bioFeat);
756 # loop through the tags, add them as <Qualifiers>
757 next if (defined $args->{SKIPTAGS
} &&
758 $args->{SKIPTAGS
} =~ /all/i);
759 # Tags can consume a lot of CPU cycles, and can often be
760 # rather non-informative, so -skiptags can allow one total or
761 # selective omission of tags.
762 for my $tag ($bioFeat->all_tags()) {
763 next if (exists $args->{SKIPTAGS
}{$tag});
764 for my $val ($bioFeat->each_tag_value($tag)) {
765 $self->_addel( $xmlFeat, 'Qualifier', {
766 'value-type' => $tag ,
778 if ( (my $data = $bioSeq->seq) && !$args->{NODATA
} ) {
779 my $d = $self->_addel($xmlSeq, 'Seq-data');
780 $d->appendChild( $xml->createTextNode($data) );
783 # If references were added, make a Feature-table for them
784 unless ($#{$featRefs} < 0) {
785 my $seqFT = $self->_addel($FTs, "Feature-table", {
786 title
=> "Feature References", });
787 for my $feat (@
{$featRefs}) {
788 $seqFT->appendChild($feat);
792 # Place the completed <Sequence> tree as a child of <Sequences>
793 $seqsElem->appendChild($xmlSeq);
794 push @xmlSequences, $xmlSeq;
797 # Prevent browser crashes by explicitly closing empty elements:
798 if ($args->{CLOSE
}) {
799 my @problemChild = ('Sequences', 'Sequence', 'Feature-tables',
800 'Feature-table', 'Screen', 'View',);
801 for my $kid (@problemChild) {
802 for my $prob ($xml->getElementsByTagName($kid)) {
803 unless ($prob->hasChildNodes) {
805 $xml->createComment(" Must close <$kid> explicitly "));
811 if (defined $args->{RETURN
} &&
812 $args->{RETURN
} =~ /seq/i) {
813 return \
@xmlSequences;
822 Usage : $obj->write_seq(@args)
823 Function: Prints out an XML structure for one or more Bio::Seq objects.
824 If $seqref is an array ref, the XML tree generated will include
825 all the sequences in the array. This method is fairly simple,
826 most of the processing is performed within to_bsml.
827 Returns : A reference to the XML object generated / modified
828 Args : Argument array. Recognized keys:
830 -seq A Bio::Seq reference, or an array reference of many of them
832 Alternatively, the method may be called simply as...
834 $obj->write_seq( $bioseq )
836 ... if only a single argument is passed, it is assumed that
837 it is the sequence object (can also be an array ref of
840 -printmime If true prints "Content-type: $mimetype\n\n" at top of
841 document, where $mimetype is the value designated by this
842 key. For generic XML use text/xml, for BSML use text/x-bsml
844 -return This option will be suppressed, since the nature of this
845 method is to print out the XML document. If you wish to
846 retrieve the <Sequence> objects generated, use the to_bsml
853 my $args = $self->_parseparams( @_);
855 # If only a single value is passed, assume it is the seq object
858 # Build a BSML XML DOM object based on the sequence(s)
859 my $xml = $self->to_bsml( @_,
861 # Convert to a string
862 my $out = $xml->toString;
863 # Print after putting a return after each element - more readable
865 $self->_print("Content-type: " . $args->{PRINTMIME
} . "\n\n")
866 if ($args->{PRINTMIME
});
867 $self->_print( $out );
868 # Return the DOM tree in case the user wants to do something with it
870 $self->flush if $self->_flush_on_write && defined $self->_fh;
874 =head1 INTERNAL METHODS
875 #-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-
877 The following methods are used for internal processing, and should probably
878 not be accessed by the user.
880 =head2 _parse_location
882 Title : _parse_location
883 Usage : $obj->_parse_location($xmlDocument, $parentElem, $SeqFeatureObj)
884 Function: Adds <Interval-loc> and <Site-loc> children to <$parentElem> based
885 on locations / sublocations found in $SeqFeatureObj. If
886 sublocations exist, the original location will be ignored.
887 Returns : An array ref containing the elements added to the parent.
888 These will have already been added to <$parentElem>
889 Args : 0 The DOM::Document being modified
890 1 The DOM::Element parent that you want to add to
891 2 Reference to the Bio::SeqFeature being analyzed
895 ###############################
896 # <Interval-loc> & <Site-loc> #
897 ###############################
899 sub _parse_location
{
901 my ($xml, $xmlFeat, $bioFeat) = @_;
902 my $bioLoc = $bioFeat->location;
904 if (ref($bioLoc) =~ /Split/) {
905 @locations = $bioLoc->sub_Location;
906 # BSML 2.2 does not recognize / support joins. For this reason,
907 # we will just use the upper-level location. The line below can
908 # be deleted or commented out if/when BSML 3 supports complex
909 # interval deffinitions:
910 @locations = ($bioLoc);
912 @locations = ($bioLoc);
916 # Add the site or interval positional information:
917 for my $loc (@locations) {
918 my ($start, $end) = ($loc->start, $loc->end);
920 # Strand information is not well described in BSML
921 $locAttr{complement
} = 1 if ($loc->strand == -1);
922 if ($start ne "" && ($start == $end || $end eq "")) {
923 $locAttr{sitepos
} = $start;
924 push @added, $self->_addel($xmlFeat,'Site-loc',\
%locAttr);
925 } elsif ($start ne "" && $end ne "") {
927 # The feature is on the complementary strand
928 ($start, $end) = ($end, $start);
929 $locAttr{complement
} = 1;
931 $locAttr{startpos
} = $start;
932 $locAttr{endpos
} = $end;
933 push @added, $self->_addel($xmlFeat,'Interval-loc',\
%locAttr);
935 warn "Failure to parse SeqFeature location. Start = '$start' & End = '$end'";
941 =head2 _parse_bsml_feature
943 Title : _parse_bsml_feature
944 Usage : $obj->_parse_bsml_feature($xmlFeature )
945 Function: Will examine the <Feature> element provided by $xmlFeature and
946 return a generic seq feature.
947 Returns : Bio::SeqFeature::Generic
948 Args : 0 XML::DOM::Element <Feature> being analyzed.
952 sub _parse_bsml_feature
{
956 my $basegsf = Bio
::SeqFeature
::Generic
->new();
961 # Use the class as the primary tag value, if it is present
962 if ( my $val = $feat->getAttribute("class") ) {
963 $basegsf->primary_tag($val);
966 # Positional information is in <Interval-loc>s or <Site-loc>s
967 # We need to grab these in order, to try to recreate joins...
969 for my $kid ($feat->getChildNodes) {
970 my $nodeName = $kid->getNodeName;
971 next unless ($nodeName eq "Interval-loc" ||
972 $nodeName eq "Site-loc");
973 push @locations, $kid;
975 if ($#locations == 0) {
976 # There is only one location specified
977 $self->_parse_bsml_location($locations[0], $basegsf);
978 } elsif ($#locations > 0) {
979 #>>>> # This is not working, I think the error is somewhere downstream
980 # of add_sub_SeqFeature, probably in RangeI::union ?
981 # The sub features are added fine, but the EXPANDed parent feature
982 # location has a messed up start - Bio::SeqFeature::Generic ref
983 # instead of an integer - and an incorrect end - the end of the first
984 # sub feature added, not of the union of all of them.
986 # Also, the SeqIO::genbank.pm output is odd - the sub features appear
987 # to be listed with the *previous* feature, not this one.
989 for my $location (@locations) {
990 my $subgsf = $self->_parse_bsml_location($location);
991 # print "start ", $subgsf->start,"\n";
992 # print "end ", $subgsf->end,"\n";
993 $basegsf->add_sub_SeqFeature($subgsf, 'EXPAND');
995 # print $feat->getAttribute('id'),"\n";
996 # print $basegsf->primary_tag,"\n";
999 # What to do if there are no locations? Nothing needed?
1002 # Look at any <Attribute>s or <Qualifier>s that are present:
1003 my $floppies = &GETFLOPPIES
($feat);
1004 for my $attr (@
{$floppies}) {
1005 my ($name, $content) = &FLOPPYVALS
($attr);
1006 # Don't know what the object is, dump it to a tag:
1007 $basegsf->add_tag_value(lc($name), $content);
1010 # Mostly this helps with debugging, but may be of utility...
1011 # Add a tag holding the BSML id value
1012 if ( (my $val = $feat->getAttribute('id')) &&
1013 !$basegsf->has_tag('bsml-id')) {
1014 # Decided that this got a little sloppy...
1015 # $basegsf->add_tag_value("bsml-id", $val);
1020 =head2 _parse_bsml_location
1022 Title : _parse_bsml_location
1023 Usage : $obj->_parse_bsml_feature( $intOrSiteLoc, $gsfObject )
1024 Function: Will examine the <Interval-loc> or <Site-loc> element provided
1025 Returns : Bio::SeqFeature::Generic
1026 Args : 0 XML::DOM::Element <Interval/Site-loc> being analyzed.
1027 1 Optional SeqFeature::Generic to use
1031 sub _parse_bsml_location
{
1033 my ($loc, $gsf) = @_;
1035 $gsf ||= Bio
::SeqFeature
::Generic
->new();
1036 my $type = $loc->getNodeName;
1038 if ($type eq 'Interval-loc') {
1039 $start = $loc->getAttribute('startpos');
1040 $end = $loc->getAttribute('endpos');
1041 } elsif ($type eq 'Site-loc') {
1042 $start = $end = $loc->getAttribute('sitepos');
1044 warn "Unknown location type '$type', could not make GSF\n";
1047 $gsf->start($start);
1050 # BSML does not have an explicit method to set undefined strand
1051 if (my $s = $loc->getAttribute("complement")) {
1058 # We're setting "strand nonspecific" here - bad idea?
1059 # In most cases the user likely meant it to be on the + strand
1066 =head2 _parse_reference
1068 Title : _parse_reference
1069 Usage : $obj->_parse_reference(@args )
1070 Function: Makes a new <Reference> object from a ::Reference, which is
1071 then stored in an array provide by -refs. It will be
1072 appended to the XML tree later.
1074 Args : Argument array. Recognized keys:
1076 -xml The DOM::Document being modified
1078 -refobj The Annotation::Reference Object
1080 -refs An array reference to hold the new <Reference> DOM object
1082 -id Optional. If the XML id for the 'calling' element is
1083 provided, it will be placed in any <Reference> refs
1088 sub _parse_reference
{
1090 my $args = $self->_parseparams( @_);
1091 my ($xml, $ref, $refRef) = ($args->{XML
}, $args->{REFOBJ
}, $args->{REFS
});
1097 my $xmlRef = $xml->createElement("Reference");
1098 #>> This may not be the right way to make a BSML dbxref...
1099 if (my $link = $ref->medline) {
1100 $xmlRef->setAttribute('dbxref', $link);
1103 # Make attributes for some of the characteristics
1104 my %stuff = ( start
=> $ref->start,
1107 comment
=> $ref->comment,
1108 pubmed
=> $ref->pubmed,
1110 for my $s (keys %stuff) {
1111 $self->_addel($xmlRef, "Attribute", {
1112 name
=> $s, content
=> $stuff{$s} }) if ($stuff{$s});
1114 $xmlRef->setAttribute('refs', $args->{ID
}) if ($args->{ID
});
1115 # Add the basic information
1116 # Should probably check for content before creation...
1117 $self->_addel($xmlRef, "RefAuthors")->
1118 appendChild
( $xml->createTextNode(&STRIP
($ref->authors)) );
1119 $self->_addel($xmlRef, "RefTitle")->
1120 appendChild
( $xml->createTextNode(&STRIP
($ref->title)) );
1121 $self->_addel($xmlRef, "RefJournal")->
1122 appendChild
( $xml->createTextNode(&STRIP
($ref->location)) );
1123 # References will be added later in a <Feature-Table>
1124 push @
{$refRef}, $xmlRef;
1127 =head2 _parse_annotation
1129 Title : _parse_annotation
1130 Usage : $obj->_parse_annotation(@args )
1131 Function: Will examine any Annotations found in -obj. Data found in
1132 ::Comment and ::DBLink structures, as well as Annotation
1133 description fields are stored in -desc for later
1134 generation of <Attribute>s. <Reference> objects are generated
1135 from ::References, and are stored in -refs - these will
1136 be appended to the XML tree later.
1138 Args : Argument array. Recognized keys:
1140 -xml The DOM::Document being modified
1142 -obj Reference to the Bio object being analyzed
1144 -descr An array reference for holding description text items
1146 -refs An array reference to hold <Reference> DOM objects
1148 -id Optional. If the XML id for the 'calling' element is
1149 provided, it will be placed in any <Reference> refs
1154 sub _parse_annotation
{
1156 my $args = $self->_parseparams( @_);
1157 my ($xml, $obj, $descRef, $refRef) =
1158 ( $args->{XML
}, $args->{OBJ
}, $args->{DESC
}, $args->{REFS
} );
1159 # No good place to put any of this (except for references). Most stuff
1160 # just gets dumped to <Attribute>s
1161 my $ann = $obj->annotation;
1162 return unless ($ann);
1163 # use BMS::Branch; my $debug = BMS::Branch->new( ); warn "$obj :"; $debug->branch($ann);
1164 unless (ref($ann) =~ /Collection/) {
1165 # Old style annotation. It seems that Features still use this
1167 $self->_parse_annotation_old(@_);
1171 for my $key ($ann->get_all_annotation_keys()) {
1172 for my $thing ($ann->get_Annotations($key)) {
1173 if ($key eq 'description') {
1174 push @
{$descRef}, ["description" , $thing->value];
1175 } elsif ($key eq 'comment') {
1176 push @
{$descRef}, ["comment" , $thing->text];
1177 } elsif ($key eq 'dblink') {
1178 # DBLinks get dumped to attributes, too
1179 push @
{$descRef}, ["db_xref" , $thing->database . ":"
1180 . $thing->primary_id ];
1181 if (my $com = $thing->comment) {
1182 push @
{$descRef}, ["link" , $com->text ];
1185 } elsif ($key eq 'reference') {
1186 $self->_parse_reference( @_, -refobj
=> $thing );
1187 } elsif (ref($thing) =~ /SimpleValue/) {
1188 push @
{$descRef}, [$key , $thing->value];
1191 push @
{$descRef}, ["error", "bsml.pm did not understand ".
1192 "'$key' = '$thing'" ];
1198 =head2 _parse_annotation_old
1200 Title : _parse_annotation_old
1201 Usage : $obj->_parse_annotation_old(@args)
1202 Function: As above, but for the old Annotation system.
1203 Apparently needed because Features are still using the old-style
1206 Args : Argument array. Recognized keys:
1208 -xml The DOM::Document being modified
1210 -obj Reference to the Bio object being analyzed
1212 -descr An array reference for holding description text items
1214 -refs An array reference to hold <Reference> DOM objects
1216 -id Optional. If the XML id for the 'calling' element is
1217 provided, it will be placed in any <Reference> refs
1226 sub _parse_annotation_old
{
1228 my $args = $self->_parseparams( @_);
1229 my ($xml, $obj, $descRef, $refRef) =
1230 ( $args->{XML
}, $args->{OBJ
}, $args->{DESC
}, $args->{REFS
} );
1231 # No good place to put any of this (except for references). Most stuff
1232 # just gets dumped to <Attribute>s
1233 if (my $ann = $obj->annotation) {
1234 push @
{$descRef}, ["annotation", $ann->description];
1235 for my $com ($ann->each_Comment) {
1236 push @
{$descRef}, ["comment" , $com->text];
1239 # Gene names just get dumped to <Attribute name="gene">
1240 for my $gene ($ann->each_gene_name) {
1241 push @
{$descRef}, ["gene" , $gene];
1244 # DBLinks get dumped to attributes, too
1245 for my $link ($ann->each_DBLink) {
1246 push @
{$descRef}, ["db_xref" ,
1247 $link->database . ":" . $link->primary_id ];
1248 if (my $com = $link->comment) {
1249 push @
{$descRef}, ["link" , $com->text ];
1253 # References get produced and temporarily held
1254 for my $ref ($ann->each_Reference) {
1255 $self->_parse_reference( @_, -refobj
=> $ref );
1263 Usage : $obj->_add_page($xmlDocument, $xmlSequenceObject)
1264 Function: Adds a simple <Page> and <View> structure for a <Sequence>
1265 Returns : a reference to the newly created <Page>
1266 Args : 0 The DOM::Document being modified
1267 1 Reference to the <Sequence> object
1273 my ($xml, $seq) = @_;
1274 my $disp = $xml->getElementsByTagName("Display")->item(0);
1275 my $page = $self->_addel($disp, "Page");
1276 my ($width, $height) = ( 7.8, 5.5);
1277 my $screen = $self->_addel($page, "Screen", {
1278 width
=> $width, height
=> $height, });
1279 # $screen->appendChild($xml->createComment("Must close explicitly"));
1280 my $view = $self->_addel($page, "View", {
1281 seqref
=> $seq->getAttribute('id'),
1282 title
=> $seq->getAttribute('title'),
1284 title2
=> "{LENGTH} {UNIT}",
1286 $self->_addel($view, "View-line-widget", {
1287 shape
=> 'horizontal',
1288 hcenter
=> $width/2 + 0.7,
1289 'linear-length' => $width - 2,
1291 $self->_addel($view, "View-axis-widget");
1299 Usage : $obj->_addel($parentElem, 'ChildName',
1300 { anAttr => 'someValue', anotherAttr => 'aValue',})
1301 Function: Add an element with attribute values to a DOM tree
1302 Returns : a reference to the newly added element
1303 Args : 0 The DOM::Element parent that you want to add to
1304 1 The name of the new child element
1305 2 Optional hash reference containing
1306 attribute name => attribute value assignments
1312 my ($root, $name, $attr) = @_;
1314 # Find the DOM::Document for the parent
1315 my $doc = $root->getOwnerDocument || $root;
1316 my $elem = $doc->createElement($name);
1317 for my $a (keys %{$attr}) {
1318 $elem->setAttribute($a, $attr->{$a});
1320 $root->appendChild($elem);
1327 Usage : $obj->_show_dna($newval)
1328 Function: (cut-and-pasted directly from embl.pm)
1329 Returns : value of _show_dna
1330 Args : newvalue (optional)
1338 $obj->{'_show_dna'} = $value;
1340 return $obj->{'_show_dna'};
1346 Usage : $dom = $obj->_initialize(@args)
1347 Function: Coppied from embl.pm, and augmented with initialization of the
1350 Args : -file => the XML file to be parsed
1355 my($self,@args) = @_;
1357 $self->SUPER::_initialize
(@args);
1358 # hash for functions for decoding keys.
1359 $self->{'_func_ftunit_hash'} = {};
1360 $self->_show_dna(1); # sets this to one by default. People can change it
1362 my %param = @args; # From SeqIO.pm
1363 @param{ map { lc $_ } keys %param } = values %param; # lowercase keys
1364 if ( exists $param{-file
} && $param{-file
} !~ /^>/) {
1365 # Is it blasphemy to add your own keys to an object in another package?
1366 # domtree => the parsed DOM tree retruned by XML::DOM
1367 $self->{'domtree'} = $self->_parse_xml( $param{-file
} );
1368 # current_node => the <Sequence> node next in line for next_seq
1369 $self->{'current_node'} = 0;
1372 $self->sequence_factory( Bio
::Seq
::SeqFactory
->new
1373 ( -verbose
=> $self->verbose(),
1374 -type
=> 'Bio::Seq::RichSeq'))
1375 if( ! defined $self->sequence_factory );
1381 Title : _parseparams
1382 Usage : my $paramHash = $obj->_parseparams(@args)
1383 Function: Borrowed from Bio::Parse.pm, who borrowed it from CGI.pm
1384 Lincoln Stein -> Richard Resnick -> here
1385 Returns : A hash reference of the parameter keys (uppercase) pointing to
1387 Args : An array of key, value pairs. Easiest to pass values as:
1388 -key1 => value1, -key2 => value2, etc
1389 Leading "-" are removed.
1398 # Hacked out from Parse.pm
1399 # The next few lines strip out the '-' characters which
1400 # preceed the keys, and capitalizes them.
1401 for (my $i=0;$i<@param;$i+=2) {
1402 $param[$i]=~s/^\-//;
1403 $param[$i]=~tr/a-z/A-Z/;
1405 pop @param if @param %2; # not an even multiple
1413 Usage : $dom = $obj->_parse_xml($filename)
1414 Function: uses XML::DOM to construct a DOM tree from the BSML document
1415 Returns : a reference to the parsed DOM tree
1416 Args : 0 Path to the XML file needing to be parsed
1425 $self->throw("Could not parse non-existant XML file '$file'.");
1428 my $parser = XML
::DOM
::Parser
->new();
1429 my $doc = $parser->parsefile ($file);
1435 # Reports off the net imply that DOM::Parser will memory leak if you
1436 # do not explicitly dispose of it:
1437 # http://aspn.activestate.com/ASPN/Mail/Message/perl-xml/788458
1438 my $dom = $self->{'domtree'};
1439 # For some reason the domtree can get undef-ed somewhere...
1440 $dom->dispose if ($dom);
1444 =head1 TESTING SCRIPT
1446 The following script may be used to test the conversion process. You
1447 will need a file of the format you wish to test. The script will
1448 convert the file to BSML, store it in /tmp/bsmltemp, read that file
1449 into a new SeqIO stream, and write it back as the original
1450 format. Comparison of this second file to the original input file
1451 will allow you to track where data may be lost or corrupted. Note
1452 that you will need to specify $readfile and $readformat.
1455 # Tests preservation of details during round-trip conversion:
1456 # $readformat -> BSML -> $readformat
1457 my $tempspot = "/tmp/bsmltemp"; # temp folder to hold generated files
1458 my $readfile = "rps4y.embl"; # The name of the file you want to test
1459 my $readformat = "embl"; # The format of the file being tested
1461 system "mkdir $tempspot" unless (-d $tempspot);
1462 # Make Seq object from the $readfile
1463 my $biostream = Bio::SeqIO->new( -file => "$readfile" );
1464 my $seq = $biostream->next_seq();
1466 # Write BSML from SeqObject
1467 my $bsmlout = Bio::SeqIO->new( -format => 'bsml',
1468 -file => ">$tempspot/out.bsml");
1469 warn "\nBSML written to $tempspot/out.bsml\n";
1470 $bsmlout->write_seq($seq);
1471 # Need to kill object for following code to work... Why is this so?
1474 # Make Seq object from BSML
1475 my $bsmlin = Bio::SeqIO->new( -file => "$tempspot/out.bsml",
1477 my $seq2 = $bsmlin->next_seq();
1479 # Write format back from Seq Object
1480 my $genout = Bio::SeqIO->new( -format => $readformat,
1481 -file => ">$tempspot/out.$readformat");
1482 $genout->write_seq($seq2);
1483 warn "$readformat written to $tempspot/out.$readformat\n";
1486 # Join information (not possible in BSML 2.2)
1487 # Sequence type (??)