deobfuscator/Deobfuscator/bin/deob_index.pl

   1 #!/usr/bin/perl
   2
   3 # deob_index.pl
   4 # part of the Deobfuscator package
   5 # by Laura Kavanaugh and Dave Messina
   6 #
   7 # cared for by Dave Messina <dave-pause@davemessina.net>
   8 #
   9 # POD documentation - main docs before the code
  10
  11 =head1 NAME
  12
  13 deob_index.pl - extracts BioPerl documentation and indexes it in a database for easy retrieval
  14
  15 =head1 VERSION
  16
  17 This document describes deob_index.pl version 0.0.3
  18
  19
  20 =head1 SYNOPSIS
  21
  22 deob_index.pl <path to BioPerl lib> <output path>
  23
  24 =over
  25
  26 =item <path to BioPerl lib>
  27
  28 a directory path pointing to the root of the BioPerl lib tree. e.g. /export/share/lib/perl5/site_perl/5.8.7/Bio/
  29
  30 =item <output path>
  31
  32 where you would like deob_index.pl to put its output files.
  33
  34 =back
  35
  36
  37 =head1 DESCRIPTION
  38
  39 deob_index.pl goes through the entire BioPerl library tree looking for
  40 .pm and .pl files. For each one it finds, it tries to extract module-level
  41 POD documentation (e.g. SYNOPSIS, DESCRIPTION) and store it in a BerkeleyDB.
  42 It also tries to extract documentation for each method in the module and
  43 store that in a separate BerkeleyDB.
  44
  45 Specific parts of the documentation for a module or method may be retrieved
  46 individually using the functions available in Deobfuscator.pm. See that module
  47 for details.
  48
  49 While going through and trying to parse each module, deob_index.pl also
  50 reports what pieces of the documentation it can't find. For example, if
  51 a method's documentation doesn't describe the data type it returns, this
  52 script logs that information to a file. This type of automated documentation-
  53 checking could be used to standardize and improve the documentation in
  54 BioPerl.
  55
  56 deob_index.pl creates four files:
  57
  58 =over
  59
  60 =item C<< package_list.txt >>
  61
  62 A plaintext file listing each package found in the BioPerl directory that was
  63 searched. Packages are listed by their module names, such as 'Bio::SeqIO'.
  64 This file is used by L<deob_interface.cgi>.
  65
  66 =item C<< packages.db >>
  67
  68 A Berkeley DB, which stores package-level documentation, such as
  69 the synopsis and the description. Each key is a package name,
  70 e.g. "Bio::SeqIO", and each value string is composed of the
  71 individual pieces of the documentation kept separate by
  72 unique string record separators. The individual pieces of
  73 documentation are pulled out of the string using the
  74 get_pkg_docs function in Deobfuscator.pm. See that package
  75 for details.
  76
  77 =item C<< methods.db >>
  78
  79 Like packages.db, methods.db is also a Berkeley DB, except it
  80 stores various pieces of information about individual methods
  81 available to a class. Each method might have documentation
  82 about its usage, its arguments, its return values, an example,
  83 and a description of its function.
  84
  85 Each key is the fully-qualified method name, e.g.
  86 "Bio::SeqIO::next_seq". Each value is a string containing all
  87 of the pieces of documentation concatenated together and
  88 separated by unique strings serving as record separators. The
  89 extraction of the actual documentation in these strings is
  90 handled by the get_method_docs subroutine in the Deobfuscator.pm
  91 module. See that package for details.
  92
  93 Not all methods will have all of these types of documentation,
  94 and some methods will not have the different pieces of
  95 information clearly labeled and separated. For the latter type,
  96 deob_index.pl will try to store whatever free-form
  97 documentation that does exist, and the get_method_docs function
  98 in Deobfuscator.pm, if called without arguments, will return
  99 that documentation.
 100
 101 =item C<< deob_index.log >>
 102
 103 This file contains detailed information about errors
 104 encountered while trying to extract documentation during
 105 the indexing process.
 106
 107 Each line in deob_index.log is a key-value pair describing
 108 a single parsing error.
 109
 110 =back
 111
 112
 113 =head1 DIAGNOSTICS
 114
 115 These are the parsing error codes reported in 'deob_index.log'.
 116
 117 =head2 Package errors
 118
 119 =over
 120
 121 =item C<< PKG_NAME >>
 122
 123 couldn't find the name of the package
 124
 125 =item C<< SYNOPSIS >>
 126
 127 couldn't find the synopsis
 128
 129 =item C<< DESC >>
 130
 131 couldn't find the description
 132
 133 =item C<< METHODS >>
 134
 135 couldn't find any methods
 136
 137 =item C<< PKG_DUP >>
 138
 139 This package name occurs more than once
 140
 141 =back
 142
 143 =head2 Method errors
 144
 145 =over
 146
 147 =item C<< FUNCTION >>
 148
 149 couldn't find the function description
 150
 151 =item C<< EXAMPLE >>
 152
 153 couldn't find the example
 154
 155 =item C<< ARGS >>
 156
 157 couldn't find the method's arguments
 158
 159 =item C<< USAGE >>
 160
 161 couldn't find the usage statement
 162
 163 =item C<< RETURNS >>
 164
 165 couldn't find the return values
 166
 167 =item C<< FREEFORM >>
 168
 169 This method's documentation doesn't conform to the BioPerl standard of having
 170 clearly-labeled fields for title, function, example, args, usage, and returns.
 171
 172 =item C<< METH_DUP >>
 173
 174 This method name occurs more than once
 175
 176 =back
 177
 178
 179 =head1 CONFIGURATION AND ENVIRONMENT
 180
 181 This software requires:
 182
 183 =over
 184
 185 =item A working installation of the Berkeley DB
 186
 187 The Berkeley DB comes standard with most UNIX distributions, so you may
 188 already have it installed. See L<http://www.sleepycat.com> for more information.
 189
 190 =item BioPerl
 191
 192 deob_index.pl recursively navigates a directory of BioPerl modules. Note
 193 that the BioPerl module directory need not be "installed"; any old location
 194 will do. See L<http://www.bioperl.org> for the latest version.
 195
 196 =back
 197
 198
 199 =head1 DEPENDENCIES
 200
 201 L<version>, L<File::Find>, L<DB_File>
 202
 203
 204 =head1 INCOMPATIBILITIES
 205
 206 None reported.
 207
 208
 209 =head1 BUGS AND LIMITATIONS
 210
 211 No bugs have been reported.
 212
 213 deob_index.pl currently expects the sections of POD in a BioPerl module to
 214 be in a particular order, namely: NAME, SYNOPSIS, DESCRIPTION, CONSTRUCTORS,
 215 ... , APPENDIX. Those sections are expected to be marked with =head1 POD tags,
 216 and the documentation for each method is expected to be in =head2 sections
 217 in the APPENDIX. The order of SYNOPSIS and DESCRIPTION can be flipped, but
 218 this behavior should not be taken as encouragement to do so.
 219
 220 Most, but not all BioPerl modules conform to this standard. Those that do not
 221 will cause deob_index.pl to report them as errors. Although the consistency
 222 of this standard is desirable for end-users of the documentation, this code
 223 probably needs to be a little bit more flexible (patches welcome!).
 224
 225 This software has only been tested in a UNIX environment.
 226
 227
 228 =head1 FEEDBACK
 229
 230 =head2 Mailing Lists
 231
 232 User feedback is an integral part of the evolution of this and other
 233 Bioperl modules. Send your comments and suggestions preferably to one
 234 of the Bioperl mailing lists.  Your participation is much appreciated.
 235
 236   bioperl-l@bioperl.org                       - General discussion
 237   http://bioperl.org/Support.html   - About the mailing lists
 238
 239 =head2 Reporting Bugs
 240
 241 Report bugs to the Bioperl bug tracking system to help us keep track
 242 the bugs and their resolution.  Bug reports can be submitted via the
 243 web:
 244
 245   https://github.com/bioperl/bioperl-live/issues
 246
 247
 248 =head1 SEE ALSO
 249
 250 L<Deobfuscator>, L<deob_interface.cgi>, L<deob_detail.cgi>
 251
 252
 253 =head1 AUTHOR
 254
 255 Dave Messina C<< <dave-pause@davemessina.net> >>
 256
 257
 258 =head1 CONTRIBUTORS
 259
 260 =over
 261
 262 =item Laura Kavanaugh
 263
 264 =item David Curiel
 265
 266 =back
 267
 268
 269 =head1 ACKNOWLEDGMENTS
 270
 271 This software was developed originally at the Cold Spring Harbor Laboratory's
 272 Advanced Bioinformatics Course between Oct 12-25, 2005. Many thanks to David
 273 Curiel, who provided much-needed guidance and assistance on this project.
 274
 275
 276 =head1 LICENSE AND COPYRIGHT
 277
 278 Copyright (C) 2005-6 Laura Kavanaugh and Dave Messina. All Rights Reserved.
 279
 280 This module is free software; you may redistribute it and/or modify it under the
 281 same terms as Perl itself. See L<perlartistic>.
 282
 283
 284 =head1 DISCLAIMER
 285
 286 This software is provided "as is" without warranty of any kind.
 287
 288 =cut
 289
 290 use version; $VERSION = qv('0.0.2');
 291 use warnings;
 292 use strict;
 293 use File::Find;
 294 use DB_File;
 295 use IO::File;
 296 use Getopt::Std;
 297 use File::Spec;
 298
 299 # GetOpt::Std-related settings
 300 $Getopt::Std::STANDARD_HELP_VERSION = 1;
 301 getopts('s:x:');
 302
 303 my $DEBUG = 0;
 304
 305 my $usage = "
 306 deob_index.pl - extracts and parses BioPerl POD
 307 and stores the info in a database.
 308
 309 USAGE: deob_index.pl [-s bioperl-version] [-x exclude_file] <BioPerl lib dir> <output dir>
 310
 311 where
 312
 313 <BioPerl lib dir> is the BioPerl distribution you'd like to index
 314
 315     e.g. /export/share/lib/perl5/site_perl/5.8.7/Bio/
 316
 317 and
 318
 319 <output dir> is where the output files should be placed
 320
 321 OPTIONS:
 322 -s    user-supplied string to declare BioPerl's version
 323       (which will be displayed by deob_interface.cgi)
 324 -x    excluded modules file (a module paths to skip; see POD for details)
 325 ";
 326
 327 unless ( @ARGV == 2 ) { die $usage; }
 328
 329 my ( $source_dir, $dest_dir ) = @ARGV;
 330
 331 # check source_dir for full path and repair if it's a relative path
 332 unless ( File::Spec->file_name_is_absolute( $source_dir ) ) {
 333     $source_dir = File::Spec->rel2abs( $source_dir ) ;
 334 }
 335
 336 # check dest_dir for full path and repair if it's a relative path
 337 unless ( File::Spec->file_name_is_absolute( $dest_dir ) ) {
 338     $dest_dir = File::Spec->rel2abs( $dest_dir ) ;
 339 }
 340
 341 # NOTE: we're allowing only one source directory, but File::Find supports
 342 # passing an array of dirs.
 343
 344 # read in an optional list of modules to exclude from indexing
 345 # - this is aimed at modules with external dependencies that are often not
 346 # - present and thus will prevent deob_interface.cgi from loading them
 347 our ($opt_s, $opt_x);
 348 my %exclude;
 349 if (defined $opt_x) {
 350     my $exclude_fh = IO::File->new($opt_x, "r")
 351         or die "couldn't open $opt_x\n";
 352     while (<$exclude_fh>) {
 353         chomp;
 354         next if ( /^\#/ || /^\s*$/ ); # ignore comments and blank lines
 355         $exclude{$_} = 1;
 356     }
 357     print STDERR "Found ", scalar keys %exclude, " modules to be excluded.\n";
 358 }
 359
 360
 361 # save a list of the BioPerl modules to a file
 362 my $list; # filehandle
 363 my $list_file = $dest_dir . "/package_list.txt";
 364 if ( -e $list_file) { unlink($list_file); }
 365 open $list, ">$list_file" or die "deob_index.pl: couldn't open $list_file:$!\n";
 366 my @list_holder; # hold all package names so we can sort them before writing.
 367
 368 # record misbehaving BioPerl docs to a file
 369 my $log;    # filehandle
 370 my $logfile = $dest_dir . "/deob_index.log";
 371 open $log, ">$logfile" or die "deob_index.pl: couldn't open $logfile:$!\n";
 372
 373 # create databases
 374 my $meth_file = $dest_dir . '/methods.db';
 375 if ( -e $meth_file ) { unlink($meth_file); }    # remove for production?
 376 my $meth_db = create_db($meth_file) or die "deob_index.pl: couldn't create $meth_file: $!\n";
 377 my $pkg_file = $dest_dir . '/packages.db';
 378 if ( -e $pkg_file ) { unlink($pkg_file); }      # remove for production?
 379 my $pkg_db = create_db($pkg_file) or die "deob_index.pl: couldn't create $pkg_file: $!\n";
 380
 381 # used to make sure we're parsing in the right order
 382 my %FLAG;
 383
 384 # store version string in packages.db
 385 $pkg_db->{'__BioPerl_Version'} = $opt_s ? $opt_s : 'unknown';
 386
 387 # keep stats on our indexing
 388 my %stats = (
 389               'files'    => 0,
 390               'pkg_name' => 0,
 391               'desc'     => 0,
 392               'synopsis' => 0,
 393               'methods'  => 0,
 394             );
 395
 396 # wanted points to the subroutine which is run on each found file
 397 # ( in this program, that subroutine is &extract_pod )
 398 # no_chdir prevents find from chdir'ing into each subsequent directory
 399 my %FIND_OPTIONS = ( wanted => \&extract_pod);#, no_chdir => 1 );
 400
 401 # This is the important line - Find::File actually doing the
 402 # traversal of the directory tree.
 403 find( \%FIND_OPTIONS, $source_dir );
 404
 405 # sort and write out package list
 406 foreach my $sorted_pkg (sort @list_holder) {
 407     print $list $sorted_pkg, "\n";
 408 }
 409
 410 # store user-supplied BioPerl version number
 411
 412 # output stats
 413 print STDOUT "\nThis indexing run found:\n";
 414 print $log "\nThis indexing run found:\n";
 415 foreach my $stat ( 'files', 'pkg_name', 'desc', 'synopsis', 'methods' ) {
 416     printf STDOUT "%5d %s\n", $stats{$stat}, $stat;
 417     printf $log "%5d %s\n", $stats{$stat}, $stat;
 418 }
 419
 420 # close files and DBs
 421 untie $meth_db or die "deob_index.pl: couldn't close $meth_file: $!\n";
 422 untie $pkg_db  or die "deob_index.pl: couldn't close $pkg_file: $!\n";
 423 close $list    or die "deob_index.pl: couldn't close $list: $!\n";
 424 close $log     or die "deob_index.pl: couldn't close $log: $!\n";
 425 my $mode = 0666;
 426 chmod($mode, $pkg_file, $meth_file, $list_file);
 427
 428 ### Parsing subroutines ###
 429 sub extract_pod {
 430     my ($file) = $_;
 431     my $long_file = $File::Find::name;
 432
 433     # skip if it's on our exclude list
 434     foreach my $one (keys %exclude) {
 435         if ($File::Find::name =~ /$one$/) {
 436             print STDERR "Excluding $file\n";
 437             print $log "Excluding $file\n";
 438             return;
 439         }
 440     }
 441
 442     # skip unless it's a perl file that exists
 443     return unless ( $file =~ /\.PLS$/ ) or ( $file =~ /\.p[ml]$/ );
 444     return unless -e $file;
 445
 446     $stats{'files'}++;
 447
 448     open my $fh, '<', $File::Find::name or die "deob_index.pl: could not read file '$file': $!\n";
 449
 450     # these have to be done in order
 451     my ( $pkg_name, $short_desc ) = get_pkg_name($fh);
 452     my ($synopsis, $desc);
 453     LOOP: while (my ($type, $section) = get_generic($fh) ) {
 454         if    ($type eq 'synopsis')    { $synopsis = $section; }
 455         elsif ($type eq 'description') { $desc     = $section; }
 456         else { last LOOP; }
 457     }
 458
 459     my $constructors = get_constructors($fh);
 460     my $methods      = get_methods($fh);
 461
 462     # record package name to our package list file
 463     if ($pkg_name) { push @list_holder, $pkg_name; }
 464
 465     # store valid package data here
 466     my @pkg_data;
 467
 468     # error reporting
 469     if ($pkg_name) {
 470         $stats{'pkg_name'}++;
 471         print $pkg_name, "\n" if $DEBUG == 1;
 472     }
 473     else {
 474         print $log " PKG_NAME: $long_file\n";
 475     }
 476     if ($short_desc) {
 477         $stats{'short_desc'}++;
 478         push @pkg_data, $short_desc;
 479         print $short_desc, "\n" if $DEBUG == 1;
 480     }
 481     else {
 482                 push @pkg_data, 'no short description available'; # store something
 483         print $log "SHORT_DESC: $long_file\n";
 484     }
 485     if ($synopsis) {
 486         $stats{'synopsis'}++;
 487         print $synopsis, "\n" if $DEBUG == 1;
 488         push @pkg_data, $synopsis;
 489     }
 490     else {
 491                 push @pkg_data, 'no synopsis available'; # store something
 492         print $log " SYNOPSIS: $long_file\n";
 493     }
 494     if ($desc) {
 495         $stats{'desc'}++;
 496         print $desc, "\n" if $DEBUG == 1;
 497         push @pkg_data, $desc;
 498     }
 499     else {
 500                 push @pkg_data, 'no description available'; # store something
 501         print $log "     DESC: $long_file\n";
 502     }
 503     if ($methods) {
 504         my $method_count = scalar keys %$methods;
 505         print "**** Found $method_count methods in $pkg_name\n"
 506             if $DEBUG == 2;
 507         foreach my $method ( keys %$methods ) {
 508             $stats{'methods'}++;
 509             print $method, "\n//\n" if $DEBUG == 2;
 510         }
 511     }
 512     else {
 513         print $log "  METHODS: $long_file\n";
 514     }
 515
 516     # prepare data for databases
 517     my $pkg_record   = pkg_prep(@pkg_data);
 518     my $meth_records = meth_prep( $pkg_name, $methods );
 519
 520     # load data in databases
 521     if ($pkg_name) {
 522         pkg_load( $pkg_db, $pkg_name, $pkg_record );
 523         meth_load( $meth_db, $meth_records );
 524     }
 525 }
 526
 527 sub slurp_until_next {
 528     my ($fh) = @_;
 529
 530     my @lines;
 531     my $prev_line = $_;
 532
 533
 534     LINE: while (<$fh>) {
 535         next LINE if $_ eq $prev_line;
 536
 537         # if it's a POD directive
 538         if (/^\=/) {
 539
 540             # reset our position to the beginning of the line
 541             # so it is seen as part of the next POD section
 542             seek $fh, -length($_), 1;
 543             last LINE;
 544         }
 545         else {
 546             push @lines, $_;
 547         }
 548     }
 549     return join q{}, @lines;
 550 }
 551
 552 sub get_pkg_name {
 553     my ($fh) = @_;
 554
 555     my $pkg_name;
 556     my $short_desc;
 557
 558     LINE: while (<$fh>) {
 559         chomp;
 560         print "**", $_, "\n" if $DEBUG == 2;
 561
 562         # grab package name
 563         # - "short desc" is the one-line description of the package
 564         if ( $_ =~ /^\=head1\s+NAME/ ) {
 565             <$fh>;
 566             my $next_line = <$fh>;
 567             ( $pkg_name, $short_desc ) = split /\s+/, $next_line, 2;
 568                         $short_desc .= slurp_until_next($fh);
 569
 570             # strip off leading dash
 571             $short_desc =~ s/^(\-)+\s+//;
 572
 573                         # strip off trailing spaces
 574                         $short_desc =~ s/\s+$//;
 575
 576                         # strip any newlines
 577                         $short_desc =~ s/\n/ /;
 578
 579             print $pkg_name, "\n" if $DEBUG == 1;
 580
 581             last LINE;
 582         }
 583
 584         # we've hit a =head1, but it's the wrong one
 585         elsif ( $_ =~ /^\=head1\s+/ ) {
 586             last LINE;
 587         }
 588     }
 589     if ($pkg_name) {
 590         $FLAG{'pkg_name'} = 1;
 591         return $pkg_name, $short_desc;
 592     }
 593 }
 594
 595 sub get_generic {
 596     my ($fh) = @_;
 597
 598     my $section;
 599
 600     LINE: while (<$fh>) {
 601         chomp;
 602         print "**", $_, "\n" if $DEBUG == 2;
 603
 604         if ( $_ =~ /^\=head1\s+SYNOPSIS/ ) {
 605             $section = slurp_until_next($fh);
 606             if ($section) {
 607                 $FLAG{'synopsis'} = 1;
 608                 return ('synopsis', $section);
 609             }
 610             else { last LINE; }
 611         }
 612         elsif ( $_ =~ /^\=head1\s+DESCRIPTION/ ) {
 613             $section = slurp_until_next($fh);
 614             if ($section) {
 615                 $FLAG{'description'} = 1;
 616                 return ('description', $section);
 617             }
 618             else { last LINE; }
 619         }
 620
 621         # if we hit the APPENDIX, time to stop
 622         elsif (/^\=head1\s+APPENDIX/) {
 623
 624             # reset our position to the beginning of the line
 625             # so it is seen by the next parser
 626             seek $fh, -length($_)*2, 1;
 627             last LINE;
 628         }
 629     }
 630 }
 631
 632 sub get_synopsis {
 633     my ($fh) = @_;
 634
 635     my $synopsis;
 636
 637     LINE: while (<$fh>) {
 638         chomp;
 639         print "**", $_, "\n" if $DEBUG == 2;
 640
 641         if ( $_ =~ /^\=head1\s+SYNOPSIS/ ) {
 642             $synopsis = slurp_until_next($fh);
 643             last LINE;
 644         }
 645
 646         # we've hit a =head1, but it's the wrong one
 647         elsif ( $_ =~ /^\=head1\s+/ ) {
 648             last LINE;
 649         }
 650     }
 651     if ($synopsis) {
 652         $FLAG{'synopsis'} = 1;
 653         return $synopsis;
 654     }
 655 }
 656
 657 sub get_desc {
 658     my ($fh) = @_;
 659
 660     my $desc;
 661
 662     LINE: while (<$fh>) {
 663         chomp;
 664         print "**", $_, "\n" if $DEBUG == 2;
 665
 666         if ($_ =~ /^=head1\s+VERSION/ ) {
 667             slurp_until_next($fh);
 668         }
 669
 670         if ( $_ =~ /^\=head1\s+DESCRIPTION/ ) {
 671             $desc = slurp_until_next($fh);
 672             last LINE;
 673         }
 674
 675         # we've hit a =head1, but it's the wrong one
 676         elsif ( $_ =~ /^\=head1\s+/ ) {
 677             last LINE;
 678         }
 679     }
 680     if ($desc) {
 681         $FLAG{'description'} = 1;
 682         return $desc;
 683     }
 684 }
 685
 686 sub get_constructors {
 687
 688     # not implemented
 689
 690     # should return a hashref
 691 }
 692
 693 sub get_methods {
 694     my ($fh) = @_;
 695     my %methods;
 696
 697     # we shouldn't see any methods until after the APPENDIX
 698     my $seen_appendix = 0;
 699
 700     # there's an '=cut' after we enter the APPENDIX
 701     # we know the method '=head2' tags will come after it
 702     my $seen_first_cut = 0;
 703
 704     LINE: while (<$fh>) {
 705         if ( $_ =~ /^\=head1\s+APPENDIX/ ) {
 706             $seen_appendix = 1;
 707         }
 708
 709         # this should be the first tag after the APPENDIX
 710         if ( $seen_appendix && $_ =~ /^\=cut/ ) {
 711             $seen_first_cut = 1;
 712         }
 713
 714         # this should be a method
 715         if ( $seen_first_cut && $_ =~ /^\=head2\s+(\S+)/ ) {
 716             $methods{$1} = slurp_until_next($fh);
 717         }
 718     }
 719
 720     # returns a hashref
 721     return \%methods;
 722 }
 723
 724 ### Database subroutines ###
 725 sub create_db {
 726     my ($filename) = @_;
 727
 728     my %hash;
 729     my $hashref = \%hash;
 730
 731     tie %hash, "DB_File", $filename
 732         or die "ERROR: couldn't open $filename:$!\n";
 733
 734     return $hashref;
 735 }
 736
 737 sub pkg_prep {
 738
 739     # unique string on which to split our sub-records
 740     my $rec_sep = 'DaVe-ReC-sEp';
 741
 742     my $record = join $rec_sep, @_;
 743
 744     return $record;
 745 }
 746
 747 sub meth_prep {
 748     my ( $pkg_name, $methods ) = @_;
 749     my %records;
 750
 751     foreach my $entry ( keys %$methods ) {
 752         my $key = $pkg_name . '::' . $entry;
 753         my $record;    # what will be stored in the db
 754         my $rec_sep = 'DaVe-ReC-sEp';
 755
 756         # if the method conforms to the BioPerl doc spec,
 757         # we will split it into constituent pieces before storing
 758         # it in the db. If not, we store the whole thing as one lump.
 759
 760         my $last;      # for grabbing multi-line entries
 761         my %fields = (
 762             'title'    => '',
 763             'usage'    => '',
 764             'function' => '',
 765             'example'  => '',
 766             'returns'  => '',
 767             'args'     => '',
 768         );
 769
 770
 771         my @lines = split "\n", $methods->{$entry};
 772         foreach my $line (@lines) {
 773             if ( $line =~ /^\s+Title\s+:(.*)/ ) {
 774                 next if $1 =~ /^\s+$/;
 775                 $fields{'title'} = $1;
 776                 $last = \$fields{'title'};
 777             }
 778             elsif ( $line =~ /^\s+Usage\s+:(.*)/ ) {
 779                 next if $1 =~ /^\s+$/;
 780                 $fields{'usage'} = $1;
 781                 $last = \$fields{'usage'};
 782             }
 783             elsif ( $line =~ /^\s+Function\s?:(.*)/ ) {
 784                 next if $1 =~ /^\s+$/;
 785                 $fields{'function'} = $1;
 786                 $last = \$fields{'function'};
 787             }
 788             elsif ( $line =~ /^\s+Example\s+:(.*)/ ) {
 789                 next if $1 =~ /^\s+$/;
 790                 $fields{'example'} = $1;
 791                 $last = \$fields{'example'};
 792             }
 793             elsif ( $line =~ /^\s+Returns\s+:(.*)/ ) {
 794                 next if $1 =~ /^\s+$/;
 795                 $fields{'returns'} = $1;
 796                 $last = \$fields{'returns'};
 797             }
 798             elsif ( $line =~ /^\s+Args\s+:(.*)/ ) {
 799                 next if $1 =~ /^\s+$/;
 800                 $fields{'args'} = $1;
 801                 $last = \$fields{'args'};
 802             }
 803
 804             # grab multi-line entries
 805             elsif ( $line =~ /^\s{8,}(\s.*)/ ) { $$last .= $1; }
 806         }
 807
 808         # debugging
 809         if ( $DEBUG == 2 ) {
 810             print "** $entry **\n";
 811             foreach my $field ( keys %fields ) {
 812                 print STDOUT $field, "\t", $fields{$field}, "\n";
 813             }
 814             print "\n";
 815         }
 816
 817         # if any of our fields have a value, store subrecords
 818         my $filled_fields = grep /\w+/, values %fields;
 819         print STDERR $key, "\t", $filled_fields, "\n" if $DEBUG == 3;
 820         if ( $filled_fields > 0 ) {
 821             if ( !$fields{'title'} ) { print $log '    TITLE: ', $key, "\n"; }
 822             if ( !$fields{'usage'} ) { print $log '    USAGE: ', $key, "\n"; }
 823             if ( !$fields{'function'} ) {
 824                 print $log ' FUNCTION: ', $key, "\n";
 825             }
 826             if ( !$fields{'example'} ) {
 827                 print $log '  EXAMPLE: ', $key, "\n";
 828             }
 829             if ( !$fields{'returns'} ) {
 830                 print $log '  RETURNS: ', $key, "\n";
 831             }
 832             if ( !$fields{'args'} ) { print $log '     ARGS: ', $key, "\n"; }
 833
 834             # create the records to be stored in the db
 835             foreach my $field ( keys %fields ) {
 836                 my $subrecord
 837                     = $rec_sep . '-' . $field . '|' . $fields{$field};
 838                 $record .= $subrecord;
 839             }
 840
 841             # store the records
 842             $records{$key} = $record;
 843         }
 844
 845         # if no subfields, store whatever docs we do have for the method
 846         else {
 847             $record = $methods->{$entry};
 848             print $log ' FREEFORM: ', $key, "\n";
 849         }
 850     }
 851     return \%records;
 852 }
 853
 854 sub pkg_load {
 855     my ( $pkg_db, $pkg_name, $record ) = @_;
 856
 857     if ( exists $pkg_db->{$pkg_name} ) {
 858         print $log '  PKG_DUP: ', $pkg_name, "\n";
 859         warn(
 860             "$pkg_name already exists in package db!\n",
 861             "existing record:\n$pkg_db->{$pkg_name}\n",
 862             "attempted to add:\n$record\n",
 863             )
 864             if $DEBUG == 2;
 865     }
 866     else {
 867         $pkg_db->{$pkg_name} = $record;
 868     }
 869 }
 870
 871 sub meth_load {
 872     my ( $meth_db, $records ) = @_;
 873
 874     foreach my $method ( keys %$records ) {
 875         if ( exists( $meth_db->{$method} ) ) {
 876             print $log ' METH_DUP: ', $method, "\n";
 877             warn(
 878                 "$method already exists in method db!\n",
 879                 "existing record:\n$meth_db->{$method}\n",
 880                 "attempted to add:\n$records->{$method}\n",
 881                 )
 882                 if $DEBUG == 2;
 883         }
 884         else {
 885             $meth_db->{$method} = $records->{$method};
 886         }
 887     }
 888 }
 889
 890 __END__