4 # part of the Deobfuscator package
5 # by Laura Kavanaugh and Dave Messina
7 # cared for by Dave Messina <dave-pause@davemessina.net>
9 # POD documentation - main docs before the code
13 deob_index.pl - extracts BioPerl documentation and indexes it in a database for easy retrieval
17 This document describes deob_index.pl version 0.0.3
22 deob_index.pl <path to BioPerl lib> <output path>
26 =item <path to BioPerl lib>
28 a directory path pointing to the root of the BioPerl lib tree. e.g. /export/share/lib/perl5/site_perl/5.8.7/Bio/
32 where you would like deob_index.pl to put its output files.
39 deob_index.pl goes through the entire BioPerl library tree looking for
40 .pm and .pl files. For each one it finds, it tries to extract module-level
41 POD documentation (e.g. SYNOPSIS, DESCRIPTION) and store it in a BerkeleyDB.
42 It also tries to extract documentation for each method in the module and
43 store that in a separate BerkeleyDB.
45 Specific parts of the documentation for a module or method may be retrieved
46 individually using the functions available in Deobfuscator.pm. See that module
49 While going through and trying to parse each module, deob_index.pl also
50 reports what pieces of the documentation it can't find. For example, if
51 a method's documentation doesn't describe the data type it returns, this
52 script logs that information to a file. This type of automated documentation-
53 checking could be used to standardize and improve the documentation in
56 deob_index.pl creates four files:
60 =item C<< package_list.txt >>
62 A plaintext file listing each package found in the BioPerl directory that was
63 searched. Packages are listed by their module names, such as 'Bio::SeqIO'.
64 This file is used by L<deob_interface.cgi>.
66 =item C<< packages.db >>
68 A Berkeley DB, which stores package-level documentation, such as
69 the synopsis and the description. Each key is a package name,
70 e.g. "Bio::SeqIO", and each value string is composed of the
71 individual pieces of the documentation kept separate by
72 unique string record separators. The individual pieces of
73 documentation are pulled out of the string using the
74 get_pkg_docs function in Deobfuscator.pm. See that package
77 =item C<< methods.db >>
79 Like packages.db, methods.db is also a Berkeley DB, except it
80 stores various pieces of information about individual methods
81 available to a class. Each method might have documentation
82 about its usage, its arguments, its return values, an example,
83 and a description of its function.
85 Each key is the fully-qualified method name, e.g.
86 "Bio::SeqIO::next_seq". Each value is a string containing all
87 of the pieces of documentation concatenated together and
88 separated by unique strings serving as record separators. The
89 extraction of the actual documentation in these strings is
90 handled by the get_method_docs subroutine in the Deobfuscator.pm
91 module. See that package for details.
93 Not all methods will have all of these types of documentation,
94 and some methods will not have the different pieces of
95 information clearly labeled and separated. For the latter type,
96 deob_index.pl will try to store whatever free-form
97 documentation that does exist, and the get_method_docs function
98 in Deobfuscator.pm, if called without arguments, will return
101 =item C<< deob_index.log >>
103 This file contains detailed information about errors
104 encountered while trying to extract documentation during
105 the indexing process.
107 Each line in deob_index.log is a key-value pair describing
108 a single parsing error.
115 These are the parsing error codes reported in 'deob_index.log'.
117 =head2 Package errors
121 =item C<< PKG_NAME >>
123 couldn't find the name of the package
125 =item C<< SYNOPSIS >>
127 couldn't find the synopsis
131 couldn't find the description
135 couldn't find any methods
139 This package name occurs more than once
147 =item C<< FUNCTION >>
149 couldn't find the function description
153 couldn't find the example
157 couldn't find the method's arguments
161 couldn't find the usage statement
165 couldn't find the return values
167 =item C<< FREEFORM >>
169 This method's documentation doesn't conform to the BioPerl standard of having
170 clearly-labeled fields for title, function, example, args, usage, and returns.
172 =item C<< METH_DUP >>
174 This method name occurs more than once
179 =head1 CONFIGURATION AND ENVIRONMENT
181 This software requires:
185 =item A working installation of the Berkeley DB
187 The Berkeley DB comes standard with most UNIX distributions, so you may
188 already have it installed. See L<http://www.sleepycat.com> for more information.
192 deob_index.pl recursively navigates a directory of BioPerl modules. Note
193 that the BioPerl module directory need not be "installed"; any old location
194 will do. See L<http://www.bioperl.org> for the latest version.
201 L<version>, L<File::Find>, L<DB_File>
204 =head1 INCOMPATIBILITIES
209 =head1 BUGS AND LIMITATIONS
211 No bugs have been reported.
213 deob_index.pl currently expects the sections of POD in a BioPerl module to
214 be in a particular order, namely: NAME, SYNOPSIS, DESCRIPTION, CONSTRUCTORS,
215 ... , APPENDIX. Those sections are expected to be marked with =head1 POD tags,
216 and the documentation for each method is expected to be in =head2 sections
217 in the APPENDIX. The order of SYNOPSIS and DESCRIPTION can be flipped, but
218 this behavior should not be taken as encouragement to do so.
220 Most, but not all BioPerl modules conform to this standard. Those that do not
221 will cause deob_index.pl to report them as errors. Although the consistency
222 of this standard is desirable for end-users of the documentation, this code
223 probably needs to be a little bit more flexible (patches welcome!).
225 This software has only been tested in a UNIX environment.
232 User feedback is an integral part of the evolution of this and other
233 Bioperl modules. Send your comments and suggestions preferably to one
234 of the Bioperl mailing lists. Your participation is much appreciated.
236 bioperl-l@bioperl.org - General discussion
237 http://bioperl.org/Support.html - About the mailing lists
239 =head2 Reporting Bugs
241 Report bugs to the Bioperl bug tracking system to help us keep track
242 the bugs and their resolution. Bug reports can be submitted via the
245 https://github.com/bioperl/bioperl-live/issues
250 L<Deobfuscator>, L<deob_interface.cgi>, L<deob_detail.cgi>
255 Dave Messina C<< <dave-pause@davemessina.net> >>
262 =item Laura Kavanaugh
269 =head1 ACKNOWLEDGMENTS
271 This software was developed originally at the Cold Spring Harbor Laboratory's
272 Advanced Bioinformatics Course between Oct 12-25, 2005. Many thanks to David
273 Curiel, who provided much-needed guidance and assistance on this project.
276 =head1 LICENSE AND COPYRIGHT
278 Copyright (C) 2005-6 Laura Kavanaugh and Dave Messina. All Rights Reserved.
280 This module is free software; you may redistribute it and/or modify it under the
281 same terms as Perl itself. See L<perlartistic>.
286 This software is provided "as is" without warranty of any kind.
290 use version
; $VERSION = qv
('0.0.2');
299 # GetOpt::Std-related settings
300 $Getopt::Std
::STANDARD_HELP_VERSION
= 1;
306 deob_index.pl - extracts and parses BioPerl POD
307 and stores the info in a database.
309 USAGE: deob_index.pl [-s bioperl-version] [-x exclude_file] <BioPerl lib dir> <output dir>
313 <BioPerl lib dir> is the BioPerl distribution you'd like to index
315 e.g. /export/share/lib/perl5/site_perl/5.8.7/Bio/
319 <output dir> is where the output files should be placed
322 -s user-supplied string to declare BioPerl's version
323 (which will be displayed by deob_interface.cgi)
324 -x excluded modules file (a module paths to skip; see POD for details)
327 unless ( @ARGV == 2 ) { die $usage; }
329 my ( $source_dir, $dest_dir ) = @ARGV;
331 # check source_dir for full path and repair if it's a relative path
332 unless ( File
::Spec
->file_name_is_absolute( $source_dir ) ) {
333 $source_dir = File
::Spec
->rel2abs( $source_dir ) ;
336 # check dest_dir for full path and repair if it's a relative path
337 unless ( File
::Spec
->file_name_is_absolute( $dest_dir ) ) {
338 $dest_dir = File
::Spec
->rel2abs( $dest_dir ) ;
341 # NOTE: we're allowing only one source directory, but File::Find supports
342 # passing an array of dirs.
344 # read in an optional list of modules to exclude from indexing
345 # - this is aimed at modules with external dependencies that are often not
346 # - present and thus will prevent deob_interface.cgi from loading them
347 our ($opt_s, $opt_x);
349 if (defined $opt_x) {
350 my $exclude_fh = IO
::File
->new($opt_x, "r")
351 or die "couldn't open $opt_x\n";
352 while (<$exclude_fh>) {
354 next if ( /^\#/ || /^\s*$/ ); # ignore comments and blank lines
357 print STDERR
"Found ", scalar keys %exclude, " modules to be excluded.\n";
361 # save a list of the BioPerl modules to a file
362 my $list; # filehandle
363 my $list_file = $dest_dir . "/package_list.txt";
364 if ( -e
$list_file) { unlink($list_file); }
365 open $list, ">$list_file" or die "deob_index.pl: couldn't open $list_file:$!\n";
366 my @list_holder; # hold all package names so we can sort them before writing.
368 # record misbehaving BioPerl docs to a file
369 my $log; # filehandle
370 my $logfile = $dest_dir . "/deob_index.log";
371 open $log, ">$logfile" or die "deob_index.pl: couldn't open $logfile:$!\n";
374 my $meth_file = $dest_dir . '/methods.db';
375 if ( -e
$meth_file ) { unlink($meth_file); } # remove for production?
376 my $meth_db = create_db
($meth_file) or die "deob_index.pl: couldn't create $meth_file: $!\n";
377 my $pkg_file = $dest_dir . '/packages.db';
378 if ( -e
$pkg_file ) { unlink($pkg_file); } # remove for production?
379 my $pkg_db = create_db
($pkg_file) or die "deob_index.pl: couldn't create $pkg_file: $!\n";
381 # used to make sure we're parsing in the right order
384 # store version string in packages.db
385 $pkg_db->{'__BioPerl_Version'} = $opt_s ?
$opt_s : 'unknown';
387 # keep stats on our indexing
396 # wanted points to the subroutine which is run on each found file
397 # ( in this program, that subroutine is &extract_pod )
398 # no_chdir prevents find from chdir'ing into each subsequent directory
399 my %FIND_OPTIONS = ( wanted
=> \
&extract_pod
);#, no_chdir => 1 );
401 # This is the important line - Find::File actually doing the
402 # traversal of the directory tree.
403 find
( \
%FIND_OPTIONS, $source_dir );
405 # sort and write out package list
406 foreach my $sorted_pkg (sort @list_holder) {
407 print $list $sorted_pkg, "\n";
410 # store user-supplied BioPerl version number
413 print STDOUT
"\nThis indexing run found:\n";
414 print $log "\nThis indexing run found:\n";
415 foreach my $stat ( 'files', 'pkg_name', 'desc', 'synopsis', 'methods' ) {
416 printf STDOUT
"%5d %s\n", $stats{$stat}, $stat;
417 printf $log "%5d %s\n", $stats{$stat}, $stat;
420 # close files and DBs
421 untie $meth_db or die "deob_index.pl: couldn't close $meth_file: $!\n";
422 untie $pkg_db or die "deob_index.pl: couldn't close $pkg_file: $!\n";
423 close $list or die "deob_index.pl: couldn't close $list: $!\n";
424 close $log or die "deob_index.pl: couldn't close $log: $!\n";
426 chmod($mode, $pkg_file, $meth_file, $list_file);
428 ### Parsing subroutines ###
431 my $long_file = $File::Find
::name
;
433 # skip if it's on our exclude list
434 foreach my $one (keys %exclude) {
435 if ($File::Find
::name
=~ /$one$/) {
436 print STDERR
"Excluding $file\n";
437 print $log "Excluding $file\n";
442 # skip unless it's a perl file that exists
443 return unless ( $file =~ /\.PLS$/ ) or ( $file =~ /\.p[ml]$/ );
444 return unless -e
$file;
448 open my $fh, '<', $File::Find
::name
or die "deob_index.pl: could not read file '$file': $!\n";
450 # these have to be done in order
451 my ( $pkg_name, $short_desc ) = get_pkg_name
($fh);
452 my ($synopsis, $desc);
453 LOOP
: while (my ($type, $section) = get_generic
($fh) ) {
454 if ($type eq 'synopsis') { $synopsis = $section; }
455 elsif ($type eq 'description') { $desc = $section; }
459 my $constructors = get_constructors
($fh);
460 my $methods = get_methods
($fh);
462 # record package name to our package list file
463 if ($pkg_name) { push @list_holder, $pkg_name; }
465 # store valid package data here
470 $stats{'pkg_name'}++;
471 print $pkg_name, "\n" if $DEBUG == 1;
474 print $log " PKG_NAME: $long_file\n";
477 $stats{'short_desc'}++;
478 push @pkg_data, $short_desc;
479 print $short_desc, "\n" if $DEBUG == 1;
482 push @pkg_data, 'no short description available'; # store something
483 print $log "SHORT_DESC: $long_file\n";
486 $stats{'synopsis'}++;
487 print $synopsis, "\n" if $DEBUG == 1;
488 push @pkg_data, $synopsis;
491 push @pkg_data, 'no synopsis available'; # store something
492 print $log " SYNOPSIS: $long_file\n";
496 print $desc, "\n" if $DEBUG == 1;
497 push @pkg_data, $desc;
500 push @pkg_data, 'no description available'; # store something
501 print $log " DESC: $long_file\n";
504 my $method_count = scalar keys %$methods;
505 print "**** Found $method_count methods in $pkg_name\n"
507 foreach my $method ( keys %$methods ) {
509 print $method, "\n//\n" if $DEBUG == 2;
513 print $log " METHODS: $long_file\n";
516 # prepare data for databases
517 my $pkg_record = pkg_prep
(@pkg_data);
518 my $meth_records = meth_prep
( $pkg_name, $methods );
520 # load data in databases
522 pkg_load
( $pkg_db, $pkg_name, $pkg_record );
523 meth_load
( $meth_db, $meth_records );
527 sub slurp_until_next
{
534 LINE
: while (<$fh>) {
535 next LINE
if $_ eq $prev_line;
537 # if it's a POD directive
540 # reset our position to the beginning of the line
541 # so it is seen as part of the next POD section
542 seek $fh, -length($_), 1;
549 return join q{}, @lines;
558 LINE
: while (<$fh>) {
560 print "**", $_, "\n" if $DEBUG == 2;
563 # - "short desc" is the one-line description of the package
564 if ( $_ =~ /^\=head1\s+NAME/ ) {
566 my $next_line = <$fh>;
567 ( $pkg_name, $short_desc ) = split /\s+/, $next_line, 2;
568 $short_desc .= slurp_until_next
($fh);
570 # strip off leading dash
571 $short_desc =~ s/^(\-)+\s+//;
573 # strip off trailing spaces
574 $short_desc =~ s/\s+$//;
577 $short_desc =~ s/\n/ /;
579 print $pkg_name, "\n" if $DEBUG == 1;
584 # we've hit a =head1, but it's the wrong one
585 elsif ( $_ =~ /^\=head1\s+/ ) {
590 $FLAG{'pkg_name'} = 1;
591 return $pkg_name, $short_desc;
600 LINE
: while (<$fh>) {
602 print "**", $_, "\n" if $DEBUG == 2;
604 if ( $_ =~ /^\=head1\s+SYNOPSIS/ ) {
605 $section = slurp_until_next
($fh);
607 $FLAG{'synopsis'} = 1;
608 return ('synopsis', $section);
612 elsif ( $_ =~ /^\=head1\s+DESCRIPTION/ ) {
613 $section = slurp_until_next
($fh);
615 $FLAG{'description'} = 1;
616 return ('description', $section);
621 # if we hit the APPENDIX, time to stop
622 elsif (/^\=head1\s+APPENDIX/) {
624 # reset our position to the beginning of the line
625 # so it is seen by the next parser
626 seek $fh, -length($_)*2, 1;
637 LINE
: while (<$fh>) {
639 print "**", $_, "\n" if $DEBUG == 2;
641 if ( $_ =~ /^\=head1\s+SYNOPSIS/ ) {
642 $synopsis = slurp_until_next
($fh);
646 # we've hit a =head1, but it's the wrong one
647 elsif ( $_ =~ /^\=head1\s+/ ) {
652 $FLAG{'synopsis'} = 1;
662 LINE
: while (<$fh>) {
664 print "**", $_, "\n" if $DEBUG == 2;
666 if ($_ =~ /^=head1\s+VERSION/ ) {
667 slurp_until_next
($fh);
670 if ( $_ =~ /^\=head1\s+DESCRIPTION/ ) {
671 $desc = slurp_until_next
($fh);
675 # we've hit a =head1, but it's the wrong one
676 elsif ( $_ =~ /^\=head1\s+/ ) {
681 $FLAG{'description'} = 1;
686 sub get_constructors
{
690 # should return a hashref
697 # we shouldn't see any methods until after the APPENDIX
698 my $seen_appendix = 0;
700 # there's an '=cut' after we enter the APPENDIX
701 # we know the method '=head2' tags will come after it
702 my $seen_first_cut = 0;
704 LINE
: while (<$fh>) {
705 if ( $_ =~ /^\=head1\s+APPENDIX/ ) {
709 # this should be the first tag after the APPENDIX
710 if ( $seen_appendix && $_ =~ /^\=cut/ ) {
714 # this should be a method
715 if ( $seen_first_cut && $_ =~ /^\=head2\s+(\S+)/ ) {
716 $methods{$1} = slurp_until_next
($fh);
724 ### Database subroutines ###
729 my $hashref = \
%hash;
731 tie
%hash, "DB_File", $filename
732 or die "ERROR: couldn't open $filename:$!\n";
739 # unique string on which to split our sub-records
740 my $rec_sep = 'DaVe-ReC-sEp';
742 my $record = join $rec_sep, @_;
748 my ( $pkg_name, $methods ) = @_;
751 foreach my $entry ( keys %$methods ) {
752 my $key = $pkg_name . '::' . $entry;
753 my $record; # what will be stored in the db
754 my $rec_sep = 'DaVe-ReC-sEp';
756 # if the method conforms to the BioPerl doc spec,
757 # we will split it into constituent pieces before storing
758 # it in the db. If not, we store the whole thing as one lump.
760 my $last; # for grabbing multi-line entries
771 my @lines = split "\n", $methods->{$entry};
772 foreach my $line (@lines) {
773 if ( $line =~ /^\s+Title\s+:(.*)/ ) {
774 next if $1 =~ /^\s+$/;
775 $fields{'title'} = $1;
776 $last = \
$fields{'title'};
778 elsif ( $line =~ /^\s+Usage\s+:(.*)/ ) {
779 next if $1 =~ /^\s+$/;
780 $fields{'usage'} = $1;
781 $last = \
$fields{'usage'};
783 elsif ( $line =~ /^\s+Function\s?:(.*)/ ) {
784 next if $1 =~ /^\s+$/;
785 $fields{'function'} = $1;
786 $last = \
$fields{'function'};
788 elsif ( $line =~ /^\s+Example\s+:(.*)/ ) {
789 next if $1 =~ /^\s+$/;
790 $fields{'example'} = $1;
791 $last = \
$fields{'example'};
793 elsif ( $line =~ /^\s+Returns\s+:(.*)/ ) {
794 next if $1 =~ /^\s+$/;
795 $fields{'returns'} = $1;
796 $last = \
$fields{'returns'};
798 elsif ( $line =~ /^\s+Args\s+:(.*)/ ) {
799 next if $1 =~ /^\s+$/;
800 $fields{'args'} = $1;
801 $last = \
$fields{'args'};
804 # grab multi-line entries
805 elsif ( $line =~ /^\s{8,}(\s.*)/ ) { $$last .= $1; }
810 print "** $entry **\n";
811 foreach my $field ( keys %fields ) {
812 print STDOUT
$field, "\t", $fields{$field}, "\n";
817 # if any of our fields have a value, store subrecords
818 my $filled_fields = grep /\w+/, values %fields;
819 print STDERR
$key, "\t", $filled_fields, "\n" if $DEBUG == 3;
820 if ( $filled_fields > 0 ) {
821 if ( !$fields{'title'} ) { print $log ' TITLE: ', $key, "\n"; }
822 if ( !$fields{'usage'} ) { print $log ' USAGE: ', $key, "\n"; }
823 if ( !$fields{'function'} ) {
824 print $log ' FUNCTION: ', $key, "\n";
826 if ( !$fields{'example'} ) {
827 print $log ' EXAMPLE: ', $key, "\n";
829 if ( !$fields{'returns'} ) {
830 print $log ' RETURNS: ', $key, "\n";
832 if ( !$fields{'args'} ) { print $log ' ARGS: ', $key, "\n"; }
834 # create the records to be stored in the db
835 foreach my $field ( keys %fields ) {
837 = $rec_sep . '-' . $field . '|' . $fields{$field};
838 $record .= $subrecord;
842 $records{$key} = $record;
845 # if no subfields, store whatever docs we do have for the method
847 $record = $methods->{$entry};
848 print $log ' FREEFORM: ', $key, "\n";
855 my ( $pkg_db, $pkg_name, $record ) = @_;
857 if ( exists $pkg_db->{$pkg_name} ) {
858 print $log ' PKG_DUP: ', $pkg_name, "\n";
860 "$pkg_name already exists in package db!\n",
861 "existing record:\n$pkg_db->{$pkg_name}\n",
862 "attempted to add:\n$record\n",
867 $pkg_db->{$pkg_name} = $record;
872 my ( $meth_db, $records ) = @_;
874 foreach my $method ( keys %$records ) {
875 if ( exists( $meth_db->{$method} ) ) {
876 print $log ' METH_DUP: ', $method, "\n";
878 "$method already exists in method db!\n",
879 "existing record:\n$meth_db->{$method}\n",
880 "attempted to add:\n$records->{$method}\n",
885 $meth_db->{$method} = $records->{$method};