3 # dbfetch style caching proxy for GenBank
7 use HTTP::Request::Common;
11 use vars qw(%GOT $BUFFER %MAPPING $CACHE);
13 use constant CACHE_LOCATION => '/usr/tmp/dbfetch_cache';
14 use constant MAX_SIZE => 100_000_000; # 100 megs, roughly
15 use constant CACHE_DEPTH => 4;
16 use constant EXPIRATION => "1 week";
17 use constant PURGE => "1 hour";
19 %MAPPING = (genbank => {db=>'nucleotide',
21 genpep => {db=>'protein',
23 # we're doing everything in callbacks, so initialize globals.
27 print header('text/plain');
29 param() or print_usage();
32 my $style = param('style');
33 my $format = param('format');
35 my @ids = split /\s+/,$id;
37 $format = 'genbank' if $format eq 'default'; #h'mmmph
39 $MAPPING{$db} or error(1=>"Unknown database [$db]");
40 $style eq 'raw' or error(2=>"Unknown style [$style]");
41 $format eq 'genbank' or error(3=>"Format [$format] not known for database [$db]");
43 $CACHE = Cache::FileCache->new({cache_root => CACHE_LOCATION,
44 default_expires_in => EXPIRATION,
45 cache_DEPTH => CACHE_DEPTH,
46 namespace => 'dbfetch',
47 auto_purge_interval => PURGE});
49 # handle cached entries
51 if (my $obj = $CACHE->get($_)) {
57 # handle the remainder
58 @ids = grep {!$GOT{$_}} @ids;
60 my $request = POST('http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi',
61 [rettype => $MAPPING{$db}{rettype},
62 db => $MAPPING{$db}{db},
70 my $ua = LWP::UserAgent->new;
71 my $response = $ua->request($request,\&callback);
73 if ($response->is_error) {
74 my $status = $response->status_line;
75 error(6 => "HTTP error from GenBank [$status]");
79 my @missing_ids = grep {!$GOT{$_}} @ids;
80 foreach (@missing_ids) {
81 error(4=>"ID [$_] not found in database [$db]",1);
84 # my $response = $response->content;
89 my ($locus) = $record =~ /^LOCUS\s+(\S+)/m;
90 my ($accession) = $record =~ /^ACCESSION\s+(\S+)/m;
91 my ($version,$gi) = $record =~ /^VERSION\s+(\S+)\s+GI:(\d+)/m;
92 foreach ($locus,$accession,$version,$gi) {
94 $CACHE->set($_,$record);
102 while (($index = index($BUFFER,"//\n\n",$index))>=0) {
103 my $record = substr($BUFFER,0,$index);
104 $index += length("//\n\n");
105 substr($BUFFER,0,$index) = '';
106 process_record($record);
114 This script is intended to be used non-interactively.
116 Brief summary of arguments:
119 This interface does not specify what happens when biofetch is called
120 in interactive context. The implementations can return the entries
121 decorated with HTML tags and hypertext links.
123 A URL for biofetch consists of four sections:
127 2. host www.ebi.ac.uk
128 3. path to program /Tools/dbfetch/dbfetch
129 4. query string ?style=raw;format=embl;db=embl;id=J00231
134 The query string options are separated from the base URL (protocol +
135 host + path) by a question mark (?) and from each other by a semicolon
136 ';' (or by ampersand '&'). See CGI GET documents at
137 http://www.w3.org/CGI/). The order of options is not critical. It is
138 recommended to leave the ID to be the last item.
140 Input for options should be case insensitive.
146 Descr : database name
148 Usage : db=genpep | db=genbank
151 Currently this server accepts "genbank" and "genpep"
156 Descr : +/- HTML tags
158 Usage : style=raw | db=html
159 Arg : enum (raw|html)
161 In non-interactive context, always give "style=raw". This uses
162 "Content-Type: text/plain". If other content types are needed (XML),
163 this part of the spesifications can be extended to accommodate them.
165 This server only accepts "raw".
171 Descr : format of the database entries returned
173 Usage : format=genbank
176 Format defaults to the distribution format of the database (embl for
177 EMBL database). If some other supported format is needed this option
178 is needed (E.g. formats for EMBL: fasta, bsml, agave).
180 This server only accepts "genbank" format.
185 Descr : unique database identifier(s)
187 Usage : db=J00231 | id=J00231+BUM
190 The ID option should be able to process all UIDS in a database. It
191 should not be necessary to know if the UID is an ID, accession number
192 or accession.version.
194 The number of entry UIDs allowed is implementation specific. If the
195 limit is exceeded, the the program reports an error. The UIDs should
196 be separated by spaces (use '+' in a GET method string).
201 The following standardized one line messages should be printed out in
204 ERROR 1 Unknown database [$db].
205 ERROR 2 Unknown style [$style].
206 ERROR 3 Format [$format] not known for database [$db].
207 ERROR 4 ID [$id] not found in database [$db].
208 ERROR 5 Too many IDs [$count]. Max [$MAXIDS] allowed.
217 my ($code,$message,$noexit) = @_;
218 print "ERROR $code $message\n";
219 exit 0 unless $noexit;
226 bp_biofetch_genbank_proxy.pl - Caching BioFetch-compatible web proxy for GenBank
230 Install in cgi-bin directory of a Web server. Stand back.
234 This CGI script acts as the server side of the BioFetch protocol as
235 described in http://obda.open-bio.org/Specs/. It provides two
236 database access services, one for data source "genbank" (nucleotide
237 entries) and the other for data source "genpep" (protein entries).
239 This script works by forwarding its requests to NCBI's eutils script,
240 which lives at http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi.
241 It then reformats the output according to the BioFetch format so the
242 sequences can be processed and returned by the Bio::DB::BioFetch
243 module. Returned entries are temporarily cached on the Web server's
244 file system, allowing frequently-accessed entries to be retrieved
245 without another round trip to NCBI.
249 You must have the following installed in order to run this script:
252 2) the perl modules LWP and Cache::FileCache
253 3) a web server (Apache recommended)
255 To install this script, copy it into the web server's cgi-bin
256 directory. You might want to shorten its name; "dbfetch" is
259 There are several constants located at the top of the script that you
260 may want to adjust. These are:
264 This is the location on the filesystem where the cached files will be
265 located. The default is /usr/tmp/dbfetch_cache.
269 This is the maximum size that the cache can grow to. When the cache
270 exceeds this size older entries will be deleted automatically. The
271 default setting is 100,000,000 bytes (100 MB).
275 Entries that haven't been accessed in this length of time will be
276 removed from the cache. The default is 1 week.
280 This constant specifies how often the cache will be purged for older
281 entries. The default is 1 hour.
285 To see if this script is performing as expected, you may test it with
288 use Bio::DB::BioFetch;
289 my $db = Bio::DB::BioFetch->new(-baseaddress=>'http://localhost/cgi-bin/dbfetch',
292 my $seq = $db->get_Seq_by_id('DDU63596');
293 print $seq->seq,"\n";
295 This should print out a DNA sequence.
299 L<Bio::DB::BioFetch>, L<Bio::DB::Registry>
303 Lincoln Stein, E<lt>lstein-at-cshl.orgE<gt>
305 Copyright (c) 2003 Cold Spring Harbor Laboratory
307 This library is free software; you can redistribute it and/or modify
308 it under the same terms as Perl itself. See DISCLAIMER.txt for
309 disclaimers of warranty.