agg.1

   1 .TH AGG 1 2011-04-04 agg "the news aggregator"
   2
   3 .SH NAME
   4 .B agg
   5 \- the news aggregator
   6
   7 .SH SYNOPSIS
   8 cat feed.rss |
   9 .B agg
  10
  11 .SH DESCRIPTION
  12 .B agg
  13 is a news aggregator following the UNIX philosophy. It
  14 simply reads a news feed (currently RSS only) from stdin
  15 and creates or updates a filesystem representation of
  16 that feed.
  17
  18 .SH FILE SYSTEM
  19 .B agg
  20 creates or updates the following directory structure in the
  21 current working directory:
  22
  23 .nf
  24         $FEED_NAME
  25         |-- $ITEM_1
  26         |-- $ITEM_2
  27         |-- ...
  28         `-- $ITEM_N
  29
  30 .SH USAGE
  31 .B agg
  32 uses the mtime of files and directories to represent dates
  33 of publication.
  34
  35 If the feed directory does not exist,
  36 .B agg
  37 will create it and store all items in the feed there.
  38 The mtimes of the files will be set to the corresponding
  39 date of publication, the mtime of the feed directory will
  40 be set to the date of publication of the most recent item.
  41
  42 If the feed directory already exists (e.g. on subsequent
  43 runs),
  44 .B agg
  45 checks the mtime of the feed directory and only fetches
  46 items with a newer date of publication, again setting the
  47 mtimes for the items fetched in this run. The mtime of the
  48 feed directory will be set to the date of publication of
  49 the most recent item that was fetched in this run.
  50
  51 If an item does not have a publication date, it is set to
  52 the UNIX timestamp 0.
  53
  54 By manually changing the mtime of the feed directory, you
  55 can make agg either skip unfetched items or refetch old
  56 ones.
  57
  58 To avoid unintentionally changing the mtime and thus
  59 skipping items, you can use a tiny wrapper called
  60 .B nomtime.
  61
  62 .SH FAQ
  63
  64 .IP Q1 3
  65 Writing file names that are are specified in the feed?
  66 What about security?
  67 .IP A1 3
  68 .B agg
  69 removes all slashes from file and directory names before
  70 they are written, so everything ends up where it belongs.
  71 You should run it in a dedicated directory, though.
  72 .IP Q2 3
  73 But a malicious feed could use up all space/inodes.
  74 .IP A2 3
  75 Depends on your operating system (configuration). It's not
  76 the job of a news aggregator to enforce quotas.
  77 .IP Q3 3
  78 Why no download mechanism?
  79 .IP A3 3
  80 Because it's a news aggregator, not a
  81 download-and-news-aggregation-program.
  82 .IP Q4 3
  83 Why no user interface?
  84 .IP A4 3
  85 Because it's a news aggregator, not a
  86 download-and-news-aggregation-and-news-reader-program.
  87 The file system hierarchy created is pretty much usable
  88 using the default UNIX tools. Feel free to write your own
  89 frontend.
  90 .IP Q5 3
  91 No way! This program writes HTML!
  92 .IP A5 3
  93 Yes, I like to be able to subscribe to xkcd and similar,
  94 even if it means I have to launch a graphical browser once
  95 in a while. Anyways, there's
  96
  97 .nf
  98         cat $item | elinks -dump
  99 .fi
 100 .IP Q6 3
 101 But do I have to download the feed by hand?
 102 .IP A6 3
 103 .nf
 104         wget $URL -O - | agg
 105 .fi
 106 .IP Q7 3
 107 But this wastes traffic when there are no new items!
 108 .IP A7 3
 109 .B agg
 110 quits when it assumes that there are no new feeds (see
 111 .B BUGS
 112 ). The amount of data read too much depends on the ratio
 113 of processing vs. download rate.
 114
 115 .nf
 116         wget $URL -O - --limit-rate=10K | agg
 117 .fi
 118 .IP Q8 3
 119 Okay. But it only works on a single feed!
 120 .IP A8 3
 121 Sky is the limit.
 122
 123 .nf
 124         for feed in `cat feeds`; do
 125                 (wget $feed -qO - --limit-rate=10K | agg) &
 126         done
 127 .fi
 128 .IP Q9 3
 129 How to fetch only new items from feeds that don't use
 130 publication dates?
 131 .IP A9 3
 132 Not supported by
 133 .B agg
 134 itsself, since it would require a second level storage that
 135 contains (hashes of) everything the
 136 .B agg
 137 directory contained -- including items you
 138 explicitly deleted. You can easily built such functionality
 139 on top using a few lines of shell code.
 140
 141
 142 .SH BUGS
 143
 144 .IP * 2
 145 Supports only RSS.
 146 .IP * 2
 147 Uses fixed size buffers to simplify code. May lead
 148 cut-off news texts or links. The chances for this to
 149 happen are low and without much consequences.
 150 .IP * 2
 151 Assumes items are ordered descending by publication date
 152 (newest items on top). Processing is stopped as soon as an
 153 old item is encountered.
 154 .IP * 2
 155 Assumes items only change if their publication date
 156 changes. Again, for simplicity.
 157 .IP * 2
 158 Creation of a "sub-feed" directory if the channel contained
 159 an element that had a title tag but is not an item.
 160 .IP * 2
 161 Supports only dates that have their time zone formatted as
 162 +xxxx, not as their abbreviation.
 163 .IP * 2
 164 Item titles may conflict, especially if they were too long
 165 and have been cutted.
 166 .IP * 2
 167 Items will always be (over-) written in the order they are
 168 placed in the feed.
 169 .IP * 2
 170 HTML output is formatted badly.
 171 .IP * 2
 172 Standard mtime for items without pubDate should be now().
 173
 174 .SH LICENSE
 175
 176 ISC
 177
 178 .SH WEBSITE
 179
 180 http://programmers.at/work/on/agg
 181
 182 .SH REPOSITORY
 183
 184 git://repo.or.cz/agg.git
 185
 186 .SH AUTHORS
 187
 188 Andreas Waidler <arandes@programmers.at>