ispell-tl/Crawler.txt

   1
   2 NOTES ON THE CONSTRUCTION OF THE WORD LIST
   3    A preliminary version of this spell checking dictionary was assembled
   4 with the help of my web crawler "An Crúbadán":
   5
   6   http://borel.slu.edu/crubadan/
   7
   8 BUILDING TEXT CORPORA FOR MINORITY LANGUAGES
   9
  10    Initially a small collection of "seed" texts are fed to the crawler
  11 (a few hundred words of running text have been sufficient in practice).
  12 Queries combining words from these texts are generated and passed to
  13 the Google API which returns a list of documents potentially written
  14 in the target language.  These are downloaded, processed into plain text,
  15 and formatted.  A combination of statistical techniques bootstrapped from
  16 the initial seed texts (and refined as more texts are added to the database)
  17 is used to determine which documents (or sections thereof) are written in
  18 the target language.   The crawler then recursively follows links contained
  19 within documents that are in the target language.   When these run out,
  20 the entire process is repeated, with a new set of Google queries generated
  21 from the new, larger corpus.
  22
  23 EXTRACTING A CLEAN WORD LIST
  24
  25    The raw texts downloaded using the scheme just described contain a lot
  26 of pollution and are unsuitable for use without some further processing.
  27 I have been able to extract reasonably accurate spell checking dictionaries
  28 by applying a series of simple filters.
  29    First, statistics measuring co-occurrence with the highest frequency words
  30 in the target language are used to filter out sections written in other
  31 languages or containing mostly noise (e.g. computer code, tabular data, etc.).
  32 The remaining text is tokenized and used to generate a word list sorted by
  33 frequency and the lowest frequency words are filtered out.   Then, depending
  34 on the target language, correctly-spelled words from one or more "polluting"
  35 languages are filtered out to be checked by hand later.  Usually this means
  36 English, but I also filter Dutch from the Frisian corpus, Spanish from
  37 Chamorro, etc.  The remaining words are used to generate 3-gram statistics
  38 for the target language.  These are used to flag as "suspect" any remaining
  39 words containing one or more improbable 3-grams.  Additional filters specific
  40 to certain languages can be applied optionally; for instance, pairs of words
  41 differing only in the presence or absence of diacritical marks can be flagged,
  42 or words with a capital letter appearing after the first letter, words
  43 with no vowels, etc.
  44
  45 Please contact me at the address below if you are interested in applying
  46 these techniques to a new language.
  47
  48 Kevin Scannell
  49 <scannell@slu.edu>
  50 March 2004