lingucomponent/source/thesaurus/mythes/data_layout.txt

   1 Description of the Structure of the Data needed by MyThes
   2 --------------------------------------------------------
   3
   4 MyThes is very simple.  Almost all of the "smarts" are really
   5 in the thesaurus data file itself.
   6
   7 The format for this file is at follows:
   8
   9 - no binary data
  10
  11 - line ending is a newline '\n' and not carriage return/linefeeds
  12
  13 - Line 1 is a character string that describes the encoding
  14 used for the file.  It is up to the calling program to convert
  15 to and from this encoding if necessary.
  16
  17      ISO8859-1 is used by the th_en_US_new.dat file.
  18
  19      Strings currently recognized by OpenOffice.org are:
  20
  21      UTF-8
  22      ISO8859-1
  23      ISO8859-2
  24      ISO8859-3
  25      ISO8859-4
  26      ISO8859-5
  27      ISO8859-6
  28      ISO8859-7
  29      ISO8859-8
  30      ISO8859-9
  31      ISO8859-10
  32      KOI8-R
  33      CP-1251
  34      ISO8859-14
  35      ISCII-DEVANAGARI
  36
  37
  38 - All of the remaning lines of the file follow this structure
  39
  40 entry|num_mean
  41 pos|syn1_mean|syn2|...
  42 .
  43 .
  44 .
  45 pos|mean_syn1|syn2|...
  46
  47
  48 where:
  49
  50    entry      - all lowercase version of the word or phrase being described
  51    num_mean   - number of meanings for this entry
  52
  53    There is one meaning per line and each meaning is comprised of
  54
  55    pos        -  part of speech or other meaning specific description
  56    syn1_mean  -  synonym 1 also used to describe the meaning itself
  57    syn2       - synonym 2 for that meaning etc.
  58
  59
  60 To make this even more clearer, here is actual data for the
  61 entry "simple".
  62
  63 simple|9
  64 (adj)|simple |elemental|ultimate|oversimplified|simplistic|simplex|simplified|unanalyzable|
  65 undecomposable|uncomplicated|unsophisticated|easy|plain|unsubdivided
  66 (adj)|elementary|uncomplicated|unproblematic|easy
  67 (adj)|bare|mere|plain
  68 (adj)|childlike|wide-eyed|dewy-eyed|naive |naif
  69 (adj)|dim-witted|half-witted|simple-minded|retarded
  70 (adj)|simple |unsubdivided|unlobed|smooth
  71 (adj)|plain
  72 (noun)|herb|herbaceous plant
  73 (noun)|simpleton|person|individual|someone|somebody|mortal|human|soul
  74
  75
  76 It says that "simple" has 9 different meanings and each
  77 meaning will have its part of speech and at least 1 synonym
  78 with other if presetn following on the same line.
  79
  80
  81
  82 Once you ahve created your own structured text file you can use
  83 the perl program "th_gen_idx.pl" which can be found in this
  84 directory to create an index file that is used to seek into
  85 your data file by the MyThes code.
  86
  87 The correct way to run the perl program is as follows:
  88
  89 cat th_en_US_new.dat | ./th_gen_idx.pl > th_en_US_new.idx
  90
  91
  92
  93 Then if you head the resulting index file you should see the
  94 following:
  95
  96 ISO8859-1
  97 142689
  98 'hood|10
  99 's gravenhage|88
 100 'tween|173
 101 'tween decks|196
 102 .22|231
 103 .22 caliber|319
 104 .22 calibre|365
 105 .38 caliber|411
 106 .38 calibre|457
 107 .45 caliber|503
 108 .45 calibre|549
 109 0|595
 110 1|666
 111 1 chronicles|6283
 112 1 esdras|6336
 113
 114
 115 Line 1 is the same encoding string taken from the
 116 structured thesaurus data file.
 117
 118 Line 2 is a count of the total number of entries
 119 in your thesaurus.
 120
 121 All of the remaining lines are of the form
 122
 123 entry|byte_offset_into_data_file_where_entry_is_found
 124
 125
 126 That's all there is too it.
 127
 128
 129 Kevin
 130 kevin.hendricks@sympatico.ca
 131