README.TXT

   1 ***********************************************************************
   2 TOOL:  SVMTool++
   3 AUTHOR(s): Jesus Gimenez, Lluis Marquez and Senen Moya
   4 CONTRIBUTIONS: Gael de Chalendar <Gael.de-Chalendar@cea.fr>
   5 DATE:   23/02/2007
   6 VERSION: 1.1.6
   7 DESCRIPTION: A general POS tagger generator based on Support Vector Machines.
   8 ***********************************************************************
   9
  10 Contents
  11 --------
  12
  13 LGPL.txt   _ license terms.
  14 README.TXT _ this file.
  15 CMakeLists.txt   _ main cmake file to compile and link SVMTool
  16      ./sample _ API sample.
  17         * main.cc    _ Sample source
  18         * main       _ Sample program
  19      ./src    _ SVMTool sources.
  20
  21 Compilation products
  22 --------------------
  23 * libsvmtool.so _ The SVMTool library, used by the executables and usable from
  24   you own programs
  25 * SVMTagger  _ POS-tagger
  26 * SVMTeval   _ evaluation component
  27 * SVMTlearn  _ learning component
  28
  29 Description
  30 -----------
  31
  32 SVMT is a very simple and effective part-of-speech tagger based on Support Vector Machines. By means of a rigorous experimental evaluation, we conclude that the proposed SVM-based tagger is robust and flexible for feature modelling (including lexicalization), trains efficiently with almost no parameters to tune, and is able to tag thousands of words per second, which makes it really practical for real NLP applications. Regarding accuracy, the SVM-based tagger significantly outperforms the TnT tagger exactly under the same conditions, and achieves a very competitive accuracy of 97.17% for English on the WSJ corpus, which is comparable to the best taggers reported up to date. This prototype is implemented in C++.
  33
  34 The SVMlight software implementation of Vapnik's Support Vector Machine [Vapnik, 1995] by Thorsten Joachims has been used to train the models. For further information on it see:
  35
  36     * T. Joachims, Making large-Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning, B. Schölkopf and C. Burges and A. Smola (ed.), MIT-Press, 1999.
  37
  38
  39
  40 COMPONENTS:
  41 ===========
  42
  43 The SVMTool consists of three main components:
  44
  45     * SVMTlearn
  46     * SVMTagger
  47     * SVMTeval
  48
  49 These are, namely the learner, tagger and evaluator.
  50
  51 ---------------------------------------------------------------------
  52 (1) SVMTlearn
  53 ---------------------------------------------------------------------
  54
  55 Given a training set of examples (either annotated or unannotated), it is responsible for the training of a set of SVM classifiers. So as to do that, it makes use of SVM-light software package, an implementation of Vapnik's SVMs in C, developed by Thorsten Joachims.
  56
  57 Training data must be columned, i.e. a token per line corpus in a sentece by sentence fashion. The token is expected to be the first field of the line. The POS takes the second field in the output. The rest of the line is ignored.
  58
  59 SVMTlearn behaviour is easily adjusted through a configuration file. These are the currently available options:
  60         - Sliding window (size and core position)
  61         - Feature set (word features, POS features, ortographic features)
  62         - Feature filtering (count cutoff and max mapping size)
  63         - SVM model compression
  64         - C parameter tunning
  65         - Test [ against a test set or via cross-validation ]
  66         - Dictionary repairing (either heuristically and/or based on a correction list)
  67         - Ambiguous classes (may be optionally provided)
  68         - Open classes (may be optionally provided)
  69         - Backup lexicon (may be optionally provided)
  70
  71
  72 ---------------------------------------------------------------------
  73 (2) SVMTagger
  74 ---------------------------------------------------------------------
  75
  76 Given a text corpus (one token per line) and the path to a previously learned SVM model (including the automatically generated dictionary), it performs the POS tagging of a sequence of words. The tagging goes on-line based on a sliding window which gives a view of the feature context to be considered at every decision. Calculated part-of-speech tags feed directly forward next tagging decisions as context features.
  77
  78 The SVMtagger component works on standard input/output. It processes a token per line corpus in a sentence by sentence fashion. The token is expected to be the first field of the line. The predicted POS will take the second field in the output. The rest of the line remains unchanged. Lines beginning with '##' are just ignored by the tagger.
  79
  80 If you integrate SVMTool in a larger application that does its own morphologic analysis, you can pass the possible tags computed by your tools, thus helping SVMTagger to handle unknown words (for example named entities or words with an unusual case). In this case, the input format will take a second element on each line: the parenthesized, comma-separated list of possible tags. For example:
  81 token1 (tag1,tag2)
  82 token2 (tag3,tag1,tag4)
  83 ...
  84
  85 SVMTagger is very flexible, and adapts very well to the needs of the user. Thus you may find the several options currently available:
  86
  87         - Tagging scheme (greedy/sentence-level)
  88         - Tagging direction (left-to-right, right-to-left, or both)
  89         - One pass / Two passes
  90         - SVM Model Compression
  91         - Get all predicitons (not only the winner)
  92         - Use of a softmax function to transform predictions into probabilities
  93         - Backup lexicon (may be optionally provided)
  94
  95
  96 ---------------------------------------------------------------------
  97 (3) SVMTeval
  98 ---------------------------------------------------------------------
  99
 100 Given a SVMTool predicted tagging output and the corresponding gold-standard, SVMTeval evaluates the performance in terms of accuracy. It is a very useful component for the tunning of the system parameters, such as the C parameter, the feature patterns and filtering, the model compression et cetera.
 101
 102 Moreover, based on a given morphological dictionary (e.g. the automatically generated at training time) results may be presented also for different sets of words (known words vs unknown words, ambiguous words vs unambiguous words).  A different view of these same results can be seen from the class of ambiguity perspective, too, i.e., words sharing the same kind of ambiguity may be considered together. Also words sharing the same degree of disambiguation complexity, determined by the size of their ambiguity classes, can be
 103 grouped.
 104
 105         - mode:
 106                 * 0 : full report
 107                 * 1 : brief report (overall accuracy)
 108                 * 2 : comparing known vs unknown words
 109                 * 3 : grouping words according to their level of ambiguity
 110                 * 4 : grouping words according to their class of ambiguity
 111                 * 5 : from the part-of-speech point of view
 112
 113
 114 ---------------------------------------------------------------------
 115
 116
 117 NOTES:
 118 ======
 119 - comments from nlp community members are very welcome.
 120
 121 - Perl version is available at http://www.lsi.upc.edu/~SVMTool
 122
 123 - If you're interested in getting oncoming updates of this freely available software please do not hesitate to e-mail us:
 124         * Jesus Gimenez, at jgimenez@lsi.upc.edu
 125         * Lluis Marquez, at lluism@lsi.upc.edu
 126
 127
 128 CONTRIBUTING:
 129 =============
 130
 131 The SVMTool library is licensed under LGPL , which means that it may be linked to and used by commercial software packages. But the license also enforces that any changes or improvements made to the library (and in this case also to the morphological data) must be redistributed under LGPL terms.
 132
 133 Thus, if you improve the software or data, either adding new functionalities, fixing bugs, or adding analyzers for new languages, you can not distribute them under different conditions than those stated in the license (i.e. freely and with no usage restrictions).
 134
 135 If you want that your changes and improvements become useful to many other people using this free software, please contact us ( jgimenez@lsi.upc.es ).
 136
 137
 138 SVMTool Discussion Group
 139 ========================
 140
 141 Discussion on features and bugs of this software as well as information
 142 about oncoming updates takes place on the SVMTool group, to which
 143 you can subscribe at:
 144
 145 http://groups-beta.google.com/group/SVMT
 146
 147 and post messages at:
 148
 149 SVMT@googlegroups.com
 150
 151
 152 REFERENCES:
 153 ==========
 154
 155 Please reference this tool in your academic works citing the following paper:
 156
 157 * Jesus Gimenez and Lluis Marquez
 158   SVMTool: A general POS tagger generator based on Support Vector Machines.
 159   Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC'04).
 160   Lisbon, Portugal. 2004.