compound-split/README

   1 Instructions for running the compound splitter, which is a reimplementation
   2 and extension (more features, larger non-word list) of the model described in
   3
   4   C. Dyer. (2009)  Using a maximum entropy model to build segmentation
   5             lattices for MT. In Proceedings of NAACL HLT 2009,
   6             Boulder, Colorado, June 2009
   7
   8 If you use this software, please cite this paper.
   9
  10
  11 GENERATING 1-BEST SEGMENTATIONS AND LATTICES
  12 ------------------------------------------------------------------------------
  13
  14 Here are some sample invokations:
  15
  16   ./compound-split.pl --output 1best < infile.txt > out.1best.txt
  17       Segment infile.txt according to the 1-best segmentation file.
  18
  19   ./compound-split.pl --output plf < infile.txt > out.plf
  20
  21   ./compound-split.pl --output plf --beam 3.5 < infile.txt > out.plf
  22       This generates denser lattices than usual (the default beam threshold
  23       is 2.2, higher numbers do less pruning)
  24
  25
  26 MODEL TRAINING (only for the adventuresome)
  27 ------------------------------------------------------------------------------
  28
  29 I've included some training data for training a German language lattice
  30 segmentation model, and if you want to explore, you can or change the data.
  31 If you're especially adventuresome, you can add features to cdec (the current
  32 feature functions are found in ff_csplit.cc).  The training/references are
  33 in the file:
  34
  35                dev.in-ref
  36
  37 The format is the unsegmented form on the right and the reference lattice on
  38 the left, separated by a triple pipe ( ||| ).  Note that the segmentation
  39 model inserts a # as the first word, so your segmentation references must
  40 include this.
  41
  42 To retrain the model (using MAP estimation of a conditional model), do the
  43 following:
  44
  45   cd de
  46   ./TRAIN
  47
  48 Note, the optimization objective is supposed to be non-convex, but i haven't
  49 found much of an effect of where I initialize things.  But I haven't looked
  50 very hard- this might be something to explore.
  51