pdf:status.md

   1 The status is **undone** and *unstarted*
   2
   3 * The PDF files are [not](https://github.com/coolwanglu/pdf2htmlEX/issues/294) structured. Mainly there are only concatenate words, but the structure of things is lost: if we have one list, in pdf it's only a pile of text. The disadvantage of it is that for example many tools include footers as part of page.
   4
   5 * So the only strategic to extract information from PDF is to first extract text from PDF and then try to analyze it. I think manually (with persons). Computationally is error prone.
   6
   7 * We have [many tools](https://aur.archlinux.org/packages/?O=0&K=pdf) for achiving the 1st stage: [pdf-reader](https://github.com/yob/pdf-reader), [docsplit](http://documentcloud.github.com/docsplit/), [origami](http://www.sajithmr.me/origami-pdf-library-for-ruby), maxtract, latexifier, [pdfminer](http://www.unixuser.org/~euske/python/pdfminer/) (combined with [pdfgrep](http://pdfgrep.sourceforge.net/) is a good start), [opaf](https://code.google.com/p/opaf/), [pdf2htmlEX](http://coolwanglu.github.io/pdf2htmlEX/) (the best rendering), [pdf2json](https://github.com/modesty/pdf2json), [sejda](http://www.sejda.com/#upload-file),  [TeX4ht](http://tug.org/tex4ht/), pdf2xml [[1]](https://github.com/CrossRef/pdf2xml) [[2]](https://github.com/eliask/pdf2xml) [[3]](https://github.com/zejn/pypdf2xml). We have to test them.