sdext/source/pdfimport/README.md

   1 # PDF import
   2
   3 ## Introduction
   4
   5 The code in this directory parses a PDF file and builds a LibreOffice
   6 document contain similar elements, which can then be edited.
   7 It is invoked when opening a PDF file, but **not** when inserting
   8 a PDF into a document.  Inserting a PDF file renders it and inserts
   9 a non-editable, rendered version.
  10
  11 The parsing is done by the library [Poppler](https://poppler.freedesktop.org/)
  12 which then calls back into one layer of this code which is built as a
  13 Poppler output device implementation.
  14
  15 The PDF format is specified by [this document](https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf).
  16
  17 Note that PDF is a language that describes how to **render** a page, not
  18 a language for describing an editable document, thus some of the conversion
  19 is a heuristic that doesn't always give good results.
  20
  21 Indeed, PDF is Turing complete, and can embed Javascript, which is also
  22 Turing complete, so it's a wonder that PDFs ever manage to display anything.
  23
  24 ## Current limitations
  25
  26 - Not all elements have clipping implemented.
  27
  28 - LibreOffice's clipping routines all use Even-odd winding rules, where
  29 as PDF can (and usually does) use non-zero winding rules, making some
  30 clipping operations incorrect.
  31
  32 - In PDF, there's no concept of lines of text or paragraphs, each
  33 character can be entirely separate.  The code has very simple heuristics
  34 for reassembling characters back into lines of text.
  35 Other programs, like *pdftotext* have more complex heuristics that might be worth a try.
  36
  37 - Some cheap PDF operations, like the more advanced fills, generate many
  38 hundreds of objects in LibreOffice, which can make the document painfully
  39 slow to open.  At least some of these are possible to improve by adding
  40 more Poppler API implementations.  Some may require expanding LibreOffice's
  41 set of fill types.
  42
  43 - There can be differences between distributions Poppler library builds
  44 and the builds LibreOffice builds when it doesn't have a distro build
  45 to use, e.g. in LibreOffice's own distributed builds or the bibisect
  46 builds.  In particular the distro builds may include another library
  47 (supporting another embedded image type) than LibreOffice's build.
  48
  49 ## Fundamental limitations
  50
  51 - The ordering of fonts embedded in PDF are often ASCII, but not always.
  52 Sometimes they're arbitrary.  They may then include a *ToUnicode* map allowing
  53 programs to map the arbitrary index back to Unicode.  Alas not all PDFs
  54 include it, and some even use a bogus map to make it harder to copy/edit.
  55 If the same PDF renders correctly in other readers but fails to copy-and-paste
  56 then this is probably the issue.
  57
  58 - PDF can use complex programming in many places, for example a simple fill
  59 could be composed of a complex program to generate the fill tiles instead
  60 of an obvious simple item that can be encoded as LibreOffice shading type.
  61 Rendering these down to image tiles works OK but can sometimes end up
  62 with a fuzzy image rather than a nice sharp vector representation.
  63
  64 - Poppler's device interface API is not meant to be stable.  The code
  65 thus has lots of ifdef's to deal with different Poppler versions.
  66
  67 ## Structure
  68
  69 Note that the structure is dictated by Poppler being GPL licensed, where
  70 as LibreOffice isn't.
  71
  72 - *xpdfwrapper/* contains the GPL code that's linked with Poppler
  73 and forms the *xpdfimport* binary.    That binary outputs a stream
  74 representing the PDF as simpler operations (lines, clipping operations,
  75 images etc).  These form a series of commands on stdout, and binary
  76 data (mostly images) on stderr.  This does make adding debugging tricky.
  77
  78 - *wrapper/* contains the LibreOffice glue that execs the *xpdfimport*
  79 binary and parses the stream.  It also sets up password entry for
  80 protected PDFs.  After parsing the keyword and then any data that
  81 should be with the keyword, this layer than calls into the following
  82 tree layer.
  83
  84 - *tree/*' forms internal tree objects for each of the calls from the
  85 wrapper layer.  The tree is then 'visited' by optimisation layers
  86 (that do things like assemble individual characters into lines of text)
  87 and then by backend specific XML generators (e.g. for Draw and Writer)
  88 that then generate an XML stream to be parsed by the core of LibreOffice.
  89
  90 ## Bug handling
  91
  92 - Please tag bugs with *filter:pdf* in component *filters and storage*.
  93
  94 - The *pdfseparate* utility which is part of poppler is useful for splitting
  95 a PDF into individual pages to figure out which page is causing a crash
  96 or hang or shrinking the problem down.
  97
  98 - [qpdf](https://github.com/qpdf/qpdf) is useful for editing raw PDF
  99 files to really cut down the number of primitives, but takes some
 100 getting used to.
 101
 102 - The xpdfimport binary can be run independently of the rest of LibreOffice
 103 to allow the translated stream to be examined:
 104
 105         ./instdir/program/xpdfimport problem.pdf < /dev/null > stream 2> binarystream
 106