src/documentation/content/xdocs/hslf/quick-guide.xml

   1 <?xml version="1.0" encoding="UTF-8"?>
   2 <!--
   3    ====================================================================
   4    Licensed to the Apache Software Foundation (ASF) under one or more
   5    contributor license agreements.  See the NOTICE file distributed with
   6    this work for additional information regarding copyright ownership.
   7    The ASF licenses this file to You under the Apache License, Version 2.0
   8    (the "License"); you may not use this file except in compliance with
   9    the License.  You may obtain a copy of the License at
  10
  11        http://www.apache.org/licenses/LICENSE-2.0
  12
  13    Unless required by applicable law or agreed to in writing, software
  14    distributed under the License is distributed on an "AS IS" BASIS,
  15    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  16    See the License for the specific language governing permissions and
  17    limitations under the License.
  18    ====================================================================
  19 -->
  20 <!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V1.1//EN" "../dtd/document-v11.dtd">
  21
  22 <document>
  23     <header>
  24         <title>POI-HSLF - A Quick Guide</title>
  25         <subtitle>Overview</subtitle>
  26         <authors>
  27             <person name="Nick Burch" email="nick at torchbox dot com"/>
  28         </authors>
  29     </header>
  30
  31     <body>
  32         <section><title>Basic Text Extraction</title>
  33         <p>For basic text extraction, make use of
  34 <code>org.apache.poi.hslf.extractor.PowerPointExtractor</code>. It accepts a file or an input
  35 stream. The <code>getText()</code> method can be used to get the text from the slides, and the <code>getNotes()</code> method can be used to get the text
  36 from the notes. Finally, <code>getText(true,true)</code> will get the text
  37 from both.
  38                 </p>
  39                 </section>
  40
  41                 <section><title>Specific Text Extraction</title>
  42                 <p>To get specific bits of text, first create a <code>org.apache.poi.hslf.usermodel.SlideShow</code>
  43 (from a <code>org.apache.poi.hslf.HSLFSlideShow</code>, which accepts a file or an input
  44 stream). Use <code>getSlides()</code> and <code>getNotes()</code> to get the slides and notes.
  45 These can be queried to get their page ID (though they should be returned
  46 in the right order).</p>
  47                 <p>You can then call <code>getTextRuns()</code> on these, to get
  48 their blocks of text. (One TextRun normally holds all the text in a
  49 given area of the page, eg in the title bar, or in a box).
  50 From the <code>TextRun</code>, you can extract the text, and check
  51 what type of text it is (eg Body, Title). You can allso call
  52 <code>getRichTextRuns()</code>, which will return the
  53 <code>RichTextRun</code>s that make up the <code>TextRun</code>. A
  54 <code>RichTextRun</code> is made up of a sequence of text, all having the
  55 same character and paragraph formatting.
  56                 </p>
  57                 </section>
  58
  59         <section><title>Poor Quality Text Extraction</title>
  60         <p>If speed is the most important thing for you, you don't care
  61                 about getting duplicate blocks of text, you don't care about
  62                 getting text from master sheets, and you don't care about getting
  63                 old text, then
  64                 <code>org.apache.poi.hslf.extractor.QuickButCruddyTextExtractor</code>
  65                 might be of use.</p>
  66                 <p>QuickButCruddyTextExtractor doesn't use the normal record
  67                 parsing code, instead it uses a tree structure blind search
  68                 method to get all text holding records. You will get all the text,
  69                 including lots of text you normally wouldn't ever want. However,
  70                 you will get it back very very fast!</p>
  71                 <p>There are two ways of getting the text back.
  72                 <code>getTextAsString()</code> will return a single string with all
  73                 the text in it. <code>getTextAsVector()</code> will return a
  74                 vector of strings, one for each text record found in the file.
  75                 </p>
  76                 </section>
  77
  78                 <section><title>Changing Text</title>
  79                 <p>It is possible to change the text via
  80                 <code>TextRun.setText(String)</code> or
  81                 <code>RichTextRun.setText(String)</code>. It is not yet possible
  82                 to add additional TextRuns or RichTextRuns.</p>
  83                 <p>When calling <code>TextRun.setText(String)</code>, all
  84                 the text will end up with the same formatting. When calling
  85                 <code>RichTextRun.setText(String)</code>, the text will retain
  86                 the old formatting of that <code>RichTextRun</code>.
  87                 </p>
  88                 </section>
  89
  90                 <section><title>Adding Slides</title>
  91                 <p>You may add new slides by calling
  92                 <code>SlideShow.createSlide()</code>, which will add a new slide
  93                 to the end of the SlideShow. It is not currently possible to
  94                 re-order slides, nor to add new text to slides (currently only
  95                 adding Escher objects to new slides is supported).
  96                 </p>
  97                 </section>
  98
  99                 <section><title>Guide to key classes</title>
 100                 <ul>
 101                 <li><code>org.apache.poi.hslf.HSLFSlideShow</code>
 102                 Handles reading in and writing out files. Calls
 103                 <code>org.apache.poi.hslf.record.record</code> to build a tree
 104                 of all the records in the file, which it allows access to.
 105                 </li>
 106                 <li><code>org.apache.poi.hslf.record.record</code>
 107                 Base class of all records. Also provides the main record generation
 108                 code, which will build up a tree of records for a file.
 109                 </li>
 110                 <li><code>org.apache.poi.hslf.usermodel.SlideShow</code>
 111   Builds up model entries from the records, and presents a user facing
 112   view of the file
 113                 </li>
 114                 <li><code>org.apache.poi.hslf.model.Slide</code>
 115   A user facing view of a Slide in a slidesow. Allows you to get at the
 116   Text of the slide, and at any drawing objects on it.
 117                 </li>
 118                 <li><code>org.apache.poi.hslf.model.TextRun</code>
 119   Holds all the Text in a given area of the Slide, and will
 120   contain one or more <code>RichTextRun</code>s.
 121                 </li>
 122                 <li><code>org.apache.poi.hslf.usermodel.RichTextRun</code>
 123   Holds a run of text, all having the same character and
 124   paragraph stylings. It is possible to modify text, and/or text stylings.
 125                 </li>
 126                 <li><code>org.apache.poi.hslf.extractor.PowerPointExtractor</code>
 127   Uses the model code to allow extraction of text from files
 128                 </li>
 129                 <li><code>org.apache.poi.hslf.extractor.QuickButCruddyTextExtractor</code>
 130   Uses the record code to extract all the text from files very fast,
 131   but including deleted text (and other bits of Crud).
 132                 </li>
 133                 </ul>
 134                 </section>
 135         </body>
 136 </document>