docs/enduser-slow.html

   1 <?xml version="1.0" encoding="UTF-8"?>
   2 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
   3     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
   4 <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head>
   5 <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
   6 <meta name="description" content="Explains how to speed up HTML Purifier through caching or inbound filtering." />
   7 <link rel="stylesheet" type="text/css" href="./style.css" />
   8
   9 <title>Speeding up HTML Purifier - HTML Purifier</title>
  10
  11 </head><body>
  12
  13 <h1 class="subtitled">Speeding up HTML Purifier</h1>
  14 <div class="subtitle">...also known as the HELP ME LIBRARY IS TOO SLOW MY PAGE TAKE TOO LONG page</div>
  15
  16 <div id="filing">Filed under End-User</div>
  17 <div id="index">Return to the <a href="index.html">index</a>.</div>
  18 <div id="home"><a href="http://htmlpurifier.org/">HTML Purifier</a> End-User Documentation</div>
  19
  20 <p>HTML Purifier is a very powerful library. But with power comes great
  21 responsibility, in the form of longer execution times.  Remember, this
  22 library isn't lightly grazing over submitted HTML: it's deconstructing
  23 the whole thing, rigorously checking the parts, and then putting it back
  24 together. </p>
  25
  26 <p>So, if it so turns out that HTML Purifier is kinda too slow for outbound
  27 filtering, you've got a few options: </p>
  28
  29 <h2>Inbound filtering</h2>
  30
  31 <p>Perform filtering of HTML when it's submitted by the user. Since the
  32 user is already submitting something, an extra half a second tacked on
  33 to the load time probably isn't going to be that huge of a problem.
  34 Then, displaying the content is a simple a manner of outputting it
  35 directly from your database/filesystem. The trouble with this method is
  36 that your user loses the original text, and when doing edits, will be
  37 handling the filtered text.  While this may be a good thing, especially
  38 if you're using a WYSIWYG editor, it can also result in data-loss if a
  39 user makes a typo. </p>
  40
  41 <p>Example (non-functional):</p>
  42
  43 <pre>&lt;?php
  44     /**
  45      * FORM SUBMISSION PAGE
  46      * display_error($message) : displays nice error page with message
  47      * display_success() : displays a nice success page
  48      * display_form() : displays the HTML submission form
  49      * database_insert($html) : inserts data into database as new row
  50      */
  51     if (!empty($_POST)) {
  52         require_once '/path/to/library/HTMLPurifier.auto.php';
  53         require_once 'HTMLPurifier.func.php';
  54         $dirty_html = isset($_POST['html']) ? $_POST['html'] : false;
  55         if (!$dirty_html) {
  56             display_error('You must write some HTML!');
  57         }
  58         $html = HTMLPurifier($dirty_html);
  59         database_insert($html);
  60         display_success();
  61         // notice that $dirty_html is *not* saved
  62     } else {
  63         display_form();
  64     }
  65 ?&gt;</pre>
  66
  67 <h2>Caching the filtered output</h2>
  68
  69 <p>Accept the submitted text and put it unaltered into the database, but
  70 then also generate a filtered version and stash that in the database.
  71 Serve the filtered version to readers, and the unaltered version to
  72 editors.  If need be, you can invalidate the cache and have the cached
  73 filtered version be regenerated on the first page view.  Pros? Full data
  74 retention. Cons? It's more complicated, and opens other editors up to
  75 XSS if they are using a WYSIWYG editor (to fix that, they'd have to be
  76 able to get their hands on the *really* original text served in
  77 plaintext mode). </p>
  78
  79 <p>Example (non-functional):</p>
  80
  81 <pre>&lt;?php
  82     /**
  83      * VIEW PAGE
  84      * display_error($message) : displays nice error page with message
  85      * cache_get($id) : retrieves HTML from fast cache (db or file)
  86      * cache_insert($id, $html) : inserts good HTML into cache system
  87      * database_get($id) : retrieves raw HTML from database
  88      */
  89     $id = isset($_GET['id']) ? (int) $_GET['id'] : false;
  90     if (!$id) {
  91         display_error('Must specify ID.');
  92         exit;
  93     }
  94     $html = cache_get($id); // filesystem or database
  95     if ($html === false) {
  96         // cache didn't have the HTML, generate it
  97         $raw_html = database_get($id);
  98         require_once '/path/to/library/HTMLPurifier.auto.php';
  99         require_once 'HTMLPurifier.func.php';
 100         $html = HTMLPurifier($raw_html);
 101         cache_insert($id, $html);
 102     }
 103     echo $html;
 104 ?&gt;</pre>
 105
 106 <h2>Summary</h2>
 107
 108 <p>In short, inbound filtering is the simple option and caching is the
 109 robust option (albeit with bigger storage requirements). </p>
 110
 111 <p>There is a third option, independent of the two we've discussed: profile
 112 and optimize HTMLPurifier yourself. Be sure to report back your results
 113 if you decide to do that! Especially if you port HTML Purifier to C++.
 114 <tt>;-)</tt></p>
 115
 116 </body>
 117 </html>
 118
 119 <!-- vim: et sw=4 sts=4 -->