docs/filter-levels.txt

   1
   2 Filter Levels
   3     When one size *does not* fit all
   4
   5 The more I think about it, the less sense it makes for maintaining one huge
   6 monolithic HTMLDefinition class.  There's simply so much variation that
   7 could go into this definition: the set of HTML good for blog entries is
   8 definitely too large for HTML that would be allowed in blog comments. Going
   9 from Transitional to Strict requires changes to the definition.
  10
  11 However, allowing users to specify their own whitelists was an idea I
  12 rejected from the start.  Simply put, the typical programmer is too lazy
  13 to actually go through the trouble of investigating which tags, attributes
  14 and properties to allow.  HTMLDefinition makes a big part of what HTMLPurifier
  15 is.
  16
  17 The idea, then, is to setup fundamentally different set of definitions, which
  18 can further be customized using simpler configuration options.
  19
  20 Here are some fuzzy levels you could set:
  21
  22 1. Comments - Wordpress recommends a, abbr, acronym, b, blockquote, cite,
  23     code, em, i, strike, strong; however, you could get away with only a, em and
  24     p; also having blockquote and pre tags would be helpful.
  25 2. BBCode - Emulate the usual tagset for forums: b, i, img, a, blockquote,
  26     pre, div, span and h[2-6] (the last three are for specially formatted
  27     posts, div and span require associated classes or inline styling enabled
  28     to be useful)
  29 3. Pages - As permissive as possible without allowing XSS.  No protection
  30     against bad design sense, unfortunantely.  Suitable for wiki and page
  31     environments.
  32 4. Lint - Accept everything in the spec, a Tidy wannabe. (This probably won't
  33     get implemented as it would require routines for things like <object>
  34     and friends to be implemented, which is a lot of work for not a lot of
  35     benefit)
  36
  37 One final note: when you start axing tags that are more commonly used, you
  38 run the risk of accidentally destroying user data, especially if the data
  39 is incoming from a WYSIWYG eidtor that hasn't been synced accordingly. This may
  40 make forbidden element to text transformations desirable (for example, images).
  41
  42
  43
  44 == Element Risk Analysis ==
  45
  46 Legend:
  47     [danger level] - regular tags / uncommon tags ~ deprecated tags
  48     [danger level]* - rare tags
  49
  50 1 - blockquote, code, em, i, p, tt / strong, sub, sup
  51 1* - abbr, acronym, bdo, cite, dfn, kbd, q, samp
  52 2 - b, br, del, div, pre, span / ins, s, strike ~ u
  53 3 - h2, h3, h4, h5, h6 ~ center
  54 4 - h1, big ~ font
  55 5 - a
  56 7 - area, map
  57
  58 These are special use tags, they should be enabled on a blanket basis.
  59
  60 Lists - dd, dl, dt, li, ol, ul ~ menu, dir
  61 Tables - caption, table, td, th, tr / col, colgroup, tbody, tfoot, thead
  62
  63 Forms - fieldset, form, input, lable, legend, optgroup, option, select, textarea
  64 XSS - noscript, object, script ~ applet
  65 Meta - base, basefont, body, head, html, link, meta, style, title
  66 Frames - frame, frameset, iframe
  67
  68 And tag specific notes:
  69
  70 a   - general problems involving linkspam
  71 b   - too much bold is bad, typographically speaking bold is discouraged
  72 br  - often misused
  73 center - CSS, usually no legit use
  74 del - only useful in editing context
  75 div - little meaning in certain contexts i.e. blog comment
  76 h1  - usually no legit use, as header is already set by application
  77 h*  - not needed in blog comments
  78 hr  - usually not necessary in blog comments
  79 img - could be extremely undesirable if linking to external pics (CSRF, goatse)
  80 pre - could use formatting, only useful in code contexts
  81 q   - very little support
  82 s   - transform into span with styling or del?
  83 small - technically presentational
  84 span - depends on attribute allowances
  85 sub, sup - specialized
  86 u   - little legit use, prefer class with text-decoration
  87
  88 Based on the riskiness of the items, we may want to offer %HTML.DisableImages
  89 attribute and put URI filtering higher up on the priority list.
  90
  91
  92 == Attribute Risk Analysis ==
  93
  94 We actually have a suprisingly small assortment of allowed attributes (the
  95 rest are deprecated in strict, and thus we opted not to allow them, even
  96 though our output is XHTML Transitional by default.)
  97
  98 Required URI - img.alt, img.src, a.href
  99 Medium risk - *.class, *.dir
 100 High risk - img.height, img.width, *.id, *.style
 101
 102 Table - colgroup/col.span, td/th.rowspan, td/th.colspan
 103 Uncommon - *.title, *.lang, *.xml:lang
 104 Rare - td/th.abbr, table.summary, {table}.charoff
 105 Rare URI - del.cite, ins.cite, blockquote.cite, q.cite, img.longdesc
 106 Presentational - {table}.align, {table}.valign, table.frame, table.rules,
 107     table.border
 108 Partially presentational - table.cellpadding, table.cellspacing,
 109     table.width, col.width, colgroup.width
 110
 111
 112 == CSS Risk Analysis ==
 113
 114 There are certain CSS elements that are extremely useful inline, but then
 115 as you get to more presentation oriented styling it may not always be
 116 appropriate to inline them.
 117
 118 Useful - clear, float, border-collapse, caption-side
 119
 120 These CSS properties can break layouts if used improperly. We have excluded
 121 any CSS properties that are not currently implemented (such as position).
 122
 123 Dangerous, can go outside container - float
 124 Easy to abuse - font-size, font-family (font), width
 125 Colored - background-color (background), border-color (border), color
 126 Dramatic - border, list-style-position (list-style), margin, padding,
 127     text-align, text-indent, text-transform, vertical-align, line-height
 128
 129 Dramatic elements substnatially change the look of text in ways that should
 130 probably have been reserved to other areas.