From 728e6c5b440f08f491998dedb8437c9aa6f0e548 Mon Sep 17 00:00:00 2001 From: "Edward Z. Yang" Date: Sat, 30 Sep 2006 19:25:51 +0000 Subject: [PATCH] Sync 1.1 branch as much as possible with trunk. git-svn-id: http://htmlpurifier.org/svnroot/htmlpurifier/branches/1.1@476 48356398-32a2-884e-a903-53898d9a118a --- NEWS | 10 ++++ docs/colors.txt | 23 ++++++++ docs/filter-levels.txt | 97 +++++++++++++++++++++++++++------ docs/strictness.txt | 25 +++++++++ library/HTMLPurifier/HTMLDefinition.php | 36 ++++++++---- 5 files changed, 164 insertions(+), 27 deletions(-) create mode 100644 docs/colors.txt create mode 100644 docs/strictness.txt diff --git a/NEWS b/NEWS index d312c8ce..e25a41b1 100644 --- a/NEWS +++ b/NEWS @@ -24,6 +24,16 @@ NEWS ( CHANGELOG and HISTORY ) HTMLPurifier . Refactored parseData() to general Lexer class . Tester named "HTML Purifier" not "HTMLPurifier" +1.1.1, released 2006-09-24 +! Configuration option to optionally Tidy up output for indentation to make up + for dropped whitespace by DOMLex (pretty-printing for the entire application + should be done by a page-wide Tidy) +- Various documentation updates +- Fixed parse error in configuration documentation script +- Fixed fatal error in benchmark scripts, slightly augmented +- As far as possible, whitespace is preserved in-between table children +- Sample test-settings.php file included + 1.1.0, released 2006-09-16 ! Directive documentation generation using XSLT ! XHTML can now be turned off, output becomes
diff --git a/docs/colors.txt b/docs/colors.txt new file mode 100644 index 00000000..e0b74e45 --- /dev/null +++ b/docs/colors.txt @@ -0,0 +1,23 @@ + +Colors + Hammering some sense into those content-makers + +Your website probably has a color-scheme. Green on white, purple on yellow, +whatever. When you give users the ability to style their content, you may +want them to keep in line with your styling. If you're website is all +about light colors, you don't want a user to come in and vandalize your +page with a deep maroon. + +This is an extremely silly feature proposal, but I'm writing it down anyway. + +What if the user could constrain the colors specified in inline styles? You +are only allowed to use these shades of dark green for text and these shades +of light yellow for the background. At the very least, you could ensure +that we did not have pale yellow on white text. + +Implementation issues: +1. Requires the color attribute definition to know, currently, what the text +and background colors are. This becomes difficult when classes are thrown +into the mix. +2. The user still has to define the permissible colors, how does one do +something like that? diff --git a/docs/filter-levels.txt b/docs/filter-levels.txt index 09b0563f..52f4a05b 100644 --- a/docs/filter-levels.txt +++ b/docs/filter-levels.txt @@ -20,15 +20,32 @@ can further be customized using simpler configuration options. Here are some fuzzy levels you could set: 1. Comments - Wordpress recommends a, abbr, acronym, b, blockquote, cite, - code, em, i, strike, strong; however, you could get away with only a, b and - i; also having p and pre tags would be helpful. -2. Pages - As permissive as possible without allowing XSS. No protection + code, em, i, strike, strong; however, you could get away with only a, em and + p; also having blockquote and pre tags would be helpful. +2. BBCode - Emulate the usual tagset for forums: b, i, img, a, blockquote, + pre, div, span and h[2-6] (the last three are for specially formatted + posts, div and span require associated classes or inline styling enabled + to be useful) +3. Pages - As permissive as possible without allowing XSS. No protection against bad design sense, unfortunantely. Suitable for wiki and page environments. -3. Lint - Accept everything in the spec, a Tidy wannabe. +4. Lint - Accept everything in the spec, a Tidy wannabe. (This probably won't + get implemented as it would require routines for things like + and friends to be implemented, which is a lot of work for not a lot of + benefit) -I've also decomposed tags into risk levels. An asterisk indicates that no one -really uses that tag, tilde indicates it's deprecated. +One final note: when you start axing tags that are more commonly used, you +run the risk of accidentally destroying user data, especially if the data +is incoming from a WYSIWYG eidtor that hasn't been synced accordingly. This may +make forbidden element to text transformations desirable (for example, images). + + + +== Element Risk Analysis == + +Legend: + [danger level] - regular tags / uncommon tags ~ deprecated tags + [danger level]* - rare tags 1 - blockquote, code, em, i, p, tt / strong, sub, sup 1* - abbr, acronym, bdo, cite, dfn, kbd, q, samp @@ -38,30 +55,76 @@ really uses that tag, tilde indicates it's deprecated. 5 - a 7 - area, map +These are special use tags, they should be enabled on a blanket basis. + Lists - dd, dl, dt, li, ol, ul ~ menu, dir Tables - caption, table, td, th, tr / col, colgroup, tbody, tfoot, thead + Forms - fieldset, form, input, lable, legend, optgroup, option, select, textarea XSS - noscript, object, script ~ applet - Meta - base, basefont, body, head, html, link, meta, style, title Frames - frame, frameset, iframe And tag specific notes: -a - general problems involving linkspam -b - too much bold is bad, typographically speaking bold is discouraged -br - often misused +a - general problems involving linkspam +b - too much bold is bad, typographically speaking bold is discouraged +br - often misused center - CSS, usually no legit use del - only useful in editing context div - little meaning in certain contexts i.e. blog comment -h1 - usually no legit use, as header is already set by application -h* - not needed in blog comments -hr - usually not necessary in blog comments -img - could be extremely undesirable if linking to external pics +h1 - usually no legit use, as header is already set by application +h* - not needed in blog comments +hr - usually not necessary in blog comments +img - could be extremely undesirable if linking to external pics (CSRF, goatse) pre - could use formatting, only useful in code contexts -q - very little support -s - transform into span with styling or del? +q - very little support +s - transform into span with styling or del? small - technically presentational span - depends on attribute allowances sub, sup - specialized -u - little legit use, prefer class with text-decoration +u - little legit use, prefer class with text-decoration + +Based on the riskiness of the items, we may want to offer %HTML.DisableImages +attribute and put URI filtering higher up on the priority list. + + +== Attribute Risk Analysis == + +We actually have a suprisingly small assortment of allowed attributes (the +rest are deprecated in strict, and thus we opted not to allow them, even +though our output is XHTML Transitional by default.) + +Required URI - img.alt, img.src, a.href +Medium risk - *.class, *.dir +High risk - img.height, img.width, *.id, *.style + +Table - colgroup/col.span, td/th.rowspan, td/th.colspan +Uncommon - *.title, *.lang, *.xml:lang +Rare - td/th.abbr, table.summary, {table}.charoff +Rare URI - del.cite, ins.cite, blockquote.cite, q.cite, img.longdesc +Presentational - {table}.align, {table}.valign, table.frame, table.rules, + table.border +Partially presentational - table.cellpadding, table.cellspacing, + table.width, col.width, colgroup.width + + +== CSS Risk Analysis == + +There are certain CSS elements that are extremely useful inline, but then +as you get to more presentation oriented styling it may not always be +appropriate to inline them. + +Useful - clear, float, border-collapse, caption-side + +These CSS properties can break layouts if used improperly. We have excluded +any CSS properties that are not currently implemented (such as position). + +Dangerous, can go outside container - float +Easy to abuse - font-size, font-family (font), width +Colored - background-color (background), border-color (border), color +Dramatic - border, list-style-position (list-style), margin, padding, + text-align, text-indent, text-transform, vertical-align, line-height + +Dramatic elements substnatially change the look of text in ways that should +probably have been reserved to other areas. diff --git a/docs/strictness.txt b/docs/strictness.txt new file mode 100644 index 00000000..b4f9268b --- /dev/null +++ b/docs/strictness.txt @@ -0,0 +1,25 @@ + +Is HTML Purifier Strict or Transitional? + A little bit of helpful guidance + +Despite the fact that HTML Purifier professes only to support transitional +HTML, it rejects a lot of attributes and elements that are actually, indeed, +valid. You can investigate progress.html to find out precisely what we +are doing to these *deprecated* attributes. + +However, users have found that Strict HTML imposes some quite unreasonable +restrictions on certain things. The start and value attributes in ol and +li (respectively) perhaps are the most contested. There's is currently no +widely supported browser method short of JavaScript that can replace these +two deprecated elements. HTML Purifier does not currently support them, but +it might behoove us to do so while our output is still transitional. + +Fortunantely, that's the only real bugger case. The others have near-perfect +CSS equivalents, and were presentational anyway. However, the other question +pops up: should we always convert these to the CSS forms when 1. the spec +allows them anyway and 2. older browsers support them better? After all, the +whole point about CSS is to seperate styling from content, so inline styling +doesn't solve that problem. + +It's an icky question, and we'll have to deal with it as more and more +transforms get implemented. diff --git a/library/HTMLPurifier/HTMLDefinition.php b/library/HTMLPurifier/HTMLDefinition.php index 34791ee7..9ef7d1c1 100644 --- a/library/HTMLPurifier/HTMLDefinition.php +++ b/library/HTMLPurifier/HTMLDefinition.php @@ -56,6 +56,7 @@ class HTMLPurifier_HTMLDefinition /** * String name of parent element HTML will be going into. + * @todo Allow this to be overloaded by user config * @public */ var $info_parent = 'div'; @@ -111,12 +112,19 @@ class HTMLPurifier_HTMLDefinition ////////////////////////////////////////////////////////////////////// // info[]->child : defines allowed children for elements - // entities: prefixed with e_ and _ replaces . + // entities: prefixed with e_ and _ replaces . from DTD + // double underlines are entities we made up // we don't use an array because that complicates interpolation // strings are used instead of arrays because if you use arrays, // you have to do some hideous manipulation with array_merge() + // todo: determine whether or not having allowed children + // that aren't allowed globally affects security (it shouldn't) + // if above works out, extend children definitions to include all + // possible elements (allowed elements will dictate which ones + // get dropped + $e_special_extra = 'img'; $e_special_basic = 'br | span | bdo'; $e_special = "$e_special_basic | $e_special_extra"; @@ -142,16 +150,18 @@ class HTMLPurifier_HTMLDefinition $e_block = "p | $e_heading | div | $e_lists | $e_blocktext | table"; $e__flow = "#PCDATA | $e_block | $e_inline | $e_misc"; $e_Flow = new HTMLPurifier_ChildDef_Optional($e__flow); - $e_a_content = new HTMLPurifier_ChildDef_Optional("#PCDATA | $e_special". - " | $e_fontstyle | $e_phrase | $e_inline_forms | $e_misc_inline"); + $e_a_content = new HTMLPurifier_ChildDef_Optional("#PCDATA". + " | $e_special | $e_fontstyle | $e_phrase | $e_inline_forms". + " | $e_misc_inline"); $e_pre_content = new HTMLPurifier_ChildDef_Optional("#PCDATA | a". " | $e_special_basic | $e_fontstyle_basic | $e_phrase_basic". " | $e_inline_forms | $e_misc_inline"); - $e_form_content = new HTMLPurifier_ChildDef_Optional(''); //unused - $e_form_button_content = new HTMLPurifier_ChildDef_Optional(''); // unused + $e_form_content = new HTMLPurifier_ChildDef_Optional('');//unused + $e_form_button_content = new HTMLPurifier_ChildDef_Optional('');//unused $this->info['ins']->child = - $this->info['del']->child = new HTMLPurifier_ChildDef_Chameleon($e__inline, $e__flow); + $this->info['del']->child = + new HTMLPurifier_ChildDef_Chameleon($e__inline, $e__flow); $this->info['blockquote']->child= $this->info['dd']->child = @@ -225,7 +235,7 @@ class HTMLPurifier_HTMLDefinition ////////////////////////////////////////////////////////////////////// // info[]->type : defines the type of the element (block or inline) - // reuses $e_Inline and $e_block + // reuses $e_Inline and $e_Block foreach ($e_Inline->elements as $name) { $this->info[$name]->type = 'inline'; @@ -243,7 +253,7 @@ class HTMLPurifier_HTMLDefinition $this->info['a']->excludes = array('a' => true); $this->info['pre']->excludes = array_flip(array('img', 'big', 'small', - // technically in spec, but we don't allow em anyway + // technically useless, but good to be indepth 'object', 'applet', 'font', 'basefont')); ////////////////////////////////////////////////////////////////////// @@ -253,6 +263,8 @@ class HTMLPurifier_HTMLDefinition // by the transform classes. It will, however, do simple and slightly // complex attribute value substitution + // the question of varying allowed attributes is more entangling. + $e_Text = new HTMLPurifier_AttrDef_Text(); // attrs, included in almost every single one except for a few, @@ -297,7 +309,8 @@ class HTMLPurifier_HTMLDefinition $this->info['table']->attr['summary'] = $e_Text; - $this->info['table']->attr['border'] = new HTMLPurifier_AttrDef_Pixels(); + $this->info['table']->attr['border'] = + new HTMLPurifier_AttrDef_Pixels(); $e_Length = new HTMLPurifier_AttrDef_Length(); $this->info['table']->attr['cellpadding'] = @@ -329,7 +342,7 @@ class HTMLPurifier_HTMLDefinition $this->info['q']->attr['cite'] = $e_URI; ////////////////////////////////////////////////////////////////////// - // UNIMP : info_tag_transform : transformations of tags + // info_tag_transform : transformations of tags $this->info_tag_transform['font'] = new HTMLPurifier_TagTransform_Font(); $this->info_tag_transform['menu'] = new HTMLPurifier_TagTransform_Simple('ul'); @@ -339,6 +352,9 @@ class HTMLPurifier_HTMLDefinition ////////////////////////////////////////////////////////////////////// // info[]->auto_close : tags that automatically close another + // todo: determine whether or not SGML-like modeling based on + // mandatory/optional end tags would be a better policy + // make sure you test using isset() not !empty() // these are all block elements: blocks aren't allowed in P -- 2.11.4.GIT