Public Git Hosting - mediawiki.git/commit

commit	9f14fbd002713abcf65a5b5cddc5d52dee90a977
author	C. Scott Ananian <cscott@cscott.net>
	Fri, 21 Jan 2022 22:03:26 +0000 (21 17:03 -0500)
committer	C. Scott Ananian <cscott@cscott.net>
	Fri, 4 Mar 2022 19:06:02 +0000 (4 14:06 -0500)
tree	35553d39a02079a3aa561478f5bb28a629064595	tree \| snapshot (tar.gz zip)
parent	ccaeb8368072bb00c6bbf6e2447a42da26acbbf2	commit \| diff

Add Sanitizer::removeSomeTags() which uses Remex to tokenize

The existing Sanitizer::removeHTMLtags() method, in addition to having
dodgy capitalization, uses regular expressions to parse the HTML.
That produces corner cases like T298401 and T67747 and is not guaranteed
to yield balanced or well-formed HTML.

Instead, introduce and use a new Sanitizer::removeSomeTags() method
which is guaranteed to always return balanced and well-formed HTML.

Note that Sanitizer::removeHTMLtags()/::removeSomeTags() take a callback
argument which (as far as I can tell) is never used outside core. Mark
that argument as @internal, and clean up the version used by
::removeSomeTags().

Use the new ::removeSomeTags() method in the two places where
DISPLAYTITLE is handled (following up on T67747).  The use by the
legacy parser is more difficult to replace (and would have a
performace cost), so leave the old ::removeHTMLtags() method in place
for that call site for now: when the legacy parser is replaced by
Parsoid the need for the old ::removeHTMLtags() will go away.  In a
follow-up patch we'll rename ::removeHTMLtags() and mark it @internal
so that we can deprecate ::removeHTMLtags() for external use.

Some benchmarking code added.  On my machine, with PHP 7.4, the new
method tidies short 30-character title strings at a rate of about
6764/s while the tidy-based method being replaced here managed 6384/s.
Sanitizer::removeHTMLtags blazes through short strings 20x faster
(120,915/s); some of this difference is due to the set up cost of
creating the tag whitelist and the Remex pipeline, so further
optimizations could doubtless be done if Sanitizer::removeSomeTags()
is more widely used.

Bug: T299722
Bug: T67747
Change-Id: Ic864c01471c292f11799c4fbdac4d7d30b8bc50f

autoload.php		diff \| blob \| blame \| history
includes/OutputPage.php		diff \| blob \| blame \| history
includes/parser/CoreParserFunctions.php		diff \| blob \| blame \| history
includes/parser/RemexRemoveTagHandler.php	[new file with mode: 0644]	blob
includes/parser/RemexStripTagHandler.php		diff \| blob \| blame \| history
includes/parser/Sanitizer.php		diff \| blob \| blame \| history
maintenance/benchmarks/benchmarkSanitizer.php		diff \| blob \| blame \| history
tests/phpunit/includes/parser/SanitizerTest.php		diff \| blob \| blame \| history
tests/phpunit/unit/includes/parser/SanitizerUnitTest.php		diff \| blob \| blame \| history