Public Git Hosting - mediawiki.git/commit

Replace our mb_substr() fallback implementation with one which is not quite so horrible...

While not too awful on smallish strings, the way it worked was *murder* on large input:
the *entire string* would be broken up into an array of individual characters, sliced up,
then merged back together.

In my testing I couldn't even get the function to complete in a reasonable time for, say,
127k worth of text... not only did the regex split take forever, but it would eat an insane
amount of memory, likely triggering memory_limit hits in a sane world.

The new implementation counts characters from the beginning or end of a string to determine
the byte-based offsets to use for substr() start and count parameters, and only uses
a couple temporary dupes of the string in memory. For typical short offset/count cases
(take or trim one or a few characters) this performs about 3-5x worse than native mb_substr()
for in my testing.

Large offsets are optimized by first skipping the same number of bytes as characters, since all
characters take at least one byte. On primarily Latin text this made some of my test cases
actually *faster* than native mb_substr()! ;) For non-Latin texts this takes out a fair chunk
of our work, but can still leave us with very slow execution -- eg ~30ms to get through a few
dozens of kilobytes worth of offset on Japanese text. But at least it completes now!

This could probably be optimized further, perhaps skipping progressively smaller chunks in
binary-chop fashion. :)

For fun, my profiling results (profiling & test scripts are in a little git repo which I would
push to gitorious to poke at, but gitorious hates me right now and won't finish my repo setup):

strlen       mb_strlen short ascii - 0.0019ms - 19
strlen      xmb_strlen short ascii - 0.0672ms - 19

strlen       mb_strlen short unicode - 0.0019ms - 19
strlen      xmb_strlen short unicode - 0.0657ms - 19

strlen       mb_strlen long ascii - 0.0826ms - 20000
strlen      xmb_strlen long ascii - 0.1236ms - 20000

strlen       mb_strlen long unicode - 0.0774ms - 20000
strlen      xmb_strlen long unicode - 0.1901ms - 20000

strlen       mb_strlen san francisco - 0.4775ms - 126700
strlen      xmb_strlen san francisco - 0.4474ms - 126700

substr       mb_substr short ascii first - 0.0022ms - 1-byte string ("s") <- native
substr      xmb_substr short ascii first - 0.0168ms - 1-byte string ("s") <- old fallback
substr     xmb_substr3 short ascii first - 0.0069ms - 1-byte string ("s") <- new fallback

substr       mb_substr short ascii last - 0.0023ms - 1-byte string ("s")
substr      xmb_substr short ascii last - 0.0171ms - 1-byte string ("s")
substr     xmb_substr3 short ascii last - 0.0113ms - 1-byte string ("s")

substr       mb_substr short ascii trim last 9 - 0.0023ms - 10-byte string ("short asci")
substr      xmb_substr short ascii trim last 9 - 0.0183ms - 10-byte string ("short asci")
substr     xmb_substr3 short ascii trim last 9 - 0.0119ms - 10-byte string ("short asci")

substr       mb_substr short ascii middle 3 - 0.0022ms - 3-byte string ("sci")
substr      xmb_substr short ascii middle 3 - 0.0171ms - 3-byte string ("sci")
substr     xmb_substr3 short ascii middle 3 - 0.0149ms - 3-byte string ("sci")

substr       mb_substr short unicode first - 0.0022ms - 1-byte string ("s")
substr      xmb_substr short unicode first - 0.0184ms - 1-byte string ("s")
substr     xmb_substr3 short unicode first - 0.0071ms - 1-byte string ("s")

substr       mb_substr short unicode last - 0.0026ms - 2-byte string ("ß")
substr      xmb_substr short unicode last - 0.0187ms - 2-byte string ("ß")
substr     xmb_substr3 short unicode last - 0.0130ms - 2-byte string ("ß")

substr       mb_substr short unicode trim last 9 - 0.0024ms - 14-byte string ("short áéíó")
substr      xmb_substr short unicode trim last 9 - 0.0200ms - 14-byte string ("short áéíó")
substr     xmb_substr3 short unicode trim last 9 - 0.0137ms - 14-byte string ("short áéíó")

substr       mb_substr short unicode middle 3 - 0.0022ms - 6-byte string ("éíó")
substr      xmb_substr short unicode middle 3 - 0.0188ms - 6-byte string ("éíó")
substr     xmb_substr3 short unicode middle 3 - 0.0189ms - 6-byte string ("éíó")

substr       mb_substr san fran first - 0.0022ms - 1-byte string ("{")
substr     xmb_substr3 san fran first - 0.0069ms - 1-byte string ("{")

substr       mb_substr san fran last - 0.8914ms - 1-byte string ("\n")
substr     xmb_substr3 san fran last - 0.0109ms - 1-byte string ("\n")

substr       mb_substr san fran non-first - 0.5995ms - 127318-byte string (c00cabc812ac347bd2e81a3e3f04e23d)
substr     xmb_substr3 san fran non-first - 0.0213ms - 127318-byte string (c00cabc812ac347bd2e81a3e3f04e23d)

substr       mb_substr san fran middle 1k - 0.2218ms - 1025-byte string (c42eb5c511670f72ff4593a39219682c)
substr     xmb_substr3 san fran middle 1k - 0.3883ms - 1025-byte string (c42eb5c511670f72ff4593a39219682c)

substr       mb_substr boston-ja first - 0.0021ms - 1-byte string ("{")
substr     xmb_substr3 boston-ja first - 0.0068ms - 1-byte string ("{")

substr       mb_substr boston-ja last - 0.5497ms - 1-byte string ("\n")
substr     xmb_substr3 boston-ja last - 0.0110ms - 1-byte string ("\n")

substr       mb_substr boston-ja non-first - 0.4128ms - 127637-byte string (933e70d1d10f4d64cdfbd69b58592cd4)
substr     xmb_substr3 boston-ja non-first - 0.0216ms - 127637-byte string (933e70d1d10f4d64cdfbd69b58592cd4)

substr       mb_substr boston-ja middle 1k - 0.2237ms - 2006-byte string (1eaa8554ff4507109b1cba7a597d82bf)
substr     xmb_substr3 boston-ja middle 1k - 30.6811ms - 2006-byte string (1eaa8554ff4507109b1cba7a597d82bf)

commit	1505db42a26f9a3e7b3934d8d48e190e39a06924
author	Brion Vibber <brion@users.mediawiki.org>
	Wed, 26 Aug 2009 05:51:21 +0000 (26 05:51 +0000)
committer	Brion Vibber <brion@users.mediawiki.org>
	Wed, 26 Aug 2009 05:51:21 +0000 (26 05:51 +0000)
tree	7458d0bb281fb28c53a23358893cd1c05d338ec9	tree \| snapshot (tar.gz zip)
parent	8c3e5040fc082490bd6dcf6d711b00b22858aa8f	commit \| diff