Page MenuHomePhabricator

Remove use of utf8_encode and utf8_decode in Wikibase
Closed, ResolvedPublic

Description

utf8_encode() and utf8_decode() are deprecated in PHP 8.2, because their names are quite misleading (RFC): they convert strings from Latin 1 to UTF-8 and vice versa.

Wikibase uses them in SimpleCacheWithBagOStuff (since the original revision of that class):

	private function serialize( $value ) {
		$serializedValue = serialize( $value );
		$dataToStore = utf8_encode( $serializedValue );

		$signature = hash_hmac( 'sha256', $dataToStore, $this->secret );
		return json_encode( [ $signature, $dataToStore ] );
	}

	private function unserialize( $string, $default, array $loggingContext ) {
		$result = json_decode( $string );

		// ...
		$decodedData = utf8_decode( $data );

		if ( $decodedData === serialize( false ) ) {
			return false;
		}

		// phpcs:disable Generic.PHP.NoSilencedErrors.Discouraged
		$value = @unserialize(
			$decodedData,
			[
				'allowed_classes' => [ \stdClass::class ]
			]
		);

		// ...
		return $value;
	}

I’m pretty sure this use of the functions is pointless (though “correct” in that the methods perform their inverse operations). MediaWiki sets mb_internal_encoding( 'UTF-8' ); early in Setup.php (since T137509), so there is no reason to assume that PHP serialize() returns Latin 1 that needs to be UTF-8 encoded before being passed into json_encode() – we’re just putting JSONified mojibake in the inner cache.

shell.php
> $c = new HashBagOStuff()
> ( new Wikibase\Lib\SimpleCacheWithBagOStuff( $c, 'prefix', 'secret' ) )->set( 'key', 'äöü' )
> $c->get( $c->makeKey( 'prefix', 'key' ) )
= "["9b94f92e4cfdcd3a282fc9398acd9a77f6d74d72fe95930c67e3edc44ece2745","s:6:\"\u00c3\u00a4\u00c3\u00b6\u00c3\u00bc\";"]"
> json_decode( $c->get( $c->makeKey( 'prefix', 'key' ) ) )[1] 
= "s:6:"äöü";"

We should remove both method calls. (But we’ll need to tweak the cache key at the same time, to ensure that old mojibake serializations don’t get deserialized without the utf8_decode().)

Event Timeline

Note: the practical impact of this issue is that any Wikibase developer on PHP 8.2 will need to disable deprecation warnings on their wiki (I use error_reporting( E_ALL & ~E_DEPRECATED ); // T324202). In Wikidata team prioritization terms, I’d say that this affects developers (but it sounds like this task isn’t prioritized for us).


We should remove both method calls. (But we’ll need to tweak the cache key at the same time, to ensure that old mojibake serializations don’t get deserialized without the utf8_decode().)

Given that this cache is (AFAICT) used among other things for the term store / formatter cache, where it has a very high absolute hit rate (over a million hits per minute) and a pretty good relative hit rate (ca. 98%), I don’t think we can afford to just switch out the cache key without having some kind of fallback for the old value, unfortunately. But instead of changing the cache key, I think we can do something clever in SimpleCacheWithBagOStuff::unserialize(), between the json_decode() and the utf8_decode() – so that both “old-style” and “new-style” entries are stored in the underlying cache using the same key, but we detect whether to call utf8_decode() based on the JSON structure. (Maybe “new-style” values have an extra array element in the JSON after the $signatureToCheck and $data, or whatever.) That way we keep the key the same and only hit the underlying cache once rather than twice.

Yeah, I think since this change is a bit more global to Wikibase than just affecting Wikidata.org, then Wikibase Product Platform Team WPP should probably take a look at this.

So far, since wikidata.org production is on PHP v7.4 and the mw-cli tool is still on PHP v8.1 (and this is the tool we onboard new developers to use) this probably has less of an effect on developers who are not aware of the deprecation or how to set the proper error reporting flags to ignore it. But if this sufficiently bugs us as a team, there might be a reason to pick it up, so long as the impact is constricted.

However, judging from the comments above, this requires a bit more thinking as at least some compatibility layer is needed.

though “correct” in that the methods perform their inverse operations […] we’re just putting JSONified mojibake in the inner cache.

The utf8_decode documentation says "UTF-8 characters which do not exist in ISO-8859-1 (that is, code points above U+00FF) are replaced with ?". This operation is not reversible. I can confirm this from personal experience.

though “correct” in that the methods perform their inverse operations […] we’re just putting JSONified mojibake in the inner cache.

The utf8_decode documentation says "UTF-8 characters which do not exist in ISO-8859-1 (that is, code points above U+00FF) are replaced with ?". This operation is not reversible. I can confirm this from personal experience.

But this utf8_decode() call is never going to see “UTF-8 characters which do not exist in ISO-8859-1”. It only ever sees results from utf8_encode(), which treats its input as ISO-8859-1, so the output of utf8_encode() only contains UTF-8 characters which do exist in ISO-8859-1.

> $c = new HashBagOStuff()
> $sc = new Wikibase\Lib\SimpleCacheWithBagOStuff( $c, 'prefix', 'secret' )
> $sc->set( 'key', '🤔' ); // U+1F914
> $sc->get( 'key' )
= "🤔"

Oh, got it. It's the wrong way around.

The weird code was added in patchset 20 in July 2018, apparently by @WMDE-leszek: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/441203/19..20/lib/includes/SimpleCacheWithBagOStuff.php. The commit message was edited in the same patchset. It mentions 3 sources, but I can't find utf8_… mentioned on these pages.

I tried to read the discussion, but it's not very helpful. Unfortunately I wasn't involved back then. There is also no Phabricator task linked.

What's done in the code directly after the weird utf8_encode is a hash_hmac call. Could it be that this is not able to work with UTF-8 strings, but expects something else?

What is SimpleCacheWithBagOStuff even used for? What does the extra hashing do?

My suggestion is to not touch this class, but mark it as deprecated and replace all usages with something reliable from core.

An alternative is to change the serialization format and add a 3rd array element with a version number 2. Cache entries without this number are still passed through the old deserialization.

Change 983300 had a related patch set uploaded (by Ollie Shotton; author: Paladox):

[mediawiki/extensions/Wikibase@master] SimpleCacheWithBagOStuff: Replace utf8_decode with mb_convert_encoding

https://gerrit.wikimedia.org/r/983300

As I was explicitly called out, a few answers, plus an opinion at the end.

The weird code was added in patchset 20 in July 2018, apparently by @WMDE-leszek: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/441203/19..20/lib/includes/SimpleCacheWithBagOStuff.php. The commit message was edited in the same patchset. It mentions 3 sources, but I can't find utf8_… mentioned on these pages.

I tried to read the discussion, but it's not very helpful. Unfortunately I wasn't involved back then. There is also no Phabricator task linked.

What's done in the code directly after the weird utf8_encode is a hash_hmac call. Could it be that this is not able to work with UTF-8 strings, but expects something else?

Looking at the change history I can recall I pair-programmed parts of this change with Aleksey. I didn't remember the exact reason for adding utf8_encode and _decode. But thankfully there are tests which help to understand it.
Test in question is seemingly Wikibase\Lib\Tests\SimpleCacheTestCase::testBinaryData
Role of utf8_encode is apparently not about dealing with UTF-8 strings, but to ensure that $dataToStore in return json_encode( [ $signature, $dataToStore ] ); (Wikibase\Lib\SimpleCacheWithBagOStuff::serialize) is UTF-8 encoded. As per PHP documentation in case it was not, json_encode would return false, leading to incorrect value being stored in cache.

It might be, which maybe Lucas implies, that given the current usage of this cache class (I recall it was introduced to cache labels and descriptions of items, those could be maybe assumed to be always UTF-8-encoded) this situation might be impossible in practice. Whether this warrants removal of the encoding in the cache class, I am not sure.

What is SimpleCacheWithBagOStuff even used for?

It is a PSR-16-compliant implementation CacheInterface implementation wrapping Mediawiki's "BagOStuff", as per consquence of Decision to "Use PSR-16 PHP Cache Interface in Wikibase" (instead of binding to Mediawiki's less univeral interfaces or other abstractions)
Originally, SimpleCacheWithBagOStuff was introduced to use cache instead of secondary SQL table for efficient fetching of item label and description data to be used when displaying "references" to an item. It could be it has been also adopted for other use cases.

My suggestion is to not touch this class, but mark it as deprecated and replace all usages with something reliable from core.

I note that the latter part of the suggestion goes against the quoted above decision to use more common cache abstraction.

Finally, I'd like to address the above remarks of possibly introducing some fallback cache key generation algorithm to not change keys while changing the logic in the cache class. I'd think it would be sensible to provide some data, if it was actually intended to introduce some "smart" logic to keep cache keys unchanged. Without understanding the suggested impact it does seem like unnecessary complexity to me. I'd naively think if such a detail like changing the cache key generation algorithm (I guess equaling temporary dropping of cache) can take Wikidata down, we might have bigger problems than what is being discussed here. I admit I don't remember details of rollout of the cache approach (instead of the secondary SQL table) but I don't recall any spectacular outages on the way.

Finally, I'd like to address the above remarks of possibly introducing some fallback cache key generation algorithm to not change keys while changing the logic in the cache class. I'd think it would be sensible to provide some data, if it was actually intended to introduce some "smart" logic to keep cache keys unchanged. Without understanding the suggested impact it does seem like unnecessary complexity to me.

I already provided some data in T324202#9302370. A cache hit rate between 98% and 99% means that, if we remove the cache, the database will be hit with somewhere between 50× and 100× as many queries of this type as before. I don’t know how this query compares to other queries that hit the database (does it currently occupy 0.1% of the execution time? 1%? 10%? no idea), but it seems to run often enough to occasionally show up in the slow queries log (searching for wbt_term_in_lang, a table name that I expect to appear in the database query underlying this cache) – the few instances where this query was above the threshold to get logged (5 seconds) already sum to 47.4 s in the past hour.

I'd naively think if such a detail like changing the cache key generation algorithm (I guess equaling temporary dropping of cache) can take Wikidata down, we might have bigger problems than what is being discussed here. I admit I don't remember details of rollout of the cache approach (instead of the secondary SQL table) but I don't recall any spectacular outages on the way.

I don’t understand this argument. The cache was considered necessary when we introduced it (T242871), and while the site wasn’t down before then, the cache improved performance substantially (e.g. median parse time for Wikibase items halved, from ~200 ms to ~100 ms, T242871#6625282). More generally speaking – how can we build a well-performing website when the very mechanisms that make it perform well can be criticized purely on the grounds that, if those mechanisms were removed, the website would no longer perform well?

It’s possible the site won’t go down if we just change the cache key, but I’d rather avoid the massive load spike if we can.

Very bad choice of words in the argumentation, my apologies. What I attempted to mean is that I'd encourage to compare both the cost of adding some additional logic to some low level code (and for how long? for a day or long-term?) as well as the "operational" impact of changing cache keys and its duration. (massive load spike and lasting how long) I am not saying it is not sensible thing to add, neither meant, of course, to say performance optimizations should only be restricted to cases when they would prevent complete outage. I attempted to call for more empirical approach to discussing benefits and tradeoffs of potential optimization.

All that referring to the situation that the cache keys were actually supposed to change due to removal of utf8 encoding/decoding which I am not sure is actually wanted and necessary for the task at hand.

Would it be possible to expedite this fix, please? It's blocking CI (and the MediaWiki support for) PHP 8.2 generally.

Change 983300 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] SimpleCacheWithBagOStuff: Replace utf8_decode with mb_convert_encoding

https://gerrit.wikimedia.org/r/983300

Prio Notes:

Impact AreaAffectedNotes
production / end usersNot until production moves to PHP 8.2, which will take quite some time (the next target after mw-on-k8s is 8.1)
monitoring
development effortsutf8_encode() and utf8_decode() are confusing; developers on PHP 8.2 have to disable deprecation warnings
onboarding efforts
additional stakeholdersStrongly affected – blocks PHP 8.2 on several extensions, including ones by other teams (e.g. GrowthExperiments)

Change #1015992 had a related patch set uploaded (by Thiemo Kreuz (WMDE); author: Fomafix):

[mediawiki/extensions/Wikibase@master] Replace deprecated utf8_encode

https://gerrit.wikimedia.org/r/1015992

I’m pretty sure this use of the functions is pointless (though “correct” in that the methods perform their inverse operations). MediaWiki sets mb_internal_encoding( 'UTF-8' ); early in Setup.php (since T137509), so there is no reason to assume that PHP serialize() returns Latin 1 that needs to be UTF-8 encoded before being passed into json_encode() – we’re just putting JSONified mojibake in the inner cache.

It turns out that the class has a test (SimpleCacheTestCase::testBinaryData(), inherited from an upstream library) that asserts that binary strings can be stored in the cache and retrieved correctly; this breaks if the utf8_encode() is removed. However, I see no reason for us to support this (we only put text into the cache, not binary data), so I think we should just skip the test.

Change #1017815 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/extensions/Wikibase@master] SimpleCacheWithBagOStuff: Remove double UTF-8 (1/3)

https://gerrit.wikimedia.org/r/1017815

Change #1017816 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/extensions/Wikibase@master] SimpleCacheWithBagOStuff: Remove double UTF-8 (2/3)

https://gerrit.wikimedia.org/r/1017816

Change #1017817 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/extensions/Wikibase@master] SimpleCacheWithBagOStuff: Remove double UTF-8 (3/3)

https://gerrit.wikimedia.org/r/1017817

Change #1018958 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/extensions/Wikibase@master] SimpleCacheWithBagOStuff: Check cached data more thoroughly

https://gerrit.wikimedia.org/r/1018958

Change #1019060 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/extensions/Wikibase@master] SimpleCacheWithBagOStuff: Extract methods from unserialize()

https://gerrit.wikimedia.org/r/1019060

Change #1017815 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] SimpleCacheWithBagOStuff: Remove double UTF-8 (1/3)

https://gerrit.wikimedia.org/r/1017815

Change #1019175 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/extensions/Wikibase@REL1_42] SimpleCacheWithBagOStuff: Remove double UTF-8 (1/3)

https://gerrit.wikimedia.org/r/1019175

Change #1019175 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@REL1_42] SimpleCacheWithBagOStuff: Remove double UTF-8 (1/3)

https://gerrit.wikimedia.org/r/1019175

Change #1017816 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] SimpleCacheWithBagOStuff: Remove double UTF-8 (2/3)

https://gerrit.wikimedia.org/r/1017816

Change #1020229 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/extensions/Wikibase@REL1_42] SimpleCacheWithBagOStuff: Remove double UTF-8 (2/3)

https://gerrit.wikimedia.org/r/1020229

Change #1020229 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@REL1_42] SimpleCacheWithBagOStuff: Remove double UTF-8 (2/3)

https://gerrit.wikimedia.org/r/1020229

Change #1017817 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] SimpleCacheWithBagOStuff: Remove double UTF-8 (3/3)

https://gerrit.wikimedia.org/r/1017817

Change #1018958 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] SimpleCacheWithBagOStuff: Check cached data more thoroughly

https://gerrit.wikimedia.org/r/1018958

Change #1019060 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] SimpleCacheWithBagOStuff: Extract methods from unserialize()

https://gerrit.wikimedia.org/r/1019060

Change #1015992 abandoned by Jforrester:

[mediawiki/extensions/Wikibase@master] Replace deprecated utf8_encode

Reason:

Removed in other patches, instead.

https://gerrit.wikimedia.org/r/1015992

Change #1023401 had a related patch set uploaded (by Jforrester; author: Jforrester):

[integration/config@master] Zuul: Re-apply PHP 8.2 CI to Wikibase-based code

https://gerrit.wikimedia.org/r/1023401

Given that this cache is (AFAICT) used among other things for the term store / formatter cache, where it has a very high absolute hit rate (over a million hits per minute) and a pretty good relative hit rate (ca. 98%), I don’t think we can afford to just switch out the cache key without having some kind of fallback for the old value, unfortunately.

The relative hit rate looks pretty stable over the past week (when step 2/3, emitting the new cache format, started to roll out with the train), so I think we can call this rollout a success :)

Minor note: I didn’t realize this before, but the multi-step rollout isn’t just important in case of train rollback – the cache is intentionally shared between repo and client (see TermFallbackCacheFactory::getTermFallbackCache()), so (if I understand correctly) at any time we can have two different train versions reading and writing the cache. So it’s extra important that we split this change across several trains ^^

Change #1023401 merged by jenkins-bot:

[integration/config@master] Zuul: Re-apply PHP 8.2 CI to Wikibase-based code

https://gerrit.wikimedia.org/r/1023401

Mentioned in SAL (#wikimedia-releng) [2024-04-26T18:18:26Z] <James_F> Zuul: Re-apply PHP 8.2 CI to Wikibase-based code for T360560 T324202 T353161