Page MenuHomePhabricator

Stop needing to use wgLegacyEncoding in Wikimedia cluster production
Closed, ResolvedPublic

Event Timeline

@Jdforrester-WMF Do you expect T128149: Remove wgLegacyEncoding feature of Revision/BlobStore to have more dependencies than just this one?

The tasks exist separately. This task is about migrating Wikimedia production to not use the variable. T128149 is about (eventually, possibly not yet) removing the functionality from MediaWiki core. It'd be cheap and harmless to keep around (unused by default for years already) for a few releases longer if other parties may be involved as well. There's no rush there.

@Jdforrester-WMF Do you expect T128149: Remove wgLegacyEncoding feature of Revision/BlobStore to have more dependencies than just this one?

The tasks exist separately.

Not that much separately when one is blocker of the other... ;-)

Anyway, I was just wondering, that this might be more or less useless middle-step, hence why I asked...

Out of curiosity what are the pain points of it being left in that lead to wanting to remove it? Data already has to be loaded via compressed/encoded/multi-row/external storage/etc so nobody should be reading old raw text entries and being surprised at their encoding since years ago.

Just a desire for cleanliness, or a partial step to adding some sort of different functionality?

In T128150#2073245, @brion wrote:

Out of curiosity what are the pain points of it being left in that lead to wanting to remove it? Data already has to be loaded via compressed/encoded/multi-row/external storage/etc so nobody should be reading old raw text entries and being surprised at their encoding since years ago.

Just a desire for cleanliness, or a partial step to adding some sort of different functionality?

The former.

The former.

Hehe fair enough. :) In that case I would recommend adding a maintenance script to do the conversion & using it in production to clean up the existing rows, at which point we can safely drop it from our config file.

Maybe deprecate the var to mark it for removal in a future version for T128149 if you feel strongly, but it's a pretty narrow bit of actual code that's fairly well-isolated. (There are exactly two places that use $wgLegacyEncoding, in revision text fetching and password comparisons.)

The move started a new set of production warnings called: PHP Warning: gzinflate(): data error. Quite small though: 82 last day. I thought it was a case of legacy encoding lingering in a wiki that doesn't have it set anymore but it also happens in nlwiki which still has the legacy encoding option.

The move started a new set of production warnings called: PHP Warning: gzinflate(): data error. Quite small though: 82 last day. I thought it was a case of legacy encoding lingering in a wiki that doesn't have it set anymore but it also happens in nlwiki which still has the legacy encoding option.

Found the root cause, it happens if the row in text table is only compressed and nothing else. That made the moveToExternal to try doing iconv on a compressed gzip and then tried to check if it's compressed or not. Leading to this fun.

For fixing it, I'm thinking of just loading the content from ES, then doing iconv from utf-8 to windows 1252, then do gzip inflate, then conversion back to utf-8 and then compression and storage it in ES again.

Yup. In nlwiki:
// old_id = 592755

> $extStoreAccess = MediaWiki\MediaWikiServices::getInstance()->getExternalStoreAccess();


> $text = $extStoreAccess->fetchFromURL( 'DB://cluster27/3026858' );

> \Wikimedia\AtEase\AtEase::suppressWarnings();

> $text = iconv( 'UTF-8//IGNORE', 'windows-1252', $text );

> \Wikimedia\AtEase\AtEase::restoreWarnings();

> $text = gzinflate( $text );

> \Wikimedia\AtEase\AtEase::suppressWarnings();

> $text = iconv( 'windows-1252','UTF-8//IGNORE', $text );

> \Wikimedia\AtEase\AtEase::restoreWarnings();

> var_dump( $text );
string(52) *Some Swedish (??) text*

Ran some fixing scripts that have fixed 60% of them in dawiktionary, dawiki, svwiktionary and svwiki. Now I need to recover from a backup for nlwiki and enwiki

Change 930715 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] moveToExternal: First decompress gziped entries before iconv

https://gerrit.wikimedia.org/r/930715

Change 930715 merged by jenkins-bot:

[mediawiki/core@master] moveToExternal: First decompress gziped entries before iconv

https://gerrit.wikimedia.org/r/930715

Change 930925 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@wmf/1.41.0-wmf.13] moveToExternal: First decompress gziped entries before iconv

https://gerrit.wikimedia.org/r/930925

Change 930925 merged by jenkins-bot:

[mediawiki/core@wmf/1.41.0-wmf.13] moveToExternal: First decompress gziped entries before iconv

https://gerrit.wikimedia.org/r/930925

Mentioned in SAL (#wikimedia-operations) [2023-06-19T08:36:18Z] <ladsgroup@deploy1002> Started scap: Backport for [[gerrit:930925|moveToExternal: First decompress gziped entries before iconv (T128150)]]

Mentioned in SAL (#wikimedia-operations) [2023-06-19T08:37:39Z] <ladsgroup@deploy1002> ladsgroup: Backport for [[gerrit:930925|moveToExternal: First decompress gziped entries before iconv (T128150)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-06-19T08:45:10Z] <ladsgroup@deploy1002> Finished scap: Backport for [[gerrit:930925|moveToExternal: First decompress gziped entries before iconv (T128150)]] (duration: 08m 52s)

Change 931087 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] Temporarily bring back legacy encoding in four wikis

https://gerrit.wikimedia.org/r/931087

Change 931087 merged by jenkins-bot:

[operations/mediawiki-config@master] Temporarily bring back legacy encoding in four wikis

https://gerrit.wikimedia.org/r/931087

Mentioned in SAL (#wikimedia-operations) [2023-06-19T08:53:42Z] <ladsgroup@deploy1002> Started scap: Backport for [[gerrit:931087|Temporarily bring back legacy encoding in four wikis (T128150)]]

Mentioned in SAL (#wikimedia-operations) [2023-06-19T08:55:12Z] <ladsgroup@deploy1002> ladsgroup: Backport for [[gerrit:931087|Temporarily bring back legacy encoding in four wikis (T128150)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-06-19T09:01:13Z] <ladsgroup@deploy1002> Finished scap: Backport for [[gerrit:931087|Temporarily bring back legacy encoding in four wikis (T128150)]] (duration: 07m 31s)

Brought back everything that is still corrupted from backups in svwiktionary, dawiki, and svwiki. Only nlwiki and enwiki is left.

Now only enwiki. Started running it.

Change 931306 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] Stop setting wgLegacyEncdoing

https://gerrit.wikimedia.org/r/931306

Change 931306 merged by jenkins-bot:

[operations/mediawiki-config@master] Stop setting wgLegacyEncdoing

https://gerrit.wikimedia.org/r/931306

Mentioned in SAL (#wikimedia-operations) [2023-06-20T10:22:31Z] <ladsgroup@deploy1002> Started scap: Backport for [[gerrit:931306|Stop setting wgLegacyEncdoing (T128150 T128151)]]

Mentioned in SAL (#wikimedia-operations) [2023-06-20T10:23:54Z] <ladsgroup@deploy1002> ladsgroup: Backport for [[gerrit:931306|Stop setting wgLegacyEncdoing (T128150 T128151)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-06-20T10:30:37Z] <ladsgroup@deploy1002> Finished scap: Backport for [[gerrit:931306|Stop setting wgLegacyEncdoing (T128150 T128151)]] (duration: 08m 06s)

Ladsgroup moved this task from In progress to Done on the DBA board.

I let James do the further clean up.