Every edit (including rollback) distorts non-ASCII text
Closed, ResolvedPublic

Description

Steps to reproduce:

  • Make any change to any page that includes a non-ASCII symbol, and the saved version will have distorted characters

Examples:

Analysis:

Background: The old Revision::getRevisionText() and the new BlobStore::expandBlob() methods apply the legacy encoding if no flags are provided - the "utf-8" flag is required to bypass this conversion.

As part of the refactoring for MCR, the code for constructing a Revision from an array was consolidated with the code for constructing from a row object. Row objects are required to have the old_flags field set; this field being null or empty would trigger legacy encoding conversion. The same logic was now applied for the 'flags' field when constructing from an array - which was a mistake. No conversion (or indeed decompression or other kinds of decoding) should be applied when constructing from arrays.

This mistake led to the legacy encoding conversion to be applied whenever a Revision object was constructed from an array - which is the case whenever a new Revision is prepared for insertion into the database while saving an edit. This caused data corruption by double-encoding.

Solution:
Do not apply any processing to the content blob when constructing a Revision from an array (at least not for the normal case of the 'flags' field not being set).

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 11 2018, 5:33 PM
Skalman triaged this task as Unbreak Now! priority.Jan 11 2018, 5:35 PM
Restricted Application added subscribers: Liuxinyu970226, Jay8g, TerraCodes. · View Herald TranscriptJan 11 2018, 5:35 PM
Skalman updated the task description. (Show Details)Jan 11 2018, 5:39 PM
Skalman updated the task description. (Show Details)Jan 11 2018, 5:43 PM
Skalman updated the task description. (Show Details)Jan 11 2018, 5:50 PM

More information:

  • Piping through iconv -f utf8 -t windows-1252 appears to fix the text, so it might be double-UTF8-encoded.
  • I could reproduce this on a new page (User:Lucas Werkmeister (WMDE)/sandbox, including the reported behavior that the page is only corrupted after purge (the first render is still correct – wild guess: perhaps it still has the uncorrupted wikitext in memory but the database version is corrupted?).
  • I could not reproduce this on Swedish Wikipedia (test page.
  • According to the mediawiki-config repository (wmf-config/InitialiseSettings.php, right at the top), svwiktionary still uses the windows-1252 legacy encoding. So do svwiki and enwiki, where the problem doesn’t seem to occur – but they haven’t moved from wmf.15 to wmf.16 yet.
  • dawiktionary, which also uses the legacy encoding, seems to have the same bug (example edit).
Addshore added a subscriber: Addshore.
Addshore added a subscriber: daniel.

Change 403743 had a related patch set uploaded (by 20after4; owner: 20after4):
[operations/mediawiki-config@master] Rollback group1 to wmf.15 due to T184749

https://gerrit.wikimedia.org/r/403743

Change 403743 merged by jenkins-bot:
[operations/mediawiki-config@master] Rollback group1 to wmf.15 due to T184749

https://gerrit.wikimedia.org/r/403743

Mentioned in SAL (#wikimedia-operations) [2018-01-11T20:12:32Z] <twentyafterfour@tin> rebuilt and synchronized wikiversions files: Rollback group1 to wmf.15 due to T184749 refs T180749

Addshore added a comment.EditedJan 11 2018, 8:31 PM

I can reproduce locally on master with $wgLegacyEncoding = 'windows-1252'; and the test string "sammansättningar"

Edit any page, and on first load after edit it all appears to be okay, after a purge the text will be foobared

This code only got as far as group 1 and only seems to affect wikis with wgLegacyEncoding to windows-1252

So out of all of the wikis with the encoding:

'wgLegacyEncoding' => [
	'default' => false,

	'enwiki' => 'windows-1252',
	'dawiki' => 'windows-1252',
	'svwiki' => 'windows-1252',
	'nlwiki' => 'windows-1252',

	'dawiktionary' => 'windows-1252',
	'svwiktionary' => 'windows-1252',
],

dawiktionary and svwiktionary will be the ones that were hit with issues.

Change 403754 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/core@master] Extra tests for SqlBlobStore with 'windows-1252' legacy encoding

https://gerrit.wikimedia.org/r/403754

Git bisect confirms it is the MCR stuff.

6af796f3e0cf3e66cd7d7e59af8445f5712d68fe is the first bad commit
commit 6af796f3e0cf3e66cd7d7e59af8445f5712d68fe
Author: daniel <daniel.kinzler@wikimedia.de>
Date:   Thu Aug 31 20:41:04 2017 +0200

    MCR: Deprecate and gut Revision class

So, https://gerrit.wikimedia.org/r/#/c/399174/ and https://gerrit.wikimedia.org/r/#/c/374077/ (the second patch was merged earlier but unused until 399174 was merged)

Anomie added a subscriber: Anomie.Jan 11 2018, 9:26 PM

Prior to the MCR patch, passing 'text' in the array to Revision::__construct() would directly set ->mText. In the MCR patch, RevisionStore::emulateMainSlot_1_29() instead takes 'text' as if it were from the database old_text and uses an empty old_flags, which when it passes it through expandBlob() (as you'd expect to do with old_text) to determine the content of the page being saved causes it to be interpreted as the legacy encoding.

Change 403759 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@master] Add tests for RevisionStore::newMutableRevisionFromArray and legacy encoding

https://gerrit.wikimedia.org/r/403759

Anomie said:

EditPage::internalAttemptSave() → WikiPage::doEditContent() → WikiPage::doModify() → Revision::__construct() calls RevisionStore::newMutableRevisionFromArray()

I don't think so. The pages are saved correctly in the db, as show by the presence of the utf-8 flag. The problem we are dealing with happens when reading that content.

This bug only happens when $wgLegacyEncoding is set. But the variable doesn't seem to be checked anywhere!

Change 403909 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/core@master] RevisionStore, fix loadSlotContent with no $blobFlags

https://gerrit.wikimedia.org/r/403909

Restricted Application added a project: User-Addshore. · View Herald TranscriptJan 12 2018, 11:58 AM
Addshore moved this task from Inbox to Next on the Multi-Content-Revisions board.Jan 12 2018, 11:58 AM
Addshore moved this task from Backlog to Needs Review on the User-Addshore board.
Johan moved this task from To Triage to In current Tech/News draft on the User-notice board.
This comment was removed by daniel.
daniel updated the task description. (Show Details)Jan 12 2018, 1:47 PM

Change 403922 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] Document expandBlob behavior when no flags are given.

https://gerrit.wikimedia.org/r/403922

Change 403909 merged by jenkins-bot:
[mediawiki/core@master] RevisionStore, fix loadSlotContent with no $blobFlags

https://gerrit.wikimedia.org/r/403909

Change 403930 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/core@wmf/1.31.0-wmf.16] RevisionStore, fix loadSlotContent with no $blobFlags

https://gerrit.wikimedia.org/r/403930

Anomie said:

EditPage::internalAttemptSave() → WikiPage::doEditContent() → WikiPage::doModify() → Revision::__construct() calls RevisionStore::newMutableRevisionFromArray()

I don't think so. The pages are saved correctly in the db, as show by the presence of the utf-8 flag. The problem we are dealing with happens when reading that content.

The "utf-8" flag is set in the DB, yes. But the text itself is already mojibaked when saved in the DB. The reading back from the DB works fine.

Change 403759 abandoned by Anomie:
Add tests for RevisionStore::newMutableRevisionFromArray and legacy encoding

Reason:
Absorbed into Ieb02ac593fc6b42af1692d03d9d578a76417eb54

https://gerrit.wikimedia.org/r/403759

Status update: The fix is merged into master (with a backport to wmf.16) and thus on the Beta Cluster for testing. Given it is a Friday we will attempt a continuation of the train on Monday (per https://wikitech.wikimedia.org/wiki/Deployments/Holding_the_train#What_happens_next?), to be coordinated with @mmodell / @thcipriani and @Addshore.

The "utf-8" flag is set in the DB, yes. But the text itself is already mojibaked when saved in the DB. The reading back from the DB works fine.

☹ At least, affected revisions can be fixed by simply removing that flag.

The "utf-8" flag is set in the DB, yes. But the text itself is already mojibaked when saved in the DB. The reading back from the DB works fine.

☹ At least, affected revisions can be fixed by simply removing that flag.

No, that would turn the broken "åäö" into the even more broken "åäö" when the revision is read.

Change 404075 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/core@master] selenium, update page spec to include more chars

https://gerrit.wikimedia.org/r/404075

Change 403922 merged by jenkins-bot:
[mediawiki/core@master] Document expandBlob behavior when no flags are given.

https://gerrit.wikimedia.org/r/403922

Change 404152 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] Add more tests for legacy encoding in RevisionTest

https://gerrit.wikimedia.org/r/404152

Change 404075 merged by WMDE-Fisch:
[mediawiki/core@master] selenium, update page spec to include more chars

https://gerrit.wikimedia.org/r/404075

daniel awarded a token.EditedJan 15 2018, 6:58 PM

The problem has been fixed on the live site since Wednesday, and is (hopefully?!) fixed on master. Can this ticket be closed or re-prioritized now? It no longer requires urgent attention.

The relevant incident report can be found at https://wikitech.wikimedia.org/wiki/Incident_documentation/20180111-LegacyEncoding

Change 403930 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/core@wmf/1.31.0-wmf.16] RevisionStore, fix loadSlotContent with no $blobFlags

https://gerrit.wikimedia.org/r/403930

Hrm, seem to be getting test failures trying to merge this patch.

Change 403930 abandoned by Ladsgroup:
RevisionStore, fix loadSlotContent with no $blobFlags

https://gerrit.wikimedia.org/r/403930

Change 403930 restored by Ladsgroup:
RevisionStore, fix loadSlotContent with no $blobFlags

https://gerrit.wikimedia.org/r/403930

Change 403930 merged by jenkins-bot:
[mediawiki/core@wmf/1.31.0-wmf.16] RevisionStore, fix loadSlotContent with no $blobFlags

https://gerrit.wikimedia.org/r/403930

Mentioned in SAL (#wikimedia-operations) [2018-01-16T19:46:38Z] <thcipriani@tin> Synchronized php-1.31.0-wmf.16/includes/Storage/RevisionStore.php: [[gerrit:403930|RevisionStore, fix loadSlotContent with no $blobFlags]] T184749 (duration: 01m 13s)

Can this be marked as resolved now?

thcipriani closed this task as Resolved.Jan 17 2018, 4:40 PM

This fix was deployed to wmf.16 wikis yesterday and verified by @Addshore, closing.

Change 403754 merged by jenkins-bot:
[mediawiki/core@master] Extra tests for SqlBlobStore with 'windows-1252' legacy encoding

https://gerrit.wikimedia.org/r/403754

238482n375 lowered the priority of this task from Unbreak Now! to Lowest.Jun 15 2018, 8:03 AM
238482n375 removed Addshore as the assignee of this task.
238482n375 moved this task from Next Up to In Code Review on the Analytics-Kanban board.
238482n375 edited subscribers, added: 238482n375; removed: Aklapper.

SG9tZVBoYWJyaWNhdG9yCk5vIG1lc3NhZ2VzLiBObyBub3RpZmljYXRpb25zLgoKICAgIFNlYXJjaAoKQ3JlYXRlIFRhc2sKTWFuaXBoZXN0ClQxOTcyODEKRml4IGZhaWxpbmcgd2VicmVxdWVzdCBob3VycyAodXBsb2FkIGFuZCB0ZXh0IDIwMTgtMDYtMTQtMTEpCk9wZW4sIE5lZWRzIFRyaWFnZVB1YmxpYwoKICAgIEVkaXQgVGFzawogICAgRWRpdCBSZWxhdGVkIFRhc2tzLi4uCiAgICBFZGl0IFJlbGF0ZWQgT2JqZWN0cy4uLgogICAgUHJvdGVjdCBhcyBzZWN1cml0eSBpc3N1ZQoKICAgIE11dGUgTm90aWZpY2F0aW9ucwogICAgQXdhcmQgVG9rZW4KICAgIEZsYWcgRm9yIExhdGVyCgpUYWdzCgogICAgQW5hbHl0aWNzLUthbmJhbiAoSW4gUHJvZ3Jlc3MpCgpTdWJzY3JpYmVycwpBa2xhcHBlciwgSkFsbGVtYW5kb3UKQXNzaWduZWQgVG8KSkFsbGVtYW5kb3UKQXV0aG9yZWQgQnkKSkFsbGVtYW5kb3UsIEZyaSwgSnVuIDE1CkRlc2NyaXB0aW9uCgpPb3ppZSBqb2JzIGhhdmUgYmVlbiBmYWlsaW5nIGF0IGxlYXN0IGEgZmV3IHRpbWVzIGVhY2guIE1vcmUgaW52ZXN0aWdhdGlvbiBuZWVkZWQuCkpBbGxlbWFuZG91IGNyZWF0ZWQgdGhpcyB0YXNrLkZyaSwgSnVuIDE1LCA3OjIxIEFNCkhlcmFsZCBhZGRlZCBhIHN1YnNjcmliZXI6IEFrbGFwcGVyLiC3IFZpZXcgSGVyYWxkIFRyYW5zY3JpcHRGcmksIEp1biAxNSwgNzoyMSBBTQpKQWxsZW1hbmRvdSBjbGFpbWVkIHRoaXMgdGFzay5GcmksIEp1biAxNSwgNzoyMiBBTQpKQWxsZW1hbmRvdSB1cGRhdGVkIHRoZSB0YXNrIGRlc2NyaXB0aW9uLiAoU2hvdyBEZXRhaWxzKQpKQWxsZW1hbmRvdSBhZGRlZCBhIHByb2plY3Q6IEFuYWx5dGljcy1LYW5iYW4uCkpBbGxlbWFuZG91IG1vdmVkIHRoaXMgdGFzayBmcm9tIE5leHQgVXAgdG8gSW4gUHJvZ3Jlc3Mgb24gdGhlIEFuYWx5dGljcy1LYW5iYW4gYm9hcmQuCkNoYW5nZSBTdWJzY3JpYmVycwpDaGFuZ2UgUHJpb3JpdHkKQXNzaWduIC8gQ2xhaW0KTW92ZSBvbiBXb3JrYm9hcmQKQ2hhbmdlIFByb2plY3QgVGFncwpBbmFseXRpY3MtS2FuYmFuCtcKU2VjdXJpdHkK1wpXaWtpbWVkaWEtVkUtQ2FtcGFpZ25zIChTMi0yMDE4KQrXClNjYXAK1wpTY2FwIChTY2FwMy1BZG9wdGlvbi1QaGFzZTIpCtcKQWJ1c2VGaWx0ZXIK1wpEYXRhLXJlbGVhc2UK1wpIYXNodGFncwrXCkxhYnNEQi1BdWRpdG9yCtcKTGFkaWVzLVRoYXQtRk9TUy1NZWRpYVdpa2kK1wpMYW5ndWFnZS0yMDE4LUFwci1KdW5lCtcKTGFuZ3VhZ2UtMjAxOC1KYW4tTWFyCtcKSEhWTQrXCkhBV2VsY29tZQrXCkJvbGQKSXRhbGljcwpNb25vc3BhY2VkCkxpbmsKQnVsbGV0ZWQgTGlzdApOdW1iZXJlZCBMaXN0CkNvZGUgQmxvY2sKUXVvdGUKVGFibGUKVXBsb2FkIEZpbGUKTWVtZQpQcmV2aWV3CkhlbHAKRnVsbHNjcmVlbiBNb2RlClBpbiBGb3JtIE9uIFNjcmVlbgoyMzg0ODJuMzc1IGFkZGVkIHByb2plY3RzOiBTZWN1cml0eSwgV2lraW1lZGlhLVZFLUNhbXBhaWducyAoUzItMjAxOCksIFNjYXAgKFNjYXAzLUFkb3B0aW9uLVBoYXNlMiksIEFidXNlRmlsdGVyLCBEYXRhLXJlbGVhc2UsIEhhc2h0YWdzLCBMYWJzREItQXVkaXRvciwgTGFkaWVzLVRoYXQtRk9TUy1NZWRpYVdpa2ksIExhbmd1YWdlLTIwMTgtQXByLUp1bmUsIExhbmd1YWdlLTIwMTgtSmFuLU1hciwgSEhWTSwgSEFXZWxjb21lLlBSRVZJRVcKMjM4NDgybjM3NSBtb3ZlZCB0aGlzIHRhc2sgZnJvbSBJbiBQcm9ncmVzcyB0byBJbiBDb2RlIFJldmlldyBvbiB0aGUgQW5hbHl0aWNzLUthbmJhbiBib2FyZC4KMjM4NDgybjM3NSByZW1vdmVkIEpBbGxlbWFuZG91IGFzIHRoZSBhc3NpZ25lZSBvZiB0aGlzIHRhc2suCjIzODQ4Mm4zNzUgdHJpYWdlZCB0aGlzIHRhc2sgYXMgTG93ZXN0IHByaW9yaXR5LgoyMzg0ODJuMzc1IHJlbW92ZWQgc3Vic2NyaWJlcnM6IEFrbGFwcGVyLCBKQWxsZW1hbmRvdS4KQ29udGVudCBsaWNlbnNlZCB1bmRlciBDcmVhdGl2ZSBDb21tb25zIEF0dHJpYnV0aW9uLVNoYXJlQWxpa2UgMy4wIChDQy1CWS1TQSkgdW5sZXNzIG90aGVyd2lzZSBub3RlZDsgY29kZSBsaWNlbnNlZCB1bmRlciBHTlUgR2VuZXJhbCBQdWJsaWMgTGljZW5zZSAoR1BMKSBvciBvdGhlciBvcGVuIHNvdXJjZSBsaWNlbnNlcy4gQnkgdXNpbmcgdGhpcyBzaXRlLCB5b3UgYWdyZWUgdG8gdGhlIFRlcm1zIG9mIFVzZSwgUHJpdmFjeSBQb2xpY3ksIGFuZCBDb2RlIG9mIENvbmR1Y3QuILcgV2lraW1lZGlhIEZvdW5kYXRpb24gtyBQcml2YWN5IFBvbGljeSC3IENvZGUgb2YgQ29uZHVjdCC3IFRlcm1zIG9mIFVzZSC3IERpc2NsYWltZXIgtyBDQy1CWS1TQSC3IEdQTApZb3VyIGJyb3dzZXIgdGltZXpvbmUgc2V0dGluZyBkaWZmZXJzIGZyb20gdGhlIHRpbWV6b25lIHNldHRpbmcgaW4geW91ciBwcm9maWxlLCBjbGljayB0byByZWNvbmNpbGUu