Steps to reproduce:
- Make any change to any page that includes a non-ASCII symbol, and the saved version will have distorted characters
Examples:
- https://sv.wiktionary.org/w/index.php?title=sur&type=revision&diff=3085353&oldid=3023155
- A rollback that produces weird text
- https://sv.wiktionary.org/wiki/Anv%C3%A4ndare:Skalman/test
- Immediately after saving, the text showed correctly (containing "åäö"), but purging the cache (with ?action=purge) shows distorted text ("åäö")
- https://sv.wiktionary.org/w/index.php?title=backa&curid=42516&diff=3085356&oldid=2910267
- This edit was created using the API (using on-wiki JavaScript)
- Immediately after saving, the text showed correctly, but purging the cache (with ?action=purge) shows distorted text
Analysis:
Background: The old Revision::getRevisionText() and the new BlobStore::expandBlob() methods apply the legacy encoding if no flags are provided - the "utf-8" flag is required to bypass this conversion.
As part of the refactoring for MCR, the code for constructing a Revision from an array was consolidated with the code for constructing from a row object. Row objects are required to have the old_flags field set; this field being null or empty would trigger legacy encoding conversion. The same logic was now applied for the 'flags' field when constructing from an array - which was a mistake. No conversion (or indeed decompression or other kinds of decoding) should be applied when constructing from arrays.
This mistake led to the legacy encoding conversion to be applied whenever a Revision object was constructed from an array - which is the case whenever a new Revision is prepared for insertion into the database while saving an edit. This caused data corruption by double-encoding.
Solution:
Do not apply any processing to the content blob when constructing a Revision from an array (at least not for the normal case of the 'flags' field not being set).