Page MenuHomePhabricator

Some rows (from the year 2004) in SQL databases have text in latin1 encoding
Open, LowPublic

Description

There are some rows in the frwiki_p database containing text encoded in latin1. This is probably not a good thing. Consider https://fr.wikipedia.org/w/index.php?title=France&diff=prev&oldid=498177&diffonly=1 ; the edit summary does not appear, and yet:

MariaDB [frwiki_p]> select rev_comment from revision where rev_id=498177;
+-----------------------------------------+
| rev_comment                             |
+-----------------------------------------+
| HasharBot - [[Cat�gorie:Pays d'Europe]]  |
+-----------------------------------------+
1 row in set (0.00 sec)

Which appears to be b"HasharBot - [[Cat\xe9gorie:Pays d'Europe]]" in Python 3. The offending character is "é", which is encoded in latin1 instead of utf8 (b"HasharBot - [[Cat\xc3\xa9gorie:Pays d'Europe]]").

Please do the needful and proceed with the process of fixing this.

Event Timeline

Sigma raised the priority of this task from to Needs Triage.
Sigma updated the task description. (Show Details)
Sigma added a project: Cloud-Services.
Sigma subscribed.
jcrespo set Security to Software security bug.Aug 8 2015, 6:29 AM
Restricted Application changed the visibility from "Public (No Login Required)" to "Custom Policy". · View Herald TranscriptAug 8 2015, 6:29 AM
Restricted Application changed the edit policy from "All Users" to "Custom Policy". · View Herald Transcript
Restricted Application added a project: acl*security. · View Herald Transcript

Setting this at least temporarily to private before making sure this is not a security issue.

From the point of view of the Database, there is nothing wrong here: comment field on the database allows for arbitrary binary strings. And it has the same (assuming) incorrect utf-8 character on all production databases. The question are:

  • Could this be a security concern?
  • Does the string really have an invalid utf-8 character?
  • Do we allow arbitrary strings (non-utf8)?
  • How was this inserted? Application/API/ allowed it or other method? Is it repeatable and should we allow that?
  • In case we allow that- should we do something different than not showing the string at all? Should we sanitize/check output too?
  • Should we check for invalid strings on all databases?

The linked edit is from 2004, pretty sure MediaWiki (and Wikipedia) did use latin1 internally back then. This must have been missed in the conversion to utf-8 somehow, aeons ago. https://www.mediawiki.org/wiki/Manual:$wgUseLatin1

Thank you, @matmarex, didn't check the date and assumed it was a recent edit. If you agree with it, I will remove the security protection, and either lower its priority to "I will do a slow check with time" or set it as won't fix/just fix this particular instance.

Yup, I think we should remove the security bit, fix this one by hand and maybe figure out if we should double check the conversion (if we can?)

jcrespo removed a project: acl*security.
jcrespo changed the visibility from "Custom Policy" to "Public (No Login Required)".
jcrespo changed the edit policy from "Custom Policy" to "All Users".
jcrespo changed Security from Software security bug to None.
Aklapper renamed this task from latin1 encoding in sql databases to Some rows (from the year 2004) in SQL databases have text in latin1 encoding.Jun 9 2019, 5:22 PM

I'm not marking this as resolved because I don't know whether this is really considered a problem as such ... but all revisions from before the MediaWiki 1.5 upgrade (in June 2005) in all formerly Latin1 wikis, including English and French, will be encoded in Latin1 unless they've been deleted and re-deleted since the upgrade. That's exactly what the option $wgLegacyEncoding is for. See: https://www.mediawiki.org/wiki/Manual:$wgLegacyEncoding

Also see the relevant text here:
https://www.mediawiki.org/wiki/Manual:Upgrading

Very similar to/probably a duplicate of T351953 (which originated as a symptom of the error)

For future reference, a generic conversation can be done using CONVERT() to interpret the data as latin1, and converting it again to utf8.

name=
[07:57 UTC] krinkle at tools-login
$ sql cawiki

MariaDB [cawiki_p]> SET @bad = UNHEX('506CE06369642050E972657A20427275');
MariaDB [cawiki_p]> SELECT @bad;
+------------------+
| @bad             |
+------------------+
| Pl?cid P?rez Bru   |
+------------------+
1 row in set (0.001 sec)

MariaDB [cawiki_p]> SELECT CONVERT(@bad USING latin1), HEX(CONVERT(CONVERT(@bad USING latin1), CHAR CHARACTER SET utf8));
+----------------------------+-------------------------------------------------------------------+
| CONVERT(@bad USING latin1) | HEX(CONVERT(CONVERT(@bad USING latin1), CHAR CHARACTER SET utf8)) |
+----------------------------+-------------------------------------------------------------------+
| Plàcid Pérez Bru           | 506CC3A06369642050C3A972657A20427275                              |
+----------------------------+-------------------------------------------------------------------+

I suggest going through all usernames/actor names on all wikis calling mb_check_encoding on them and see what's left.

Change #1130770 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/extensions/WikimediaMaintenance@master] findLegacyEncodingRows.php: New script to scan actor.actor_name

https://gerrit.wikimedia.org/r/1130770

I've scanned some of the wikis mentioned in this task and nearby tasks to start looking for stuff:

fr.wikipedia.org T108434

krinkle@mwmaint1002
$ mwscript ~/findLegacyEncodingRows.php --wiki frwiki --table actor
... checking approximately 11,321,096 rows in the actor table
Done! Checked 14,307,205 actor rows, of which 0 were bad!

ca.wikipedia.org T389559

$ mwscript ~/findLegacyEncodingRows.php --wiki cawiki --table actor
... checking approximately 1,017,452 rows in the actor table
Found actor.actor_id 342875 value: `Sovi�tic`
Found actor.actor_id 342887 value: `Sim� - Sim6(A)Cataloniamail.Com`
Found actor.actor_id 343066 value: `Lletiss�`
Found actor.actor_id 343415 value: `Sebasti�`
Done! Checked 1,327,914 actor rows, of which 4 were bad!

Various via T155529:

svwiktionary, T128156

$ mwscript ~/findLegacyEncodingRows.php --wiki svwiktionary --table actor
... checking approximately 99,198 rows in the actor table
Done! Checked 122,582 actor rows, of which 0 were bad!

dawiktionary, T128155

$ mwscript ~/findLegacyEncodingRows.php --wiki dawiktionary --table actor
... checking approximately 28,295 rows in the actor table
Done! Checked 32,857 actor rows, of which 0 were bad!

nlwiki, T128154

$ mwscript ~/findLegacyEncodingRows.php --wiki nlwiki --table actor
... checking approximately 3,027,473 rows in the actor table
Done! Checked 3,824,503 actor rows, of which 0 were bad!

svwiki, T128153

$ mwscript ~/findLegacyEncodingRows.php --wiki svwiki --table actor
... checking approximately 1,933,322 rows in the actor table
Done! Checked 2,391,152 actor rows, of which 0 were bad!

dawiki, T128152

$ mwscript ~/findLegacyEncodingRows.php --wiki dawiki --table actor
... checking approximately 807,312 rows in the actor table
Done! Checked 1,138,370 actor rows, of which 0 were bad!

enwiki T128151

$ mwscript ~/findLegacyEncodingRows.php --wiki enwiki --table actor
... checking approximately 86,602,018 rows in the actor table
Done! Checked 113,507,528 actor rows, of which 0 were bad!

Note that T128151: Migrate all old DB rows from windows-1252 to UTF-8 on enwiki and similar were about text table/ES entries. wgLegacyEncoding only applied during retrieval of blob entries. We still might have issues like this in any table of those wikis.

Ack. I'm using those wikis because those wikis are known to be old enough to have used legacy encoding in the past, and were known to have unmigrated data on them at some point. Those tasks themselves were fine indeed and remain closed/resolved.

I've updated the script to support the comment table as well and added a sleep given this is a bit larger data so we should probably slow down the scan a bit to reduce database load.

Ran it on testwiki to start with, and was expecting zero hits, but got two dozen results. The invalid byte is at the end for all these, so I'm guessing these aren't cases where we accepted different encoding or neglected to normalize/migrate, but rather cases where MediaWiki cropped/trimmed values without (correctly) considering multi-byte. Interesting, but unrelated in some sense.

I've censored most of the results from the paste below because I suspect many of these are from deleted revisions. Remember that we point to comment from public revs, deleted page revs, public logs, and oversighted/private logs.

$ mwscript ~/findLegacyEncodingRows.php --wiki testwiki --table comment
... checking approximately 229,125 rows in the comment table
Found comment.comment_id 24808 value: `[[Статья, которая была досмотренной, потом стала выверенной, а теперь не имеет флагов]] moved to [[Статья, которая была досмотренной, потом стала �`
Found comment.comment_id 24809 value: `[[Статья, которая была досмотренной, потом стала выверенной, а теперь не имеет флагов 2]] moved to [[Статья, которая была выверенной, потом стала д�`
XX
XX
XX
Found comment.comment_id 108858 value: `{{subst:User:Elfuego2|owner:en=admin|intro:en=asdfasdf|dc_creator:en=admin|security_context:en=admin|publisher:en=admin|otype:en=68|subtitle:en=|show:en=true|offline:en=15 January 2111 00:00|contributor:en=|created:en=15 January 2011 11:32|title=eĥoŝan�`
XX
XX
XX
Found comment.comment_id 125034 value: `Clean up: content was: ''''Мой текст'''{{delete}} ''Мой текст'' [[Мой текст]] [http://Мой текст]  '''ААААААААААААААААААААААААААААААААААААААА ааааааааааааааааааа�`
XX
XX
XX
Found comment.comment_id 127145 value: `Spam: content was: "{{Delete|cross-wiki spam}} XXX playing Poker ''ค�`
Done! Checked 264,588 comment rows, of which 25 were bad!

To my knowledge we don't trim edit summaries directly. Like with page content, if the user input is too long, we validate and reject it. (To submit too long an edit summary today, you'd need to either use the API, bypass the HTML5 maxlength on the input client-side. In the past this was trivial given input maxlength did not yet exist. Either way we reject on submission.)

But... these are all automatic edit sumamries (WP:AES), for user actions where we include excerpt of the page content as part of the edit summary or log action. For example, when deleting a redirect, we quote the first N chars of the content in the reason. I'm guessing we used to generate those excerpts without considering multi-byte strings

I've run it on one other wiki (frwiki), to confirm whether the "truncated excerpt" problem is limited to testwiki. E.g. a problem with a feature that was only tested there before fixing, or a bug that was caught on group0 and then fixed.

And..., yeah, definitely on other wikis as well. From a quick look most are the exact same thing. I found only one that looked like a latin1-encoding issues here (comment_id 1198681), but there are probably others drowned out by the noise from the truncation issue, so I won't scan any further until this is addressed first.

$ mwscript ~/findLegacyEncodingRows.php --wiki frwiki --table comment
... checking approximately 64,596,902 rows in the comment table
Found comment.comment_id 839023 value: `{{Information |Description = XX}} Exemple : [[Cat�`
Found comment.comment_id 1198681 value: `Ryo - Homonymie r�solue � l'aide du robot: Royaume-Uni`
Found comment.comment_id 2227798 value: `{{Information |Description = XX}} [[Cat�`
XX
XX
Found comment.comment_id 48826612 value: `Bac à sable : contenait avant blanchiment « ▓▓▓░░░▓▓░░▓▓▓▓▓▓░░▓▓▓▓▓░░░▓▓▓▓▓▓░░▓▓  ▓▓▓▓░░▓▓░░▓▓░░░░░░▓▓░░▓▓░░▓▓░░░░�`
XX
XX
Found comment.comment_id 49003193 value: `Auto-promo, publicité ou spam, [[Wikipédia:Vérifiabilité|contenu non vérifiable]] : Le contenu était « === Trouver facilement des musiciens près de chez vous. ===  Vous êtes-vous déjà demandé pourquoi il est si difficile de trouver des m... �`
XX
XX
Found comment.comment_id 49006514 value: `[[Wikipédia:Critères d'admissibilité des articles|Ne répond pas aux critères d'admissibilité de Wikipédia]] : Le contenu était « XX.  On peut l'apercevoir dans la t�`
XX
XX
Found comment.comment_id 49006655 value: `[[Wikipédia:Critères d'admissibilité des articles|Ne répond pas aux critères d'admissibilité de Wikipédia]] : contenait avant blanchiment « == Biographie ==  XX artiste lyrique fran�`

Done! Checked 107,428,191 comment rows, of which 4,733 were bad!

The entry with the (suspected) latin1 encoding is, as expected from 2004.

https://quarry.wmcloud.org/query/91900

SELECT * FROM revision WHERE rev_comment_id=1198681 LIMIT 1;

-- rev_id        507534
-- rev_timestamp 2004-06-05

https://fr.wikipedia.org/w/index.php?diff=507534

The truncation bug seems to have stopped around Feb 2012, which, given fr.wikipedia.org is a major wiki is likely close to when the bug was fixed in MediaWiki. If wanted to confirm that in Git, this date should help narrow down a possible confirmation there.

SELECT * FROM logging WHERE log_comment_id=49006655 LIMIT 1;

-- log_id        35508409
-- log_action    delete
-- log_timestamp 2012-02-28

https://fr.wikipedia.org/w/index.php?title=Special:Log&logid=35508409

We could detect that case like this:

	if ( !mb_check_encoding( $row->value, 'UTF-8' ) ) {
		// Row contains invalid UTF-8
		$last = mb_substr( $row->value, -1, null, 'UTF-8' );
		if ( $last !== '' && mb_ord( $last, 'UTF-8' ) === false ) {
			// Last character is invalid
			if ( mb_check_encoding( substr( $row->value, 0, -strlen( $last ) ), 'UTF-8' ) ) {
				// Text without last character is valid
			}
		}
	}

(adapted from https://gerrit.wikimedia.org/r/c/mediawiki/extensions/DiscussionTools/+/955833/8/includes/TopicSubscriptionsPager.php, where we detected a similar problem in DiscussionTools data)
(by the way, is it necessary to repeat the 'UTF-8' parameter everywhere? can the default encoding ever not be UTF-8? we set it with mb_internal_encoding( 'UTF-8' ); in Setup.php)

Then we could either update this data with a maint script (trim off the last character, add ellipsis), or make RevisionStore or whatever transparently accept it.

It can mess up checksums and cause all sorts of other issues. Also it's quite expensive to run on every db row.