Page MenuHomePhabricator

Some rows (from the year 2004) in SQL databases have text in latin1 encoding
Open, LowPublic

Description

There are some rows in the frwiki_p database containing text encoded in latin1. This is probably not a good thing. Consider https://fr.wikipedia.org/w/index.php?title=France&diff=prev&oldid=498177&diffonly=1 ; the edit summary does not appear, and yet:

MariaDB [frwiki_p]> select rev_comment from revision where rev_id=498177;
+-----------------------------------------+
| rev_comment                             |
+-----------------------------------------+
| HasharBot - [[Cat�gorie:Pays d'Europe]]  |
+-----------------------------------------+
1 row in set (0.00 sec)

Which appears to be b"HasharBot - [[Cat\xe9gorie:Pays d'Europe]]" in Python 3. The offending character is "é", which is encoded in latin1 instead of utf8 (b"HasharBot - [[Cat\xc3\xa9gorie:Pays d'Europe]]").

Please do the needful and proceed with the process of fixing this.

Event Timeline

Sigma raised the priority of this task from to Needs Triage.
Sigma updated the task description. (Show Details)
Sigma added a project: Cloud-Services.
Sigma subscribed.
jcrespo set Security to Software security bug.Aug 8 2015, 6:29 AM
Restricted Application changed the visibility from "Public (No Login Required)" to "Custom Policy". · View Herald TranscriptAug 8 2015, 6:29 AM
Restricted Application changed the edit policy from "All Users" to "Custom Policy". · View Herald Transcript
Restricted Application added a project: acl*security. · View Herald Transcript

Setting this at least temporarily to private before making sure this is not a security issue.

From the point of view of the Database, there is nothing wrong here: comment field on the database allows for arbitrary binary strings. And it has the same (assuming) incorrect utf-8 character on all production databases. The question are:

  • Could this be a security concern?
  • Does the string really have an invalid utf-8 character?
  • Do we allow arbitrary strings (non-utf8)?
  • How was this inserted? Application/API/ allowed it or other method? Is it repeatable and should we allow that?
  • In case we allow that- should we do something different than not showing the string at all? Should we sanitize/check output too?
  • Should we check for invalid strings on all databases?

The linked edit is from 2004, pretty sure MediaWiki (and Wikipedia) did use latin1 internally back then. This must have been missed in the conversion to utf-8 somehow, aeons ago. https://www.mediawiki.org/wiki/Manual:$wgUseLatin1

Thank you, @matmarex, didn't check the date and assumed it was a recent edit. If you agree with it, I will remove the security protection, and either lower its priority to "I will do a slow check with time" or set it as won't fix/just fix this particular instance.

Yup, I think we should remove the security bit, fix this one by hand and maybe figure out if we should double check the conversion (if we can?)

jcrespo removed a project: acl*security.
jcrespo changed the visibility from "Custom Policy" to "Public (No Login Required)".
jcrespo changed the edit policy from "Custom Policy" to "All Users".
jcrespo changed Security from Software security bug to None.
Aklapper renamed this task from latin1 encoding in sql databases to Some rows (from the year 2004) in SQL databases have text in latin1 encoding.Jun 9 2019, 5:22 PM

I'm not marking this as resolved because I don't know whether this is really considered a problem as such ... but all revisions from before the MediaWiki 1.5 upgrade (in June 2005) in all formerly Latin1 wikis, including English and French, will be encoded in Latin1 unless they've been deleted and re-deleted since the upgrade. That's exactly what the option $wgLegacyEncoding is for. See: https://www.mediawiki.org/wiki/Manual:$wgLegacyEncoding

Also see the relevant text here:
https://www.mediawiki.org/wiki/Manual:Upgrading