Page MenuHomePhabricator

Some rows (from the year 2004) in SQL databases have text in latin1 encoding
Open, LowPublic

Description

There are some rows in the frwiki_p database containing text encoded in latin1. This is probably not a good thing. Consider https://fr.wikipedia.org/w/index.php?title=France&diff=prev&oldid=498177&diffonly=1 ; the edit summary does not appear, and yet:

MariaDB [frwiki_p]> select rev_comment from revision where rev_id=498177;
+-----------------------------------------+
| rev_comment                             |
+-----------------------------------------+
| HasharBot - [[Cat�gorie:Pays d'Europe]]  |
+-----------------------------------------+
1 row in set (0.00 sec)

Which appears to be b"HasharBot - [[Cat\xe9gorie:Pays d'Europe]]" in Python 3. The offending character is "é", which is encoded in latin1 instead of utf8 (b"HasharBot - [[Cat\xc3\xa9gorie:Pays d'Europe]]").

Please do the needful and proceed with the process of fixing this.

Event Timeline

Sigma created this task.Aug 8 2015, 4:09 AM
Sigma raised the priority of this task from to Needs Triage.
Sigma updated the task description. (Show Details)
Sigma added a project: Cloud-Services.
Sigma added a subscriber: Sigma.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 8 2015, 4:09 AM
jcrespo set Security to Software security bug.Aug 8 2015, 6:29 AM
Restricted Application changed the visibility from "Public (No Login Required)" to "Custom Policy". · View Herald TranscriptAug 8 2015, 6:29 AM
Restricted Application changed the edit policy from "All Users" to "Custom Policy". · View Herald Transcript
Restricted Application added a project: acl*security. · View Herald Transcript
jcrespo added a subscriber: jcrespo.Aug 8 2015, 6:37 AM

Setting this at least temporarily to private before making sure this is not a security issue.

From the point of view of the Database, there is nothing wrong here: comment field on the database allows for arbitrary binary strings. And it has the same (assuming) incorrect utf-8 character on all production databases. The question are:

  • Could this be a security concern?
  • Does the string really have an invalid utf-8 character?
  • Do we allow arbitrary strings (non-utf8)?
  • How was this inserted? Application/API/ allowed it or other method? Is it repeatable and should we allow that?
  • In case we allow that- should we do something different than not showing the string at all? Should we sanitize/check output too?
  • Should we check for invalid strings on all databases?

The linked edit is from 2004, pretty sure MediaWiki (and Wikipedia) did use latin1 internally back then. This must have been missed in the conversion to utf-8 somehow, aeons ago. https://www.mediawiki.org/wiki/Manual:$wgUseLatin1

Thank you, @matmarex, didn't check the date and assumed it was a recent edit. If you agree with it, I will remove the security protection, and either lower its priority to "I will do a slow check with time" or set it as won't fix/just fix this particular instance.

Yup, I think we should remove the security bit, fix this one by hand and maybe figure out if we should double check the conversion (if we can?)

jcrespo triaged this task as Low priority.Aug 8 2015, 6:57 PM
jcrespo removed a project: acl*security.
jcrespo changed the visibility from "Custom Policy" to "Public (No Login Required)".
jcrespo changed the edit policy from "Custom Policy" to "All Users".
jcrespo changed Security from Software security bug to None.
jcrespo moved this task from Triage to Backlog on the DBA board.Aug 14 2015, 2:18 PM
Aklapper renamed this task from latin1 encoding in sql databases to Some rows (from the year 2004) in SQL databases have text in latin1 encoding.Jun 9 2019, 5:22 PM