Some rows (from the year 2004) in SQL databases have text in latin1 encoding
Open, LowPublic
Actions

Assigned To

None

Authored By

	Sigma
	Aug 8 2015, 4:09 AM

Description

There are some rows in the frwiki_p database containing text encoded in latin1. This is probably not a good thing. Consider https://fr.wikipedia.org/w/index.php?title=France&diff=prev&oldid=498177&diffonly=1 ; the edit summary does not appear, and yet:

MariaDB [frwiki_p]> select rev_comment from revision where rev_id=498177;
+-----------------------------------------+
| rev_comment                             |
+-----------------------------------------+
| HasharBot - [[Cat�gorie:Pays d'Europe]]  |
+-----------------------------------------+
1 row in set (0.00 sec)

Which appears to be b"HasharBot - [[Cat\xe9gorie:Pays d'Europe]]" in Python 3. The offending character is "é", which is encoded in latin1 instead of utf8 (b"HasharBot - [[Cat\xc3\xa9gorie:Pays d'Europe]]").

Please do the needful and proceed with the process of fixing this.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open	Feature	None	T18660 Database table cleanup (tracking)
		Open		None	T108434 Some rows (from the year 2004) in SQL databases have text in latin1 encoding

Event Timeline

Sigma created this task.Aug 8 2015, 4:09 AM

Sigma raised the priority of this task from to Needs Triage.

Sigma updated the task description. (Show Details)

Sigma added a project: Cloud-Services.

Sigma subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 8 2015, 4:09 AM

It seems to be terribly encoded only for some revisions: https://fr.wikipedia.org/w/index.php?title=Institut_d%27histoire_de_la_R%C3%A9volution_fran%C3%A7aise&diff=prev&oldid=59390655&diffonly=1 is in utf8.

jcrespo set Security to Software security bug.Aug 8 2015, 6:29 AM

Restricted Application changed the visibility from "Public (No Login Required)" to "Custom Policy". · View Herald TranscriptAug 8 2015, 6:29 AM

Restricted Application changed the edit policy from "All Users" to "Custom Policy". · View Herald Transcript

Restricted Application added a project: acl*security. · View Herald Transcript

Setting this at least temporarily to private before making sure this is not a security issue.

From the point of view of the Database, there is nothing wrong here: comment field on the database allows for arbitrary binary strings. And it has the same (assuming) incorrect utf-8 character on all production databases. The question are:

Could this be a security concern?
Does the string really have an invalid utf-8 character?
Do we allow arbitrary strings (non-utf8)?
How was this inserted? Application/API/ allowed it or other method? Is it repeatable and should we allow that?
In case we allow that- should we do something different than not showing the string at all? Should we sanitize/check output too?
Should we check for invalid strings on all databases?

Krenair edited projects, added MediaWiki-libs-Rdbms; removed Cloud-Services.Aug 8 2015, 12:24 PM

The linked edit is from 2004, pretty sure MediaWiki (and Wikipedia) did use latin1 internally back then. This must have been missed in the conversion to utf-8 somehow, aeons ago. https://www.mediawiki.org/wiki/Manual:$wgUseLatin1

Thank you, @matmarex, didn't check the date and assumed it was a recent edit. If you agree with it, I will remove the security protection, and either lower its priority to "I will do a slow check with time" or set it as won't fix/just fix this particular instance.

Yup, I think we should remove the security bit, fix this one by hand and maybe figure out if we should double check the conversion (if we can?)

jcrespo triaged this task as Low priority.Aug 8 2015, 6:57 PM

jcrespo removed a project: acl*security.

jcrespo changed the visibility from "Custom Policy" to "Public (No Login Required)".

jcrespo changed the edit policy from "Custom Policy" to "All Users".

jcrespo changed Security from Software security bug to None.

jcrespo added a project: DBA.Aug 8 2015, 6:59 PM

jcrespo moved this task from Triage to Backlog on the DBA board.Aug 14 2015, 2:18 PM

• Phabricator_maintenance removed a subscriber: yuvipanda.Jun 7 2017, 6:48 PM

Krinkle moved this task from Untriaged to Usage problem on the MediaWiki-libs-Rdbms board.Aug 14 2018, 3:13 AM

Krinkle added a parent task: T18660: Database table cleanup (tracking).

Aklapper renamed this task from latin1 encoding in sql databases to Some rows (from the year 2004) in SQL databases have text in latin1 encoding.Jun 9 2019, 5:22 PM

ArielGlenn subscribed.Jun 9 2019, 5:26 PM

Reedy edited projects, added WMF-General-or-Unknown; removed MediaWiki-libs-Rdbms.Jun 9 2019, 8:43 PM

Scott added a project: Wikimedia-database-issue.Aug 19 2019, 1:47 PM

Scott moved this task from Untriaged to Bad data & Corruption on the Wikimedia-database-issue board.

RhinosF1 subscribed.Nov 10 2020, 7:43 AM

Isaac mentioned this in T285092: Encoding issues with externallinks tables.Jun 21 2021, 1:37 PM

DannyS712 mentioned this in T155529: Get rid of UTF-8 encoded as latin-1.Jul 1 2021, 3:12 AM

I'm not marking this as resolved because I don't know whether this is really considered a problem as such ... but all revisions from before the MediaWiki 1.5 upgrade (in June 2005) in all formerly Latin1 wikis, including English and French, will be encoded in Latin1 unless they've been deleted and re-deleted since the upgrade. That's exactly what the option $wgLegacyEncoding is for. See: https://www.mediawiki.org/wiki/Manual:$wgLegacyEncoding

Also see the relevant text here:
https://www.mediawiki.org/wiki/Manual:Upgrading

Krinkle edited projects, added Wikimedia-database-issue (Bad data); removed Wikimedia-database-issue.Apr 9 2022, 6:07 PM

Some rows (from the year 2004) in SQL databases have text in latin1 encodingOpen, LowPublicActions

Description

Related ObjectsSearch...

Event Timeline

Some rows (from the year 2004) in SQL databases have text in latin1 encoding
Open, LowPublic
Actions

Related Objects
Search...