Page MenuHomePhabricator

[RFC] Future of charset and collation for mediawiki on mysql
Closed, DuplicatePublic

Description

Mediawiki is likely to fail on newer MySQL and MariaDB defaults, that use 4-byte utf8 collation by default, how should we solve T193222 ? Should we migrate to binary collations only (defining sizes by number of bytes, not characters)? [this seems the most supported option] If yes, how to migrate existing utf8 installations? If not, how to support larger indexes? What changes are needed to better support third party users?

Related Objects

Event Timeline

MediaWiki core already uses varbinary for almost all fields, but as seen in T193222 extensions are less consistent.

For MySQL specifically, we should probably have the installer error out if the character set is utf8mb4 and innodb_large_prefix is not in effect. We should also have it always use either binary or utf8mb4 as character sets, never the 3-byte "utf8".

In general, I hate all the schema differences between different databases (see also T191231: RFC: Abstract schemas and schema changes). If we want to go with a "binary" character set on MySQL, we should do the same on all the other databases too. If we want to use a UTF-8 character set where varchar(255) means 255 codepoints, we should do that on all databases. In the modern day the latter would make a lot more sense, IMO.

I fully support killing the "Do you want to use UTF-8 collation" option (aka the BMP only utf8 collation) in the installer. Its is poorly explained option that I'm sure 99% of our users misunderstand, and just increases the diversity of things that can go wrong.

This is specifically a mediawiki task, as in, not a bug that affects WMF installation, so I support and will help with any decision you take. What is a good approach to get consensus, specially to avoid problems with 3rd party users? Are there stats for who is using utf8 installations?

CCicalese_WMF moved this task from Inbox to Backlog on the MediaWiki-Platform-Team-Archived board.

If you want this to be an RFC then you need to tag it with TechCom-RFC . But I'm not sure if it's ready for public discussion yet, since it's mostly just a problem statement.

Most people seem inclined to maintain a single charset going forward- which makes sense to me rather than maintaining 2 separate ones. If everybody here supports that (and I heard no voice against), I guess we need a way to migrate existing (non-WMF) installations and thus unify all at a major version?

My only suggestion is to remove the utf-8 installer option ASAP as that is reasonable to happen, to avoid having more issues. Also extensions should convert to assume always binary on new install(?)

I will let people suggest ways on how that can be accomplished and take a back seat (I am not familiar with mediawiki release cycles/workflows) but I am willing to help with anything I can.

In my opinion we should pick one datatype and standardize on it. Migrating existing columns to binary or utf8mb4 should be doable with a maintenance script, and these sorts of migrations and changes seem within scope for the schema abstraction RFC as part of the concerns there is taking the disparate environments of existing installs and standardizing them (although primarily with non-MySQL dbmses, but making MySQL focused on one and exactly one schema seems good as well).

The issue with standardizing on binary for everything is how absolutely terribly other dbmses handle binary fields (looking at you mssql). So, if we standardize MySQL on binary, I think other dbmses would need to decide for themselves which to use (and then stick with that choice).

If we were designing MediaWiki from scratch today, my vote would be on utf8mb4. However, given the amount of legacy data we have in varbinary, migrating from varbinary to varchar could take a very, very long time on the larger Wikipedias as well as large 3rd party wikis, and this is a concern we need to account for. If the upgrade process is next to impossible for a DBA to manage, then they aren't going to want to upgrade and that doesn't help anyone.

As such, my opinion is that we should standardize everything (across all dbmses) to binary, including those with terrible support for it. This will have the extra benefit of standardizing on UTF-8 across the board. Right now mssql uses nvarchar which is UCS-2 in versions older than 2014 (not even UTF-16, which means it also doesn't support characters outside of the BMP; in 2014 and later it's proper UTF-16), and it also has index size issues due to that.

Vvjjkkii renamed this task from [RFC] Future of charset and collation for mediawiki on mysql to yedaaaaaaa.Jul 1 2018, 1:11 AM
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
Mainframe98 renamed this task from yedaaaaaaa to [RFC] Future of charset and collation for mediawiki on mysql .Jul 1 2018, 11:43 AM
Mainframe98 lowered the priority of this task from High to Medium.
Mainframe98 updated the task description. (Show Details)
Mainframe98 added a subscriber: Aklapper.

This T135969 is an old bug (open many years ago) but it was recently "closed" abusively (by some mediawiki developers that are only interested in cleaning up the backlogs and don't want to solve the signaled problems, even this one which is a severe one), even if it was reported many years ago (and then considered as perfectly valid, because it really affected several wikis of Wikimedia). I cannot find the initial bug that existed long before Phabricator (where the history of bugs/RFEs that were closed before Phabricator was openwas not imported). But this was accepted as valid in that time (and it concerned multiple wikis used by Wikimedia and many others, not all of them were migrated or reloaded).

Now you refuse to fix it, and says this is an "invalid" bug even if it still there. Mediawiki was designed to support multiple SQL backends (and even Wikimedia has changed and migrated its databases multiple times). But there are still installations that CANNOT make such SQL migration (the SQL engine is also used for something else and contains other data than just the wiki, or the wiki is used as an interface for other applications needing its existing database and for which it's not viable to change the engine because other apps than MEdiaWiki are using it).

Mediawiki still just makes assumptions about how data is stored, but there is absolutely no data value validator, no reencoding layer in these SQL adapters. Value ranges are not checked: maximum size of text fields or blobs (possibly blobs cannot be stored in the SQL engine itself but only some URI to a file accessible on the same host or to a shared folder), precision of numeric fields, support of native collation or not... This makes Mediawiki intrinsicly unsecure for anything else than a mush more limited set of SQL setups than what is documented, and it also means that Mediawiki is not as universal as it was supposed to be (and how it was documented).

So now if you treat T135969 as "invalid", this means that you have (silently) reduced the goals at one time, and the docs of MediaWiki need to be fixed. If MEdiawiki starts and silently thinks that all is OK, it means that the SQL engine is supported and should not have such risks of severe silent data losses caused by normal edits made by any user.

Note that various users have been banned from editing on a wiki just because thei innocently typed some valid text (e.g./ containing Emojis or supplementary Chinese characters), all was OK when they previewed what they edited, they saved it, and then all their edits were lost along with other edits made by others (and these edits were not even recoverable). It was not the user's fault (they did not know how to resurrect the dropped text) but the fault of Mediawiki itself.

As well this bug has caused damages in various private installations of Mediawikis in corporate environments, so much that Mediawiki was finally banned from their projects : they chose to go with another CMS which is more friendly and more easily and dafely to integrate with their existing data infrastructure, without forcing them to do massive migration for what would be minor data for the wiki itself, most of the data being for other important apps).

Note that the "closed" bug T193222 was also caused by lack of data constraints validation, that caused either silent truncation or errors at runtime: clearly this is the same type of bug: data integrity is not warrantied at all, it is only assumed by the SQL-adapters of Mediawiki, and never checked: silent data changes when saving data are really extremely bad, a pure SQL only declaration of datatype constraints is not enough, the SQL adapters must still perform the correct checks and report invalid data to the user without executing what would be a destructive SQL query. I consider these two bugs exactly like SQL injection: it is a question of security and data integrity.

And security comes first (it requires strict data validation, which is also a requirement today for anything connected to the Internet and accepting input from random users), before any other goals (usability, performance and scalability).

@Verdy_p: Your last comment here does not seem to be helpful. Please read and understand https://www.mediawiki.org/wiki/Bug_management/Phabricator_etiquette how and why to not get personal if you'd like to be active in Wikimedia Phabricator. Thanks for your understanding and for being respectful.

Because T135969 has spilled here, I can see the annoyance with someone using a "bad" encoding, I will try to be a bit conciliatory, however, I don't see any way to solve that, other than documenting (maybe it was done already):

The only encoding supported for mysql is binary, others may work on limited cases, but are unsupported. To migrate to binary run: ALTER TABLE table CONVERT TO CHARACTER SET binary; on all your tables".

If your complain is that lack of documentation, that would be a fair one. If it is that you want to allow full plane UTF-8 characters on utf8(mb3) mysql configuration, that is literally impossible. People should migrate to the supported charset(s). If the developers don't want to support multiple encodings (and I can see why- index length issues, meaning changes that would hide bugs) that is up to them. The good news is that because open source, everyone can become a developer and support additional features! I highly encourage you to do that.

the wiki is used as an interface for other applications needing its existing database and for which it's not viable to change the engine

This makes no sense to me- data can be stored in raw format, and will be transformed on the fly to the configured client encoding, if application is properly configured (character_set_client, character_set_connection and character_set_server).

your suggestion does not apply: it 's not viable to convert all tables on an existing database that has other uses.
And I do not necessarily "want to support full plane UTF-8" in a "utf8(mb3)" config. MediaWiki should still run without problem with that config, without causing major issues because of some unsupported characters that MediaWiki never checks.
Reread what I asked: I just want that MediaWiki checks the character sets (a simple insert or update in the database at startup, followed by a read, can immediately detect is non-BMP characters are safe or not, and it is enough to position a flag and then allow Mediawiki to make correct "preview" that will warn the user that his edit cannot be saved "as is".
But the fact that MediaWiki continues working as if there was no problem (and no problem visible or reported even when previewing the edited page) is unsafe.
Is it so complicate to make such check, which has a near-zero cost on UTF8(mb4) config, but will force the code to use text validation prior to saving or previewing, only if this "non-UTF8" config is detected? What is the performance impact really ?
Now you suggest me to develop a patch, but that requires me to develop MediaWiki itself (and I don't like programming in PHP). My initial bug was to ask to some developer to consider this as a request for improvement and fixing, and this old bug was valid since years and is still valid today, it is just not solved for now, and the current developers only seem to consider the needs of Wikimedia for its own wikis, but forgets the needs for other wikis that have different goals (and MediaWiki is not just made for Wikimedia, which has lot of WM-specific features not portable to other places, that don't have the large farm of servers and the complex storage configuration). Most wikis outside Wikiemdia run on a single host which run their own local database engine (and cannot support multiple engines, due to resources constraints). That's why MEdiawiki has many optional plugins they don't have to support, and why MediaWiki also supports several DB engines (and I don't see why it could not support an existing "mb3" config, even if this measn that users won't be able to post non-BMP characters; but in this config MediaWiki shoulkd still be safe to use (and for now it is not).

If you are referring to other applications using the same MySQL server – that's okay. The charset configuration for MySQL applies per database table. You can have a single MySQL server, and a single database, and have your MW tables use the charset that MW supports. The developers have provided a way to make it work that is compatible with your technical requirements.

There's not just MySQL. Other organization may use MSSQL, Sybase, Oracle, Infomix all of them having their own charset support (and in all of them, installing additional charsets to support the full UTF-8 is costly as it also requires installing (and maintaining) collation data. In frequent cases, collation cannot be updated all the time at each Unicode version, because it requires costly reindexing (but partial UTF-8 is possible, and I think this is the reason why MySQL defined the UTF-8(mb3), even if it also requires updating the collations when there are Unicode or CLDR updates for characters encoded in the BMP).

But generally organization just stick to one collation rule and won't reindex all the time, so they also choose a charset once. Deploying a larger charset often requires significant checks and may introduce security holes or bugs in their own internal apps (unable to handle properly some characters, because of unsafe charset mappings creating duplicates/collisions that their software is not prepared to handle correctly).

And many small wikis (which are very active) don't have the time to proceed to the migration of their database charset (and also make sure that other database constraints such as maximum field sizes will not be violated) in a reasonnable time.

But the problem is that MySQL silently discards any text after any character it does not support (I think this is also a bug of MySQL: it should report an error, or the MediaWiki's database handler traps this error and silently ignores it, and still instructs the dB to proceed with the truncated text). In such case, this invalid insert/update should be avoided and the MediaWiki's db-adapter should perform safety checks on submitted values that can be reported to the Mediawiki core and which will then inform the user that their edit cannot be accepted as is. That db-adapter could then propose an updated text (where unwanted characters are removed, or reencoded differently (e.g. using NCRs) and it will be up to the editing user to check if this is still safe to save, because he will get an updated "preview" before submitting again.

I wonder how this may be impacted by T191231. Theoretically, having such an abstraction layer could make it easier to add support for alternative collation configurations such as utf8mb4.

The current plan with respect to T191231 is to specify the behavior expected of the database-specific handler. In particular, string fields must be able to store any valid UTF-8 string (subject to column length limitations) and comparisons and ordering must work by Unicode code point (e.g. "a" and "A" are distinct, "a" always comes after "Z"). See https://www.mediawiki.org/wiki/User:Anomie/Abstract_schema/DB_Requirements#Character_sets_and_collations.

MySQL's 3-byte "utf8" would therefore never be acceptable, since it can't store non-BMP code points. Use of binary versus utf8mb4 is yet to be decided; the latter would probably be preferable, but may turn out to be significant work in the rest of MediaWiki to properly support it.

Thank you for the explanation Anomie, it's all clear now. :)

CCicalese_WMF subscribed.

We believe that this is covered by T191231. Please feel free to reopen or comment there if you believe there are aspects of this task that are not covered there.

Because T135969 has spilled here, I can see the annoyance with someone using a "bad" encoding, I will try to be a bit conciliatory, however, I don't see any way to solve that, other than documenting (maybe it was done already):

The only encoding supported for mysql is binary, others may work on limited cases, but are unsupported. To migrate to binary run: ALTER TABLE table CONVERT TO CHARACTER SET binary; on all your tables".

If your complain is that lack of documentation, that would be a fair one. If it is that you want to allow full plane UTF-8 characters on utf8(mb3) mysql configuration, that is literally impossible. People should migrate to the supported charset(s). If the developers don't want to support multiple encodings (and I can see why- index length issues, meaning changes that would hide bugs) that is up to them. The good news is that because open source, everyone can become a developer and support additional features! I highly encourage you to do that.

I am the sysadmin for wiki.openstreetmap.org. The ALTER TABLE command above is the first time I've seen any documentation on what might be needed to fix the "bad" encoding. Which tables?!?

Please document what is required for us poor outsiders to fix the issue. A method to identify if the wiki is even affected would be appreciate. Documentation on mediawiki.org or within the mediawiki documentation would be appreciated, hidden behind a locked cabinet in phabricator is tricky. ;-)

Hi, @Firefishy - you are answering to a 4 year old, closed discussion, so you are likely to get very few responses. As a fellow OSMer, I would encourage you to file a new ticket with the problem you suffer or suffered and that way we can help you (through documentation or otherwise).

The truth is that the issue is not MediaWiki, but the historical artefact of 'utf8' in MySQL (and MariaDB) meaning https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8mb3.html , which at some point meant that 4-byte Unicode UTF-8 characters did not "fit" in a MySQL 'utf8'-3 byte encoding (think emojis). That is documented on the MySQL manual.

Mediawiki at some point supported MySQL's utf8, but now recommends the binary encoding (even if UTF-8 is being used as the real underlying format), as that allows multiple collations or orderings that are created sometimes first by the International Wikimedia community and later implemented on other software, much time later. It also allows to support multiple collations on the same table. Other charsets for specific wikis may work (e.g. for English-only Wikis anything that can store 4-byte UTF-8 may work, but that is not something that it is supported.

Please feel free to file a new task and add me for more context about what Mediawiki documentation could make better, as this is mostly an upstream change?

The truth is that the issue is not MediaWiki, but the historical artefact of 'utf8' in MySQL (and MariaDB)...

I know about the MySQL utf8 issues...
I know that new installs of Mediawiki are fixed...

What is missing is any documentation or upgrade procedure (that I can find) for existing Mediawiki installations that used the perfectly valid for the time MySQL utf8 to move to the binary format used by new installs. Dump & reload is lossy.