Page MenuHomePhabricator

Saved edited page is truncated on supplementary characters (e.g. Emojis, or supplementary chinese, in Unicode planes 1 or 2) when your database doesn't support that
Closed, DeclinedPublic

Description

On the current version of MediaWiki, adding ANY supplementary character litterally (such as 🕹 for the Joystick Emoji), i.e. directly in the visible form (encoded in UTF-8), but the page is TRUNCATED on this character, discarding all what follows !

The same page can be previewed correctly, clicking just the "Save" causes the bug where text is lost !

Bug seen on Wikimedia Commons, and on OpenStreetmap wiki, using Mediawiki version 1.26.3 (fa11b59)

https://wiki.openstreetmap.org/wiki/Special:Version

This is a security issue (well the affected pages can be restored from history but all edits past the supplementary character are invisible/lost)

Don't close this as "invalid" ! Something is not documented and visibly this does not affect the history comment, or the start of the page;
this is something occuring after some volume of text. And the database is ALREADY using UTF-8 everywhere !

Event Timeline

Known workaround: use "&#1xxxx;" instead of the litteral UTF-8 encoded character.

Not all supplementary characters cause this unexpected truncation of text. It looks like there's a special code supposed to filter/exclude some characters, but that incorrectly deletes such characters from the wiki source (instead of just stripping this character, it truncates all the text starting at this character).
But it does occur also on some supplementary chinese characters in SIP (plane 2) for unknown reason.

Note also that there's no truncation when using "Preview". This only happens when saving (without changing anything after the preview).
This happens also with "null-edits" of existing pages already containing these characters.

Please provide steps to reproduce. It is unclear where (anywhere on a page?) and how (VisualEditor? Classic wikitext editor?) to "add ANY supplementary character" to see the problem. Also see https://www.mediawiki.org/wiki/How_to_report_a_bug for general information. Thanks!

Everthing was said in the FIRST message. Read it correctly !

Steps to reproduce: just edit ANY page and add an emoji (such as U+1F579 as indicated in the first message) in the middle.

Click preview: all is OK.
Click save: that character is deleted as well as everything that follows (edit again: everything is missing after this character! the page needs to be restored)

Bug was seen with the wiki editor (I did not see if this affects the visual editor)

Add this on the OpenStreetmap wiki, at end of any page:

πŸ•ΉMediaWiki bug test with emojis?

You'll see that the text "MediaWiki bug with emojis?" is not saved (truncated along with the emoji).

Apparently this is caused by some internal anti-emoji filter that is enabled there.
There's been various quirks and bugs related to Emoji support in MediaWiki and various attempts to implement those filters (as they caused problems in some export tools, such as PDF generators). But may be this was first attempted by trying to clear them (incorrectly) from wiki pages.

Or may be this is a bug in the underground database layer (some internal conversion to UTF-16 and a broken interface that does not like surrogates and incorrectly filter them? Or because the datbaase actulaly does not use UTF-8 but CESU-8, where supplementary characters are encoded as 6 bytes instead of 4, i.e. 3 bytes for each surrogate, and when loading data there's a check for UTF-8 conformance that considers everything is broken starting just at the 3 CESU-8 bytes for the first high surrogate, which are invalid in strict UTF-8)

I made this test here!
http://wiki.openstreetmap.org/w/index.php?title=Overpass_API%2FPublic_transport_examples&type=revision&diff=1304144&oldid=1304114

"no diff" !
but the 31 bytes were counted in the history (length of "MediaWiki bug test with emojis?", excluding the emoji itself from the count indicated in the history). They are however ignored when rendering the page or loading it again in the editor, and probably not saved at all.

You can do this test on any other page (your talk page, a template, any article, a File description page, a category page). It does not matter. Any insertion of a single emoji anywhere in the page will cause a severe unexpected truncation of the page. If you insert a single emoji that at start of the page, it becomes completely blank.

Correction: truncation apparently depends on total page size. At start of the page, there's no truncation and the emoji is also saved in the history comment. I see this bug when inserting after more than about 1KB of content.

I'm doing further tests on my own sandbox page. Not all pages have this issue, this depends on where the emoji is inserted. It seems that this is related to some internal (partial) buffer length when an UTF-8 sequence is split in two parts.

I can reproduce the test in fact with ANY supplementary character (not just emojis), such as:

  • cartographic and road sign pictograms : 🚫🚸 πŸš€ 🚁 πŸš‚ πŸšƒ πŸš„ πŸš… πŸš† πŸš‡ 🚈 πŸš‰ 🚊 πŸš‹πŸš  🚑 🚒 🚣
  • Domino tiles: πŸ€° πŸ€± πŸ€² πŸ€³
  • Musical symbols: 𝄀 𝄁 𝄂 𝄃
  • Miscellaneous symbols (tools): πŸ”¨ πŸ”© πŸ”ͺ πŸ”«
  • Ancient Greek numbers: 𐅀 𐅁 𐅂 𐅃
  • Gothic characters: 𐌰 𐌱 𐌲 𐌳
  • Old Turkic letters: 𐰀 𐰁 𐰂 𐰃
  • Ugaritic letters: πŽ€ 𐎁 πŽ‚ πŽƒ
  • SIP sinograms for Han, such as U+20000 for the aspirated breathing sound "oh!" (pinyin "hΔ“")

I can also reproduce the test on the Gothic Wikipedia (where there's also some strange page truncation occuring in random pages)

I see some under-background differences:

  • Wikipedia uses MariaDB version 10.0.23-MariaDB-log, when OSM Wiki uses MySQL version 5.5.49-0ubuntu0.14.04.1
  • Wikipedia uses HHVM version 3.12.1 (srv), when OSM Wiki uses PHP version 5.5.9-1ubuntu4.16 (apache2handler)
  • Wikipedia uses ICU version 4.8.1.1, when OSM wiki uses ICU version 52.1

May be this is an issue in PHP (I have doubts, Wikipedia Gothic is there since long now and PHP supports supplementary characters in UTF-8 as octet strings with any encoding, replacing PHP/Zend by HHVM should not have created such difference), or more probably in the MySQL adapter (incorrect setting of the UTF-8 encoding). The bug is also probably not in ICU (which has full support of supplementary characters, provided that the internal encoding is correctly set).

ElasticSearch is in sync between both wikis (There's also no Lua support in OSM wiki it cannot be an issue).

I should contact the OSM Wiki admins to have their opinion. Or something is incorrectly documented in MediaWiki to setup PHP, MySQL correctly. I doubt this is caused by Apache (because the preview works).

Everthing was said in the FIRST message. Read it correctly !

Unfortunately not. As I said, it was missing a clear list of steps to reproduce so anybody else could follow those steps without having to interpret. See my previous comment. Thanks for having clarified those steps now.
(I'm also happy to offer a "please" as a prefix for any future commands. ;) )

Characters outside of the Basic Multilingual Plane are only supported when your database supports them. MySQL does not if you use the UTF-8 character set. You should use the "binary" character set when setting up your database. This is documented and not a MediaWiki bug.

Invalid ? I can reproduce it always ! How can you just ignore what is a *real* issue?
I also detailed the very simple list of steps (only 1 character needed in an article)
Also the database in question supports UTF-8 as this is the default and all characters in the BMP are supported (not just ASCII).
The pages are truncated without notice (in fact the full page is saved, but cannot be loaded completely after saved, so truncation causes its redering to be truncated, and the end can no logner be edited).

Verdy_p triaged this task as High priority.
Verdy_p updated the task description. (Show Details)
Jdforrester-WMF subscribed.

Invalid ? I can reproduce it always ! How can you just ignore what is a *real* issue?

Please read @matmarex's closing note.

[…]

Also the database in question supports UTF-8 as this is the default and all characters in the BMP are supported (not just ASCII).

Yes. But this character is outside the BMP.

Jdforrester-WMF renamed this task from Saved edited page is truncated on supplementary characters (e.g. Emojis, or supplementary chinese, in Unicode planes 1 or 2) to Saved edited page is truncated on supplementary characters (e.g. Emojis, or supplementary chinese, in Unicode planes 1 or 2) when your database doesn't support that.May 23 2016, 3:01 PM

But the database is already in binary mode... (this is even specified automatically in the installation script since 2010), unless it was created separately without the installation script.
How can MySQL truncate the result when there's valid UTF-8 for supplementary characters, and the database already uses UTF-8 (if it is still set in this mode and not in binary mode)?
You also said this is documented, I found no place in MediaWiki documentation where this could even cause an issue.
Is MySQL support for UTF-8 so broken so that it only supports characters in the BMP and truncates all the rest in queries for the wikitext (but not in the saved history comment!!) ?

Verdy_p updated the task description. (Show Details)

In fact MediaWiki is the bug: it uses "utf8" instead of "utf8mb4" in MySQL 5.5!

But the database is already in binary mode... (this is even specified automatically in the installation script since 2010), unless it was created separately without the installation script.

The charset is set separately for every database, every table and every column. Presumably some of your columns are set to UTF-8 rather than binary.

You also said this is documented, I found no place in MediaWiki documentation where this could even cause an issue.

The installer warns about this. "In binary mode, MediaWiki stores UTF-8 text to the database in binary fields. This is more efficient than MySQL's UTF-8 mode, and allows you to use the full range of Unicode characters. In UTF-8 mode, MySQL will know what character set your data is in, and can present and convert it appropriately, but it will not let you store characters above the Basic Multilingual Plane."

How can MySQL truncate the result when there's valid UTF-8 for supplementary characters, and the database already uses UTF-8 (if it is still set in this mode and not in binary mode)?

Is MySQL support for UTF-8 so broken so that it only supports characters in the BMP and truncates all the rest in queries

Yup, that is precisely what happens.

In fact MediaWiki is the bug: it uses "utf8" instead of "utf8mb4" in MySQL 5.5!

We still support MySQL 5.x, not just 5.5. This is T50767: Support 'utf8mb4' character set in MySQL 5.5 and above, by the way.

You've changed that only for MariaDB
https://phabricator.wikimedia.org/rGEVLcea181a530764ecdede2525d315e8f1052b42faa

not for MySQL 5.5 and above (MediaWiki still "SET NAME utf8" instead of "utf8mb4")

You've changed that only for MariaDB
https://phabricator.wikimedia.org/rGEVLcea181a530764ecdede2525d315e8f1052b42faa

not for MySQL 5.5 and above (MediaWiki still "SET NAME utf8" instead of "utf8mb4")

How is this relevant? That commit isn't even in MediaWiki. It's Python code, for goodness sake.

And if only 3byte UTF-8 is used (with the MySQL "utf8" encoding), there's nothing in MediaWiki which will prevent the text being truncated silently.
Mediawiki should know that this can happen, using a global variable such as *wgSupportSupplementaryPlanes, and implement a safe filter, sending a warning to the editing user that some characters cannot be saved and have been filtered (the page should not be saved immediately, but rendered with this filtering implemented in a preview.

Also the Mediawiki code is full of occurences of "utf8" instead of "utf8mb4" and there's not any test for that (except using the unsafe "binary mode" by testing an unrelated config variable).

Additionally, when starting up, MediaWiki should test the database by commiting a single update of text containing some supplementary characterss and read it from the database to see if the text is preserved.
If not, it will implement a filtering test prior to saving pages (occurences of supplementary characters can be replaced by numeric character references if needed, instead of filtering completely these characters).
The wiki daminsitrator should be informed too in starting logging messages.

Aklapper raised the priority of this task from High to Needs Triage.May 23 2016, 4:15 PM

Please do not set the task priority field if you do not plan to work on fixing this task. Thanks.

Invalid ? I can reproduce it always ! How can you just ignore what is a *real* issue?

Nobody "ignored" an issue (also see the Phabricator etiquette how to phrase this in a more acceptable way).
An explanation for the current behavior was provided and is documented.

Mediawiki should know that this can happen

See T135969#2318683 - the installer warns about this.

Also the Mediawiki code is full of occurences of "utf8" instead of "utf8mb4"

See the link provided in T135969#2318686.

Additionally, when starting up, MediaWiki should test the database by commiting a single update of text containing some supplementary characterss and read it from the database to see if the text is preserved.

As far as I know there are no plans to implement this, hence proposing to close this task as declined.

No plan for impelmeting it? Not even a safe detector that cause pages to be unexpectely truncated even if they are correct when previewed?
Users will not all remark that their edit was lost of that this truncated the bottom of the page after saving it.
Their edit will eventually be reverted by someone else, but the work done is compeltely lost, this is really unfair for what is a wiki installation problem and a geenral problem of usability.
All wikis running with MySQL below version 5.5 are affected (unless they have modified the installation script to use "binary" instead of the default MySQL "utf8" encoding, or set it manually to "utf8mb4"). This is a serious issue.

Even standard bots may insert at any time a supplementary character (this is the default for the Gothic Wikipedia for normal text everywhere). Wikis in Chinese are also likely to be affected at any time by characters in the SIP. Users of smartphones (Android, iphones...) can now easily post emojis in every talk page (they could be banned for that if this truncates pages, even if this is not their fault and most users will not even know how to revert their own edit that was not properly saved).

The current documentation of MediaWiki (the standard page about its installation) does not say any word about this issue. Wiki admins are not aware of the problem (or may not know what are the severe consequence if non-BMP characters are not supported, may be they think that they just don't need support for these supplementary characters, not needed for the languages they intend to support)... Until someone will start breaking pages unexpectedly (or a malicious bot will abuse this bug by editing the initial section to insert a non-BMP character, causing the page to be truncated). This info is hidden within pages hard to find on Phabricator or other developer pages. It is hard to find and too technical.

I still detected othe pages truncated after saving (correct after preview) when inserting translations to Gothic (using the Gothic script which is not in the BMP).

This is still an unsolved major issue.

And users are posting emojis randomly in talk pages, causing their unexpected truncation. NOTHING in MEdiaWiki prevents this truncation.

There MUST be someting to detect when such characters won't be supported (even if this is because of a limitation of the background SQL database), and at least there should be a filter to convert them to numeric character references rather than plain UTF-8 which does not work.

Adding a simple test at MediaWiki startup to check if UTF-8 is fully supported will not cost a lot (you just need to update a single small text data column in a single row in any table, commit it, and read it back to see if it was properly saved. This will configure the level of UTF-8 support that MediaWiki can safely use (UCS_FULL, UCS2_ONLY, OCTETS_ONLY) to implement the filter (the filter will be a no-op only for UCS_FULL, the OCTETS_ONLY meaning ISO8859-1 extended to Windows-1252 by the HTML5 standard)

Just a pointer to the discussion at T194125, both migrating to utf8mb4 or to binary should solve that, but there is questions on how to do that. No matter the decision ,utf8mb3 should not be supported because of the issues it creates.

Whever you like it or not, it is a fact that MediaWiki will have already been installed with the "utf8" option in the scripts already delivered since long for creating the initial database for MySQL.
Now you say you don't support it, but this was supported as long as no one was attempting to use supplementary characters (outside the BMP, needing 4 bytes and not just 3 bytes).
It is a fact that MySQL silently truncate strings when storing them, no error is returned. The preview before saving is still correct (so it is not a problem of PHP or HHVM or OS compatibility).

Mediawiki still fails to check if 4-bytes encoded supplementary characters are really supported (it's not jsut the MySQL client interface used by PHP, as the issue will be in the underlying table store format),; and it offer no way to use a filter to reencode the wikitext (with NCRs) when supplementary characters are not supported by the underlying store. This filter should be added automatically by the Mediawiki startup code (it may still provide some logging to alert the database owner that there's an encoding issue and because of that the storage will be less efficient, and full binary UTF-8 ordering will not work as expected in categories: page contents as well as page titles may contain NCRs in the backing store, so the filter will have to work in the two directions: from compliant UTF-8 4-bytes to NCRs when submitting the SQL queries, and from NCRs to 4-byte UTF-8 when retreiving data; this can be transparent to the Mediawiki application built on top of it, it will be part of the SQL connector, using a small interface library).

Some maintenance scripts running outside Mediawiki may find unexpected NCRs but those scripts are only used by database administrators, the same that would be installing MediaWiki and reading its logs: For most cases this will not be an issue when NCRs are found in page contents (including templates), the only issue will be the presence of NCRs in page titles (including user names on that wiki!), and the possible bugs that may (rarely) occur within some pages using the javascript or CSS content models

  • in Javascript you may eventually get incorrect string lengths when measured in UTF-16 code units, because the supplementatry NCR like "𒍅" will be counted as 9 ASCII code units instead of 2 surrogates.
  • in CSS, the NCRs need another syntax if you find supplementary characters in property values, like in {text-before: "text"}, or in selectors (notably in class names or element identifiers)
  • This will not be an issue at all for the default MediaWiki content model where NCRs can be used everywhere (just like in standard HTML).

In all cases, Mediawiki should not be usable as it is now with a silent truncation of text: the preview is correct, the actual data saved does not match what was submitted and tested. This causes severe issues when anyone can sign in and post a comment in a talk page that will clear everything after he posted just by using his own signature.

You still need to perform better validation and secure the content even if the database store is not installed as you would like now (it was installed the standard way years ago, and converting the database to use another encoding is a very long process). For me it's just simpler to implement the filter, as it does not require any database migration and it preserves the content against unepxted truncation which can damage many pages on a the wiki (and notably now that many users are posting comments from their smartphone where they'll easily push some emojis).

You have no choice: please add minimal support so that supplementary characters won't unexpectely truncate pages (this forces us to constantly revert changes completely, and any modification that was made after that non-BMP character is lost, and there 's no way at all to get it back or fix it). So please detect these databases and then apply the reencoding filter for the Mediawiki content model (this has to be done in Mediawiki itself, so yes it is an internal Mediawiki bug that MediaWiki must handle itself, independantly of the underlying database, PHP or OS installation).

For now Mediawiki is completely unsafe and is easily attackable (and users of these wikis are absolutely not at fault). It's eveident that Mediawiki was released initally without taking care about full UTF-8 conformance and the installation scripts provided were wrong.

@Verdy_p: If you want to implement this, https://www.mediawiki.org/wiki/Gerrit/Tutorial explains how to provide patches. Thanks.

Krinkle subscribed.

Per previous comments, it is not feasible to support all Unicode characters when the MySQL charset is set to something that is meant to support this. This is not specific to MediaWiki. To use these characters, please migrate to a supported character set encoding that supports these characters.

This is what Wikipedia does, and is also what the MW installer recommends by default.

I just wanted that MediaWiki performs a basic check if it is not installed on a compliant base (this test can be extremely fast at startup, to see if it supports non-BMP characters or if they cause text to be truncated: in that case, some gobal boolean flag is set and will activate any data submission containing such non-BMP character so that the user is informed that these characters are not supported; but submitted text with them will then be rejected, and no unexpected truncation will silently occur) : this is a basic security feature, as many wikis cannot be reinstalled on another database without long offline migration period, and possibly the underlying database will not support it.

Of course this is not a problem for Wikipedia, but this report was NOT about Wikipedia but about MediaWiki itself and its installation. I don't see why MediaWiki must not run on a database which does not support non-BMP character.

Why do ALL wikis HAVE TO support the whole UCS in their database ? This is an unnecessary requirement (MW does not REQUIRE it, it just RECOMMENDS it) even if this will of course limit the usage of the wiki, but there are many small wikis that will never want to support the whole UCS or will even just accept plain ASCII or some basic Windows-1252 encoding, because the database must also be compatible with other local processes or local custom extensions taking/putting some data to internal databases not supporting the UCS).

Not all wikis will be open to the world, they may exist only within an organization for its own private use, and in a single language and a single basic encoding. Even the WMF may have its own internal small wiki running without the full UCS support in their backend (this is just not the case for wikis that WMF opens publicly to the world on the Internet, but there may even exist small wikis in Toollabs that have their own small SQL engine with restricted text encodings as they don't need anything else, such as full internationalization support: they may choose it because their local program is not prepared to support the Unicode algorithms and the complex data it requires, plus regular maintenance of this support data at each Unicode release, such as reindexing with new collation rules).

In summary this old request is possibly invalid for Wikipedia or other Wikimedia wikis (but that's out of scope with what I reported !).
But it remains really valid, and still not corrected, for other wikis in general.

It's not a priority for Wikimedia wikis, but remains a priority (and a severe bug) for others (which also do not necessarily use recent version of MySQL or MariaDB, because MediaWiki proposes and supports multiple other SQL backends). The test I demand should be implemented in each backend interface (even if it will be a no-op for Wikimedia wikis) that will detect the charset they really support (and the necessary charset conversions if needed, possibly lossy if the charset on the backend is not a conforming UTF). For now MediaWiki assumes that if its backend interface sens an INSERT or UPDATE or similar request to the SQL engine, and the requests does not fail, this text has NOT been modified by the backend (this assumption is false even if it can be checked easily and once at startup of Mediawiki).

It it is still blocking T194125 (where the major concern is the volume of migration, or severe performance problems if, for supporting the full UCS, one admin has to convert the storage to varbinary, and then no more integrated support for collation and sorts in the SQL backend, forcing these to be implemented on the client side, in the backend interface, with probably huge memory constraints: imagine how to request the nth first item of an ordered list, you have to load the full list in the client, i.e. in the MediaWiki instance via its backend interface sort it locally, then drop the unnecessary items; this case happens for example when navigating highly populated categories)

Note that a database may safely support the full UCS, but not the collation on the full set (it may collate only characters in the BMP, and collate everything else at end; it may also not collate the full BMP: Mediawiki by default just collates correctly only the ASCII subset and then orders all the rest as binary; that's why we have "sort keys" in categories, even in Wikimedia wikis, and most of them don't have complete collation data except for a single language where it may have been tuned specifically on a small subset, treating all the rest in binary order!). Such setting however is not concerned by this current bug as the text is safely stored and loaded without truncation or transforms.

Aklapper triaged this task as Lowest priority.Jun 29 2019, 4:29 PM
Krinkle closed this task as Declined.EditedJul 11 2019, 3:15 PM

All the PHP engines and DB backends that MediaWiki supports are capable of processing BMP characters. If you are in a situation where this is not the case, you have either misconfigured the database server or are using software we do not provide support for.

Focus related helpful thoughts and ideas to T194125 instead.

Please do not re-open this task again.

Verdy_p reopened this task as Open.EditedJul 12 2019, 9:58 PM

That's wrong. Being "capable" is just assumed, it is never checked and there are existing wikis using SQL backends that silently drop non-BMP characters (and all what follows them), one of them being the OpenStreetmap wiki. May be its misconfigured, but MediaWiki is completely forgets to check that, and this causes silent drops of data when editing.

My request remains open because NOTHING in the documentation of MediaWiki enforced the SQL backend to be fully UTF-8 capable (and this is evidently false for various SQL backends, INCLUDING those that you "support" (but don't want to).

(Initially when the bug was first open, it also affected a few Wikimedia Wikis, they have been migrated since that time, by reloading their database completely and even changing their SQL engine multiple times, but this is still not the case of many other wikis).

And there are many wikis that are already affected by this bug: it's just enough toi insert a single non-BMP character in the page: all is correct when previewing, the truncation occurs AFTER saving.

I jsut wanted that MediaWiki checked at start that there's a full BMP support (a single SQL request can do that) and then positions a flag. Later when previewing or saving the page, if there are non-BMP characters (does not need any SQL access to check that), a warning is displayed on the preview and nothing is saved, the user must then remove or replace those characters, but he cannot save at all,

Or there's an extension/hook that will convert these non-BMP characters to NCRs in the adapter performing the SQL request. ALL serious applications using an SQL database have an "adapter" layer in charge of making character encoding conversion: this is really simple to in this adapter, which can use the flag positioned at startup to know if it must do that.

I've not saying this is a bug in MediaWiki core itself, it's definitely a bug in the existing SQL adapters that are really unsafe, because they never perform any conversion, and no test at all, and there's NEVER any warranty that what will be saved will be what will be retrieved later. It's just a bad assumption, even if the doc RECOMMENDS to use full-UTF-8 backend (but STILL has NEVER said that they MUST support the full UCS).

This remains a bug with the sysadmin's configuration of the database, and not with MediaWiki. Please do not edit war.

No, this is still installed as it was always documented. The basic test I request is also on topic for "MediaWiki database". This is a real bug in that part of Mediawiki, that never asserts but only assumes this is configured as you expect. Wikimedia itself has changed multiple times the way the encodings were used in the DB, and changed appropriately the SQL adapters, but it forgot this case which is very simple to test (at least assert at startup). If you made an assertion and stopped the engine, you would receive tons of complaints that Mediawiki now refuses to run.
It cans still be easily corrected by implemented (when required) the encoding converter (using NCRs for example, or saving with pairs of surrogates, if supported by the engine).

As well it's up to to the SQL adapter to check other constraints on accepted values (notably the max size supported by the engine for various fields or for Blobs: this is also never checked)