Page MenuHomePhabricator

[Bug] wb_terms table truncates labels exceeding 255 bytes, possibly leaving invalid UTF-8
Open, LowPublic

Description

On the russian UI of Q120180, in position held property (P39) (занимаемая должность), the reference label for Q19180760 is trimmed.

Details

Related Gerrit Patches:
mediawiki/extensions/Wikibase : masterFix truncated terms on the fly

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 11 2016, 9:27 AM
thiemowmde triaged this task as Medium priority.Aug 11 2016, 10:14 AM
thiemowmde moved this task from incoming to needs discussion or investigation on the Wikidata board.

There are multiple things to consider here here:

  • The label, description and alias fields are limited to 250 characters (see multilang-limits in Wikibase.default.php). Same for monolingual terms. But string values are limited to 400 characters. I'm not absolutely sure all this is as intended.
  • We are currently using the term_text column from the wb_terms table to display these labels. It's set to varbinary(255). But:
  • There are multiple problems with touching the term_text field:
    • There is no absolute maximum byte length a 250 characters UTF-8 string can have. A sequence of emoji may even exceed 1000 bytes.
    • [Solution: Increase length] but does MySQL still support all indexing features when we increase the length?
    • [Solution: Use VARCHAR] But does MySQL still support all indexing features when we change the field from VARBINARY to VARCHAR?
  • [Solution: Fallback to retrieving lookup] A possible workaround is to check the length of the strings returned by the EntityInfoTermLookup. If it's exactly 255 characters we repeat the lookup a second time with the EntityRetrievingTermLookup. (Or to add this fallback behavior directly to the EntityInfoTermLookup) The "retrieving" lookup was the one we used for all label lookups a few months ago. We started using the term table based lookup for performance reasons, to avoid deserializing hundreds of entity blobs just because we need a few labels.
  • [Solution: Elastic lookup] Long term solution should be to "drop" the terms table (read: do not use it any more) in favor of a proper elastic based index (see T125500: [Epic] Index Wikidata labels and descriptions as separate fields in ElasticSearch).
hoo added a subscriber: hoo.Aug 11 2016, 11:03 AM
  • [Solution: Increase length] but does MySQL still support all indexing features when we increase the length?

Yes, we had a similar problem with ips_site_page which we solved that way: T99459: ips_site_page is too short to store some (full) page titles.

hoo added a comment.Aug 12 2016, 2:39 PM

We have 2133 affected terms:

MariaDB [wikidatawiki_p]> SELECT COUNT(*) FROM wb_terms WHERE LENGTH( term_text) > 254;
+----------+
| COUNT(*) |
+----------+
|     2133 |
+----------+
1 row in set (8 min 15.88 sec)
thiemowmde renamed this task from Max lenght of reference label in russian to [Bug] Terms table truncates labels exceeding 255 bytes, possibly leaving invalid UTF-8.Aug 12 2016, 2:51 PM
daniel added a comment.EditedAug 23 2016, 5:32 PM

Some quick notes:

In repo/sql/Wikibase.sql, this is declared as term_text VARCHAR(255) BINARY NOT NULL, which is NOT the same as VARBINARY. VARCHAR(255) BINARY should use binary collation for UTF8 data, any would allow 255 unicode characters to be stored. The field was apparently changed to VARBINARY during deployment, possibly for backwards compat with MySQL4.

How ever we work around the issue that this field is VARBINARY(255) on the live system, we should strip any broken UTF8 from the end of the string, using StringNormalizer::removeBadCharLast.

Change 306253 had a related patch set uploaded (by Daniel Kinzler):
Fix truncated terms on the fly

https://gerrit.wikimedia.org/r/306253

It's VARBINARY in both my local database and on the live system (we can see it truncates at 255 bytes). I don't know how this happened. I assume the meaning of VARCHAR BINARY changed. It appears this was an alias for VARBINARY in the MySQL versions we use.

Here is an example search: https://www.wikidata.org/w/api.php?action=wbsearchentities&search=%D0%92%D1%8B%D1%81%D0%BE%D1%87%D0%B0%D0%B9%D1%88%D0%B8%D0%B9+%D0%BC%D0%B0%D0%BD%D0%B8%D1%84%D0%B5%D1%81%D1%82+1+%D0%BC%D0%B0%D1%80%D1%82%D0%B0+1881+%D0%B3%D0%BE%D0%B4&format=json&language=ru&uselang=ru&type=item. Note that the search result ends with \ufffd. This should be stripped, as @daniel suggests.

During story time we found that:

We said that we will, for now, solve this bug by implementing the "fallback" workaround described in T142691#2543325.

Given that it is only a problem in a small number of cases let's not spend time on the workaround and instead work on the proper fix long-term. That's what we also said in story time.

Could be relevant: I was able to create a item with 810 characters (https://www.wikidata.org/wiki/Q26903397), but I can't add more labels with that length. https://www.wikidata.org/wiki/Q2732136#P734 shows the broken display as described in this ticket.

Given that it is only a problem in a small number of cases let's not spend time on the workaround and instead work on the proper fix long-term. That's what we also said in story time.

The proper long term fix for this one is killing the wb_terms table.
Tagging T198866 as related.

Addshore renamed this task from [Bug] Terms table truncates labels exceeding 255 bytes, possibly leaving invalid UTF-8 to [Bug] wb_terms table truncates labels exceeding 255 bytes, possibly leaving invalid UTF-8.Sep 18 2018, 12:28 PM
Addshore lowered the priority of this task from Medium to Low.

Change 306253 abandoned by Daniel Kinzler:
Fix truncated terms on the fly

https://gerrit.wikimedia.org/r/306253

thiemowmde removed thiemowmde as the assignee of this task.Sep 11 2019, 2:12 PM
thiemowmde added a subscriber: thiemowmde.