[Bug] wb_terms table truncates labels exceeding 255 bytes, possibly leaving invalid UTF-8
Closed, InvalidPublic
Actions

Assigned To

None

Authored By

	Lea_Lacroix_WMDE
	Aug 11 2016, 9:27 AM

Description

On the russian UI of Q120180, in position held property (P39) (занимаемая должность), the reference label for Q19180760 is trimmed.

Screenshot from 2016-08-11 11-30-49.png (768×1 px, 150 KB)

Details

	Subject	Repo	Branch	Lines +/-
	Fix truncated terms on the fly	mediawiki/extensions/Wikibase	master	+19 -4

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Invalid		None	T142691 [Bug] wb_terms table truncates labels exceeding 255 bytes, possibly leaving invalid UTF-8
		Resolved		thiemowmde	T142820 [Task] Decide on quick fix for "Max length of reference label in russian"

Event Timeline

Lea_Lacroix_WMDE created this task.Aug 11 2016, 9:27 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 11 2016, 9:27 AM

Lea_Lacroix_WMDE updated the task description. (Show Details)Aug 11 2016, 9:31 AM

There are multiple things to consider here here:

The label, description and alias fields are limited to 250 characters (see multilang-limits in Wikibase.default.php). Same for monolingual terms. But string values are limited to 400 characters. I'm not absolutely sure all this is as intended.
We are currently using the term_text column from the wb_terms table to display these labels. It's set to varbinary(255). But:
There are multiple problems with touching the term_text field:
- There is no absolute maximum byte length a 250 characters UTF-8 string can have. A sequence of emoji may even exceed 1000 bytes.
- [Solution: Increase length] but does MySQL still support all indexing features when we increase the length?
- [Solution: Use VARCHAR] But does MySQL still support all indexing features when we change the field from VARBINARY to VARCHAR?
[Solution: Fallback to retrieving lookup] A possible workaround is to check the length of the strings returned by the EntityInfoTermLookup. If it's exactly 255 characters we repeat the lookup a second time with the EntityRetrievingTermLookup. (Or to add this fallback behavior directly to the EntityInfoTermLookup) The "retrieving" lookup was the one we used for all label lookups a few months ago. We started using the term table based lookup for performance reasons, to avoid deserializing hundreds of entity blobs just because we need a few labels.
[Solution: Elastic lookup] Long term solution should be to "drop" the terms table (read: do not use it any more) in favor of a proper elastic based index (see T125500: [Epic] Index Wikidata labels and descriptions as separate fields in ElasticSearch).

In T142691#2543325, @thiemowmde wrote:

[Solution: Increase length] but does MySQL still support all indexing features when we increase the length?

Yes, we had a similar problem with ips_site_page which we solved that way: T99459: ips_site_page is too short to store some (full) page titles.

• Glorian_Yapinus created subtask T142820: [Task] Decide on quick fix for "Max length of reference label in russian".Aug 12 2016, 10:05 AM

• Glorian_Yapinus mentioned this in T142820: [Task] Decide on quick fix for "Max length of reference label in russian".Aug 12 2016, 1:48 PM

hoo closed subtask T142820: [Task] Decide on quick fix for "Max length of reference label in russian" as Invalid.Aug 12 2016, 1:50 PM

Lydia_Pintscher reopened subtask T142820: [Task] Decide on quick fix for "Max length of reference label in russian" as Open.Aug 12 2016, 1:51 PM

We have 2133 affected terms:

MariaDB [wikidatawiki_p]> SELECT COUNT(*) FROM wb_terms WHERE LENGTH( term_text) > 254;
+----------+
| COUNT(*) |
+----------+
|     2133 |
+----------+
1 row in set (8 min 15.88 sec)

thiemowmde renamed this task from Max lenght of reference label in russian to [Bug] Terms table truncates labels exceeding 255 bytes, possibly leaving invalid UTF-8.Aug 12 2016, 2:51 PM

thiemowmde added subscribers: Lydia_Pintscher, daniel, • Jonas, aude.

Some quick notes:

In repo/sql/Wikibase.sql, this is declared as term_text VARCHAR(255) BINARY NOT NULL, which is NOT the same as VARBINARY. VARCHAR(255) BINARY should use binary collation for UTF8 data, any would allow 255 unicode characters to be stored. The field was apparently changed to VARBINARY during deployment, possibly for backwards compat with MySQL4.

How ever we work around the issue that this field is VARBINARY(255) on the live system, we should strip any broken UTF8 from the end of the string, using StringNormalizer::removeBadCharLast.

Change 306253 had a related patch set uploaded (by Daniel Kinzler):
Fix truncated terms on the fly

https://gerrit.wikimedia.org/r/306253

gerritbot added a project: Patch-For-Review.Aug 23 2016, 5:48 PM

It's VARBINARY in both my local database and on the live system (we can see it truncates at 255 bytes). I don't know how this happened. I assume the meaning of VARCHAR BINARY changed. It appears this was an alias for VARBINARY in the MySQL versions we use.

Here is an example search: https://www.wikidata.org/w/api.php?action=wbsearchentities&search=%D0%92%D1%8B%D1%81%D0%BE%D1%87%D0%B0%D0%B9%D1%88%D0%B8%D0%B9+%D0%BC%D0%B0%D0%BD%D0%B8%D1%84%D0%B5%D1%81%D1%82+1+%D0%BC%D0%B0%D1%80%D1%82%D0%B0+1881+%D0%B3%D0%BE%D0%B4&format=json&language=ru&uselang=ru&type=item. Note that the search result ends with \ufffd. This should be stripped, as @daniel suggests.

During story time we found that:

Certain closed subtasks of T75087: Efficient entity label lookup (tracking) are probably the reason for this bug.
The remaining open subtasks of T75087 will solve this specific bug.
A much wider parent task T86530: Replace wb_terms table with more specialized mechanisms for terms (tracking) also exists, outlining a series of individual replacements necessary to get rid of the terms table.

We said that we will, for now, solve this bug by implementing the "fallback" workaround described in T142691#2543325.

thiemowmde closed subtask T142820: [Task] Decide on quick fix for "Max length of reference label in russian" as Resolved.Aug 26 2016, 3:20 PM

thiemowmde added a project: Wikidata-Sprint-2016-08-30.

Given that it is only a problem in a small number of cases let's not spend time on the workaround and instead work on the proper fix long-term. That's what we also said in story time.

Lydia_Pintscher removed a project: Wikidata-Sprint-2016-08-30.Aug 30 2016, 9:09 AM

Could be relevant: I was able to create a item with 810 characters (https://www.wikidata.org/wiki/Q26903397), but I can't add more labels with that length. https://www.wikidata.org/wiki/Q2732136#P734 shows the broken display as described in this ticket.

thiemowmde mentioned this in T145932: [Bug] Special:NewItem does not validate maximum label length.Sep 17 2016, 11:08 AM

In T142691#2587131, @Lydia_Pintscher wrote:

Given that it is only a problem in a small number of cases let's not spend time on the workaround and instead work on the proper fix long-term. That's what we also said in story time.

The proper long term fix for this one is killing the wb_terms table.
Tagging T198866 as related.

Addshore renamed this task from [Bug] Terms table truncates labels exceeding 255 bytes, possibly leaving invalid UTF-8 to [Bug] wb_terms table truncates labels exceeding 255 bytes, possibly leaving invalid UTF-8.Sep 18 2018, 12:28 PM

Addshore lowered the priority of this task from Medium to Low.

Change 306253 abandoned by Daniel Kinzler:
Fix truncated terms on the fly

https://gerrit.wikimedia.org/r/306253

Addshore mentioned this in T208425: [EPIC] Kill the wb_terms table.Mar 25 2019, 4:38 PM

Lucas_Werkmeister_WMDE mentioned this in T225805: DatabaseTermIdsAcquirer fails on terms longer than 255 bytes.Jun 14 2019, 3:07 PM

hoo mentioned this in T162562: Choose a field length for entity id columns and stick to it.Aug 1 2019, 4:10 PM

Maintenance_bot removed a project: Patch-For-Review.Aug 1 2019, 5:11 PM

thiemowmde removed thiemowmde as the assignee of this task.Sep 11 2019, 2:12 PM

thiemowmde subscribed.