Page MenuHomePhabricator

wb_terms contains invalid UTF-8 data
Closed, InvalidPublic

Description

wb_terms table contains terms - e.g. labels, descriptions, etc. - for Wikidata items. The length of these terms is limited by the table definitions:

| term_text           | varbinary(255)      | NO   | MUL | NULL    |                |

However, it is not ensured that when the longer data is cut off, the result is a valid utf-8. For example, running:

select * from wb_terms where term_full_entity_id='Q1102' and term_language='kn' and term_type='description';

We get this:

|  2349274243 |              0 | Q1102               | item             | kn            | description | ಪ್ಲುಟೋನಿಯಮ್ ಎಂಬುದು ಪೂ ಮತ್ತು ಅಣುಗಳ ಸಂಖ್ಯೆ 94 ಅನ್ನು ಹೊಂದಿರುವ ಟ್ರಾನ್ಸ್ಯುರಾನಿಕ್ ವಿಕಿರಣಶೀಲ ರಾಸಾಯನಿ?                                                                                                                                                                   |                 |           0 |

Note the ? at the end of the text - it's there because it was cut-off and now is an invalid UTF-8 sequence. I think Wikibase should cut the terms so that it would not produce invalid sequences, otherwise other code that might use that table would get all kinds of weird errors from functions that assume valid UTF-8.