Page MenuHomePhabricator

Mixup of unicode characters in Query Service
Open, MediumPublic

Description

Some unicode characters seem to be mixed up in the Query Service results.

Example: on the item 12, the different characters are correctly described and represented.
However, on this query result, the character ⓬ appears in the "circled_number" instead of ⑫. Same thing for some other numbers.

Reported here

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

This version of the query checks the revision which the query service imported. As far as I can tell, in Wikidata the data was correct at the time (i. e., the query service didn’t just miss an update), and in the RDF export it looks correct as well.

This can also be reproduced without reference to any particular item (link):

SELECT ("⑫" AS ?circled) ("⓬" AS ?negativeCircled) {}
circlednegativeCircled

And flipped (link):

SELECT ("⓬" AS ?negativeCircled) ("⑫" AS ?circled) {}
negativeCircledcircled

I suspect that, for whatever reason, it considers both literals to be identical, and whichever one it encounters first is then reused for the other one as well…?

They’re distinct Unicode characters, by the way, not any kind of combined characters: ⑫ U+246B CIRCLED NUMBER TWELVE, ⓬ U+24EC NEGATIVE CIRCLED NUMBER TWELVE. They’re not new characters, either: the former was added in Unicode 1.1 (June 1993), the latter in 3.2 (March 2002).

Probably related to other issues about Unicode and to ICU collation level. I presume collation level enabled now at Blazegraph confuses these two.

These characters are indeed mapped to the same term in the DB.

SELECT ( ConstantNode(TermId(1415304733L)[⓬]) AS VarNode(negativeCircled) ) ( ConstantNode(TermId(1415304733L)[⑫]) AS VarNode(circled) )

Blazegraph uses ICU collation as default key builder implementation.
The characters are indeed seems very similar, thus ICU might decide to mix them up.
The behavior might be fixed but might result in many side effects, especially with complex unicode sequences, for example diacritics, and should be carefully considered. Another complication is, that the change of the keybuilder will require all journals full reload.

The issue is that by default Blazegraph uses tertiary ICU collation level IIRC (I can check specific one) so it ignores some differences like that one - generating same term key for both. It can be switched to Identical but that would generate much larger term keys which would hurt performance and increase storage size.

Gehel triaged this task as Low priority.Sep 15 2020, 7:49 AM

I've run into this twice today already. :(

First the character "Ꜵ" (AO ligature) was instead displayed as "🇦🇴" (flag of Angola) which was extremely confusing and I'm glad I was already aware of this ticket because who knows how much time I would've wasted trying to figure out what was wrong with my query otherwise.

Now the regex [\u2100-\u214F] suddenly failed on "ℰ" (U+2130) because it's decided the string is actually "𝔼" (U+1D53C).

Indeed, Ꜵ/🇦🇴 seems to be the same issue:

SELECT ("🇦🇴" AS ?angola) ("Ꜵ" AS ?ao) {}
angolaao
🇦🇴🇦🇴
SELECT ("Ꜵ" AS ?ao) ("🇦🇴" AS ?angola) {}
aoangola

Similarly, User:Unjoanqualsevol reported an issue with “l·” (two characters) and “ŀ” (one character – depending on font, the difference may not be very visible, but try dragging a selection and placing the end between the L and the dot) being mixed up (discussion permalink), which shows the same symptom –

SELECT ("Abeŀlio"@ca AS ?lWithMiddleDot) ("Abel·lio"@ca AS ?middleDot) {}
lWithMiddleDotmiddleDot
AbeŀlioAbeŀlio
SELECT ("Abel·lio"@ca AS ?middleDot) ("Abeŀlio"@ca AS ?lWithMiddleDot) {}
middleDotlWithMiddleDot
Abel·lioAbel·lio

– but only if I language-tag the strings as @ca: with plain strings on both sides, there is no mixup here. Very odd.

I wonder if this issue has made a resurgence recently? Between Nikki and Unjoanqualsevol, we’ve now had two user reports within as many weeks, whereas otherwise I don’t remember hearing about this issue since it was reported over a year ago.

@Gehel Could you have a look, and as we had several mentions of it over the past weeks, maybe reconsider the priority of this task? Thanks a lot!

If you will consider changing collator configuration, note, that collator type should NOT be changed from the default value ICU:
com.bigdata.btree.keys.KeyBuilder.collator=ICU
There are collator type options JDK and ASCII, but both would not be usable, as JDK is basically result in the same comparison as ICU uses, but generate much larger keys; and ASCII just assumes the source text to be ASCII and completely drops Unicode support.

As Stas mentioned Blazegraph uses ICU default collator strength. Which depends on locale of the literal, but is Tertiary in most cases (that's why it might behave differently if lang tag is specified):
com.ibm.icu.text.Collator#getInstance(java.util.Locale)

You have 4 strength options besides default Tertiary:
Ref: http://userguide.icu-project.org/collation/concepts#TOC-Comparison-Levels

Primary Level: Typically, this is used to denote differences between base characters (for example, "a" < "b"). It is the strongest difference. For example, dictionaries are divided into different sections by base character. This is also called the level-1 strength.

Secondary Level: Accents in the characters are considered secondary differences (for example, "as" < "às" < "at"). Other differences between letters can also be considered secondary differences, depending on the language. A secondary difference is ignored when there is a primary difference anywhere in the strings. This is also called the level-2 strength.
Note: In some languages (such as Danish), certain accented letters are considered to be separate base characters. In most languages, however, an accented letter only has a secondary difference from the unaccented version of that letter.

Tertiary Level (Default in most cases): Upper and lower case differences in characters are distinguished at the tertiary level (for example, "ao" < "Ao" < "aò"). In addition, a variant of a letter differs from the base form on the tertiary level (such as "A" and "Ⓐ"). Another example is the difference between large and small Kana. A tertiary difference is ignored when there is a primary or secondary difference anywhere in the strings. This is also called the level-3 strength.

Quaternary Level: When punctuation is ignored (see Ignoring Punctuations (§)) at level 1-3, an additional level can be used to distinguish words with and without punctuation (for example, "ab" < "a-b" < "aB"). This difference is ignored when there is a primary, secondary or tertiary difference. This is also known as the level-4 strength. The quaternary level should only be used if ignoring punctuation is required or when processing Japanese text (see Hiragana processing (§)).

Identical Level: When all other levels are equal, the identical level is used as a tiebreaker. The Unicode code point values of the NFD form of each string are compared at this level, just in case there is no difference at levels 1-4 . For example, Hebrew cantillation marks are only distinguished at this level. This level should be used sparingly, as only code point values differences between two strings is an extremely rare occurrence. Using this level substantially decreases the performance for
both incremental comparison and sort key generation (as well as increasing the sort key length). It is also known as level 5 strength.

While Quaternary level might be sufficient for 'Abeŀlio' if dot is a punctuation here, but given the necessity to distinguish between ⑫ and ⓬, the only option to consider is Identical.

The strength could be adjusted by specifying RWStore.properties parameter:
com.bigdata.btree.keys.KeyBuilder.collator.strength=Identical

It will not update configuration for existing journals, you would need full reload, and watch out for the size of the resulting journal, it will be larger, but it's hard to estimate how much.

Hi, I'm not a developer, I only contribute to Wikidata, but, IMHO, It's really weird for a database to change data encoding when outputs results(!?) Yes, I know, changes are on indexing, but users doesn't care about internal pipes: data is encoded with A, they make a query, and output is uses encoding B. Really weird.

So, If encoding normalization of query results is a desired behaviour (Is it?) data should be normalized on data input, with a warning to user when recorded data differs from input data. Of course, normalization input generates a lot of troubles too. What happens if a record must need and not normalized form?

If encoding normalization of query results isn't a desired behaviour, then identical level for collation/indexing should be used.

My 5 cents.

This problem also effects umlaut characters vs the use of the diacritics Unicode character U+0308. (And ListeriaBot cannot save the latter.)

Gehel raised the priority of this task from Low to Medium.Nov 9 2020, 4:19 PM
Gehel moved this task from All WDQS-related tasks to Current work on the Wikidata-Query-Service board.

Ꜵ being conflated with 🇦🇴 is a bug in the version of ICU4j we use, switching to ICU 68.1 (currently use 4.8 from 2011) solves the problem.

Other issues related to similar chars (⑫ vs ⓬) do indeed require switching the collation strength to identical which will increase the key sizes by ~80%. Hard to tell what is the actual impact on the blazegraph journal size. As discussed in https://github.com/blazegraph/database/issues/93 it does seem that query perf should not be affected too much.
The user impact is hard to evaluate as well, while it's clearly wrong&confusing when two terms are conflated we have no idea how useful it can be when the terms are not ambiguous. There are queries that are perhaps relying on this to find results.
In P13502 I've listed (brute-force search) the list of characters that would no longer be conflated using identical. This sadly does not take into account sequences (like emojis and the angola flag) for which I don't have great ideas on how to evaluate the impact but this particular problem could well be very isolated.

Concerning the version of ICU we currently use, I believe that using identical will solve most of these problems but it's probable that we might be affected by other bugs esp. when sorting. This probably deserves its own ticket and is more related to blazegraph's tech-dept.

To move this ticket forward it does seem clear that we can't enable this option on production machines without prior testing on sizes but also on user impact.
We don't have enough machines to run multiple tests at the same time and we might have to either:

  • wait for the planned tests (blank node removal with the streaming updater) to finish
  • or do it at the same time.

I'll prepare some puppet patches in the meantime

I did some experiments using one chunk of our dumps which accounts for 31,883,361 triples which is ~3‰ of the dump size.
The journal size using the default tertiary strength is 154Gb it grows up to 174Gb using identical which is close to 13% increase in size. Assuming that this increase remains linear we would jump from 886Gb to 1Tb (114Gb increase) on current production machine.
For the benefit (the terms that are no longer conflated): Identical allows to store 9855953 terms vs 9855878 for tertiary. Which means that out of the 9855953 terms I inspected only 75 are conflated.
Using collation strength Identical does not seem to be the right approach to me (cost vs benefit).

I believe we should at least fix the obvious ICU issues by upgrading the version used by blazegraph but concerning the symbols (P13502) we should try to find an alternative at the blazegraph level that does not involve a 13% increase in journal size.

I wonder for instance why blazegraph is using collation for building its keys here, is the term index used for sorting or doing range queries? If not maybe there would be a way to add a custom key generator that just does NFC normalization and using UTF-8 for the Term2ID index a bit like what lucene does.

To summarize:

  • using Identical does not seem to be viable solution to solve this issue
  • upgrading blazegraph to a newer version of ICU will solve some of the problems
  • evaluate other approaches for computing the Term2ID keys to stop conflating symbols

Given that blazegraph is un-maintained I'm pessimistic about the third point, the second point sounds more approachable.

dcausse added a subscriber: dcausse.

Moving back to the backlog to re-evaluate the priority

Hi @CamelCaseNick , could you give us more details about the issue you encounter with the umlaut character?

  • example queries where the character switch is happening
  • how this affect Listeria (example, error message, etc.)

As David mentioned above, we're probably not going to be able to solve the root problem immediately, but we may be able to investigate a bit more on your specific issue.
Thanks in advance!

@Lea_Lacroix_WMDE I found some in p:P547/pq:P1932 when I created https://www.wikidata.org/wiki/User:CamelCaseNick/Stolpersteine/Hamburg/Mitte, e.g. http://www.wikidata.org/entity/Q66148105 contains U+0308 in Müller (https://w.wiki/pZZ).

As you see in the query I used, with replace(replace(..., "u\u0308", "ü"), "o\u0308", "ö") I had a successful workaround. The error message was something like couldn't save due hash mismatch. I tried it now (https://www.wikidata.org/w/index.php?title=User:CamelCaseNick/test&oldid=1319530534) and couldn't reproduce it with Listeria v2, i.e. it seems to work now.

Just ran into this again while trying to find items with the same Unicode character statement. There's 50,000 items with one of those statements (query) and 531,000 pairs of items allegedly having the same value (query).