Page MenuHomePhabricator

Default Blazegraph configuration confuses strings with and without RTL mark
Closed, ResolvedPublic

Description

As discussed on wiki, Blazegraph seems to consider strings with RL mark (U-200F) and without it to be the same string. This is not a good idea, especially given that strings in Blazegraph are "sticky" - one string is indexed only once in TERM2ID store, and the following accesses would use the same entry. Which leads to various strange effects when two strings get conflated into one.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

The reason seems to be that Blazegraph is using ICU collation keys, and ICU collator seems to ignore U-200F by default. We may need to do a patch to change that. Relevant code is in: https://github.com/blazegraph/database/blob/master/bigdata-core/bigdata/src/java/com/bigdata/btree/keys/ICUSortKeyGenerator.java

We may want to try to change collator strength or find some other way to avoid such issues.

Test case: insert data

INSERT  {
  <a> <b> "0000 0000 4698 056X\u200F" .
   <c> <d> "0000 0000 4698 056X" . } WHERE {}

Then query:

SELECT * WHERE {
 ?x ?y "0000 0000 4698 056X"
}

It should only produce one result, but it produces two now.

Looks like setting option -Dcom.bigdata.btree.keys.KeyBuilder.collator.strength=Identical fixes the issue, but this requires full reindex and almost doubles the size of the keys for strings, which may have impact on space consumed. I'll see if there's a way to fix the immediate problem in a more direct way.

Values affected:

0000 0000 4698 056X
0000 0000 3227 156X
0000 0000 5154 1895
0000 0000 5328 9611
0000 0000 7896 3086
0000 0003 6772 0443
0000 0000 5661 6438
0000 0000 8043 5485
0000 0003 7884 5356
0000 0003 9447 4903

Mentioned in SAL (#wikimedia-operations) [2018-06-26T05:28:57Z] <SMalyshev> testing fix for T197447 on wdqs1009

Applied the temp database fix for wdqs2001 and wdqs2002. Seemt to be working. I'll let them to run for a bit with it, if I don't see anything weird, I'll apply it to the rest of the servers.

I am not sure whether to apply the more generic collation fix, since it inflates the keys and the problem seems to be pretty rare.

Mentioned in SAL (#wikimedia-operations) [2018-06-27T20:02:06Z] <SMalyshev> applied fix for T197447 to eqiad wdqs cluster, which involved restart of the services

The immediate problem seems to be fixed, I will not switch the collator for now, due to the need of full reindex and potential performance impact. If that happens more, I'll re-consider it.

Vvjjkkii renamed this task from Default Blazegraph configuration confuses strings with and without RTL mark to ouaaaaaaaa.Jul 1 2018, 1:03 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii removed Smalyshev as the assignee of this task.
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot renamed this task from ouaaaaaaaa to Default Blazegraph configuration confuses strings with and without RTL mark.Jul 2 2018, 12:09 PM
CommunityTechBot closed this task as Resolved.
CommunityTechBot assigned this task to Smalyshev.
CommunityTechBot lowered the priority of this task from High to Medium.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added a subscriber: Aklapper.