Replace wb_terms table with more specialized mechanisms for terms (tracking)
Open, HighPublic

Description

wb_terms is one of the largest tables controlled by Wikibase, and probably that is queried and updated most frequently. This ticket aims to find a replacement for that large monolithic table.

Re-implementations:

  • Uniqueness constraints should be re-implemented based on hashes T74430
  • Lookup by label/alias should be implemented based on Cirrus/Elastic T194143
  • Searching in the client (ArticlePlaceholder) T177453
  • SpecialEntitiesWithoutPage (EntitiesWithoutTermFinder) T194144
  • Term lookup by entityid+termtype+language T143706+T194158?

If all of these are done, we should be able to remove the wb_terms table (in the Wikimedia installation, for 3rd parties other solutions might be needed).

Related Objects

StatusAssignedTask
ResolvedLydia_Pintscher
OpenNone
DuplicateNone
OpenNone
ResolvedLydia_Pintscher
ResolvedLydia_Pintscher
ResolvedLydia_Pintscher
ResolvedAddshore
Resolvedaude
OpenNone
ResolvedLydia_Pintscher
ResolvedNone
ResolvedWMDE-leszek
OpenNone
OpenNone
OpenNone
ResolvedLadsgroup
Resolvedaude
ResolvedMarostegui
ResolvedLadsgroup
ResolvedAndrew
ResolvedLadsgroup
Resolvedaude
ResolvedLadsgroup
ResolvedLadsgroup
ResolvedLadsgroup
ResolvedSmalyshev
DuplicateNone
OpenNone
OpenNone
OpenNone
OpenNone
daniel created this task.Jan 12 2015, 12:46 PM
daniel updated the task description. (Show Details)
daniel raised the priority of this task from to Needs Triage.
daniel added a subscriber: daniel.
Lydia_Pintscher set Security to None.
Lydia_Pintscher triaged this task as High priority.
Lydia_Pintscher added a project: Performance.
Lydia_Pintscher added subscribers: aude, JanZerebecki, hoo and 2 others.

This generally seems fine to me. Definitely better than the current terms table.

Lookup by label/alias should be implemented based on Cirrus/Elastic

This is the only thing I'm really *frown* at. Should it say "search" instead of "lookup"? If it's just lookup, I'd like to know what the motivation to use Cirrus/Elastic for this is.

aude added a comment.Jan 15 2015, 1:01 PM

I generally agree with Daniel's proposal and think Elastic is suitable for the lookup use cases. Elastic well works for both "search" and query / exact matches.

Elastic might even be ok for bulk lookups, but we'll want to investigate further the performance of the two approaches, and think it's worth keeping the sql implementation also so not to force hard dependency on Elastic.

@jeroen: the "lookup" may be an exact match, case insensitive, and/or a prefix match. The code that handles these cases for wb_terms is complex and brittle.

Doing this in Cirrus would scale better, because Cirrus supports sharding. We should try to avoid big tables like wb_terms, or at least, avoid queries against them.

one of the largest tables

FYI: currently 128,836,060 rows.

properties by language+description

By language + label.

Tobi_WMDE_SW added a comment.EditedJan 15 2015, 1:53 PM

Outcome of the meeting with @aude @JanZerebecki @thiemowmde @daniel

(1) uniqueness constraints:
-> PROBLEM: currently done by a self-JOIN on the huge table -> bad idea
-> solution would be a separate table with hashes
-> AGREEMENT moving uniqueness constraints away from terms table
-> AGREEMENT the linked suggestion (hash based) on how to do this should be discussed again before starting to implement it (and needs to be discussed with WMF (performance))

(2) finding properties/items by label:
-> PROBLEM: timeout for short prefixes (caused by the fact that we sort the huge result set by weight in memory)
-> AGREEMENT: use elastic but keep current structure of terms table to be able to fall back
-> AGREEMENT: there should be a fallback for users that don’t want to depend on elastic
-> AGREEMENT: split off interfaces from term index (for each type of query, and one for updating)
-> AGREEMENT: once we have the elastic solution, we can think of moving away from the fall back (or find a different solution as fallback)

(3) finding labels by id:
-> AGREEMENT: keep it as it is for now until we have solved problem 1 and 2 and then re-visit to discuss this again

general AGREEMENT: out of those 3 problem 2 has highest priority.

(2) finding props by label:
-> PROBLEM: timeout for short search-stings
-> AGREEMENT: use elastic but keep current structure of terms table to be able to fall back
-> AGREEMENT: there should be a fallback for users that don’t want to depend on elastic
-> AGREEMENT: split off interfaces from term index (for each type of query, and one for updating)
-> AGREEMENT: once we have the elastic solution, we can think of moving away from the fall back (or find a different solution as fallback)

Where is this used? I thought it would be entityselectors like the search box in the header, but on Wikidata short search strings seem to work just fine.

@adrian: it's used for type-ahead in the entity selector, as you assumed. Type-ahead for short prefixes fails often enough to be prominent in the logs.

We also need this when referring to a property by label in the {{#property}} parser function, and on Special:ItemDisambiguation, as well as the corresponding API module (which is what type-ahead uses).

Ok, thanks @daniel. Is this about substring, exact or prefix matches? I'd think exact and prefix matches is something we should be able to do without resorting to Elasticsearch. On the other hand, I'd trust you all to know that better than I do :)

@adrianheine: it'S about prefix matches, and yes, it could be done without elastic. Getting it right with SQL is a lot harder though than getting it right with elastic.

aude added a comment.Jan 15 2015, 5:16 PM

it would be nice if suggestions would consider normalized diacritic marks. (e.g. find Zúrich in Spanish if entering "Zurich" without having to type the "ú" and without needing "Zurich" as an alias)

elastic can help with this sort of thing, better than SQL.

Note sure if this has been considered: the existing terms table cannot deal with non numeric entity ids. Any new solution we create should not have the same problem.

The ticket "T86950: Implement TermLookup directly on top of wb_terms table" does not technically block this, but should be addressed along with the other tasks.

I suppose core does normalization as well as prefix matching, right? Can't we re-use their mechanism?

I think the base feature (searching for items / properties, based on prefix) should work without elastic as it does currently. Would be great to have that pluggable, though.

@adrian: Core does prefix matching on page titles, and applies different kinds of normalization for different tasks. I'm not sure which mechanism could be re-used for our purpose. The prefix matching is done directly on the page.page_title field. The normalization is done on the full text index table (which is not used on wikimedia wikis). Neither seems very useful to us.

I propose the following next steps:

  • introduce an EntitySearch service based on wb_terms (T86949)
  • introduce an UniqueFeature service to replace the XxxDuplicateDetector classes, based on a new wb_unique_feature table (T74430)
  • Implement TermLookup directly on top of wb_terms table (T86950)

After this, re-evaluate which parts of the TermIndex interface we still need, and how the structure of wb_terms can now be optimized.

daniel renamed this task from RFC: replace wb_terms table with more specialized mechanisms for terms to Tracking: replace wb_terms table with more specialized mechanisms for terms.Feb 4 2015, 1:43 PM
daniel added a project: Tracking.
daniel added a comment.Feb 9 2015, 4:43 PM

For Wikidata-Sprint-2015-02-03 we picked this up as an RFC. The RFC part is "done" now, per the comment above.
I suggest to keep this open as a tracking bug, but drop it from the sprint. Alternatively, we turn this back into an RFC we can close, and create a new tracking bug.

Nemo_bis renamed this task from Tracking: replace wb_terms table with more specialized mechanisms for terms to Replace wb_terms table with more specialized mechanisms for terms (tracking).Jul 19 2015, 9:58 AM
daniel updated the task description. (Show Details)Aug 23 2016, 5:15 PM

I may have done this comment on the wrong ticket: T163551#3221748

hoo updated the task description. (Show Details)May 8 2018, 1:42 PM
hoo updated the task description. (Show Details)May 8 2018, 1:46 PM