Page MenuHomePhabricator

Centralize document id and namespace generation
Closed, ResolvedPublic

Description

To support multi-wiki indices we need to prefix wiki specific things, namely the id's and the namespaces. This would be best supported by centralizing generation of these values such that they can be consistently prefixed across the codebase.

This needs to take into account not only when we create documents to be indexed, but also when these values are used at query time.

Event Timeline

Restricted Application added projects: Discovery, Discovery-Search. · View Herald TranscriptJul 6 2016, 6:23 PM
Restricted Application added subscribers: Zppix, Aklapper. · View Herald Transcript
EBernhardson updated the task description. (Show Details)Jul 6 2016, 6:34 PM
debt triaged this task as Normal priority.Jul 6 2016, 9:16 PM
debt moved this task from needs triage to Up Next on the Discovery-Search board.

One thing i'm still not sure about here is how to handle the transition period. The setting actually depends on how the index was built, rather than what the current setting is. If the index was built without prefixed id's then we need to continue using those. Perhaps i can use the metastore index for this, will see.

Change 300179 had a related patch set uploaded (by EBernhardson):
[WIP] Centralize document and namespace id generation

https://gerrit.wikimedia.org/r/300179

As i work through this i realize that this means we cant do something like Title::makeTitle($hit['_source']['namespace'], $hit['_source']['title']) due to the namespace being prefixed. We can unprefix namespaces, but won't necessarily know the exact prefix used on other wiki's stored within the index. We can make assumptions, but that may not be a great idea.

I wonder though if we really need to prefix namespaces, we could instead build boolean filters that check both the wiki field and the namespace field. This is likely less performant, but i'm not sure how much? Integrating the id directly into the namespace may be premature optimization that simply makes other things more complicated.

Change 300179 merged by jenkins-bot:
Centralize document id generation

https://gerrit.wikimedia.org/r/300179

debt closed this task as Resolved.Aug 26 2016, 4:28 PM