Page MenuHomePhabricator

Resolve document id clashes with unified type
Closed, DeclinedPublic

Description

Unifying the type between archive/page/namespace is going to have clashes. For example we index namespaces with the namespace id as the document id, so there can't be overlapping pages.

We previously tracked down most of the places that deal with document ids and labeled them as such, it might be possible to add a prefix to them, as elasticsearch considers document id's to be strings anyways.

Event Timeline

EBernhardson created this task.

If we split namespace into it's own single index for all wiks, the main difference remaining may now only be archive documents. Perhaps we need to revisit archive and see if it makes sense to unify the two as a single document with two different states (archived/current) ? Will need to think about it.

For the namespace data we could perhaps consider storing them in the metastore index.
For archives I have no clue if could unify them into a single index. A minor concern may be that we will store in the same index different kind of wikis such as the private ones and public ones.

I was actually thinking namespace could be it's own single index, shared between the wiki's. I suppose we could use the metastore for that, it's tiny data and fits into the metastore concept.

For archive i wasn't thinking of unifying between wikis, but rather unifying the lifecycle of new page -> indexed -> deleted -> archived. As far as i can tell the archive page id is the same page id it had when it was indexed, so this would allow keeping the document id simple. Having a unified document model/lifecycle for the index seems desirable and easier to reason about than having two different things (page/archive) with a field distinguishing them. Of course the field to distinguish them would still exist, I'm mostly just thinking about treating it as a single thing.

A single doc model seems reasonable and match how mappings work, a single unified mapping.
When you speak a unified lifecycle is it like we will flip a kind of state field in the document to indicate the status of document?
Because one thing to consider is that redirects currently do not have their own documents when they are in the "indexed" state. So the flow cannot be a single deleted -> archived step, it'll still have to deal with redirects I think.
Other concern I have is regarding the copy_to: [suggest]. Currently the archive type does not specify this copy to behavior for the title field.
I'm not even sure how that works with elastic, I bet elastic just make sure that the indexed fields are compatible with fields of same name in other types and is still able to treat copy_to differently depending on the doc's type.
With a single mapping archived titles will perhaps start to popup in DYM suggestions which does not seem to be desirable?

I hadn't thought about copy_to and tried it out, indeed elasticsearch seems to handle multiple types with varied copy_to on the same field correctly. I don't see any obvious solution to this while moving away from multiple types, short of adding a field that is only populated for archive documents. Looking around the cluster we have at most 50k archive documents per-wiki.

Something we haven't discussed but should at least ponder the implications of: What if we make a new per-wiki archive index? This is by far the simplest solution. In terms of production usage these are all tiny, nothing needs more than a single shard. So we have 1 primary, 2 replicas and 900 wikis. That would add a total of 2700 tiny shards to the 9340 in the cluster, or a 28% increase. We haven't had problems (that I'm aware of) with master timeouts recently, perhaps the increased timeout along with cluster version updates have resolved some of our problems there? Elasticsearch has a blog post giving a "rule of thumb" of 20-25 shards per GB of heap, or 21-26k shards on a cluster of our size. Overall though I'm wary of adding so many new shards. I wonder if there is some test we could devise for the warm-spare cluster to create archive indices and measure master latency for a variety of actions. I suppose we would have to repeat the test on a loaded cluster though to ensure it's representative.

Now i'm off in a completely different direction, but perhaps worthwhile to consider more directly the easy way before dismissing it. Wasted a bunch of time on the multi-wiki indices before and then found it wasn't really necessary.

We decided to not add a new type field, and instead split archive into it's own index. To handle the sharding problem we will create 2 new "tiny" clusters on the existing hardware to split all of the tiny wiki's between.