Page MenuHomePhabricator

Update OtherIndex to operate on a cluster other than the one holding the wiki
Closed, ResolvedPublic

Description

To allow storing wikis on separate clusters but still supporting the OtherIndex (on-wiki commons multimedia search with duplicates removal) functionality we need CirrusSearch to understand, in this limited circumstance, that other clusters exist and send indexing/search operations the right way.

Event Timeline

EBernhardson created this task.

For search we might consider Cross Cluster Search. This was added in 5.4, came out of beta in 6.0, and is the blessed replacement for tribe nodes. It essentially allows us to query the other index as if it were local by prefixing the index with the cluster name, such as large:commonsiwiki_file. This allows us to ignore the question of which cluster (eqiad, codfw?) to read from in the cirrus code, relegating it to elasticsearch configuration.

For indexing we need to be a little more involved. The use cases look to be:

  • Default operation issues writes to all clusters. Writes to clusters not accepting writes are backed up into the job queue.
  • Maintenance scripts issue writes to a single cluster. Multi-cluster operations are almost always handled by separate invocations.

When we perform OtherIndex updates to a wiki on a separate cluster from commonswiki we need to map to the correct cluster.

  • file created on qqwiki in cluster eqiad-small01 needs to send writes to eqiad-large and codfw-large
  • Maintenance script on qqwiki in cluster eqiad-small01 needs to send writes only to eqiad-large. Sending to only local cluster, or all configured clusters is currently gated on a flag called 'same-cluster'.

Current config:[ NS_FILE => 'commonswiki_file' ]

Proposal 1:[ NS_FILE => [ ['codfw-large', 'eqiad-large'], 'commonswiki_file' ] ]

Since search time is handled by cluster configuration, the only thing we really need here is all clusters that need writes. This will easily handle the standard operation. For maintenance scripts though, we need to know that when passed a wiki connection to eqiad-small01 we should only choose eqiad-large. We could perhaps hardcode a naming convention, match only cluster names with a matching prefix before a - deimiter.

Proposal 2:

I think part of the reason all of the above feels awkward is because we are treating all clusters as equals, with mostly a name, and not imposing the logical structure we want on them. To give it a better structure we need names, I propose, roughly:

$wgCirrusSearchReplicaGroup = 'small-01';
$wgCirrusSearchReplicaGroups = [
   'small-01' => [
    'eqiad' => 'eqiad-small-01',
    'codfw' => 'codfw-small-01',
  ],
  'small-02' => [
    'eqiad' => 'eqiad-small-02',
    'codfw' => 'codfw-small-02',
  ],
  'large' => [
    'eqiad' => 'eqiad-large', 
    ...
  ],
]

The values inside the arrays are names in wgCirrusSearchClusters and everywhere that deals with names of clusters would primarily deal with the inner structure, specified by the local replica group. OtherIndex operations instead of referring to a separate cluster, will refer to the appropriate replica group.

In proposal 2 does it mean we get rid of wgCirrusSearchWriteClusters?
On one hand I like the simplicity of the first proposal but I'd go with something like:

[ NS_FILE => [
   'read_cluster' => 'logical_name_used_by_crosscluster_search' # OR '__local__' if mw-config detects this wiki belongs to the same cluster
   'write_clusters' => [
         'eqiad-large-01',
         'codfw-large-01',
   ],
   'index_name' => 'commonswiki_file'
];

But I think this will require extra logic in mw-config to adjust properly for every wiki but to also ensure that switching between clusters remains a relatively easy config change.

And in the end I think OtherIndex is not the sole component that require some adaptation. Crosswiki search will likely require some changes unless we guarantee that all the wikis that can be accessed from a crosswiki search live on the same cluster. It's unclear to me how to do this, we managed to get crosswiki working by only using the data available in SiteMatrix, here it looks like we will also need to infer another info to determine which cluster to query.

For splitting wikis between clusters and ensuring sister searches stay on the same cluster i was hoping i could get by with a test case in mw-config that pokes at the SiteMatrix configuration (unfortunately without the SiteMatrix code in mw-debug test suite) and verifies things all belong on the "correct" clusters.

To actually choose what goes where though, I'm not sure. I'm half tempted to generate a whitelist for the large cluster, and distribute the rest of the wikis between the small clusters based on the first letter of the wiki with some letter chosen as the cutoff point between the two clusters. It wouldn't be perfectly balanced, but it would be enough and i hope it's reasonable to assume all sister wikis will have the same first letter of their wikiid

[ NS_FILE => [
   'read_cluster' => 'logical_name_used_by_crosscluster_search' # OR '__local__' if mw-config detects this wiki belongs to the same cluster
   'write_clusters' => [
         'eqiad-large-01',
         'codfw-large-01',phpun
   ],
   'index_name' => 'commonswiki_file'
];

Ideally, we want to have a generic wgCirrusSearchExtraIndexes configuration that doesn't need to be dynamically built per-wiki by mediawiki-config. Another option would be to configure a mapping from "some cluster" to appropriate extra index cluster.

The configuration would become something like:

[ NS_FILE => [
    'indexName' => 'commonswiki_file',
    'clusters' => [
        'eqiad-a' => 'eqiad-a',
        'eqiad-b' => 'eqiad-a',
        'eqiad-c' => 'eqiad-a',
        'codfw-a' => 'codfw-a',
        'codfw-b' => 'codfw-a',
        'codfw-c' => 'codfw-a',
    ]
]

This makes a simplifying assumption that the elasticsearch cross-cluster is configured giving the clusters the same names we use in cirrussearch. Using anything else seems like it would be over complicated anyways. If we add the assumption that any unconfigured mapping points to the current cluster then this can also reasonably easily handle the existing configuration without modification. When configured as a string instead of an array we can assume ['indexName' => 'foo', 'clusters' => []]. When we lookup the mapping in clusters we follow the assumption that unconfigured clusters write to themselves.

Change 443009 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/CirrusSearch@master] OtherIndex support multiple clusters

https://gerrit.wikimedia.org/r/443009

Change 443009 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] OtherIndex support multiple clusters

https://gerrit.wikimedia.org/r/443009

just to add a note that I discovered this morning that elastic will refuse to boot if the other cluster you setup like:

search:
        remote:
                other:
                        seeds: 10.11.12.1:9300

is not running.

This means we can't make clusters dependent on each others.

Mentioned in SAL (#wikimedia-operations) [2018-07-05T18:55:28Z] <ebernhardson> T194678 pause cirrussearch writes to codfw to check how kafka+mirrormaker responds

Mentioned in SAL (#wikimedia-operations) [2018-07-05T19:07:19Z] <ebernhardson> T194678 un-pause cirrussearch writes to codfw

This means we can't make clusters dependent on each others.

As discussed on irc this isn't going to be a problem for our current plans as we only need cross cluster search from the new small clusters back to the large cluster.