Page MenuHomePhabricator

Switching search traffic between datacenters should be faster
Open, HighPublic5 Estimated Story Points

Description

Currently, switching search traffic between datacenter requires a change in wmf-config. Traffic routing should be dynamic configuration, and should be extremely easy / fast to change in case of need.

It has been suggested to use etcd as a dynamic configuration store for this kind of configuration.

https://wikitech.wikimedia.org/wiki/Incident_documentation/20160818-Elasticsearch

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
debt triaged this task as Medium priority.Aug 25 2016, 10:17 PM
debt moved this task from needs triage to This Quarter on the Discovery-Search board.
debt subscribed.

This task is a documentation task so that we can do this more faster and more efficiently.

Actually, this is not only documentation (current documentation is not too bad). Ideally we want a mechanism to switch traffic that is simpler and more immediate than deploying a config change. That mechanism could be based on etcd. We need to dig a bit deeper to understand what is required to make this happen. The current strategy around etcd is to use etcd + templates to generate config file on the fly. As far as I know there isn't an implementation yet. And this might need some changes in the way we treat Cirrus configuration. @EBernhardson probably knows more than me about this.

The main thing i see is we need to split reading/writing elasticsearch. This is already the plan with cirrus-streaming-updater, after which reads could go to search.discovery.wmnet.

If we wanted to move faster some configuration changes could make it happen today. We would need to define a new search cluster in $wgCirrusSearchClusters that queries search.discovery.wmnet, set $wgCirrusSearchDefaultCluster to use that cluster, and not add that cluster to $wgCirrusSearchWriteClusters.

Realized while looking at this in mediawiki-config that we also need to deploy an envoy proxy that handles these connections.

other loose ends, we report the default search cluster via APIQuerySetInfoGeneralInfo which is read by SRE tools to ensure we don't forget that traffic is directed at non-local clusters. We probably need to either kill those or replace the checks with something that looks at the etcd based data.

Change 838182 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/puppet@production] envoy: Add service proxys for cirrussearch read traffic

https://gerrit.wikimedia.org/r/838182

Change 838270 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/mediawiki-config@master] cirrus: Add services for read operations

https://gerrit.wikimedia.org/r/838270

Change 838271 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/mediawiki-config@master] [WIP] Use discovery dns for cirrus read traffic

https://gerrit.wikimedia.org/r/838271

Change 838276 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/mediawiki-config@master] cirrus: Drop client side connect timeout config

https://gerrit.wikimedia.org/r/838276

The above patches would mostly take care of cirrussearch, but apifeatureusage and translate still need to be handled. Apifeatureusage should be easy, it's read-only from the wiki side. translate will need some consideration, it needs different logic for reads and writes.

Change 838276 merged by jenkins-bot:

[operations/mediawiki-config@master] cirrus: Drop client side connect timeout config

https://gerrit.wikimedia.org/r/838276

Mentioned in SAL (#wikimedia-operations) [2022-10-13T20:08:06Z] <samtar@deploy1002> Started scap: Backport for [[gerrit:838276|cirrus: Drop client side connect timeout config (T143553)]], [[gerrit:838269|cirrus: remove cross-dc poolcounter increases]]

Mentioned in SAL (#wikimedia-operations) [2022-10-13T20:08:27Z] <samtar@deploy1002> samtar and ebernhardson: Backport for [[gerrit:838276|cirrus: Drop client side connect timeout config (T143553)]], [[gerrit:838269|cirrus: remove cross-dc poolcounter increases]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2022-10-13T20:13:37Z] <samtar@deploy1002> Finished scap: Backport for [[gerrit:838276|cirrus: Drop client side connect timeout config (T143553)]], [[gerrit:838269|cirrus: remove cross-dc poolcounter increases]] (duration: 05m 31s)

Gehel raised the priority of this task from Medium to High.Nov 14 2022, 4:20 PM

Last datacenter switchover went without much issue. We need to review exactly how it went, but we can probably close this ticket if we are happy with the current situation.

@Gehel Let us know if you are still interested in pursuing this. Otherwise, feel free to close out.

dcausse subscribed.

I think we're still interested in this, all blockers have been resolved now.

Gehel changed the point value for this task from 2 to 5.Jun 5 2023, 3:46 PM