Page MenuHomePhabricator

Switching search traffic between datacenters should be faster
Closed, ResolvedPublic5 Estimated Story Points

Description

Currently, switching search traffic between datacenter requires a change in wmf-config. Traffic routing should be dynamic configuration, and should be extremely easy / fast to change in case of need.

It has been suggested to use etcd as a dynamic configuration store for this kind of configuration.

https://wikitech.wikimedia.org/wiki/Incident_documentation/20160818-Elasticsearch

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Gehel changed the point value for this task from 2 to 5.Jun 5 2023, 3:46 PM

When rolling this out we will also need to review envoy related alerts and make sure they are updated to the new discovery dns envoy cluster names.

Change #1136422 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/alerts@master] search: Update envoy alerts for discovery dns names

https://gerrit.wikimedia.org/r/1136422

Change #1136422 merged by jenkins-bot:

[operations/alerts@master] search: Update envoy alerts for discovery dns names

https://gerrit.wikimedia.org/r/1136422

Change #838182 merged by Bking:

[operations/puppet@production] envoy: Add service proxys for cirrussearch read traffic

https://gerrit.wikimedia.org/r/838182

Change #1143617 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/dns@master] search: add discovery records for secondary clusters

https://gerrit.wikimedia.org/r/1143617

Change #1143622 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/puppet@production] search: Update dnsdisc envoy upstreams

https://gerrit.wikimedia.org/r/1143622

Change #1143633 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cirrussearch: Add cluster-specific domain name as a SAN

https://gerrit.wikimedia.org/r/1143633

Change #1143891 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/dns@master] search: cname specific search clusters to the lvs pool

https://gerrit.wikimedia.org/r/1143891

Change #1143633 merged by Bking:

[operations/puppet@production] cirrussearch: Add cluster-specific domain name as a SAN

https://gerrit.wikimedia.org/r/1143633

Change #1143891 merged by Bking:

[operations/dns@master] search: cname specific search clusters to the lvs pool

https://gerrit.wikimedia.org/r/1143891

Change #1143617 merged by Bking:

[operations/dns@master] search: add discovery records for secondary clusters

https://gerrit.wikimedia.org/r/1143617

Change #1145278 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/puppet@production] etcd data for search-{psi,omega} dns discovery

https://gerrit.wikimedia.org/r/1145278

Order of operations: cnames in LVS (https://gerrit.wikimedia.org/r/c/operations/dns/+/1145276) -> etcd data for psi & omega (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1145278/) -> dyna records (https://gerrit.wikimedia.org/r/c/operations/dns/+/1145277) -> update dnsdisc envoy upstreams https://gerrit.wikimedia.org/r/c/operations/puppet/+/1143622

Change #1145278 merged by Bking:

[operations/puppet@production] etcd data for search-{psi,omega} dns discovery

https://gerrit.wikimedia.org/r/1145278

Change #1143622 merged by Bking:

[operations/puppet@production] search: Update dnsdisc envoy upstreams

https://gerrit.wikimedia.org/r/1143622

Change #1146075 had a related patch set uploaded (by Ryan Kemper; author: Ssingh):

[operations/puppet@production] Revert "etcd data for search-{psi,omega} dns discovery"

https://gerrit.wikimedia.org/r/1146075

Change #1146075 merged by Ryan Kemper:

[operations/puppet@production] Revert "etcd data for search-{psi,omega} dns discovery"

https://gerrit.wikimedia.org/r/1146075

Unfortunately, we had to roll back these changes for the second day in a row, and during the process, we set off alerts that paged on-call SREs.

The OpenSearch migration (T388610) is our primary focus for now, so let's revisit this ticket after the migration so we can give it the full attention it deserves.

Change #1151300 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/puppet@production] search: Add dnsdisc entries for omega and psi clusters

https://gerrit.wikimedia.org/r/1151300

Change #1151303 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/dns@master] Add search-{psi,omega}.svc.$dc.wmnet cnames

https://gerrit.wikimedia.org/r/1151303

Change #1151308 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] etcd data for search-{psi,omega} dns discovery

https://gerrit.wikimedia.org/r/1151308

Change #1151316 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] search: Update dnsdisc envoy upstreams

https://gerrit.wikimedia.org/r/1151316

Change #1151304 had a related patch set uploaded (by Ryan Kemper; author: Ebernhardson):

[operations/dns@master] search: Add search-{psi,omega} geoip discovery entries

https://gerrit.wikimedia.org/r/1151304

Change #1151308 merged by Ryan Kemper:

[operations/puppet@production] etcd data for search-{psi,omega} dns discovery

https://gerrit.wikimedia.org/r/1151308

Change #1151300 merged by Ryan Kemper:

[operations/puppet@production] search: Add dnsdisc entries for omega and psi clusters

https://gerrit.wikimedia.org/r/1151300

Change #1151304 merged by Ryan Kemper:

[operations/dns@master] search: Add search-{psi,omega} geoip discovery entries

https://gerrit.wikimedia.org/r/1151304

Change #1151303 merged by Ryan Kemper:

[operations/dns@master] Add search-{psi,omega}.svc.$dc.wmnet cnames

https://gerrit.wikimedia.org/r/1151303

Mentioned in SAL (#wikimedia-operations) [2025-06-11T18:08:45Z] <sukhe> sudo cumin 'A:lvs-secondary-eqiad or A:lvs-secondary-codfw' 'run-puppet-agent': T143553

Mentioned in SAL (#wikimedia-operations) [2025-06-11T18:10:57Z] <sukhe> sudo cumin 'A:lvs-low-traffic-eqiad or A:lvs-low-traffic-codfw' 'run-puppet-agent': T143553

Change #1151316 merged by Ryan Kemper:

[operations/puppet@production] search: Update dnsdisc envoy upstreams

https://gerrit.wikimedia.org/r/1151316

We deployed this per the plan outlined in https://phabricator.wikimedia.org/T143553#10861215 (with the addition of some authdns updates and other stuff). Everything looks good; we should be ready to continue ahead on the next mediawiki train.

Change #838270 merged by jenkins-bot:

[operations/mediawiki-config@master] cirrus: Add services for read operations

https://gerrit.wikimedia.org/r/838270

Mentioned in SAL (#wikimedia-operations) [2025-06-18T20:26:48Z] <ebernhardson@deploy1003> Started scap sync-world: Backport for [[gerrit:838270|cirrus: Add services for read operations (T143553)]]

Mentioned in SAL (#wikimedia-operations) [2025-06-18T20:29:07Z] <ebernhardson@deploy1003> ebernhardson: Backport for [[gerrit:838270|cirrus: Add services for read operations (T143553)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-06-18T20:37:59Z] <ebernhardson@deploy1003> Finished scap sync-world: Backport for [[gerrit:838270|cirrus: Add services for read operations (T143553)]] (duration: 11m 11s)

Change #838271 merged by jenkins-bot:

[operations/mediawiki-config@master] Use discovery dns for elasticsearch read traffic

https://gerrit.wikimedia.org/r/838271

Mentioned in SAL (#wikimedia-operations) [2025-06-18T20:54:38Z] <ebernhardson@deploy1003> Started scap sync-world: Backport for [[gerrit:838271|Use discovery dns for elasticsearch read traffic (T143553)]]

Mentioned in SAL (#wikimedia-operations) [2025-06-18T20:56:52Z] <ebernhardson@deploy1003> ebernhardson: Backport for [[gerrit:838271|Use discovery dns for elasticsearch read traffic (T143553)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-06-18T21:04:52Z] <ebernhardson@deploy1003> Finished scap sync-world: Backport for [[gerrit:838271|Use discovery dns for elasticsearch read traffic (T143553)]] (duration: 10m 14s)

This is mostly done, read traffic is all flowing through the dns-disc endpoints. Traffic can now move with the same tooling as everything else. Not quite done yet, as T397377 was opened for a change we noticed in the dashboards.

We should make a test of moving traffic. This should be a simple conftool command, which needs SRE super power.

In theory we should be able to depool codfw like so, causing all traffic to move to eqiad (from https://wikitech.wikimedia.org/wiki/DNS/Discovery):

for cluster in search-omega search-psi search; do
  sudo confctl --object-type discovery select "dnsdisc=${cluster},name=codfw" set/pooled=false
done

This can then be reversed with:

for cluster in search-omega search-psi search; do
  sudo confctl --object-type discovery select "dnsdisc=${cluster},name=codfw" set/pooled=true
done

Traffic test complete, moved as expected. Commands have been documented at https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Multi-DC_%2F_Multi-Cluster_Operations

Dashboard updates still need to be performed, but there is a separate ticket for that.