Page MenuHomePhabricator

14 March 2023 eqiad Service repooling
Closed, ResolvedPublic

Description

A half-hour before (10:00UTC)

  • !log Locking scap deployment for service switchover - T331541
  • Add a scap lock on deploy2002.codfw.wmnet echo "Deployment lock for service switchover - T331541" | sudo tee -a /var/lock/scap-global-lock

All services

  • !log Running sre.switchdc.mediawiki.00-optional-warmup-caches - T331541
  • Run sudo cookbook sre.switchdc.mediawiki.00-optional-warmup-caches --live-test eqiad codfw
  • !log Repooling all active/active services in eqiad - T331541
  • Run sudo cookbook sre.discovery.datacenter pool eqiad --reason "Datacenter Switchover - eqiad RO repool" --task-id T331541
  • Run sudo cookbook sre.discovery.service-route --reason T331541 depool --wipe-cache codfw restbase-async
  • Unlock scap sudo rm -v /var/lock/scap-global-lock
  • !log All active/active services in eqiad repooled, scap unlocked - T331541

Event Timeline

Clement_Goubert updated the task description. (Show Details)
Clement_Goubert moved this task from Incoming 🐫 to this.quarter 🍕 on the serviceops board.
Clement_Goubert changed the task status from Open to In Progress.Mar 14 2023, 9:56 AM
Clement_Goubert updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2023-03-14T10:02:37Z] <claime> Locking scap deployment for service switchover - T331541

Mentioned in SAL (#wikimedia-operations) [2023-03-14T10:28:19Z] <claime> Running sre.switchdc.mediawiki.00-optional-warmup-caches - T331541

Mentioned in SAL (#wikimedia-operations) [2023-03-14T10:32:51Z] <claime> Repooling all active/active services in eqiad - T331541

cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: Datacenter Switchover - eqiad RO repool - T331541 started.

Mentioned in SAL (#wikimedia-operations) [2023-03-14T10:33:13Z] <cgoubert@cumin1001> START - Cookbook sre.discovery.datacenter pool all active/active services in eqiad: Datacenter Switchover - eqiad RO repool - T331541

cgoubert@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: Datacenter Switchover - eqiad RO repool - T331541 completed.

Mentioned in SAL (#wikimedia-operations) [2023-03-14T10:47:52Z] <cgoubert@cumin1001> END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) pool all active/active services in eqiad: Datacenter Switchover - eqiad RO repool - T331541

Mentioned in SAL (#wikimedia-operations) [2023-03-14T10:48:12Z] <cgoubert@cumin1001> START - Cookbook sre.discovery.service-route depool restbase-async in codfw: T331541

Mentioned in SAL (#wikimedia-operations) [2023-03-14T10:48:16Z] <cgoubert@cumin1001> END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) depool restbase-async in codfw: T331541

Mentioned in SAL (#wikimedia-operations) [2023-03-14T11:13:37Z] <claime> We are encountering unexpected DNS anycast issued following T331541, latencies are increased but no production outage.

Mentioned in SAL (#wikimedia-operations) [2023-03-14T11:51:35Z] <cgoubert@cumin1001> START - Cookbook sre.discovery.service-route depool appservers-ro in eqiad: T331541

Mentioned in SAL (#wikimedia-operations) [2023-03-14T11:52:21Z] <cgoubert@cumin1001> END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) depool appservers-ro in eqiad: T331541

Mentioned in SAL (#wikimedia-operations) [2023-03-14T12:08:14Z] <cgoubert@cumin1001> START - Cookbook sre.discovery.service-route pool appservers-ro in eqiad: T331541

Mentioned in SAL (#wikimedia-operations) [2023-03-14T12:13:18Z] <cgoubert@cumin1001> END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) pool appservers-ro in eqiad: T331541

Mentioned in SAL (#wikimedia-operations) [2023-03-14T14:16:41Z] <claime> All active/active services in eqiad repooled, DNS issues resolved - T331541

We ran into a powerdns configuration issue which meant that instead of traffic being spread over both datacenters, we completely switched RO traffic over to eqiad.
Google doc for the issue

This has been fixed by https://gerrit.wikimedia.org/r/898738 and we are now in a stable state with eqiad pooled for RO traffic.