Rollout use of mcrouter for MediaWiki in production
Closed, ResolvedPublic

Description

In order to finish the MediaWiki config side of the mcrouter deploy, a series of steps will need to be done in stages. As I planned it, each step will involve a day or so before the next one.

Steps

  • Direct cache writes to both nutcracker and mcrouter for mediawiki.org (https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/440469/); wait 1 day
  • Direct cache writes to both nutcracker and mcrouter for all wikis (https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/440470/); wait 1 day (this will double cache writes and space usage until this multi-write stage is over)
  • Switch cache reads to mcrouter on testwiki/mediawiki.org; wait 3 days
  • Switch cache reads to mcrouter on all wikis; wait 1 week
  • Remove nutcracker from cache write operations. This is the point where rollback is trickier, requiring either a restart of cache servers or relying on purgeChangedFiles.php/purgeChangedPages.php.; wait 1 day
  • Enable prefix-based wildcard purges for mcrouter for testwikis/mw.org. Any rollback to nutcracker would need to revert this too. nutcracker does not understand wildcard purges (it would just literally purge the keys with those names, which would make them not purge anything basically).; wait 2 days
  • Enable prefix-based wildcard purges for mcrouter for all wikis.

Monitor

  • Relevant logstash channels: (ObjectCache, memcached, mediawiki-errors aggregate channel)
  • Relevant grafana dashboards: https://grafana.wikimedia.org/dashboard/db/prometheus-memcached-dc-stats?orgId=1
  • Graphite: "MediaWiki.wanobjectcache.*.hit.good.rate" and "MediaWiki.wanobjectcache.*.miss.compute.rate" should look sane during all steps (since warmup is involved)
  • Other things to watch: performance.wikimedia.org graphs

Known issues and patches:

Other patches:

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 26 2018, 6:38 PM
Imarlier added a subscriber: Imarlier.
Krinkle moved this task from Backlog to Doing on the Availability (MediaWiki-MultiDC) board.
Krinkle renamed this task from Production MediaWiki mcrouter use rollout to Rollout use of mcrouter for MediaWiki in production.

Change 440469 had a related patch set uploaded (by Krinkle; owner: Aaron Schulz):
[operations/mediawiki-config@master] Make mediawiki.org write to both nutcracker and mcrouter

https://gerrit.wikimedia.org/r/440469

Joe added a subscriber: Joe.Jul 2 2018, 9:34 AM

+1 to the overall plan; I'd like to see dates attached to the various steps now, so that we can have a clear schedule.

Imarlier moved this task from Inbox to Next-up on the Performance-Team board.Jul 2 2018, 8:27 PM
aaron added a comment.Jul 3 2018, 3:13 PM

+1 to the overall plan; I'd like to see dates attached to the various steps now, so that we can have a clear schedule.

I want to know that T197450 is not related first. With that out of the way, I can set some SWAT dates.

Change 440469 merged by jenkins-bot:
[operations/mediawiki-config@master] Make test wikis just write to both nutcracker and mcrouter

https://gerrit.wikimedia.org/r/440469

aaron added a comment.Jul 5 2018, 2:01 PM

Change 440469 merged by jenkins-bot:
[operations/mediawiki-config@master] Make test wikis just write to both nutcracker and mcrouter

https://gerrit.wikimedia.org/r/440469

Followed shortly by https://gerrit.wikimedia.org/r/443970 for mw.org

Change 440470 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[operations/mediawiki-config@master] Make all non-test wikis write to both nutcracker and mcrouter

https://gerrit.wikimedia.org/r/440470

aaron updated the task description. (Show Details)Jul 5 2018, 2:18 PM

Change 443977 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Migrate BreadCrumbs extension to Quibble

https://gerrit.wikimedia.org/r/443977

Change 443977 merged by jenkins-bot:
[integration/config@master] Migrate BreadCrumbs extension to Quibble

https://gerrit.wikimedia.org/r/443977

Change 440470 merged by jenkins-bot:
[operations/mediawiki-config@master] Make all non-test wikis write to both nutcracker and mcrouter

https://gerrit.wikimedia.org/r/440470

Change 444932 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[operations/mediawiki-config@master] Make all non-test wikis write to both nutcracker and mcrouter again

https://gerrit.wikimedia.org/r/444932

Change 444932 merged by jenkins-bot:
[operations/mediawiki-config@master] Make all non-test wikis write to both nutcracker and mcrouter again

https://gerrit.wikimedia.org/r/444932

aaron updated the task description. (Show Details)Jul 10 2018, 6:37 PM

Mentioned in SAL (#wikimedia-operations) [2018-07-11T20:57:51Z] <krinkle@deploy1001> Synchronized wmf-config/mc.php: Ifa659de6453 - Revert multi-write mcrouter for most wikis - T198239 (duration: 00m 58s)

Change 445314 had a related patch set uploaded (by Krinkle; owner: Aaron Schulz):
[mediawiki/core@master] objectcache: make BagOStuff::mergeViaLock() timeout more sensible

https://gerrit.wikimedia.org/r/445314

Change 445012 had a related patch set uploaded (by Krinkle; owner: Aaron Schulz):
[mediawiki/core@master] objectcache: minor fix to MultiWriteBagOStuff::doWrite()

https://gerrit.wikimedia.org/r/445012

Krinkle updated the task description. (Show Details)Jul 12 2018, 3:08 AM
Krinkle updated the task description. (Show Details)Jul 12 2018, 3:13 AM
Krinkle updated the task description. (Show Details)Jul 12 2018, 3:24 AM
Krinkle updated the task description. (Show Details)Jul 12 2018, 5:13 AM

Change 445040 had a related patch set uploaded (by Krinkle; owner: Aaron Schulz):
[mediawiki/core@master] [WIP] Make MultiWriteBagOStuff use the native merge() of each backend

https://gerrit.wikimedia.org/r/445040

Change 445427 had a related patch set uploaded (by Krinkle; owner: Aaron Schulz):
[mediawiki/core@master] objectcache: improve logging and error handling in BagOStuff

https://gerrit.wikimedia.org/r/445427

Krinkle updated the task description. (Show Details)Jul 12 2018, 8:32 PM
Krinkle updated the task description. (Show Details)Jul 17 2018, 12:45 AM
Krinkle updated the task description. (Show Details)
Krinkle updated the task description. (Show Details)
Krinkle updated the task description. (Show Details)Jul 17 2018, 12:47 AM
Krinkle updated the task description. (Show Details)Jul 17 2018, 12:55 AM

Change 445012 merged by jenkins-bot:
[mediawiki/core@master] objectcache: minor fix to MultiWriteBagOStuff::doWrite()

https://gerrit.wikimedia.org/r/445012

Change 445314 merged by jenkins-bot:
[mediawiki/core@master] objectcache: make BagOStuff::mergeViaLock() timeout more sensible

https://gerrit.wikimedia.org/r/445314

Change 446342 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@wmf/1.32.0-wmf.12] objectcache: make BagOStuff::mergeViaLock() timeout more sensible

https://gerrit.wikimedia.org/r/446342

aaron updated the task description. (Show Details)Jul 17 2018, 3:30 PM

Change 446342 merged by jenkins-bot:
[mediawiki/core@wmf/1.32.0-wmf.12] objectcache: make BagOStuff::mergeViaLock() timeout more sensible

https://gerrit.wikimedia.org/r/446342

Change 445040 merged by jenkins-bot:
[mediawiki/core@master] Make MultiWriteBagOStuff use the native merge() of each backend

https://gerrit.wikimedia.org/r/445040

Change 446763 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@wmf/1.32.0-wmf.13] Make MultiWriteBagOStuff use the native merge() of each backend

https://gerrit.wikimedia.org/r/446763

Change 446763 merged by jenkins-bot:
[mediawiki/core@wmf/1.32.0-wmf.13] Make MultiWriteBagOStuff use the native merge() of each backend

https://gerrit.wikimedia.org/r/446763

Change 445427 merged by jenkins-bot:
[mediawiki/core@master] objectcache: improve logging and error handling in BagOStuff

https://gerrit.wikimedia.org/r/445427

Change 447819 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[operations/mediawiki-config@master] Revert "Revert "Make all non-test wikis write to both nutcracker and mcrouter again""

https://gerrit.wikimedia.org/r/447819

Krinkle updated the task description. (Show Details)Jul 27 2018, 12:01 AM

Change 448172 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@wmf/1.32.0-wmf.14] objectcache: improve logging and error handling in BagOStuff

https://gerrit.wikimedia.org/r/448172

Change 448172 merged by jenkins-bot:
[mediawiki/core@wmf/1.32.0-wmf.14] objectcache: improve logging and error handling in BagOStuff

https://gerrit.wikimedia.org/r/448172

Krinkle updated the task description. (Show Details)Jul 27 2018, 2:20 AM

Change 447819 merged by jenkins-bot:
[operations/mediawiki-config@master] Make all wikis write to both nutcracker and mcrouter (3)

https://gerrit.wikimedia.org/r/447819

Mentioned in SAL (#wikimedia-operations) [2018-07-30T18:25:51Z] <thcipriani@deploy1001> Synchronized wmf-config/mc.php: SWAT: [[gerrit:447819|Make all wikis write to both nutcracker and mcrouter (3)]] T198239 (duration: 00m 48s)

aaron moved this task from Next-up to Doing on the Performance-Team board.Jul 30 2018, 8:19 PM

Change 449603 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[operations/mediawiki-config@master] Use mcrouter for cache reads for test wikis

https://gerrit.wikimedia.org/r/449603

Change 449604 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[operations/mediawiki-config@master] Use mcrouter for cache reads on all wikis

https://gerrit.wikimedia.org/r/449604

Change 449605 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[operations/mediawiki-config@master] Only do cache writes to mcrouter for all wikis

https://gerrit.wikimedia.org/r/449605

Change 449606 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[operations/mediawiki-config@master] Allow broadcasted mcrouter cache operations for purges

https://gerrit.wikimedia.org/r/449606

Change 449603 merged by jenkins-bot:
[operations/mediawiki-config@master] Use mcrouter for cache reads for test wikis

https://gerrit.wikimedia.org/r/449603

aaron updated the task description. (Show Details)Aug 1 2018, 5:24 PM
Krinkle updated the task description. (Show Details)Aug 1 2018, 8:09 PM
Krinkle updated the task description. (Show Details)

Change 449604 merged by jenkins-bot:
[operations/mediawiki-config@master] Use mcrouter for cache reads on all wikis

https://gerrit.wikimedia.org/r/449604

aaron updated the task description. (Show Details)Aug 13 2018, 7:59 PM

Change 449605 merged by jenkins-bot:
[operations/mediawiki-config@master] Only do cache writes to mcrouter for all wikis

https://gerrit.wikimedia.org/r/449605

Change 449606 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable broadcasted mcrouter cache operations for test wikis and mw.org

https://gerrit.wikimedia.org/r/449606

Change 452592 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[operations/mediawiki-config@master] Enable broadcasted mcrouter operations for all wikis

https://gerrit.wikimedia.org/r/452592

aaron updated the task description. (Show Details)Aug 18 2018, 5:51 AM

Change 452592 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable broadcasted mcrouter operations for all wikis

https://gerrit.wikimedia.org/r/452592

aaron closed this task as Resolved.Aug 21 2018, 8:02 AM
aaron updated the task description. (Show Details)