Rollout use of mcrouter for MediaWiki in production
Open, Needs TriagePublic

Description

In order to finish the MediaWiki config side of the mcrouter deploy, a series of steps will need to be done in stages. As I planned it, each step will involve a day or so before the next one.

  • Direct cache writes to both nutcracker and mcrouter for mediawiki.org (https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/440469/); wait 1 day
  • Direct cache writes to both nutcracker and mcrouter for all wikis (https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/440470/); wait 1 day (this will double cache writes and space usage until this multi-write stage is over)
  • Switch cache reads to mcrouter on testwiki/mediawiki.org; wait 3 days
  • Switch cache reads to mcrouter on all wikis; wait 1 week
  • Remove nutcracker from cache write operations. This is the point where rollback is trickier, requiring either a restart of cache servers or relying on purgeChangedFiles.php/purgeChangedPages.php.; wait 1 day
  • Enable prefix-based wildcard purges for mcrouter for testwikis/mw.org. Any rollback to nutcracker would need to revert this too. nutcracker does not understand wildcard purges (it would just literally purge the keys with those names, which would make them not purge anything basically).; wait 2 days
  • Enable prefix-based wildcard purges for mcrouter for all wikis.

Relevant logstash channels: (ObjectCache, memcached, mediawiki-errors aggregate channel)
Relevant grafana dashboards: https://grafana.wikimedia.org/dashboard/db/prometheus-memcached-dc-stats?orgId=1
Graphite: "MediaWiki.wanobjectcache.*.hit.good.rate" and "MediaWiki.wanobjectcache.*.miss.compute.rate" should look sane during all steps (since warmup is involved)
Other things to watch: performance.wikimedia.org graphs

Known issues and patches:

Other patches:

aaron created this task.Tue, Jun 26, 6:38 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptTue, Jun 26, 6:38 PM
Imarlier added a subscriber: Imarlier.
Krinkle moved this task from Backlog to Doing on the Availability (MediaWiki-MultiDC) board.
Krinkle renamed this task from Production MediaWiki mcrouter use rollout to Rollout use of mcrouter for MediaWiki in production.

Change 440469 had a related patch set uploaded (by Krinkle; owner: Aaron Schulz):
[operations/mediawiki-config@master] Make mediawiki.org write to both nutcracker and mcrouter

https://gerrit.wikimedia.org/r/440469

Joe added a subscriber: Joe.Mon, Jul 2, 9:34 AM

+1 to the overall plan; I'd like to see dates attached to the various steps now, so that we can have a clear schedule.

Imarlier moved this task from Inbox to Next-up on the Performance-Team board.Mon, Jul 2, 8:27 PM
aaron added a comment.Tue, Jul 3, 3:13 PM

+1 to the overall plan; I'd like to see dates attached to the various steps now, so that we can have a clear schedule.

I want to know that T197450 is not related first. With that out of the way, I can set some SWAT dates.

Change 440469 merged by jenkins-bot:
[operations/mediawiki-config@master] Make test wikis just write to both nutcracker and mcrouter

https://gerrit.wikimedia.org/r/440469

aaron added a comment.Thu, Jul 5, 2:01 PM

Change 440469 merged by jenkins-bot:
[operations/mediawiki-config@master] Make test wikis just write to both nutcracker and mcrouter

https://gerrit.wikimedia.org/r/440469

Followed shortly by https://gerrit.wikimedia.org/r/443970 for mw.org

Change 440470 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[operations/mediawiki-config@master] Make all non-test wikis write to both nutcracker and mcrouter

https://gerrit.wikimedia.org/r/440470

aaron updated the task description. (Show Details)Thu, Jul 5, 2:18 PM

Change 443977 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Migrate BreadCrumbs extension to Quibble

https://gerrit.wikimedia.org/r/443977

Change 443977 merged by jenkins-bot:
[integration/config@master] Migrate BreadCrumbs extension to Quibble

https://gerrit.wikimedia.org/r/443977

Change 440470 merged by jenkins-bot:
[operations/mediawiki-config@master] Make all non-test wikis write to both nutcracker and mcrouter

https://gerrit.wikimedia.org/r/440470

Change 444932 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[operations/mediawiki-config@master] Make all non-test wikis write to both nutcracker and mcrouter again

https://gerrit.wikimedia.org/r/444932

Change 444932 merged by jenkins-bot:
[operations/mediawiki-config@master] Make all non-test wikis write to both nutcracker and mcrouter again

https://gerrit.wikimedia.org/r/444932

aaron updated the task description. (Show Details)Tue, Jul 10, 6:37 PM

Mentioned in SAL (#wikimedia-operations) [2018-07-11T20:57:51Z] <krinkle@deploy1001> Synchronized wmf-config/mc.php: Ifa659de6453 - Revert multi-write mcrouter for most wikis - T198239 (duration: 00m 58s)

Change 445314 had a related patch set uploaded (by Krinkle; owner: Aaron Schulz):
[mediawiki/core@master] objectcache: make BagOStuff::mergeViaLock() timeout more sensible

https://gerrit.wikimedia.org/r/445314

Change 445012 had a related patch set uploaded (by Krinkle; owner: Aaron Schulz):
[mediawiki/core@master] objectcache: minor fix to MultiWriteBagOStuff::doWrite()

https://gerrit.wikimedia.org/r/445012

Krinkle updated the task description. (Show Details)Thu, Jul 12, 3:08 AM
Krinkle updated the task description. (Show Details)Thu, Jul 12, 3:13 AM
Krinkle updated the task description. (Show Details)Thu, Jul 12, 3:24 AM
Krinkle updated the task description. (Show Details)Thu, Jul 12, 5:13 AM

Change 445040 had a related patch set uploaded (by Krinkle; owner: Aaron Schulz):
[mediawiki/core@master] [WIP] Make MultiWriteBagOStuff use the native merge() of each backend

https://gerrit.wikimedia.org/r/445040

Change 445427 had a related patch set uploaded (by Krinkle; owner: Aaron Schulz):
[mediawiki/core@master] objectcache: improve logging and error handling in BagOStuff

https://gerrit.wikimedia.org/r/445427

Krinkle updated the task description. (Show Details)Thu, Jul 12, 8:32 PM
Krinkle updated the task description. (Show Details)Tue, Jul 17, 12:45 AM
Krinkle updated the task description. (Show Details)
Krinkle updated the task description. (Show Details)
Krinkle updated the task description. (Show Details)Tue, Jul 17, 12:47 AM
Krinkle updated the task description. (Show Details)Tue, Jul 17, 12:55 AM

Change 445012 merged by jenkins-bot:
[mediawiki/core@master] objectcache: minor fix to MultiWriteBagOStuff::doWrite()

https://gerrit.wikimedia.org/r/445012

Change 445314 merged by jenkins-bot:
[mediawiki/core@master] objectcache: make BagOStuff::mergeViaLock() timeout more sensible

https://gerrit.wikimedia.org/r/445314

Change 446342 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@wmf/1.32.0-wmf.12] objectcache: make BagOStuff::mergeViaLock() timeout more sensible

https://gerrit.wikimedia.org/r/446342

aaron updated the task description. (Show Details)Tue, Jul 17, 3:30 PM

Change 446342 merged by jenkins-bot:
[mediawiki/core@wmf/1.32.0-wmf.12] objectcache: make BagOStuff::mergeViaLock() timeout more sensible

https://gerrit.wikimedia.org/r/446342