Page MenuHomePhabricator

reinstall eqiad memcache servers with jessie
Closed, ResolvedPublic

Description

mc* servers in codfw already run jessie, we should reimage eqiad memcache servers with jessie too

Details

Related Gerrit Patches:
operations/mediawiki-config : masterRemoved mc1003 from the Lock Manager pools for maintenance.
operations/puppet : productionAdd mc1003 back to the redis/memcached pools after maintenance.
operations/puppet : productionAdded a partman option to mc.cfg to allow fully automated partitioning.
operations/puppet : productionFix the mcXXXX partman config to allow fully automated PXE OS installs.
operations/puppet : productionRemove mc1003 from the redis/memcached pools for maintenance.
operations/mediawiki-config : masterRemove mc1002 from the Lock Manager pool for maintenance.
operations/puppet : productionAdd mc1002 back to the redis/memcached pools after maintenance.
operations/puppet : productionRemove mc1002 from redis/memcached pools for maintenance.
operations/mediawiki-config : masterAdd mc1001.eqiad back to the Redis Lock Managers after maintenance.
operations/puppet : productionAdd mc1001.eqiad back to the redis/memcached pool after maintenance.
operations/puppet : productionRemove mc1001 from the redis/memcached pools for maintenance
operations/mediawiki-config : masterRemove mc1001.eqiad from the list of lock managers for maintenance.
operations/puppet : productionAdd Debian PXE boot option to mc100[123] servers.
operations/puppet : productionAdd mc1017/mc1018 back to the memcached/redis pools after maintenance.
operations/puppet : productionRemove mc1017/mc1018 from the redis/memcached pools for maintenance.
operations/puppet : productionAdd mc1016 back to the redis/memcached pools after maintenance.
operations/puppet : productionRemove mc1016 from the redis/memcached pools for maintenance.
operations/puppet : productionAdd mc1015.eqiad.wmnet back in the redis/memcached pools.
operations/puppet : productionRemove mc1015.eqiad.wmnet from the redis-memcached pool for maintenance.
operations/puppet : productionAdd mc1014.eqiad back to redis/memcached pool after maintenance.
operations/puppet : productionRemove mc1014 from memcached/redis pools for maintenance.
operations/puppet : productionAdd mc1012/mc1013 to redis/memcached pools after maintenance.
operations/puppet : productionRemove mc1012/mc1013 from the redis/memcached pools for maintenance.
operations/puppet : productionAdd mc1010/mc1011 back to the redis/memcached pools after maintenance.
operations/puppet : productionRemove mc1010/mc1011 from redis/memcached pools for maintenance
operations/puppet : productionAdd mc1008/mc1009 back into redis/memcached pools after maintenance.
operations/puppet : productionRemove mc1008/mc1009 from redis/memcached pool for maintenance
operations/puppet : productionAdd mc1007.eqiad back in service after maintenance.
operations/puppet : productionTemporary remove of mc1007 from the memcached/redis pool for maintenance.
operations/puppet : productionAdd mc1006 back into redis/memcached pools after maintenance.
operations/puppet : productionTemporary remove of mc1006.eqiad from the redis/memcached pool for maintenance.
operations/puppet : productionAdding support for PXE Jessie Installer to mcXXXX hosts.
operations/puppet : productionAdd mc1005.eqiad back into redis/memcached pools
operations/puppet : productionTemporary removing mc1005 from the redis/memcached pools.
operations/puppet : productionAdd Jessie option for PXE boot to memcached/redis servers. Temporary removing mc1005 for maintenance.
operations/puppet : productiondhcp: switch mc1004/1005 to jessie installer

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
elukey added a comment.Feb 9 2016, 6:48 PM

For some reason the last code reviews didn't get into this phab task:

de-pool mc1004: https://gerrit.wikimedia.org/r/#/c/269378/
re-pool mc1004: https://gerrit.wikimedia.org/r/#/c/269422

Each time a host is removed/added from the pool we observe ~30 minutes of errors like:

Memcached error for key "enwiki:messages:en:lock" on server "/var/run/nutcracker/nutcracker.sock:0": SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY

This seems not to be nutcracker but MediaWiki (better: its memcached plugin for php) that reacts to caching nodes not reachable (nutcracker will return some errors up to a point in which the server is auto-ejected from the pool).

The impact to the users is almost none, except a possible latency increase due to the cache misses.

elukey added a comment.Feb 9 2016, 6:49 PM

Next steps:

  1. Work with Joe on https://phabricator.wikimedia.org/T124761 to get wmf-reimage up and running again.
  2. Rollout Jessie to the other nodes following the above procedure (de-pool/re-pool from hiera each time).

Adding a related task handled by Joe as reference: https://phabricator.wikimedia.org/T126395

I didn't check that Redis/Memcached were correctly bound to 0.0.0.0 rather than 127.0.0.1 and this caused user impact while editing. The fix has been rolled-out: https://gerrit.wikimedia.org/r/#/c/269616/1

Change 269626 had a related patch set uploaded (by Elukey):
Add Jessie option for PXE boot to memcached/redis servers. Temporary removing mc1005 for maintenance.

https://gerrit.wikimedia.org/r/269626

Change 269626 abandoned by Elukey:
Add Jessie option for PXE boot to memcached/redis servers. Temporary removing mc1005 for maintenance.

Reason:
Splitting the change in two code reviews

https://gerrit.wikimedia.org/r/269626

Change 269639 had a related patch set uploaded (by Elukey):
Temporary removing mc1005 from the redis/memcached pools.

https://gerrit.wikimedia.org/r/269639

Change 269639 merged by Elukey:
Temporary removing mc1005 from the redis/memcached pools.

https://gerrit.wikimedia.org/r/269639

Change 269660 had a related patch set uploaded (by Elukey):
Add mc1005.eqiad back into redis/memcached pools

https://gerrit.wikimedia.org/r/269660

Change 269660 merged by Elukey:
Add mc1005.eqiad back into redis/memcached pools

https://gerrit.wikimedia.org/r/269660

Change 269668 had a related patch set uploaded (by Elukey):
Adding support for PXE Jessie Installer to mcXXXX hosts.

https://gerrit.wikimedia.org/r/269668

Change 269668 merged by Elukey:
Adding support for PXE Jessie Installer to mcXXXX hosts.

https://gerrit.wikimedia.org/r/269668

Change 269677 had a related patch set uploaded (by Elukey):
Temporary remove of mc1006.eqiad from the redis/memcached pool for maintenance.

https://gerrit.wikimedia.org/r/269677

Change 269677 merged by Elukey:
Temporary remove of mc1006.eqiad from the redis/memcached pool for maintenance.

https://gerrit.wikimedia.org/r/269677

Change 269715 had a related patch set uploaded (by Elukey):
Add mc1006 back into redis/memcached pools after maintenance.

https://gerrit.wikimedia.org/r/269715

Change 269715 merged by Elukey:
Add mc1006 back into redis/memcached pools after maintenance.

https://gerrit.wikimedia.org/r/269715

Change 269921 had a related patch set uploaded (by Elukey):
Temporary remove of mc1007 from the memcached/redis pool for maintenance.

https://gerrit.wikimedia.org/r/269921

Change 269921 merged by Elukey:
Temporary remove of mc1007 from the memcached/redis pool for maintenance.

https://gerrit.wikimedia.org/r/269921

Change 269934 had a related patch set uploaded (by Elukey):
Add mc1007.eqiad back in service after maintenance.

https://gerrit.wikimedia.org/r/269934

Change 269934 merged by Elukey:
Add mc1007.eqiad back in service after maintenance.

https://gerrit.wikimedia.org/r/269934

Change 269939 had a related patch set uploaded (by Elukey):
Remove mc1008/mc1009 from redis/memcached pool for maintenance.

https://gerrit.wikimedia.org/r/269939

Change 269939 merged by Elukey:
Remove mc1008/mc1009 from redis/memcached pool for maintenance

https://gerrit.wikimedia.org/r/269939

Change 269953 had a related patch set uploaded (by Elukey):
Add mc1008/mc1009 back into redis/memcached pools after maintenance.

https://gerrit.wikimedia.org/r/269953

Change 269953 merged by Elukey:
Add mc1008/mc1009 back into redis/memcached pools after maintenance.

https://gerrit.wikimedia.org/r/269953

Change 269961 had a related patch set uploaded (by Elukey):
Remove mc1010/mc1011 from redis/memcached pools for maintenance

https://gerrit.wikimedia.org/r/269961

Change 269961 merged by Elukey:
Remove mc1010/mc1011 from redis/memcached pools for maintenance

https://gerrit.wikimedia.org/r/269961

Change 269982 had a related patch set uploaded (by Elukey):
Add mc1010/mc1011 back to the redis/memcached pools after maintenance.

https://gerrit.wikimedia.org/r/269982

Change 269982 merged by Elukey:
Add mc1010/mc1011 back to the redis/memcached pools after maintenance.

https://gerrit.wikimedia.org/r/269982

Change 269993 had a related patch set uploaded (by Elukey):
Remove mc1012/mc1013 from the redis/memcached pools for maintenance.

https://gerrit.wikimedia.org/r/269993

Change 269993 merged by Elukey:
Remove mc1012/mc1013 from the redis/memcached pools for maintenance.

https://gerrit.wikimedia.org/r/269993

Change 270014 had a related patch set uploaded (by Elukey):
Add mc1012/mc1013 to redis/memcached pools after maintenance.

https://gerrit.wikimedia.org/r/270014

Change 270014 merged by Elukey:
Add mc1012/mc1013 to redis/memcached pools after maintenance.

https://gerrit.wikimedia.org/r/270014

Legoktm added a subscriber: Legoktm.

Stopping the work due to a latency regression spotted by Ori: https://phabricator.wikimedia.org/T126700

Johan added a subscriber: Johan.Feb 12 2016, 12:09 PM

OK, a couple of questions as this was tagged with user-notice (thanks, Legoktm!):

*) Will this affect Wikimedians who won't find out any other way? Who need to know, and what is the effect on everyone else?

*) How would you phrase this?

"(Advanced item) The Wikimedia memory cache servers in the Equiad cluster have now been upgraded to Debian 8 (Jessie)"?

Or will be, given that the work has been stopped.

Joe added a comment.Feb 12 2016, 12:25 PM

To be very clear: we don't know for sure (or at all) if the latency spike was due to this change.

Hi @Johan,

I am going to try to answer your questions:

*) Will this affect Wikimedians who won't find out any other way? Who need to know, and what is the effect on everyone else?
*) How would you phrase this?

Two separate issues happened till now, namely:

  • user session loss due to my work, because each server in the mcXXXX pool handles a subset of sessions, not replicated in the other server. During the past days there have been time windows (~30 mins each) in which users' sessions might have been dropped. I already sent an email to wikitech-ambassadors explaining what happened, but let me know if I need to do more.
  • the time taken to save a user Edit has nearly doubled in the past two days (p75, so ~25% of the users have been impacted) as tracked in https://phabricator.wikimedia.org/T126700, but we are not really sure if the problem is a consequence of my work (servers with the OS running MediaWiki caches not warmed up) or a symptom of something else. We are going to wait until Monday to see if the latency regression will go away, and then we'll be able to have something conclusive.

"(Advanced item) The Wikimedia memory cache servers in the Equiad cluster have now been upgraded to Debian 8 (Jessie)"?

This is the final goal, but at the moment we are half way through. Today I should have migrated more servers but we decided to stop and wait until Monday.

Hopefully I haven't written inaccuracies, Joe please correct me if I said something horribly wrong.

@Johan: feel free to follow up with me on IRC (elukey) if you have more questions!

Luca

@elukey: Thanks, and sorry for my late reply, been travelling over the weekend. My original question was actually less about the problems potentially related to the process and more about who is affected by the change once done, and how. (:

https://phabricator.wikimedia.org/T126700 is still a blocker for this task, the rest of the hosts will be re-imaged probably in the second part of the week or the next one.

Summary:

mc1001 -> mc1003 are still on Ubuntu
mc1004 -> mc1013 are on Debian
mc1014 -> mc1018 are still on Ubuntu

@Johan: I am planning to re-image two hosts today since I am not blocked anymore by the latency regression. The main side effect will be that 2/18 of the total user sessions will be dropped today, and users will be forced to login again. There might be also some sporadic issues while logging in while I am doing the work, but I don't expect them to be relevant.

I need to re-image 8 hosts very soon so the plan is to start with 2 today, 3 tomorrow and the remaining three on Friday.

I am going to send an email to wikitech ambassadors to alert them, really sorry for the short notice. Please let me know on IRC if any major problem comes up.

Thanks!

Luca

@Johan: quick update, I'll start tomorrow EU morning, not today.

Change 273191 had a related patch set uploaded (by Elukey):
Remove mc1014 from memcached/redis pools for maintenance.

https://gerrit.wikimedia.org/r/273191

Change 273191 merged by Elukey:
Remove mc1014 from memcached/redis pools for maintenance.

https://gerrit.wikimedia.org/r/273191

Change 273257 had a related patch set uploaded (by Elukey):
Add mc1014.eqiad back to redis/memcached pool after maintenance.

https://gerrit.wikimedia.org/r/273257

Change 273257 merged by Elukey:
Add mc1014.eqiad back to redis/memcached pool after maintenance.

https://gerrit.wikimedia.org/r/273257

Change 273272 had a related patch set uploaded (by Elukey):
Remove mc1015.eqiad.wmnet from the redis/memcached pool for maintenance.

https://gerrit.wikimedia.org/r/273272

Change 273272 merged by Elukey:
Remove mc1015.eqiad.wmnet from the redis-memcached pool for maintenance.

https://gerrit.wikimedia.org/r/273272

Change 273299 had a related patch set uploaded (by Elukey):
Add mc1015.eqiad.wmnet back in the redis/memcached pools.

https://gerrit.wikimedia.org/r/273299

Change 273299 merged by Elukey:
Add mc1015.eqiad.wmnet back in the redis/memcached pools.

https://gerrit.wikimedia.org/r/273299

Change 273415 had a related patch set uploaded (by Elukey):
Remove mc1016 from the redis/memcached pools for maintenance.

https://gerrit.wikimedia.org/r/273415

Change 273415 merged by Elukey:
Remove mc1016 from the redis/memcached pools for maintenance.

https://gerrit.wikimedia.org/r/273415

Change 273424 had a related patch set uploaded (by Elukey):
Add mc1016 back to the redis/memcached pools after maintenance.

https://gerrit.wikimedia.org/r/273424

Change 273424 merged by Elukey:
Add mc1016 back to the redis/memcached pools after maintenance.

https://gerrit.wikimedia.org/r/273424

Change 273430 had a related patch set uploaded (by Elukey):
Remove mc1017/mc1018 from the redis/memcached pools for maintenance.

https://gerrit.wikimedia.org/r/273430

Change 273430 merged by Elukey:
Remove mc1017/mc1018 from the redis/memcached pools for maintenance.

https://gerrit.wikimedia.org/r/273430

Change 273443 had a related patch set uploaded (by Elukey):
Add mc1017/mc1018 back to the memcached/redis pools after maintenance.

https://gerrit.wikimedia.org/r/273443

Change 273443 merged by Elukey:
Add mc1017/mc1018 back to the memcached/redis pools after maintenance.

https://gerrit.wikimedia.org/r/273443

Elitre added a subscriber: Elitre.Feb 29 2016, 3:15 PM

Change 273949 had a related patch set uploaded (by Elukey):
Remove mc1001.eqiad from the list of lock managers for maintenance.

https://gerrit.wikimedia.org/r/273949

Change 273951 had a related patch set uploaded (by Elukey):
Add Debian PXE boot option to mc100[123] servers.

https://gerrit.wikimedia.org/r/273951

Change 273951 merged by Elukey:
Add Debian PXE boot option to mc100[123] servers.

https://gerrit.wikimedia.org/r/273951

Change 273952 had a related patch set uploaded (by Elukey):
Remove mc1001 from the redis/memcached pools for maintenance

https://gerrit.wikimedia.org/r/273952

Change 273949 merged by Elukey:
Remove mc1001.eqiad from the list of lock managers for maintenance.

https://gerrit.wikimedia.org/r/273949

Change 273952 merged by Elukey:
Remove mc1001 from the redis/memcached pools for maintenance

https://gerrit.wikimedia.org/r/273952

Change 273969 had a related patch set uploaded (by Elukey):
Add mc1001.eqiad back to the redis/memcached pool after maintenance.

https://gerrit.wikimedia.org/r/273969

Change 273969 merged by Elukey:
Add mc1001.eqiad back to the redis/memcached pool after maintenance.

https://gerrit.wikimedia.org/r/273969

Change 273970 had a related patch set uploaded (by Elukey):
Add mc1001.eqiad back to the Redis Lock Managers after maintenance.

https://gerrit.wikimedia.org/r/273970

Change 273970 merged by Elukey:
Add mc1001.eqiad back to the Redis Lock Managers after maintenance.

https://gerrit.wikimedia.org/r/273970

Change 274065 had a related patch set uploaded (by Elukey):
Remove mc1002 from redis/memcached pools for maintenance.

https://gerrit.wikimedia.org/r/274065

Change 274065 merged by Elukey:
Remove mc1002 from redis/memcached pools for maintenance.

https://gerrit.wikimedia.org/r/274065

Change 274068 had a related patch set uploaded (by Elukey):
Remove mc1002 from the Lock Manager pool for maintenance.

https://gerrit.wikimedia.org/r/274068

elukey added a comment.Mar 1 2016, 9:45 AM

@Elitre, @Johan: Hi! Today I am going to work on the servers holding user sessions again, so I expect some intermittent issues like users forced to login again. It should be the last step hopefully and then the work will be complete. Please let me know on IRC (elukey) if you have any issue/query/etc..

Thanks!

Change 274068 merged by Elukey:
Remove mc1002 from the Lock Manager pool for maintenance.

https://gerrit.wikimedia.org/r/274068

Change 274071 had a related patch set uploaded (by Elukey):
Fix the mcXXXX partman config to allow fully automated PXE OS installs.

https://gerrit.wikimedia.org/r/274071

Change 274077 had a related patch set uploaded (by Elukey):
Add mc1002 back to the redis/memcached pools after maintenance.

https://gerrit.wikimedia.org/r/274077

Change 274077 merged by Elukey:
Add mc1002 back to the redis/memcached pools after maintenance.

https://gerrit.wikimedia.org/r/274077

Change 274097 had a related patch set uploaded (by Elukey):
Remove mc1003 from the redis/memcached pools for maintenance.

https://gerrit.wikimedia.org/r/274097

Change 274097 merged by Elukey:
Remove mc1003 from the redis/memcached pools for maintenance.

https://gerrit.wikimedia.org/r/274097

Change 274100 had a related patch set uploaded (by Elukey):
Removed mc1003 from the Lock Manager pools for maintenance.

https://gerrit.wikimedia.org/r/274100

Change 274100 merged by Elukey:
Removed mc1003 from the Lock Manager pools for maintenance.

https://gerrit.wikimedia.org/r/274100

Change 274071 merged by Elukey:
Fix the mcXXXX partman config to allow fully automated PXE OS installs.

https://gerrit.wikimedia.org/r/274071

Change 274106 had a related patch set uploaded (by Elukey):
Added a partman option to mc.cfg to allow fully automated partitioning.

https://gerrit.wikimedia.org/r/274106

Change 274106 merged by Elukey:
Added a partman option to mc.cfg to allow fully automated partitioning.

https://gerrit.wikimedia.org/r/274106

Change 274111 had a related patch set uploaded (by Elukey):
Add mc1003 back to the redis/memcached pools after maintenance.

https://gerrit.wikimedia.org/r/274111

Change 274111 merged by Elukey:
Add mc1003 back to the redis/memcached pools after maintenance.

https://gerrit.wikimedia.org/r/274111

elukey added a comment.Mar 1 2016, 6:23 PM

Task completed, all the mc servers migrated.

elukey closed this task as Resolved.Mar 1 2016, 6:24 PM
elukey added a subscriber: elukey.
Dzahn awarded a token.Mar 2 2016, 12:38 AM