Page MenuHomePhabricator

Rack/Setup new memcache servers mc1019-36
Closed, ResolvedPublic

Description

Need to find out how these should be spread around. @faidon How do you these racked? Will they go to 10G racks?

Memcache1019-1036

  • - receive in normally
  • - rack in a6/B6/C5/D4?
  • - bios and ilom setup
  • - network port setup (description/enable/labs support vlan)
  • - update dns records (mgmt and production)
  • - update install_server module (Jessie, db partition recipe for hw raid)
  • - install OS
  • - accept/sign puppet/salt keys
  • - service implementation

Related Objects

StatusSubtypeAssignedTask
Resolvedelukey

Event Timeline

Cmjohnson added a subtask: Unknown Object (Task).Jun 8 2016, 8:07 PM
Cmjohnson renamed this task from Rack/Setup (18) new memcache Servers to Rack/Setupnew memcache servers mc1019-36.Jun 8 2016, 8:09 PM
Cmjohnson updated the task description. (Show Details)
Cmjohnson updated the task description. (Show Details)
Cmjohnson renamed this task from Rack/Setupnew memcache servers mc1019-36 to Rack/Setup new memcache servers mc1019-36.Jun 9 2016, 11:48 AM
Cmjohnson updated the task description. (Show Details)
Cmjohnson added subscribers: faidon, BBlack, mark.

Tried installing and a few boxes installed but mc1020-21, mc1023-25 30-33,35 give the below message.

The highlighted entry will be executed automatically in 0s.
Unable to find LVM volume mc1020-vg/root

Volume group "mc1020-vg" not found
Skipping volume group mc1020-vg

Unable to find LVM volume mc1020-vg/root

Volume group "mc1020-vg" not found
Skipping volume group mc1020-vg

Unable to find LVM volume mc1020-vg/root

Volume group "mc1020-vg" not found
Skipping volume group mc1020-vg

Unable to find LVM volume mc1020-vg/root
Gave up waiting for root device. Common problems:

  • Boot args (cat /proc/cmdline)
    • Check rootdelay= (did the system wait long enough?)
    • Check root= (did the system wait for the right device?)
  • Missing modules (cat /proc/modules; ls /dev)

ALERT! /dev/mapper/mc1020--vg-root does not exist. Dropping to a shell!
modprobe: module ehci-orion not found in modules.dep

BusyBox v1.22.1 (Debian 1:1.22.0-9+deb8u1) built-in shell (ash)
Enter 'help' for a list of built-in commands.

/bin/sh: can't access tty; job control turned off
(initramfs)

@faidon or @BBlack could you look at this when you get a chance.

RobH closed subtask Unknown Object (Task) as Resolved.Oct 12 2016, 5:46 PM

I'm guessing probably bad partman recipe for new hardware? In any case, these seem to be sitting in a half-installed state presently.

I added rootdelay=60 and mc1020 booted correctly, so we can keep going with the installation.. I am going to dedicated some time next week on them!

elukey triaged this task as Medium priority.Oct 21 2016, 2:38 PM

Mentioned in SAL (#wikimedia-operations) [2016-10-21T14:50:37Z] <elukey> reimaging mc1020 with wmf-auto-reimage (T137345)

Mentioned in SAL (#wikimedia-operations) [2016-10-21T15:28:31Z] <elukey> reimaging mc1019 with wmf-auto-reimage (T137345)

Mentioned in SAL (#wikimedia-operations) [2016-10-21T16:05:19Z] <elukey> reimaging mc1021 with wmf-auto-reimage (T137345)

Change 317171 had a related patch set uploaded (by Elukey):
Remove mc1019 from role memcached (new host)

https://gerrit.wikimedia.org/r/317171

Change 317171 merged by Elukey:
Remove mc1019 from role memcached (new host)

https://gerrit.wikimedia.org/r/317171

Mentioned in SAL (#wikimedia-operations) [2016-10-24T07:46:33Z] <elukey> reimaging mc1022.eqiad.wmnet (T137345)

Hosts reimaged:

elukey@neodymium:~$ sudo -i salt -E 'mc10(19|2[0-9]|3[0-6]).eqiad.wmnet' cmd.run 'uname -a'                                     
mc1034.eqiad.wmnet:
    Linux mc1034 4.4.0-2-amd64 #1 SMP Debian 4.4.2-3+wmf6 (2016-10-18) x86_64 GNU/Linux
mc1025.eqiad.wmnet:
    Linux mc1025 4.4.0-2-amd64 #1 SMP Debian 4.4.2-3+wmf6 (2016-10-18) x86_64 GNU/Linux
mc1024.eqiad.wmnet:
    Linux mc1024 4.4.0-2-amd64 #1 SMP Debian 4.4.2-3+wmf6 (2016-10-18) x86_64 GNU/Linux
mc1033.eqiad.wmnet:
    Linux mc1033 4.4.0-2-amd64 #1 SMP Debian 4.4.2-3+wmf6 (2016-10-18) x86_64 GNU/Linux
mc1029.eqiad.wmnet:
    Linux mc1029 4.4.0-2-amd64 #1 SMP Debian 4.4.2-3+wmf6 (2016-10-18) x86_64 GNU/Linux
mc1030.eqiad.wmnet:
    Linux mc1030 4.4.0-2-amd64 #1 SMP Debian 4.4.2-3+wmf6 (2016-10-18) x86_64 GNU/Linux
mc1023.eqiad.wmnet:
    Linux mc1023 4.4.0-2-amd64 #1 SMP Debian 4.4.2-3+wmf6 (2016-10-18) x86_64 GNU/Linux
mc1028.eqiad.wmnet:
    Linux mc1028 4.4.0-2-amd64 #1 SMP Debian 4.4.2-3+wmf6 (2016-10-18) x86_64 GNU/Linux
mc1027.eqiad.wmnet:
    Linux mc1027 4.4.0-2-amd64 #1 SMP Debian 4.4.2-3+wmf6 (2016-10-18) x86_64 GNU/Linux
mc1019.eqiad.wmnet:
    Linux mc1019 4.4.0-2-amd64 #1 SMP Debian 4.4.2-3+wmf6 (2016-10-18) x86_64 GNU/Linux
mc1026.eqiad.wmnet:
    Linux mc1026 4.4.0-2-amd64 #1 SMP Debian 4.4.2-3+wmf6 (2016-10-18) x86_64 GNU/Linux
mc1021.eqiad.wmnet:
    Linux mc1021 4.4.0-2-amd64 #1 SMP Debian 4.4.2-3+wmf6 (2016-10-18) x86_64 GNU/Linux
mc1020.eqiad.wmnet:
    Linux mc1020 4.4.0-2-amd64 #1 SMP Debian 4.4.2-3+wmf6 (2016-10-18) x86_64 GNU/Linux
mc1035.eqiad.wmnet:
    Linux mc1035 4.4.0-2-amd64 #1 SMP Debian 4.4.2-3+wmf6 (2016-10-18) x86_64 GNU/Linux
mc1022.eqiad.wmnet:
    Linux mc1022 4.4.0-2-amd64 #1 SMP Debian 4.4.2-3+wmf6 (2016-10-18) x86_64 GNU/Linux
mc1036.eqiad.wmnet:
    Linux mc1036 4.4.0-2-amd64 #1 SMP Debian 4.4.2-3+wmf6 (2016-10-18) x86_64 GNU/Linux
mc1031.eqiad.wmnet:
    Linux mc1031 4.4.0-2-amd64 #1 SMP Debian 4.4.2-3+wmf6 (2016-10-18) x86_64 GNU/Linux
mc1032.eqiad.wmnet:
    Linux mc1032 4.4.0-2-amd64 #1 SMP Debian 4.4.2-3+wmf6 (2016-10-18) x86_64 GNU/Linux

These hosts don't have memcached/redis yet since site.pp doesn't know anything about them. So next steps should be:

  1. sanity check all the new hosts
  2. add them to puppet, figuring out network related config like IPsec
  3. establish a migration strategy to deprecate mc1001->mc1018. As far as I know we could simply replace one node at the time in mediawiki_memcached_servers and mediawiki::redis_servers::eqiad. Not sure if we'll able to keep outstanding user sessions, so it might be good to follow up with community liaisons to announce it properly.

@Joe let me know if you have concerns/suggestions/etc..

Change 319278 had a related patch set uploaded (by Elukey):
Add new mc* servers to site.pp with role:spare

https://gerrit.wikimedia.org/r/319278

Change 319278 abandoned by Elukey:
Add new mc* servers to site.pp with role:spare

Reason:
Following Moritz's suggestions, I had a misconception of role:spare.

https://gerrit.wikimedia.org/r/319278

Change 323517 had a related patch set uploaded (by Elukey):
Avoid Redis IPsec replication if the host doesn't need it.

https://gerrit.wikimedia.org/r/323517

Change 323517 merged by Elukey:
Avoid Redis IPsec replication if the host doesn't need it.

https://gerrit.wikimedia.org/r/323517

Change 323805 had a related patch set uploaded (by Elukey):
Add mc1019 to site.pp

https://gerrit.wikimedia.org/r/323805

Change 323807 had a related patch set uploaded (by Elukey):
Add temporary override to mc1019 hiera config to allow Redis config

https://gerrit.wikimedia.org/r/323807

Change 332983 had a related patch set uploaded (by Elukey):
WIP - Add base Redis instance if no MW shard is configured.

https://gerrit.wikimedia.org/r/332983

Change 323807 abandoned by Elukey:
[WIP] Add temporary dc to Redis config to allow a eqiad replica

https://gerrit.wikimedia.org/r/323807

Change 332983 abandoned by Elukey:
WIP - Add base Redis instance if no MW shard is configured.

https://gerrit.wikimedia.org/r/332983

Change 337010 had a related patch set uploaded (by Elukey):
Apply role memcached to the new mc1* hosts

https://gerrit.wikimedia.org/r/337010

After a long investigation (and lot of changes!) we are almost ready to switch the first shard (mc1001 -> mc1019) following this procedure:

  • Replace shard IPs in puppet - https://gerrit.wikimedia.org/r/#/c/336972/
  • Disable puppet on mc1001 e mc2019 (assumption: no change triggered for the other hosts)
  • Code review merge and force puppet run on mc1019 (this will install and configure properly Redis)
  • Force puppet on mw1* to update the nutcracker's config using the Cumin filter 'R:Class = Mediawiki::Nutcracker and *.eqiad.wmnet'
  • Force Redis on mc1019 to become slaveof of mc1001 (ensuring the replication works fine afterwards)
  • Set Redis on mc1019 as write enabled replica
  • Restart Nutcracker on one host to make sure that the new config is ok and does not generate tons of errors.
  • systemctl restart nutcracker executed via Cumin on hosts 'R:Class = Mediawiki::Nutcracker and *.eqiad.wmnet'
  • Remove slaveof settings from Redis on mc1019
  • Enable and execute puppet on mc1001 e mc2019 (mc2019 will become slaveof mc1019)

As it is easy to note we will use Cumin and not Salt!

If this procedure will turn up to be sound and safe we'll proceed with other shards. There might be some user impact while restarting nutcracker due to the split brain scenario, but it is an acceptable compromise in my opinion for an invasive task like this one.

Thanks @elukey ! While we're at it with (de)commissioning hosts from nutcracker there's also a config change to listen for stats on localhost which would be nice to have merged too: https://gerrit.wikimedia.org/r/#/c/324642

Change 337010 merged by Elukey:
[operations/puppet@production] Apply role memcached to the new mc1* hosts

https://gerrit.wikimedia.org/r/337010

So the new hosts are in puppet with role memcached, and since we are very close to the switchover it might be better to skip the switch when codfw will be the active DC.

That makes sense, but I'm wondering if we can replace one of the hosts prior to the switchover, so that we test it with Linux 4.9 (so that when know that it's fine with 4.9, we can upgrade all of codfw after completing the switchover and all of eqiad during the switchover).

Yes we definitely can, the procedure is written above and we can target mc1019 (decommissioning mc1001).

Maybe we can schedule it for next week?

Change 350422 had a related patch set uploaded (by Giuseppe Lavagetto):
[operations/puppet@production] site.pp: move mc1013-18 to spares, mc1031-36 into rotation

https://gerrit.wikimedia.org/r/350422

Change 350422 merged by Giuseppe Lavagetto:
[operations/puppet@production] site.pp: move mc1013-18 to spares, mc1031-36 into rotation

https://gerrit.wikimedia.org/r/350422

Change 350549 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Swap mc1001->mc1012 with mc1019->mc2030

https://gerrit.wikimedia.org/r/350549

Change 351254 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/mediawiki-config@master] Replace mc100[123] with mc10(19|2[01]) after hw refresh

https://gerrit.wikimedia.org/r/351254

Mentioned in SAL (#wikimedia-operations) [2017-05-02T07:59:20Z] <elukey> Swap mc1001->mc1012 with mc1019->mc2030 - T137345 (more informative :)

Change 350549 merged by Elukey:
[operations/puppet@production] Swap mc1001->mc1012 with mc1019->mc2030

https://gerrit.wikimedia.org/r/350549

Mentioned in SAL (#wikimedia-operations) [2017-05-02T08:32:36Z] <elukey> stop and mask redis on mc1001-mc1018 - T137345

Change 351254 merged by Elukey:
[operations/mediawiki-config@master] Replace Redis lock IPs (mc100[123]) after hw refresh

https://gerrit.wikimedia.org/r/351254