Page MenuHomePhabricator

codfw:rack/setup mc2019-mc2036
Closed, ResolvedPublic

Description

This task will track the racking and setup of 18 new memcache servers received in T151930

Racking information

Row A
mc-2019 rack A1
mc-2020 rack A5
mc-2021 rack A8
mc-2022 rack A8

Row B
mc-2023 rack B1
mc-2024 rack B5
mc-2025 rack B8
mc-2026 rack B8

Row C
mc-2027 rack C1
mc-2028 rack C1
mc-2029 rack C4
mc-2030 rack C5
mc-2031 rack C5

Row D
mc-2032 rack D1
mc-2033 rack D4
mc-2034 rack D5
mc-2035 rack D5
mc-2036 rack D8

  • receive in and attach packing slip to parent task T151930
  • rack systems, update racktables
  • create mgmt dns entries (both asset tag and hostname)
  • create production dns entries (internal vlan)
  • update/create sub task with network port info for all new hosts
  • install_server module update (mac address and partitioning info, partition like existing mc systems)
  • install os
  • puppet/salt accept
  • hand off to @elukey for service implementation.

@Joe , @RobH please review racking information and please add comments.
Thanks.

Event Timeline

So we could start putting some of these mc systems in racks with mw systems, which would eliminate the need to have any two mc systems in the same rack. That may be overkill however, as with this proposed plan, no more than 2 mc systems are ever in the same rack.

I'm fine with this, and it also happens to be exactly what was recommended in IRC, so looks good to me.

Row A
mc-2019 rack A1
mc-2020 rack A5
mc-2021 rack A8
mc-2022 rack A8

Row B
mc-2023 rack B1
mc-2024 rack B5
mc-2025 rack B8
mc-2026 rack B8

Row C
mc-2027 rack C1
mc-2028 rack C1
mc-2029 rack C4
mc-2030 rack C5
mc-2031 rack C5

Row D
mc-2032 rack D1
mc-2033 rack D4
mc-2034 rack D5
mc-2035 rack D5
mc-2036 rack D8

Papaul triaged this task as Medium priority.Jan 19 2017, 6:02 PM

I quickly tested salt/puppet to help out and and everything seems working as expected except mc2033 and mc2034 (powered down?).

Change 335208 had a related patch set uploaded (by Elukey):
Extend role memcached to the new codfw mc hosts

https://gerrit.wikimedia.org/r/335208

So a couple of remarks before starting:

  1. Role memcached is currently applied to 18 hosts in eqiad and 16 hosts in codfw. mc2001 and mc2016 are running two instances of Redis rather than only one, to take care of the extra two shards.
  2. Redis will not be deployed on the new hosts until their IPs are configured in redis_server.yaml.

So as first step I'd switch one shard to monitor it for a couple of days before proceeding with the rest. If we want to replace mc2001 with mc2019, the procedure might be:

  1. disable puppet on mc1001 and mc2001 (shard01) as precautionary measure.
  2. merge a puppet change to replace mc2001's IP in the codfw section of redis_server.yaml with mc2019's.
  3. run puppet on mc2019 to verify that Redis gets installed alongside with IPsec (even if we'll need to wait mc2001's run to get the transport fully running).
  4. run puppet on mc1001 and verify the replication with mc2019
  5. run puppet on mc2001

@Joe how does it sound?

I think it would be good to merge https://gerrit.wikimedia.org/r/#/c/319820/ first (possible with an additional hiera knob to only affect the newly setup hosts, it will save us upgrade headaches in the long run)

Change 335208 merged by Elukey:
Extend role memcached to the new codfw mc hosts

https://gerrit.wikimedia.org/r/335208

Change 335449 had a related patch set uploaded (by Elukey):
Replace mc2001 with mc2019 in Mediawiki Redis shards

https://gerrit.wikimedia.org/r/335449

Mentioned in SAL (#wikimedia-operations) [2017-02-02T10:53:03Z] <elukey> Swap mc2001 with mc2019 (Redis codfw replicas) - T155755

Change 335449 merged by Elukey:
Replace mc2001 with mc2019 in Mediawiki Redis/Memcached shards

https://gerrit.wikimedia.org/r/335449

Change 335627 had a related patch set uploaded (by Elukey):
Replace mc200[23] Redis/Memcached codfw shards with mc202[01]

https://gerrit.wikimedia.org/r/335627

Change 335627 merged by Elukey:
Replace mc200[23] Redis/Memcached codfw shards with mc202[01]

https://gerrit.wikimedia.org/r/335627

Mentioned in SAL (#wikimedia-operations) [2017-02-02T11:40:52Z] <elukey> Swap mc2002 with mc2020, mc2003 with mc2021 (Redis codfw replicas) - T155755

Change 335655 had a related patch set uploaded (by Elukey):
Replace Redis/Memcached shards mc200[4567] with mc202[2345]

https://gerrit.wikimedia.org/r/335655

Change 335655 merged by Elukey:
Replace Redis/Memcached shards mc200[4567] with mc202[2345]

https://gerrit.wikimedia.org/r/335655

Change 335676 had a related patch set uploaded (by Elukey):
Replace codfw Memcached/Redis mc1008->mc1011 with mc2026->mc2029

https://gerrit.wikimedia.org/r/335676

Change 335676 merged by Elukey:
Replace codfw Memcached/Redis mc2008->mc2011 with mc2026->mc2029

https://gerrit.wikimedia.org/r/335676

Change 335780 had a related patch set uploaded (by Elukey):
Disable auto-restart for nutcracker when config.yaml changes

https://gerrit.wikimedia.org/r/335780

Today I discovered that Swapping mc hosts in codfw is not a completely safe operation since all the mw hosts in eqiad have multiple pools configured in nutcracker, including the codfw ones. So change one mc codfw shard means changing all the nutcracker configurations across eqiad and codfw, triggering a refresh of the daemon that doesn't support graceful reload (https://github.com/twitter/twemproxy/issues/6).

This might have caused some user to loose his/her session during the following time windows:

Feb 02:
11:30 -> 12:00 UTC
10:50 -> 11:30 UTC

Feb 03:
09:00 ->09:30 UTC

From https://grafana.wikimedia.org/dashboard/db/edit-count I don't see a huge impact, but I am adding the CI tag anyway to notify them.

Next steps:

  1. merge https://gerrit.wikimedia.org/r/335780 (or something similar with base::service_unit) to prevent nutcracker restarts on config change.
  2. replace the remaining shards.

I'm not a CL but let me know if its worth setting up a CentralNotice maintenance banner to logged in users just for those time periods.

@Jseddon Hello! The next round of host swaps will not create impact, so for the moment the CentralNotice banner is not needed, but it might be when I'll have to do the same task for the eqiad host, so I'll keep in mind! Thanks :)

Elitre subscribed.

If I understand this correctly, you want people to be notified about past outages. That's a line for Tech News, that included similar items in the past. I'll make sure Johan sees this, but if there's more, please LUK?

If I understand this correctly, you want people to be notified about past outages. That's a line for Tech News, that included similar items in the past. I'll make sure Johan sees this, but if there's more, please LUK?

My goal was to notify you guys in order to be able to answer questions from the community (if any). The outage was limited to few user sessions lost for a brief window of time, so probably not worth to get a broad notification to Tech News. If you think that it is worth it let me know, I am learning how to do the good ops citizen that notifies users :)

Thanks!

Johan subscribed.

OK, then I won't include it (but thanks for checking if it should be, @Elitre!).

If I understand this correctly, you want people to be notified about past outages. That's a line for Tech News, that included similar items in the past. I'll make sure Johan sees this, but if there's more, please LUK?

My goal was to notify you guys in order to be able to answer questions from the community (if any). The outage was limited to few user sessions lost for a brief window of time, so probably not worth to get a broad notification to Tech News. If you think that it is worth it let me know, I am learning how to do the good ops citizen that notifies users :)

Thanks!

I believe we're still... experimenting with the tag thing, and although this use case is perfectly fine, do feel free to reach out directly to one of us in the future or to send an email to our mailing list so that we all see the heads-up :)

Back to the original task :)

So from T111575, it seems that there is a valid reason to have both eqiad and codfw listed in all the nutcracker's configs. I am not sure if any of the mw hosts is using this option, but if nutcracker was supporting graceful config reload there wouldn't be any issue.

More info from the mediawiki-config repo:

<?php
foreach ( [ 'eqiad', 'codfw' ] as $dc ) {
        $wgObjectCaches["redis_{$dc}"] = [
                'class'       => 'RedisBagOStuff',
                'servers'     => [ "/var/run/nutcracker/redis_{$dc}.sock" ],
                'password'    => $wmgRedisPassword,
                'loggroup'    => 'redis',
                'reportDupes' => false
        ];
}

https://gerrit.wikimedia.org/r/#/c/335780 will surely help, but we need to think about either applying patches from upstream to nutcracker or evaluating alternatives like Dynomite or Mcrouter.

Change 335780 merged by Elukey:
Disable auto-restart for nutcracker when config.yaml changes

https://gerrit.wikimedia.org/r/335780

Change 336419 had a related patch set uploaded (by Elukey):
Replace Memcached/Redis codfw shard12->16

https://gerrit.wikimedia.org/r/336419

Change 336419 merged by Elukey:
Replace Memcached/Redis codfw shard12->16

https://gerrit.wikimedia.org/r/336419

Work completed, all the nutcrackers in codfw have been restarted to pick up the change. Please note: after https://gerrit.wikimedia.org/r/#/c/335780 Nutcracker is not restarted anymore on config change (the problem arose when a change in the codfw pool triggered a restart in all the eqiad nutcrackers due to how the config is laid out).