Page MenuHomePhabricator

Split memcached in eqiad across multiple racks/rows
Closed, ResolvedPublic

Description

Presently all eqiad memcached servers reside in the same rack. This is due to legacy reasons that have to do with the fact that mc* servers have 10GbE fiber and the availability of 10GbE switches in our datacenter.

The servers need to be shuffled across the datacenter for maximum resiliency.

Details

Reference
rt6889

Related Objects

View Standalone Graph
This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.

Event Timeline

rtimport raised the priority of this task from to Medium.Dec 18 2014, 1:49 AM
rtimport added a project: ops-core.
rtimport set Reference to rt6889.

-- worth mentioning mc1017 and mc1018 have never been setup. Linking ticket
(RT6351)
Chris Johnson
Operations Engineer
Wikimedia Foundation, Inc
(415) 578-0844
<cmjohnson at wikimedia>

Reference to ticket #6351 added by cmjohnson

mark raised the priority of this task from Medium to High.Feb 5 2015, 6:50 PM
mark subscribed.
mark added a subscriber: Cmjohnson.

@Cmjohnson, can you propose a plan here to spread out the memcached servers to 3 10G racks or more?

We'll be adding some Varnish (10G) servers soon as well, so let's keep that into account.

Moving servers to other racks/rows will mean a global rebalance of the cluster at the moment, losing a lot of memcached keys in the process.

I will try to find a way to trick nutcracker not to do that.

Until I investigated that, please hold this task.

faidon removed a subscriber: Joe.
faidon renamed this task from split memcached service across multiple racks to Split memcached in eqiad across multiple racks/rows.Feb 9 2015, 10:25 AM
faidon reassigned this task from Cmjohnson to Joe.
faidon edited projects, added ops-eqiad; removed netops.
faidon changed the visibility from "WMF-NDA (Project)" to "Public (No Login Required)".
faidon changed the edit policy from "WMF-NDA (Project)" to "All Users".
faidon added subscribers: Christopher, Joe.
faidon removed a subscriber: Christopher.

So this rebalancing is going to block the deployment of 2 of the 6 Restbase systems.

Can we plan to move mc1017-mc1018 into row D? Or at minimum, out of the rack now to take the space for the restbase systems?

I'd advise possibly not wiring mc1017--1018 wherever they rest until we figure out the full rebalancing plan.

This is not blocked by me at all now - my investigation in rebalancing is done.

Whenever mc1017 and mc1018 are moved and reinstalled I can work on moving traffic off of mc1001-1002 (or whichever other machines we want to move first).

I'll reassign the ticket to Chris.

mc1017/18 have been moved to d8 in eqiad.

Some caveats:

  • Whenever moving a server, we need to change the IP in a few places:
    1. puppet/hieradata/eqiad.yml (adding a label like "shard_N")
    2. mediawiki-config/wmf-config/session.php (redis for sessions)
    3. mediawiki-config/wmf-config/filebackend.php if we move mc1001-mc1003 (redis locking) My advice would then be not to move /now/ any of mc1001-mc1003 and to change the config to use servers in different rows later.
  • Moving one server will mean losing between 5 and 10% of the memcached keys, and between 6 and 13% of the sessions. Also, session data will be first saved then lost (the same holds for memcached data, but that's supposed to be fully ephemeral, right?)

So there will be some user-facing impact from this maintenance work.

We should aim at moving at least one server today, possibly 2

What I understood from IRC yesterday is:

  • mc1001-mc1006 will stay in the existing rack (A5)
  • mc1007-mc1012 will move to C8
  • mc1013-mc1018 will move to D8

Redis locking being confined in one availability zone is a problem, though, and that needs to be adjusted in the config at later point; I think that's what you meant above.

@faidon yes I meant that, I'd just like to reduce the number of potential issues while moving the servers.

DNS entries have been made and are sitting in gerrit for review. https://gerrit.wikimedia.org/r/#/c/190358/

Switch ports have been labeled, enabled and set to the correct vlan

The racks have been prepped with cabling and power strips and racktables updated

All servers have been moved/brought back online.