Page MenuHomePhabricator

Split memcached in eqiad across multiple racks/rows
Closed, ResolvedPublic

Description

Presently all eqiad memcached servers reside in the same rack. This is due to legacy reasons that have to do with the fact that mc* servers have 10GbE fiber and the availability of 10GbE switches in our datacenter.

The servers need to be shuffled across the datacenter for maximum resiliency.

Details

Reference
rt6889

Related Objects

View Standalone Graph
This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.

Event Timeline

rtimport raised the priority of this task from to Normal.
rtimport set Reference to rt6889.
RobH created this task.Feb 21 2014, 6:14 PM

-- worth mentioning mc1017 and mc1018 have never been setup. Linking ticket
(RT6351)
Chris Johnson
Operations Engineer
Wikimedia Foundation, Inc
(415) 578-0844
<cmjohnson at wikimedia>

Reference to ticket #6351 added by cmjohnson

mark raised the priority of this task from Normal to High.Feb 5 2015, 6:50 PM
mark added a subscriber: mark.
chasemp set Security to None.
mark assigned this task to Christopher.Feb 5 2015, 10:32 PM
mark added a subscriber: Cmjohnson.

@Cmjohnson, can you propose a plan here to spread out the memcached servers to 3 10G racks or more?

We'll be adding some Varnish (10G) servers soon as well, so let's keep that into account.

RobH reassigned this task from Christopher to Cmjohnson.Feb 6 2015, 12:00 AM
Joe added a subscriber: Joe.Feb 9 2015, 9:05 AM

Moving servers to other racks/rows will mean a global rebalance of the cluster at the moment, losing a lot of memcached keys in the process.

I will try to find a way to trick nutcracker not to do that.

Until I investigated that, please hold this task.

faidon updated the task description. (Show Details)Feb 9 2015, 9:05 AM
faidon removed a subscriber: Joe.
faidon updated the task description. (Show Details)Feb 9 2015, 9:13 AM
faidon renamed this task from split memcached service across multiple racks to Split memcached in eqiad across multiple racks/rows.
faidon edited projects, added ops-eqiad; removed netops.
faidon changed the visibility from "WMF-NDA (Project)" to "Public (No Login Required)".
faidon changed the edit policy from "WMF-NDA (Project)" to "All Users".
faidon reassigned this task from Cmjohnson to Joe.
faidon added subscribers: Christopher, Joe.
faidon removed a subscriber: Christopher.
RobH added a comment.Feb 9 2015, 6:58 PM

So this rebalancing is going to block the deployment of 2 of the 6 Restbase systems.

Can we plan to move mc1017-mc1018 into row D? Or at minimum, out of the rack now to take the space for the restbase systems?

I'd advise possibly not wiring mc1017--1018 wherever they rest until we figure out the full rebalancing plan.

This is not blocked by me at all now - my investigation in rebalancing is done.

Whenever mc1017 and mc1018 are moved and reinstalled I can work on moving traffic off of mc1001-1002 (or whichever other machines we want to move first).

I'll reassign the ticket to Chris.

Joe reassigned this task from Joe to Cmjohnson.Feb 10 2015, 7:27 AM

mc1017/18 have been moved to d8 in eqiad.

Joe added a comment.EditedFeb 11 2015, 9:02 AM

Some caveats:

  • Whenever moving a server, we need to change the IP in a few places:
    1. puppet/hieradata/eqiad.yml (adding a label like "shard_N")
    2. mediawiki-config/wmf-config/session.php (redis for sessions)
    3. mediawiki-config/wmf-config/filebackend.php if we move mc1001-mc1003 (redis locking) My advice would then be not to move /now/ any of mc1001-mc1003 and to change the config to use servers in different rows later.
  • Moving one server will mean losing between 5 and 10% of the memcached keys, and between 6 and 13% of the sessions. Also, session data will be first saved then lost (the same holds for memcached data, but that's supposed to be fully ephemeral, right?)

So there will be some user-facing impact from this maintenance work.

We should aim at moving at least one server today, possibly 2

faidon added a subscriber: faidon.Feb 11 2015, 9:08 AM

What I understood from IRC yesterday is:

  • mc1001-mc1006 will stay in the existing rack (A5)
  • mc1007-mc1012 will move to C8
  • mc1013-mc1018 will move to D8

Redis locking being confined in one availability zone is a problem, though, and that needs to be adjusted in the config at later point; I think that's what you meant above.

Joe added a comment.Feb 11 2015, 9:11 AM

@faidon yes I meant that, I'd just like to reduce the number of potential issues while moving the servers.

faidon changed the status of subtask T82259: mc1016 mgmt not working from Stalled to Open.

DNS entries have been made and are sitting in gerrit for review. https://gerrit.wikimedia.org/r/#/c/190358/

Switch ports have been labeled, enabled and set to the correct vlan

The racks have been prepped with cabling and power strips and racktables updated

Joe added a comment.Feb 19 2015, 5:44 PM

All servers have been moved/brought back online.

Joe closed this task as Resolved.Feb 19 2015, 5:44 PM