Split memcached in eqiad across multiple racks/rows
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	RobH
	Feb 21 2014, 6:14 PM

Description

Presently all eqiad memcached servers reside in the same rack. This is due to legacy reasons that have to do with the fact that mc* servers have 10GbE fiber and the availability of 10GbE switches in our datacenter.

The servers need to be shuffled across the datacenter for maximum resiliency.

Details

Reference: rt6889

Related Objects
Search...

View Standalone Graph

This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.

Status	Assigned	Task
		· · ·
Resolved	fgiunchedi	T88805 rack and setup restbase production cluster in eqiad
Resolved	• Cmjohnson	T83551 Split memcached in eqiad across multiple racks/rows
Resolved	• Cmjohnson	T82259 mc1016 mgmt not working
Resolved	Joe	T89345 Missing memcached logs
		· · ·

Event Timeline

• rtimport raised the priority of this task from to Medium.Dec 18 2014, 1:49 AM

• rtimport added a project: ops-core.

• rtimport set Reference to rt6889.

RobH created this task.Feb 21 2014, 6:14 PM

-- worth mentioning mc1017 and mc1018 have never been setup. Linking ticket
(RT6351)
Chris Johnson
Operations Engineer
Wikimedia Foundation, Inc
(415) 578-0844
<cmjohnson at wikimedia>

Reference to ticket #6351 added by cmjohnson

mark raised the priority of this task from Medium to High.Feb 5 2015, 6:50 PM

mark subscribed.

• chasemp added a project: netops.Feb 5 2015, 6:51 PM

• chasemp set Security to None.

• chasemp added a project: Incident-20150205-SiteOutage.Feb 5 2015, 7:54 PM

@Cmjohnson, can you propose a plan here to spread out the memcached servers to 3 10G racks or more?

We'll be adding some Varnish (10G) servers soon as well, so let's keep that into account.

RobH reassigned this task from Christopher to • Cmjohnson.Feb 6 2015, 12:00 AM

Moving servers to other racks/rows will mean a global rebalance of the cluster at the moment, losing a lot of memcached keys in the process.

I will try to find a way to trick nutcracker not to do that.

Until I investigated that, please hold this task.

faidon updated the task description. (Show Details)Feb 9 2015, 9:05 AM

faidon removed a subscriber: Joe.

Joe mentioned this in T88730: Nutcracker needs to automatically recover from MC failure - rebalancing issues.Feb 9 2015, 9:07 AM

Joe added a subtask: T88730: Nutcracker needs to automatically recover from MC failure - rebalancing issues.

faidon updated the task description. (Show Details)Feb 9 2015, 9:13 AM

faidon renamed this task from split memcached service across multiple racks to Split memcached in eqiad across multiple racks/rows.Feb 9 2015, 10:25 AM

faidon reassigned this task from • Cmjohnson to Joe.

faidon merged a task: T88710: rebalance memcached in eqiad.

faidon edited projects, added ops-eqiad; removed netops.

faidon changed the visibility from "WMF-NDA (Project)" to "Public (No Login Required)".

faidon changed the edit policy from "WMF-NDA (Project)" to "All Users".

faidon added subscribers: Christopher, Joe.

faidon removed a subscriber: Christopher.

RobH added a parent task: T88805: rack and setup restbase production cluster in eqiad.Feb 9 2015, 6:34 PM

So this rebalancing is going to block the deployment of 2 of the 6 Restbase systems.

Can we plan to move mc1017-mc1018 into row D? Or at minimum, out of the rack now to take the space for the restbase systems?

I'd advise possibly not wiring mc1017--1018 wherever they rest until we figure out the full rebalancing plan.

This is not blocked by me at all now - my investigation in rebalancing is done.

Whenever mc1017 and mc1018 are moved and reinstalled I can work on moving traffic off of mc1001-1002 (or whichever other machines we want to move first).

I'll reassign the ticket to Chris.

Joe reassigned this task from Joe to • Cmjohnson.Feb 10 2015, 7:27 AM

mc1017/18 have been moved to d8 in eqiad.

Some caveats:

Whenever moving a server, we need to change the IP in a few places:
1. puppet/hieradata/eqiad.yml (adding a label like "shard_N")
2. mediawiki-config/wmf-config/session.php (redis for sessions)
3. mediawiki-config/wmf-config/filebackend.php if we move mc1001-mc1003 (redis locking) My advice would then be not to move /now/ any of mc1001-mc1003 and to change the config to use servers in different rows later.

Moving one server will mean losing between 5 and 10% of the memcached keys, and between 6 and 13% of the sessions. Also, session data will be first saved then lost (the same holds for memcached data, but that's supposed to be fully ephemeral, right?)

So there will be some user-facing impact from this maintenance work.

We should aim at moving at least one server today, possibly 2

What I understood from IRC yesterday is:

mc1001-mc1006 will stay in the existing rack (A5)
mc1007-mc1012 will move to C8
mc1013-mc1018 will move to D8

Redis locking being confined in one availability zone is a problem, though, and that needs to be adjusted in the config at later point; I think that's what you meant above.

@faidon yes I meant that, I'd just like to reduce the number of potential issues while moving the servers.

faidon added a subtask: T82259: mc1016 mgmt not working.Feb 11 2015, 9:34 AM

faidon changed the status of subtask T82259: mc1016 mgmt not working from Stalled to Open.

faidon mentioned this in T82509: Replace asw-c5-eqiad or asw-c8-eqiad with EX4550.Feb 11 2015, 10:20 AM

Joe mentioned this in T89345: Missing memcached logs.Feb 12 2015, 11:54 AM

Joe added a subtask: T89345: Missing memcached logs.

Joe closed subtask T89345: Missing memcached logs as Resolved.Feb 12 2015, 4:39 PM

DNS entries have been made and are sitting in gerrit for review. https://gerrit.wikimedia.org/r/#/c/190358/

Switch ports have been labeled, enabled and set to the correct vlan

The racks have been prepped with cabling and power strips and racktables updated

• Cmjohnson closed subtask T82259: mc1016 mgmt not working as Resolved.Feb 18 2015, 3:39 PM

All servers have been moved/brought back online.

Joe closed this task as Resolved.Feb 19 2015, 5:44 PM

Split memcached in eqiad across multiple racks/rowsClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Split memcached in eqiad across multiple racks/rows
Closed, ResolvedPublic
Actions

Related Objects
Search...