Page MenuHomePhabricator

new external storage cluster(s)
Closed, ResolvedPublic

Description

Our current external storage clusters just reached 90% disk usage and icinga sent warnings. We still have some months of time to spare; the current clusters has been filling up for at least 2.5 years.

We need to:

  • Review the ES server spec, mostly only to maximize disk space.
  • Order and provision 6 to 8 nodes for EQIAD (and same for CODFW; so 12 or 16 all up)

MediaWiki writes to multiple active ES clusters in order to avoid SPOF. A cluster is a minimum of 3 nodes (1 master, 2 slaves) but ideally for capacity 4 nodes (1 master, 3 slaves).

Related Objects

StatusAssignedTask
OpenNone
Resolvedjcrespo

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
jcrespo triaged this task as Normal priority.Jul 20 2015, 3:39 PM
jcrespo moved this task from Triage to Backlog on the DBA board.
demon added a subscriber: demon.Jul 20 2015, 6:06 PM

First thing we need to determine is the future of this service: should new clusters or shards be added from the application point of view or should we be 100% transparent to the application and just add extra capacity to existing servers (both on disk and memory).

@Springle pointed that the innoDB buffer pool efficiency is far from perfect, and that this will need more than extra storage. This is a good point to gather a bit of feedback from the application side of things about the prospect of this service.

If just adding space isn't as efficient for InnoDB and we would do better with new servers then let's do that. As long as existing data is copied over to a new cluster I don't see why we couldn't. It would require a bit of read-only time to the affected wikis (can be kept minimal), but otherwise I don't see any real blockers here from the application POV...just some config changes.

I'm less sure on the "can we do better in MW" with regards to better compression of blobs and so forth. That's more a question for @tstarling or @aaron...

It would require a bit of read-only time

Do not worry about the operational part, thanks to replication impact would be minimal, if any. :-)

Please note that hardware purchase, installing and data migration takes months.

jcrespo added a subscriber: Cmjohnson.EditedJul 21 2015, 4:58 PM

From the high level point of view, there are 2 main options here:

  • Keeping the old servers, which have no warranty and right now no replacements (see for example T103843). Buy only a batch of larger disks (they probably cannot be just added to the existing ones)
  • Decommission the old servers in line with the renewal policy and buy more powerful servers that would allow a) consolidation, doing more with less servers, and faster b) less human resources needed due to newer parts c) for some amount of years, warranty-covered replacements.

One important constraint is that the service should be up and running by November (see previous graphic).

We could buy new servers, immediately configure MW to write to the new cluster, then recompress the old cluster, and decommission it when recompression is done (say 3-6 months).

We're currently using about 2.7 GB per day, for each of the two ES clusters (by linear regression on yearly ganglia part_max_used). At that rate, we would need about 4900 GB for the next 5 years, plus about 600GB for recompressed old data, so say 5500 GB. That should give plenty of headroom if we plan for another recompression/decommission cycle after 3 years.

If there is no recompression, and we just migrate the ~2700 GB of old data, the projected disk usage after 5 years will be more like 8100 GB per cluster.

It probably makes sense to have a dedicated cold storage cluster, instead of putting the newly recompressed data in the active cluster, since the hardware parameters are a bit different. I see that es1001-1004 is currently used for cold storage. It has a read load in the vicinity of 1MB/s per server, 7.8TB of disk usage, 1.3TB free, so it is a bit overpowered with 12 spindles. Presumably a lot of that 7.8TB has not been recompressed, especially cluster22 and cluster23.

So I am thinking that we want 3 new clusters per datacenter: 2 active and 1 cold. We switch MW to write to the 2 new clusters, then recompress as much as possible into the cold cluster, and also migrate any legacy data to it such as cluster1/cluster2. Then decommission es1001-es1010.

Elitre added a subscriber: Elitre.Jul 22 2015, 2:03 PM
jcrespo added a comment.EditedJul 24 2015, 3:12 PM

So, summarizing @RobH @Springle:

  • 3 nodes per cluster, (ideally 4) x 3 clusters x 2 datacenters = 18 - 24 nodes. 2 out of the 3 clusters per datacenter are needed with less than 3 months hard deadline.
  • Hardware RAID
  • 12 TB of hardware HD storage (6 on RAID 10) for 3 year provision, 16 TB for a 5 year provision (most important)
  • 96 or 128 GB memory
  • IO bound, if we have to invest on memory vs. cpu vs, disk, invest on disks, but we do not need SSDs (thoughput over latency)
  • Do not invest on many CPUs (mysql does not paralelize over 32 cores)
  • Soft requirement: try to homogenize hardware for future mysql production purchases (for example, T106847)
  • Warranty, specially for disk replacement!

+1 to the provisioning.

Also +1 to Tim's plan, once we get the hardware.

Some nodes, like es1009 are down to 6% available disk space.

Hardware has already been ordered for eqiad.

jcrespo added a comment.EditedAug 13 2015, 10:30 AM

es1005 RAID degraded. We will not replace the failed disk as full server replacements are on its way in

                Device Present
                ================
Virtual Drives    : 1 
  Degraded        : 1 
  Offline         : 0 
Physical Devices  : 14 
  Disks           : 12 
  Critical Disks  : 4 
  Failed Disks    : 1
Enclosure Device ID: 32
Slot Number: 4
Drive's position: DiskGroup: 0, Span: 2, Arm: 0
Enclosure position: N/A
Device Id: 4
WWN: 5000C50054E8DE5C
Sequence Number: 3
Media Error Count: 3075
Other Error Count: 15
Predictive Failure Count: 71
Last Predictive Failure Event Seq Number: 13840
PD Type: SAS

Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
Non Coerced Size: 558.411 GB [0x45cd2fb0 Sectors]
Coerced Size: 558.375 GB [0x45cc0000 Sectors]
Sector Size:  0
Firmware state: Failed
Device Firmware Level: ES65
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x5000c50054e8de5d
SAS Address(1): 0x0
Connected Port Number: 0(path0) 
Inquiry Data: SEAGATE ST3600057SS     ES656SL4G6AK            
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None 
Device Speed: 6.0Gb/s 
Link Speed: 6.0Gb/s 
Media Type: Hard Disk Device
Drive Temperature :53C (127.40 F)
PI Eligibility:  No 
Drive is formatted for PI information:  No
PI: No PI
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s 
Port-1 :
Port status: Active
Port's Linkspeed: Unknown 
Drive has flagged a S.M.A.R.T alert : Yes

es1009 and es1006, both read/write masters are down to 5%-175G. Any news about the order?

RobH reassigned this task from jcrespo to Cmjohnson.Aug 24 2015, 6:28 PM

It shows delivered on 2015-08-20. https://rt.wikimedia.org/Ticket/Display.html?id=9524 has not been updated, even though tracking shows delivered.

I'm going to assign this task to Chris, as the RT is also assigned to him.

I think everything is here but the disks?

It would be nice to have 2 units ASAP mounted alongside es1006 and es1009 (that will substitute), prepared to clone the existing data. I've asked for comment on the data migration here: T106386#1571242

Change 234035 had a related patch set uploaded (by Jcrespo):
Adding new External Storage nodes as MariaDB::core

https://gerrit.wikimedia.org/r/234035

Change 234035 merged by Jcrespo:
Adding new External Storage nodes as MariaDB::core

https://gerrit.wikimedia.org/r/234035

jcrespo added a comment.EditedAug 27 2015, 7:36 AM

Arrangement:

ES1
===
es1012 [A2]
es1018 [D1]
es1016 [C2]

ES2
===
es1011 [A2] MASTER
es1013 [B1]
es1015 [C2]

ES3
===
es1014 [B1] MASTER
es1017 [D1]
es1019 [D3]

Change 234225 had a related patch set uploaded (by Jcrespo):
Reorganization of new External Storage nodes

https://gerrit.wikimedia.org/r/234225

Change 234225 merged by Jcrespo:
Reorganization of new External Storage nodes

https://gerrit.wikimedia.org/r/234225

Change 234244 had a related patch set uploaded (by Jcrespo):
Pool es1011, depool es1008 as storage nodes

https://gerrit.wikimedia.org/r/234244

Change 234244 merged by Jcrespo:
Pool es1011, depool es1008 as storage nodes

https://gerrit.wikimedia.org/r/234244

Change 234479 had a related patch set uploaded (by Jcrespo):
Depool es1001 for cloning, increase es1011 weight, pool es1014

https://gerrit.wikimedia.org/r/234479

Change 234479 merged by Jcrespo:
Depool es1001 for cloning, increase es1011 weight, pool es1014

https://gerrit.wikimedia.org/r/234479

The es1 servers may be some of our oldest servers for critical production usage (16GB of memory!). They still have 1.8TB drives, but they could be as old as 4 years, according to some (imported) phabricator tickets.

Change 234965 had a related patch set uploaded (by Jcrespo):
Depool es1007 for maintenance

https://gerrit.wikimedia.org/r/234965

Change 234965 merged by Jcrespo:
Depool es1007 for maintenance

https://gerrit.wikimedia.org/r/234965

Change 235192 had a related patch set uploaded (by Jcrespo):
Repool es1007, pool es1013 for the first time

https://gerrit.wikimedia.org/r/235192

Change 235192 merged by Jcrespo:
Repool es1007, pool es1013 for the first time

https://gerrit.wikimedia.org/r/235192

Change 235193 had a related patch set uploaded (by Jcrespo):
Depool es1010 to clone it to es1017

https://gerrit.wikimedia.org/r/235193

Change 235193 merged by Jcrespo:
Depool es1010 to clone it to es1017

https://gerrit.wikimedia.org/r/235193

Cmjohnson reassigned this task from Cmjohnson to jcrespo.Sep 1 2015, 1:16 PM

Assigning this to @jcrespo

Thanks again, I've already seen the entries on racktables! Was waiting for that to fully own it.

Change 235276 had a related patch set uploaded (by Jcrespo):
Repool es1010, pool es1017 for the first time

https://gerrit.wikimedia.org/r/235276

Change 235276 merged by Jcrespo:
Repool es1010, pool es1017 for the first time

https://gerrit.wikimedia.org/r/235276

Change 235423 had a related patch set uploaded (by Jcrespo):
Depool es1002 in order to clone it to new server es1016

https://gerrit.wikimedia.org/r/235423

Change 235423 merged by Jcrespo:
Depool es1002 in order to clone it to new server es1016

https://gerrit.wikimedia.org/r/235423

Change 235735 had a related patch set uploaded (by Jcrespo):
Switchover of es2 master from es1006 to es1011

https://gerrit.wikimedia.org/r/235735

Change 235735 merged by Jcrespo:
Switchover of es2 master from es1006 to es1011

https://gerrit.wikimedia.org/r/235735

After several clones and failovers, the new nodes are working as masters, alongside the old nodes. I will now slowly start the slowly depooling of the old nodes so that they can be decommissioned soon.

Change 237414 had a related patch set uploaded (by Jcrespo):
Depool es1001 for decommision; increase weight of es1015 and es1019

https://gerrit.wikimedia.org/r/237414

Change 237414 merged by Jcrespo:
Depool es1001 for decommision; increase weight of es1015 and es1019

https://gerrit.wikimedia.org/r/237414

Change 238494 had a related patch set uploaded (by Jcrespo):
Depool es1003, es1004, es1007 and es1010 for decommision

https://gerrit.wikimedia.org/r/238494

Change 238494 merged by Jcrespo:
Depool es1003, es1004, es1007 and es1010 for decommision

https://gerrit.wikimedia.org/r/238494

This is almost completed: we just need to wait for the old server to finish processing dump queries, stop mysqls to confirm we can stop them and clean up the configuration file.

Equiad old servers no longer in use, created T113080.

Not closing because codfw is still pending.

jcrespo closed this task as Resolved.Sep 27 2015, 3:18 PM

Actually, closing, tracking codfw separately.

jcrespo mentioned this in Unknown Object (Task).Jan 27 2016, 1:36 PM