Page MenuHomePhabricator

eqiad: replacement tin/deployment server
Closed, ResolvedPublic

Description

This task is covering the request of hardware to replace tin. Tin is currently the deployment server in eqiad, but as reflected on T174449, currently has a bad disk.

Tin went out of warranty on 2015-08-29. It has cabled, not hot swap disks, so the disk replacement would require downtime. Additionally, tin is using an older H310 hw raid controller. We no longer use these, as they are not wholly reliable. When we did use them, we would put them into passthrough mode, and not rely on the hw raid. It seems this was setup prior to all of those changes, since it is using hw raid on a cruddy H310 controller.

Requesting approval to use a shelf spare as a replacement server. This will allow us to image a new deployment server and get all the data there before shutting down tin.

Tin has very, very modest hardware. A single Intel(R) Xeon(R) CPU E5-2420 (1.9GHz/6 cores), 16GB RAM, and dual 500GB disks in a raid1 mirror. The small requirements could even be viable for a ganeti instance, but I'm not certain we want our deployment system to reside within a ganeti VM. (Though if it can, this may be the ideal time to do so, due to hw failure.)

At the time of this task creation, there were 6 spare systems in eqiad for assignment.

wmf4660 (was restbase1005) warranty end: 2018-02-05 HP Proliant DL360 Intel® Xeon® Processor E5-2450 v2 (2.50 GHz / 8 cores), 64GB RAM, no disks, system has 4 bays, but only 3 disk sleds installed
WMF4727 warranty end: 2018-12-05 Dell PoweEdge R430 Intel® Xeon® Processor E5- 2623 V3 (3.00/4) 32GB RAM, (4) 4TB SATA
wmf4748 warranty end: 2019-03-24 Dell PoweEdge R430 Intel Xeon E5-2640 v3 (2.60/8) 64GB RAM, (2) 1TB SATA

The last system has 3 other systems identical to it (so there are a total of 6 spare systems, where 4 of them are identical (including wmf4748). I'd recommend either wmf4748, or addign disks in and using wmf4660.

If we pick the old restbase system wmf4660 for replacement, we need to put in 2 disks from shelf spares for use.

Event Timeline

RobH created this task.Aug 29 2017, 4:13 PM
RobH moved this task from Backlog to Pending Approval on the hardware-requests board.
RobH added a subscriber: bd808.Aug 29 2017, 5:10 PM

Please note that there has been some IRC discussion. The relevance of moving deployment to ganeti was discussed on T144578. Additionally, @bd808 noted that "MediaWiki scap would be *way* faster if the deploy servers had SSD. Building the l10n data requires a bazillion stat() calls."

Reviewing tin's performance on https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&var-server=tin&var-network=bond0&from=now-7d&to=now

Shows that it tends to have IO spikes, most likely during deployments. The CPU spikes at the same time, but rarely above 50%, and memory utlization (used, not cached) is also fairly low.

The IO usage likely makes this non-ideal for ganeti.

No shelf spares have SSDs, but the recommended spare does have dual 1TB 7.2K RPM SATA 6Gbps 2.5in Hot-plug Hard Drive. This chassis could have SSDs swapped in, but they would not be covered under the system warranty, and is not advised. (We could purchase SSDs via Dell to keep them covered under proposed system warranty, which is in place until 2019-03-24.)

I'd advise we try out the spare system with the SATA disks and see how well it works. Its a higher performance system than iron. If the IO spikes continue to slow things down, then we could look at ordering SSDs from Dell for this system.

demon added a subscriber: demon.Aug 29 2017, 5:13 PM

I'd advise we try out the spare system with the SATA disks and see how well it works. Its a higher performance system than iron. If the IO spikes continue to slow things down, then we could look at ordering SSDs from Dell for this system.

This sounds reasonable to me. SSDs would probably help, but aren't worth holding up tin's replacement over. Additionally, the localization caches are going to be (eventually) moved away from CDB files which should reduce our IO usage.

RobH assigned this task to mark.Sep 5 2017, 3:07 PM

Assigning to @mark for approval of spare server usage.

@mark: We have 4 total spare systems on the shelf identical to this: wmf4748 warranty end: 2019-03-24 Dell PoweEdge R430 Intel Xeon E5-2640 v3 (2.60/8) 64GB RAM, (2) 1TB SATA

I'd like to assign this one to replace tin, which has a bad disk and is years out of warranty. Please comment and assign back to me for followup, thanks!

mark added a comment.Sep 7 2017, 4:55 PM

Approved.

Dzahn added a subscriber: Dzahn.Dec 18 2017, 6:54 PM