Page MenuHomePhabricator

Refresh Parser cache servers pc1001-pc1003
Closed, ResolvedPublic

Description

This task will track the overall refresh and quoting for replacing our parsercache cluster. While this hardware-requests is public, the pricing and procurement tasks will be private.

  • Dual CPU - Whatever the best Xeon price point per core/Ghz
  • 200GB+ preferred : kept to 256 or 512 to keep memory speed high.
  • Disks: Intel S3710 preferred but S3610 SSDs are acceptable if S3710 aren't available. 2.5GB of overall raid0 capacity required.

Google Sheet comparison of costs (WMF only viewable): https://docs.google.com/a/wikimedia.org/spreadsheets/d/1uJ12uwPbwtoJLFBQ1rROgoHN_8lqlTWcoWbUQxRWkYQ/edit?usp=sharing

original task body

pc1001 - pc1003 are Ciscos, old and past any warranty. We need to replace these ASAP.

Presumably we can get away with roughly similar specs (~ 192 GB and a few TB of I/O), but we should discuss. First, we need to find an owner for this...

...and I guess we need to replicate this to codfw as well.

Related Objects

StatusSubtypeAssignedTask
ResolvedRobH
Resolved jcrespo
Resolved jcrespo

Event Timeline

mark raised the priority of this task from to Needs Triage.
mark updated the task description. (Show Details)
mark subscribed.

No replication, these should be local-only, they are N-level cache (we can warm them one time, though). Set them for the first time on codfw, yes.

I can own this, unless you grew tired of me already on the previous purchase :-).

jcrespo triaged this task as Medium priority.Sep 8 2015, 2:31 PM

@jcrespo: Can you advise what specs would be ideal for parsercache use? The initial task assumes machines similar to the Ciscos, but often those ciscos were placed into use simply because they were more powerful than required for any given task, and we had them available in the rack.

As such, I just want to confirm and try to get a rough idea what the specifications needed would be. We're using SSDs in the Ciscos, and as they seem cheaper than SAS, would likely be the default for this.

When @mark mentions replicating to codfw, I think he meant the ordering of the machines should be mirrored in both sites? If he didn't, I would assume that this would still need hardware server (not data) mirroring? If we have one site go down, parsercache is a core service for our users, correct?

The current pc1001 has the following:

  • Dual Intel(R) Xeon(R) CPU X5650 @ 2.67GHz (6 cores per cpu)
  • 196GB in 4gb DIMM Synchronous 1333 MHz
  • Disk: Single SSDSA2M160 160GB

I'd assume we'd actually want dual SSD for non-system crash on disk failure? With the nearly non-existent disk requirements presented, we should be able to find a 1U system for this specification.

If we can determine what one 'instance' of a parsercache's smallest usable system requirements would be, we can scale accordingly to determine overall system quantity. Example: If we determine we need a system identical to what we have, it is one cpu core per 16GB memory per 10GB SSD space; and if the current system is the minimum, we must have a total of 36 core/16/10 to support the current need. (This is oversimplified, as scaling vertically and horizontally carry different considerations.)

Please note: This hardware-requests ticket should contain all the specification and metrics data, since it is public. We'll end up creating private S4 space tasks in the procurement project for quoting.

@RobH,

parsercaches hold no user data, only cache, as such we need them fast, but not reliable. To convince you this is the case, I once set 2 of them in read only mode and nobody noticed for days because they could still read from them and write to the third. The wikis suffer if the service fails completelly, but they are depooled automatically if one fails, and the data itself does not need to be replicated or needed in any way. They are already in redundant configuration at server level.

Your spec, however, is wrong- pc1001 has 2.2 TB of SSD disks:

cat /sys/class/block/sd*/device/model 
INTEL SSDSA2M160
INTEL SSDSA2BZ30
INTEL SSDSA2BZ30
INTEL SSDSA2BZ30
INTEL SSDSA2BZ30
INTEL SSDSA2BZ30
INTEL SSDSA2BZ30
INTEL SSDSA2BZ30

lsblk
NAME               MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                  8:0    0 149.1G  0 disk 
├─sda1               8:1    0   123G  0 part /
├─sda2               8:2    0   7.5G  0 part [SWAP]
└─sda3               8:3    0  18.6G  0 part /tmp
sdb                  8:16   0 279.5G  0 disk 
└─tank-data (dm-0) 252:0    0   1.8T  0 lvm  /a
sdc                  8:32   0 279.5G  0 disk 
└─tank-data (dm-0) 252:0    0   1.8T  0 lvm  /a
sdd                  8:48   0 279.5G  0 disk 
└─tank-data (dm-0) 252:0    0   1.8T  0 lvm  /a
sde                  8:64   0 279.5G  0 disk 
└─tank-data (dm-0) 252:0    0   1.8T  0 lvm  /a
sdf                  8:80   0 279.5G  0 disk 
└─tank-data (dm-0) 252:0    0   1.8T  0 lvm  /a
sdg                  8:96   0 279.5G  0 disk 
└─tank-data (dm-0) 252:0    0   1.8T  0 lvm  /a
sdh                  8:112  0 279.5G  0 disk 
└─tank-data (dm-0) 252:0    0   1.8T  0 lvm  /a
sr0                 11:0    1  1024M  0 rom

IOPS is not too crazy, we have constant 1000 IOPS, mostly because we have a hit ratio of 98.0% on the buffer pool (memory). Disk space requirements, unlike the wikis, are not expected to grow unlike the regular wiki (it is bound by the amount of data cached). MySQL is using right now 1.55TB for the data.

So, what we need is cheap servers (no hw RAID) with *lots* of memory (if we want to improve specs, go for 200+GB), we will use all of it. The rest is less relevant: 2.5 TB of non-redundant SSD disk (RAID0). Do not focus too much -as usual with mysql- on the number of cores: it has a constant load of 2 and low cpu utilization.

3 servers x 2 datacenters. While I would be tempted to go to only 2 servers per datacenter, we need 3 for the redundancy on maintenance.

Claiming this task, as I've started the process of obtaining quotes on the associated blocking tasks.

RobH changed the task status from Open to Stalled.Oct 29 2015, 6:13 PM
RobH updated the task description. (Show Details)
RobH added a subtask: Unknown Object (Task).Nov 5 2015, 2:25 PM
RobH changed the status of subtask Unknown Object (Task) from Open to Stalled.Nov 6 2015, 2:45 PM
RobH changed the status of subtask Unknown Object (Task) from Open to Stalled.

Please note that all requested quotes have been obtained off the sub-tasks and are now in review.

jcrespo closed subtask Unknown Object (Task) as Declined.Nov 16 2015, 4:05 PM

I'm resolving this parent task, as it compared Dell and HP system specifications during the refresh. We've gone with the Dell systems on T117068.

I'll create on-site tasks for receiving off T117068.

RobH changed the status of subtask Unknown Object (Task) from Stalled to Open.Dec 18 2015, 4:23 PM
RobH closed subtask Unknown Object (Task) as Resolved.Dec 18 2015, 5:17 PM
jcrespo mentioned this in Unknown Object (Task).May 29 2018, 3:55 PM
jcrespo mentioned this in Unknown Object (Task).May 29 2018, 4:07 PM