Page MenuHomePhabricator

new labstore hardware for eqiad
Closed, ResolvedPublic

Description

Synopsis:

labstore1001 is the active NFS server for most mounts in Labs. It has 8 cores and 32G of RAM and is overloaded. There are various measures in place to reduce the load on the server as much as possible but in general it is at the point of limiting users often and aggressively or upping the resources available. We discovered this server is older than anticipated having been bought in 2012(?).

labstore1002 is the passive NFS server for labstore1001 only it requires an on-site physical cable move in order to assume the role, and at the moment (T121905) has an atypical configuration which (afaik) would make putting it in service difficult. 8 cores, 32G RAM. We discovered this server is older than anticipated having been bought in 2012(?).

labstore1003 is the active for dumps and large file misc. It currently is idle most of the time. 16 cores, 64G RAM.


Possibilities for replacement are tbd.

Event Timeline

chasemp created this task.Feb 5 2016, 11:23 PM
chasemp updated the task description. (Show Details)
chasemp raised the priority of this task from to Normal.
chasemp added a project: Operations.
chasemp added subscribers: Aklapper, chasemp.
mark added a subscriber: mark.

Need some guidance here outlining how we can sort out new servers with breaking up the existing shelves.

mark added a comment.Feb 10 2016, 10:22 AM

Need some guidance here outlining how we can sort out new servers with breaking up the existing shelves.

This is a diagram that describes the current connection topology: https://wikitech.wikimedia.org/wiki/Labs_NFS#/media/File:Labstore_hardware_diagram.png

...with the exception that currently one of two servers (atm labstore1002) is actually disconnected at all times, to avoid conflicts.

As you can see, We have 5 drive shelves connected and active (and one additional spare). There are two daisy chains of shelves, one consisting of 3 shelves, the other of 2 shelves. Each server has (at least) two ports that can be connected to such a chain.

So, splitting these shelves over multiple (sharded) NFS servers could be as simple as moving one of the current daisy chains to another server with an external SAS controller port.

mark added a comment.Feb 10 2016, 10:25 AM

As for the general plan:

I think we should simply buy new, higher specced versions of the current labstore1001, with more/faster CPUs/cores, more memory (128 GB perhaps?) and 10G Ethernet. Then we could install one of them (let's call it labstore1004 for now), switch over to it as if it were current labstore1002. If that's working well, we do the same with labstore1001->labstore1003. Then we have the current setup, but on higher capacity, newer machines.

Given that we could still use the old machines for another year perhaps, we could then experiment with them with sharding, and split off say 2 shelves to them and host e.g. non-Tools NFS on them, or something.

mark added a comment.Feb 10 2016, 10:26 AM

One thing to consider is internal drives. The current systems have those (12 of them I think), but they aren't really used, as we can't access them "from the other server" after a failover.

Perhaps we can consider buying the new ones without internal drives, and only rely on the shelves. That would also mean we can buy 1U servers instead of 2U, and that would be helpful to squeeze them in the current rack where the shelves are.

chasemp added a comment.EditedFeb 17 2016, 10:48 PM

General summary and outline of above and IRC discussions.

Desired acquisition:

Two servers with 128G of RAM, 32 cores, and >=10T of local disk after RAID1 or 10, and a 10G interface. SAS controllers necessary.

Problem(s):

Labstore1001 is overloaded. I have some measures in place in Tools to minimize load issues sourced from there, but it is not difficult at present to overwhelm our NFS setup. With the coming container oriented Tools setup I believe our client count will increase (especially during the long period of transition) and we have been unable to sustain an nfsd threadcount that matches our concurrent clients. We see periods of retrans on clients with nfsstat and there is little we can do about it. We have reached the point of diminishing returns for labstore1001 tuning. 8 cores and 16G of RAM is just not enough for 250+ concurrent NFS clients with variable load. With more hardware resources I believe we can find a sustainable model but we should isolate the ability for certain portions of the environment to affect others. I would like to isolate home directories over NFS for both growth containment and performance isolation. In some fashion I want to shard on classes of service as discussed here: https://wikitech.wikimedia.org/wiki/Labs_labs_labs/future#Defining_Classes_of_Service.

We have an unknown surrounding the viability of dual connected shelves and failover for identically configured servers. If I understand correctly, previously we had a cross connected shelve setup with the idea that remote failover was possible but this is suspected of data loss / corruption. Ideally, our new setup would allow us to work out this setup.

https://grafana.wikimedia.org/dashboard/db/labs-labstore1001

Open question:

"There is one shelf that is known to have had issues with the controller on labstore1002 (shelf4, above), which was avoided in the current setup and is not currently used." from https://wikitech.wikimedia.org/wiki/Labs_NFS#Software_RAID. I don't see a "shelve 4" on https://wikitech.wikimedia.org/wiki/File:Labstore_hardware_diagram.png.

I have also searched for issues surrounding this shelve and talked to cmjohnson who said he has never heard of any issues. So I'm wondering if one of the shelves is suspect and how and why?

Current situation

I think much of this is no longer correct https://wikitech.wikimedia.org/wiki/Labs_NFS. Shelves do not present themselves as single RAID 0, no shelves are currently connected to labstore1002 as in https://wikitech.wikimedia.org/wiki/File:Labstore_hardware_diagram.png, but I do believe the failover mode is current https://wikitech.wikimedia.org/wiki/Labs_NFS#Failover.

I am continuing as if :

  • we have two labstore servers (8 cores, 16G RAM)
  • we have 5 shelves (but see open questions section)
  • these shelves have 12 1.8TB SAS drives
  • labstore1001 and labstore1002 have SAS
  • current failover mode requires hands-on intervention
  • servers and shelves occupy U space 43-29 in C3 eqiad

https://racktables.wikimedia.org/index.php?page=rack&rack_id=1958

Next iteration layout:

Labstore1004 and Labstore1005 (new servers) with 2 shelves attached allowing a 'shared' 21.6T of allocatable storage with mirrored shelves.

Labstore1001 and Labstore1002 (current servers) with 2 shelves attached allowed a 'shared' 21.6T of allocatable storage with mirrored shelves.

1 shelve in excess for TBD

Labstore1003 remains as "dumps" and large file share. I am considering whether it makes sense to move the scratch share here as well.

Rough shuffle outline:

  1. Acquire Labstore1004/5
  2. Labstore1002 moved to another location with space for labstore1001 and 2 shelves to follow. I think we have a few candidates in row C.
  3. Labstore1004 put in place of labstore1002 (assuming 2U or smaller)
  4. Tools/others shares are moved to local disks on Labstore1004 (with backups running nightly). At this point resources wise we are in a better place. Scratch is handled by labstore1003 which has the headroom (this is not backed up now). Maps follow labstore1002 with local disk at $new_location.
  5. Labstore1001 and 2 shelves moved close to labstore1002.
  6. Labstore1005 put in place of labstore1001 in c3 with 3 shelves near it and labstore1004.
  7. Connect labstore1005 and shelves in c3
  8. Connect labstore1001 and shelves at $new_location
  9. Reformat/reallocate space on Labstore1001 shelves for /home directories and maps
  10. This will allow for some proof of concept of failover for both modes of always connected storage arrays and on-site cable moves for similarly configured servers.
  11. Reformat/reallocate space on Labstore1005 shelves for Tools and others, with a better understanding from labstore1001/1002 shelve configurations of failover mechanism.
  12. Move tools/others to labstore1004/1005 shelves
  13. Move /home and maps to labstore1001/1002 shelves

Some of the latter bits especially of testing shelve failure modes is order of operations negotiable depending on how sane it is to keep maps/home on labstore1001/1002 while working things out. It's also sane to put things back into a labstore1004/1005 centric mode before hand but as is I believe we could do with 2 short-ish periods of user impacting migration after which we would be a better situation both times.

mark added a subscriber: RobH.Feb 18 2016, 4:14 PM

I'll review the new layout in more detail a bit, but overall I think we can proceed with getting hardware quotes in the mean time.

@RobH: can you create a procurement ticket out of the above request for 2 servers, and proceed with that?

RobH added a comment.EditedFeb 18 2016, 9:29 PM

I have some questions in regards to the disk shelf requirements of the two new servers.

The issue is the old MD1200 and MD1220 are NOT compatible with new Dell systems. The older systems use a 6Ghps backplane, while the new Dell SAS controllers and MD shelves use the 12Gbps backplane. This particular issue was investigated by me when we wanted to update the dataset server specification (T125422), as it connects to an array of MD1200 disk shelves.

So if we order a new Dell system, Dell states it won't work to simply install the sas controller from the older Dell system and has to use the newer, twice as fast controller. This means Dell won't support the old shelves on the new controllers. They will refer us to an after-market, non-Dell controller cable that supposedly will downgrade the speed to 6 and work, but it would be a hacky, non supported solution.

As such, it seems that the server quote request also must include the full new generation of disk shelves for support. Alternatively, we can attempt to use the older shelves with the new Dells, or potential HP systems, but it wouldn't be supported by either vendor. Any quote request for the servers would then of course exclude the SAS controller to drive the shelves. We would have to either use the support dell controller with an unsupported cable (with a multi-shelf array), or attempt a third party controller to drive the existing shelves in the new system.

If that sounds acceptable, I'll get the quotes without the shelves. The shelves being replaced would be a large overall cost, so I'm not sure if we want them quoted as well?

RobH added a comment.Feb 19 2016, 6:05 PM

Ok, I've just chatted with @mark about this via IRC and we have the following plan:

  • Get quotes for the servers without the disk shelves
  • Get quotes for the servers with the disk shelves
  • Test a new dell R430 with an older dell controller for an MD1200 and attach it to the spare lab disk shelf for testing.
    • If it works without issue, Dell was incorrect in advising it wouldn't work.
RobH mentioned this in Unknown Object (Task).Feb 19 2016, 7:15 PM
RobH added a subtask: Unknown Object (Task).
RobH claimed this task.Mar 3 2016, 11:04 PM
RobH changed the task status from Open to Stalled.

I'll keep this assigned to me, as the procurement tasks are all pending.

chasemp added a comment.EditedMar 18 2016, 8:24 PM

An overview of where I'm at with this to further T127508

In https://phabricator.wikimedia.org/T127508#2063406 I outlined that I believe our historical perspective on this has been ill-fated. I have also dug up T117453 and closed it as invalid. Pursuing this model we have replaced all 4 SAS cards in labstore1001,1002,2001, and 2002 within the recent past. I believe this was a red herring, and in all likelyhood those cards are fine. As noted in T117453:

SCSI reservations, while they should do the trick, are not supported by the H800 controller even though the disks actually support it.

Moving on we need to either:

  • Purchase MD3000 (or equivalent) or some redundant server SCSCI hardware that can support scsi reservations either by linking the hosts (as it appears to be done) or by having the logic in a storage head that polices concurrent host access for consistency
  • Find a way to have software oriented redundancy and high availability

https://phabricator.wikimedia.org/T126089#2041718 has indicated that it's a brand new world in order to get into a higher class of SAS storage and capabilties. We would in all likelyhood need new cards, shelves, servers, etc as it seems the outcome of T127490 is that there is no workable solution for old cards in newer servers. I spent a bit of time with Chris on this looking at ways to him this together and last I knew no such luck.

So I have been looking at software possibilities, the most suited among them is DRBD.

Within my tools-renewal project in labs I setup an nfs-server and nfs-server-backup and tested an NFS server on LVM/DRBD vs the same NFS server on LVM. I ran into an issue as we have no precident of floating an internal IP within Labs. I toyed with allowing it but the nuance of how instance security policies work made it a long tailed issue. So I switched to using a few of our labtest hosts.

  • As long as FSID is the same (affecting the client side handle) it is pretty feasible to fail over server side NFS (on DRBD or not) and have clients catch up after a brief "blip". Assuming data integrity between server side nodes.
  • DRBD with synchronous write is somewhere around 15-20% overhead for protocol C, and can be inched up with protocol B but still is beholden to the sync buffers. i.e. large traffic patterns can overwhelm sync neutering the advantages of protocol B. I also tried protocol A to get a feel for how far I could turn things up and really there is little advantage there depending on the traffic patterns themselves. It's true completely asynchronous replication between nodes is advantageous at small volumes but eventually the secondary node will be overwhelmed and a full resync is triggered (even tho DRBD can do a more intelligent hash and compare sync nowadays actually). The X factor is moving the metadata off to it's own physical volume as I think we can reclaim some of that performance for sure but I'm lacking a good test setup for that at the moment. The silver lining is if there is a short term or a serious performance issue disabling the secondary node puts DRBD at pretty low overhead as an UNKNOWN is basically ignored and the primary continues along at a small loss above raw LVM. Incidentally LVM itself adds very very little overhead to the physical storage from my testing which is based primarly on DD and small (50MB) to large (10-50G) files. I'm hopeful that improved hardware with fast local storage and metadata isolation we can be fine with 85-95% of raw throughput and use of DRBD. I'm against a wall where we need to iterate on this problem form this point forward as it's very hard to tell beyond the scope of knowing configuration is sound.
  • DRBD on LVM is a supported configuration and actually there are indications it is recommended. DRBD on LVM does support dynamic volume growth https://drbd.linbit.com/en/users-guide/s-resizing.html and DRBD allows for automated snapshotting of a partition in the case of a degraded node and replication where loss of a single node would result in total data loss.
  • No matter what method we use whether it's periodic and replication of snapshots or an ongoing DRBD meta-partition we are going to be forced to ensure a period of outage or read-only for consistency between nodes. Our current use patterns of NFS for logs and other highly active stateful data is never going to be well suited to this case. T127367 is as important as it has always been if not more so as pursuing redundancy here shows how weak our model is.

So some version of software redundancy and failover here doable, but we will have to tweak and work out some of the specifics in terms of how we want clients to handle periods of failover and even unavailability even if short.

On the specifications needed without SCSCI storage we are looking at (after some cleanup today):

/dev/mapper/labstore-tools   ext4      8.0T  5.0T  3.1T  63% /srv/project/tools
/dev/mapper/labstore-maps    ext4      6.0T  2.7T  3.3T  45% /srv/project/maps
/dev/mapper/labstore-others  ext4       11T  2.9T  8.0T  27% /srv/others
/dev/mapper/labstore-scratch ext4      984G  537G  397G  58% /srv/scratch

That puts our life storage use at 10.6T. We can shuffle scratch off anywhere and at 1T it is not terribly difficult. I expect this arrangement to shift considerably as things change and our storage becomes for robust. We have already seen as our storage solution is solidified users are able to use it more and the demands surface that were previously a cataclysmic ceiling.

So separating out user files within tools at least we can do something like (for short term collapse on the new nodes for reinstalling and making sane the existing nodes):

1T tools /home
8T /project/tools
10T /home
1T /scratch (on labstore1003?)

...Allowing for a shift to the new hardware for things while a plan is figured out for splitting our shelves between labstore1001/1002 in service of the same software redundancy mechanism. In which case we move the /home and possibly even /tools/home over to the existing hardware and have only operationalized storage with decent growth possibility on the new hardware).

i.e. I believe Plan C is sane from T127508 as the next level of iteration, but there will definitely have to be some iteration over the next 24 months as needs for storage surfaces.

R720xd can accommodate up to 12 * 4TB SATA ((4 * 10) / 2 (raid 10) = ~20TB)

RobH added a comment.Mar 22 2016, 3:52 PM

So it turns out the CPU we have picked out cannot work with the 12 * 4TB, due to both drawing too much voltage. However, it can work with 24 * 2TB, so we're having that quote generated now.

I'll claim this hw-request task until the sub-tasks are ordered.

mark added a comment.Mar 23 2016, 1:57 PM

My main fear is that 20 TB (after RAID10) is not a lot of headroom, considering we are using over 10 TB today, and we'll need some room for LVM snapshots etc as well. With just internal drives and an already full chassis there are no real options for expanding it it seems, and I'm not sure this will last us very long. What's the plan for that if it happens?

This layout:

/dev/mapper/labstore-tools   ext4      8.0T  5.0T  3.1T  63% /srv/project/tools
/dev/mapper/labstore-maps    ext4      6.0T  2.7T  3.3T  45% /srv/project/maps
/dev/mapper/labstore-others  ext4       11T  2.9T  8.0T  27% /srv/others
/dev/mapper/labstore-scratch ext4      984G  537G  397G  58% /srv/scratch

is only for during the transition to allow moving traffic off of the current labstore1001/1002 so they can reimaged and then others, maps, and maybe tools /home move back to it depending on how things stand usage wise. At that point we either end up with 5.0T of Tools (which includes the 700G or so Tools /home) or less and closer to 4T and we have /home for Tools on the current hardware. projects within Tools is the most significant aggregate traffic by a large margin and the most impacted by the rate limiting. Even there though we stand to claim back space as we transition out logging.

Restricted Application added a subscriber: TerraCodes. · View Herald TranscriptApr 19 2016, 7:03 PM
RobH closed this task as Resolved.Apr 19 2016, 7:04 PM

This request has been fulfilled via order on procurement task T127508. Resolving this hardware-requests task.

RobH changed the status of subtask Unknown Object (Task) from Open to Stalled.Jul 11 2016, 5:33 PM
RobH closed subtask Unknown Object (Task) as Resolved.Oct 12 2016, 5:50 PM