Page MenuHomePhabricator

Storage capacity upgrade for WDQS
Open, NormalPublic

Description

Given the growth rate of Wikidata, we need to increase storage capacity for WDQS.

There are 2 possible strategies:

  1. increase the storage capacity of our current servers

This is fairly straightforward, but would require adding storage in our current servers.

  1. migrate from RAID1 to RAID0

Other services (elasticsearch / cassandra) manage redundancy at the cluster level and not at the server level. For this to make sense, we need to ensure that loosing a full server is a non event (the cluster is robust to loosing a node AND reimaging is cheap). In the case of WDQS, we currently have 3 nodes per clusters (4 clusters: public / internal for both eqiad and codfw), increasing it to 4 nodes would provide enough redundancy at the cluster level that loosing a node is a non issue.

Note that while the storage ('/srv') is migrated to RAID0, we'll keep the OS on RAID1.

Note: the current implementation of WDQS is not shardable, this is addressed separately (T221631 / Wikidata_query_service/ScalingStrategy)

Event Timeline

Gehel created this task.Apr 23 2019, 2:37 PM
Restricted Application added a project: Wikidata. · View Herald TranscriptApr 23 2019, 2:37 PM
Smalyshev added a subscriber: Smalyshev.EditedApr 23 2019, 7:05 PM

This also makes me note that if we do introduce sharding (in any shape, either with Blazegraph or another solution) we'd need even more servers, since each shard would need to be at least on 2-3 servers to survive, so for sharding to make any sense we'd need at least 6, maybe even more servers, otherwise we'd just store every or nearly every shard on every server, which makes sharding pointless.

So if we'd want resilience to loss of a server and meaningful sharding, we'd need something like 2x servers probably.

Smalyshev updated the task description. (Show Details)Apr 23 2019, 7:59 PM
Gehel added a comment.Apr 25 2019, 8:58 AM

This also makes me note that if we do introduce sharding (in any shape, either with Blazegraph or another solution) we'd need even more servers, since each shard would need to be at least on 2-3 servers to survive, so for sharding to make any sense we'd need at least 6, maybe even more servers, otherwise we'd just store every or nearly every shard on every server, which makes sharding pointless.
So if we'd want resilience to loss of a server and meaningful sharding, we'd need something like 2x servers probably.

As discussed on IRC, this depends a lot on what the future solution is, how it shards / replicate, and what the scaling strategy is. I don't think we can take any decision on the future architecture at this point. We might want to reserve some budget for whatever solution we come up with, but the scope isn't defined enough yet to have any meaningful estimate.

faidon added a subscriber: faidon.Apr 25 2019, 10:32 AM

I don't think it makes sense to perpetuate a vertical scaling model. Both of the options listed here (adding disks, RAID 0) are things that we generally do not do, due to the hidden costs and burdens for everyone involved. Taking machines offline and rebuilding them from scratch just because a disk failed or because we need more storage is really something that we need to avoid, and something that the data center operations team cannot really support with its existing staffing (esp. taking into account the failure rate of disks).

PoC/MVPs with vertical scaling are obviously OK and we can be somewhat flexible in HW needs for those, but for a service with the maturity and popularity of WDQS I think it makes sense to start designing it for scale at this point. It's clear that it's here to stay, and that going down a vertical scaling route today would only mean that we're deferring the problem until the next storage expansion, i.e. creating more tech debt. Horizontal scaling and quick/cheap operations for expansion (= adding boxes over time, staggering them over multiple FY) is really the way to go here.

Budget-wise we can be flexible indeed! We can allocate some funds without defining exactly how we would spend them (within reason), and defer that decision until there is more clarity on the technical front. Does that make sense?

Gehel added a comment.Apr 25 2019, 5:12 PM

I don't think it makes sense to perpetuate a vertical scaling model.

I could not agree more! And there is work going into this direction (see this page). But changing the full stack while not breaking the current clients is not going to happen overnight. So we need a solution in the meantime.

I don't think it makes sense to perpetuate a vertical scaling model

With regard to disk space, I don't think we have a choice at least for the next FY. Even if we find alternative backend that can be a drop-in replacement for Blazegraph (or somehow miraculously discover an easy way to shard the data without killing performance) - and my current estimate for this optimistic scenario is "very low probability" - implementing that is likely going to take time. And in all that time we'd have to continue service queries on existing platform. Which means each node has to store the whole set of the query data.

We can discuss future sharding solutions, and it's totally fine, but realistically I personally do not see any scenario where we have any such solution implemented and ready to the point we can 100% migrate our Wikidata query load to it in a year. Right now we are no further in this road than "we should think about trying some options", and we aren't even clear on what these options would be. It's a long road from here to a working sharding solution, and I don't see a way to make through it in a shorter time. If somebody sees a way please tell me. Which for me means that at least for the next FY, we need to plan for the current model - which implies hosting all the data on each server - to be what we have.

Right now our data size is about 840G, and it keeps growing. I know Wikidata has growth projections so I'll add them here later but you can look on the DB size graph and make your own conclusions. For me, this says we'll start to run out of disk space before the end of this calendar year. So we need to find solution for that - be it new disks, RAID0 or whatever magic there is.

So when we discuss the budget for the next FY, I think we should base it on current model and growth projections we have now being the facts on the ground. If we find better model (and we do allocate time and resources to look for it), great, but we should not be basing our planning on miracles happening.

Taking machines offline and rebuilding them from scratch just because a disk failed or because we need more storage is really something that we need to avoid

So far it was my experience this happened every time we did capacity upgrade. Has this changed (regardless of RAID0 move, I'm talking about current situation) or we still need to take the host offline when we add disk capacity now? If it hasn't changed, is there a model (not involving changing WDQS platform software, etc.) that can allow us to upgrade capacity without reimaging?

Gehel moved this task from needs triage to Ops / SRE on the Discovery-Search board.May 1 2019, 4:35 PM
Smalyshev triaged this task as Normal priority.Jun 21 2019, 4:47 AM
RobH added a subscriber: RobH.EditedJul 9 2019, 4:07 PM

I'd like to note that I'm against the trend of our raid configurations becoming raid0 per node, and relying on horizontal node redundancy over disk redundancy.

Losing a single disk in a raid0 node invalidates the entire node, making it a single point of urgency for our on-site engineers to address quickly. If we stick with some kind of actual disk redundancy, we lessen the urgency of those disk failures by an order of magnitude.

Edit addition: wdqs could lose a single node, but could it lose two? Disk failures are the most common hardware failure, and it wouldnt be unheard of for two systems to have a single disk failure within a day or two of one another. This is particularly true when the systems have had disks purchased in batches.

wdqs could lose a single node, but could it lose two?

Depends. Losing two hosts in the same cluster in the same datacenter would be a huge issue. But we could temporarily borrow a node between clusters (e.g. internal -> public) if such thing happens. Also, we should be adding one host for the public cluster shortly (we want to do it regardless of anything else) so this would make it more resilient against such scenario.

Gehel added a subscriber: mark.Jul 10 2019, 9:43 AM

wdqs could lose a single node, but could it lose two?

The current configuration is 3 nodes per cluster, and we can loose one node. The plan here is to add 1 node per cluster to ensure that we can loose 2 nodes. So yes, we are taking this into account. Even if we loose 2 nodes, we can move nodes between the public and private clusters. Things start to get tensed if loose > 3 nodes in the same DC.

The chat I had with @mark can be summarized as:

  • urgency and coordination needs are the killer for DC Ops, we need to keep those at a minimum.
  • WDQS nodes are entirely independent from each other, and can be taken down or restarted whenever, without synchronization with the service owner
  • loosing a node is a non event

We probably want to revisit all this if / when we change to another database.

Gehel mentioned this in Unknown Object (Task).Jul 11 2019, 9:28 AM
Gehel mentioned this in Unknown Object (Task).Jul 11 2019, 12:29 PM
Gehel added a subtask: Unknown Object (Task).
Gehel added a subtask: Unknown Object (Task).
Gehel updated the task description. (Show Details)Jul 19 2019, 1:44 PM

I've just updated the task description to make it clear that even if we move storage to RAID0, we'll keep the OS on RAID1 (same scheme used by elasticsearch servers).

Gehel added a comment.Jul 19 2019, 2:35 PM

Rough back of the enveloppe calculation of the cost of staying on RAID1 is on T227755#5349525. Since it contains pricing, I'm keeping this on the procurement task that is private.

Talked to @Gehel, @faidon, and @RobH - and we're going to proceed with RAID0 for this install. Although RAID0 is not ideal, we can make an exception here, as long as a drive failure that causes the system to go down, won't result in any dc-ops onsite repair emergencies.

Thanks,
Willy