Page MenuHomePhabricator

Dedicated servers on WMCS to test WDQS scalability strategy
Open, MediumPublic

Description

As discussed here, we need to explore alternatives to blazegraph to scale WDQS. This requires free experimentation, and testing of the scalability itself (so requires multiple nodes). Discussion with @bd808 indicates that we should be able to have dedicated servers on WMCS which would provide adequate resources and isolation.

Ideally 3 servers with the following specs would be needed (limited validation of a solution would be possible with 2 servers).

Specs:

  • 8 cores
  • 128G RAM
  • 3T usable SSD storage (probably 6T RAID1)
  • 1G NIC

Event Timeline

Gehel created this task.Apr 23 2019, 2:29 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 23 2019, 2:29 PM
Gehel triaged this task as Medium priority.Apr 23 2019, 2:30 PM
Gehel added a project: Wikidata-Query-Service.
Restricted Application added a project: Wikidata. · View Herald TranscriptApr 23 2019, 2:30 PM
Andrew added a subscriber: Andrew.Apr 23 2019, 3:30 PM

Can you tell me more about what these hosts would actually do? Are they virt hosts hosting one big VM each?

Can you tell me more about what these hosts would actually do? Are they virt hosts hosting one big VM each?

I think we need to think about this ask from the point of view of virtual instances with the ask being 3 VMs with these specs. The related data point to think about is the testing that @Smalyshev did with help from @Andrew this past year where Stas tried out various testing instances for WDQS staging/testing. That was tracked in T206636: Provide a way to have test servers on real hardware, isolated from production for Wikidata Query Service, which I think this task is really a continuation of.

bd808 added a comment.Apr 23 2019, 3:51 PM

The next step here is to determine the hardware requirements and then find out if there is current fiscal year budget to cover procuring the required hardware, or if we can allocate a dedicated server from the existing cloudvirt pool and backfill that capacity.
The wdqs10{09,10} prod hosts each have 8 cores, 128GB RAM, 4x800GB SSD (1.6TB RAID10). Our most recent cloudvirt servers (cloudvirt10[25-30]) are 36 physical cores, 512GB RAM, 6x1.92TB SSD (5.7TB RAID10). I think this means that a single cloudvirt would be more than enough hardware if it was not shared with other arbitrary workloads.

Yes, judging from our preliminary test, if we get uncontested use of the server or even a certain chunk of it maybe (not sure if possible?) it would be enough. Note that an interesting scenario that we want to test in foreseeable future involves cluster setup, so we'd want at least 2 hosts (not sure whether they have to be on 2 separate hardware machines) with requirements close to what wdqs hosts have. It could splitting cloudvirt host into two VMs exclusively used by these test hosts would be ok. Not sure if virtualization that we do now allows such kind of fixed resource allocations (probably also I/O resources need to be taken care of?)

Not sure if virtualization that we do now allows such kind of fixed resource allocations (probably also I/O resources need to be taken care of?)

We have a few special instances that we do this for today. It is something that we are experimenting with for virtualization of shared Cloud Services systems such as the ToolsDB databases. The process that requires cloud-root assistance to create the initial instances using cli magic, but it is possible. Basically we mark the cloudvirt as unavailable for selection by the normal OpenStack scheduler and then force create the desired instances there manually.

bd808 added a comment.Apr 23 2019, 4:04 PM

I see 2 ways to approach this:

  1. Try to map these requirements into our "standard" cloudvirt hosts and expand our FY2019/2020 planned growth to include the needed expansion
  2. Spec and rack specific hardware sized for these instances using the "fake ironic" model of making them virt servers with a single instance each

A single cloudvirt identical to cloudvirt10[25-30] would provide all of the CPU and RAM needed for all 3 instances. It would not provide the amount of disk being asked for however (9T usable). We are actively discussing moving instance storage to some shared storage cluster (currently looking into Ceph), but do not currently know how this will work out, how long it will take, etc. It is also very unknown what kind of IOPS this will provide.

Gehel added a comment.Apr 23 2019, 4:11 PM

Yes, this is the continuation of T206636.

  1. Try to map these requirements into our "standard" cloudvirt hosts and expand our FY2019/2020 planned growth to include the needed expansion
  2. Spec and rack specific hardware sized for these instances using the "fake ironic" model of making them virt servers with a single instance each

In term of constraints:

  • We know that IO is critical to the performance of WDQS in its current implementation, and we suspect the same thing about any replacement solution. So I'm not sure that sharing the same IO on the 3 VMs is a great solution.
  • We need to test the distributed nature of whatever replacement solution. By hosting the 3 VMs on the same physical host, we will probably hide whatever issue might come from increased latency between different physical servers.

So my preference would be to have dedicated servers, with a single VM on each, basically using our cloud infrastructure only as an isolation layer. I'm not sure how much sense this makes and we're very much open to suggestions!

So my preference would be to have dedicated servers, with a single VM on each, basically using our cloud infrastructure only as an isolation layer. I'm not sure how much sense this makes and we're very much open to suggestions!

Yes, it makes sense. We are budgeting for the 3 hosts in Q1 of FY19/20.

Gehel moved this task from needs triage to Ops / SRE on the Discovery-Search board.May 1 2019, 4:35 PM

I verified today that we have budget for this build in FY19/20 Q1 (July-September 2019). Before I start a procurement request, @Gehel can you confirm that this test cluster is still desired by the Search team?

Perhaps @Smalyshev could confirm this? As I understand it T206561 is stalled on this issue (and has been for almost a year).

Both evaluating Virtuoso and other solutions (like JanusGraph) would require that. @Gehel should know the details.

bd808 assigned this task to Andrew.Sep 11 2019, 3:06 PM
bd808 edited projects, added cloud-services-team (Kanban); removed cloud-services-team.

Assigning to @Andrew so he can start the procurement process with DCOps.