Page MenuHomePhabricator

Dedicated servers on WMCS to test WDQS scalability strategy
Closed, ResolvedPublic

Description

As discussed here, we need to explore alternatives to blazegraph to scale WDQS. This requires free experimentation, and testing of the scalability itself (so requires multiple nodes). Discussion with @bd808 indicates that we should be able to have dedicated servers on WMCS which would provide adequate resources and isolation.

Ideally 3 servers with the following specs would be needed (limited validation of a solution would be possible with 2 servers).

Specs:

  • 8 cores
  • 128G RAM
  • 3T usable SSD storage (probably 6T RAID1)
  • 1G NIC

Event Timeline

Gehel created this task.Apr 23 2019, 2:29 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 23 2019, 2:29 PM
Gehel triaged this task as Medium priority.Apr 23 2019, 2:30 PM
Gehel added a project: Wikidata-Query-Service.
Restricted Application added a project: Wikidata. · View Herald TranscriptApr 23 2019, 2:30 PM
Andrew added a subscriber: Andrew.Apr 23 2019, 3:30 PM

Can you tell me more about what these hosts would actually do? Are they virt hosts hosting one big VM each?

Can you tell me more about what these hosts would actually do? Are they virt hosts hosting one big VM each?

I think we need to think about this ask from the point of view of virtual instances with the ask being 3 VMs with these specs. The related data point to think about is the testing that @Smalyshev did with help from @Andrew this past year where Stas tried out various testing instances for WDQS staging/testing. That was tracked in T206636: Provide a way to have test servers on real hardware, isolated from production for Wikidata Query Service, which I think this task is really a continuation of.

bd808 added a comment.Apr 23 2019, 3:51 PM

The next step here is to determine the hardware requirements and then find out if there is current fiscal year budget to cover procuring the required hardware, or if we can allocate a dedicated server from the existing cloudvirt pool and backfill that capacity.

The wdqs10{09,10} prod hosts each have 8 cores, 128GB RAM, 4x800GB SSD (1.6TB RAID10). Our most recent cloudvirt servers (cloudvirt10[25-30]) are 36 physical cores, 512GB RAM, 6x1.92TB SSD (5.7TB RAID10). I think this means that a single cloudvirt would be more than enough hardware if it was not shared with other arbitrary workloads.

Yes, judging from our preliminary test, if we get uncontested use of the server or even a certain chunk of it maybe (not sure if possible?) it would be enough. Note that an interesting scenario that we want to test in foreseeable future involves cluster setup, so we'd want at least 2 hosts (not sure whether they have to be on 2 separate hardware machines) with requirements close to what wdqs hosts have. It could splitting cloudvirt host into two VMs exclusively used by these test hosts would be ok. Not sure if virtualization that we do now allows such kind of fixed resource allocations (probably also I/O resources need to be taken care of?)

Not sure if virtualization that we do now allows such kind of fixed resource allocations (probably also I/O resources need to be taken care of?)

We have a few special instances that we do this for today. It is something that we are experimenting with for virtualization of shared Cloud Services systems such as the ToolsDB databases. The process that requires cloud-root assistance to create the initial instances using cli magic, but it is possible. Basically we mark the cloudvirt as unavailable for selection by the normal OpenStack scheduler and then force create the desired instances there manually.

bd808 added a comment.Apr 23 2019, 4:04 PM

I see 2 ways to approach this:

  1. Try to map these requirements into our "standard" cloudvirt hosts and expand our FY2019/2020 planned growth to include the needed expansion
  2. Spec and rack specific hardware sized for these instances using the "fake ironic" model of making them virt servers with a single instance each

A single cloudvirt identical to cloudvirt10[25-30] would provide all of the CPU and RAM needed for all 3 instances. It would not provide the amount of disk being asked for however (9T usable). We are actively discussing moving instance storage to some shared storage cluster (currently looking into Ceph), but do not currently know how this will work out, how long it will take, etc. It is also very unknown what kind of IOPS this will provide.

Gehel added a comment.Apr 23 2019, 4:11 PM

Yes, this is the continuation of T206636.

  1. Try to map these requirements into our "standard" cloudvirt hosts and expand our FY2019/2020 planned growth to include the needed expansion
  2. Spec and rack specific hardware sized for these instances using the "fake ironic" model of making them virt servers with a single instance each

In term of constraints:

  • We know that IO is critical to the performance of WDQS in its current implementation, and we suspect the same thing about any replacement solution. So I'm not sure that sharing the same IO on the 3 VMs is a great solution.
  • We need to test the distributed nature of whatever replacement solution. By hosting the 3 VMs on the same physical host, we will probably hide whatever issue might come from increased latency between different physical servers.

So my preference would be to have dedicated servers, with a single VM on each, basically using our cloud infrastructure only as an isolation layer. I'm not sure how much sense this makes and we're very much open to suggestions!

So my preference would be to have dedicated servers, with a single VM on each, basically using our cloud infrastructure only as an isolation layer. I'm not sure how much sense this makes and we're very much open to suggestions!

Yes, it makes sense. We are budgeting for the 3 hosts in Q1 of FY19/20.

Gehel moved this task from needs triage to Ops / SRE on the Discovery-Search board.May 1 2019, 4:35 PM

I verified today that we have budget for this build in FY19/20 Q1 (July-September 2019). Before I start a procurement request, @Gehel can you confirm that this test cluster is still desired by the Search team?

Perhaps @Smalyshev could confirm this? As I understand it T206561 is stalled on this issue (and has been for almost a year).

Both evaluating Virtuoso and other solutions (like JanusGraph) would require that. @Gehel should know the details.

bd808 assigned this task to Andrew.Sep 11 2019, 3:06 PM
bd808 edited projects, added cloud-services-team (Kanban); removed cloud-services-team.

Assigning to @Andrew so he can start the procurement process with DCOps.

Change 575312 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Add cloudvirt-wdqs hosts

https://gerrit.wikimedia.org/r/575312

Change 575312 merged by Andrew Bogott:
[operations/puppet@production] Add cloudvirt-wdqs hosts

https://gerrit.wikimedia.org/r/575312

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

cloudvirt-wdqs1002.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202002271949_andrew_128500_cloudvirt-wdqs1002_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudvirt-wdqs1002.eqiad.wmnet']

Of which those FAILED:

['cloudvirt-wdqs1002.eqiad.wmnet']

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

cloudvirt-wdqs1002.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202002271950_andrew_128610_cloudvirt-wdqs1002_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

cloudvirt-wdqs1003.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202002271950_andrew_128639_cloudvirt-wdqs1003_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudvirt-wdqs1002.eqiad.wmnet']

Of which those FAILED:

['cloudvirt-wdqs1002.eqiad.wmnet']

Completed auto-reimage of hosts:

['cloudvirt-wdqs1003.eqiad.wmnet']

Of which those FAILED:

['cloudvirt-wdqs1003.eqiad.wmnet']

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

cloudvirt-wdqs1003.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202002272013_andrew_133073_cloudvirt-wdqs1003_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

cloudvirt-wdqs1002.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202002272018_andrew_134870_cloudvirt-wdqs1002_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudvirt-wdqs1003.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['cloudvirt-wdqs1002.eqiad.wmnet']

and were ALL successful.

@Gehel after a long time in dc-limbo these hosts are just about ready to go. I'm in the process of setting up a project and custom VM flavor for you to use.,

Do you have a sense of when you'll have time to pay attention to the new systems? If at this point you don't want to think about them for a month or two we might repurpose them as temporary test boxes; on the other hand if you want to start hacking right away I can have them ready for you by tomorrow or Monday.

Andrew changed the task status from Open to Stalled.Feb 28 2020, 2:42 PM

Gehel responded on IRC:

we don't need those cloud wdqs servers for at least another 3 month, maybe more (priority changed).

So, this task is stalled until 2020-06-01. @Gehel, if your timeline changes please let us know -- we can get them back in your hands with a week or so worth of notice.

Andrew removed Andrew as the assignee of this task.Feb 28 2020, 3:50 PM

breadcrumbs for whoever picks this up in June: I created a 'wdqs-scaling ' project and a custom flavor 'wdqs-scaling' which is associated with the 'wdqs' host aggregate and is just big enough to fill up one of these servers with a single giant VM.

Change 576903 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] openstack: switch cloudvirt-wdqs servers to Ceph

https://gerrit.wikimedia.org/r/576903

These hosts will temporarily be used for testing CloudVPS's new Ceph storage cluster. To switch back to local storage, we'll need to migrate or delete all running virtual machines and revert this patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/576903

Change 576903 merged by Jhedden:
[operations/puppet@production] openstack: switch cloudvirt-wdqs servers to Ceph

https://gerrit.wikimedia.org/r/576903

Mentioned in SAL (#wikimedia-cloud) [2020-03-04T20:03:31Z] <jeh> add cloudvirt-wdqs100[123] to ceph host aggregate T221631

EBernhardson added a subscriber: EBernhardson.EditedMay 5 2020, 3:47 PM

Priorities have changed again, commons query service is back at the top of the stack. How are we in terms of a timeline making these available? Have we finished up the ceph testing? We aren't certain we will need these servers, I'm just trying to get an idea of where this is and if it's an open option.

@EBernhardson sorry for the delay in responding! We can return these hosts to you at any time -- I'll start working on that shortly unless you're back to not needing them :)

Andrew claimed this task.May 14 2020, 3:07 PM
Andrew closed this task as Resolved.May 17 2020, 4:13 AM

@EBernhardson I've switched the wdqs hosts back to local storage so you should be able to recreate the VMs that you need any time. Let me know if you run into any trouble!