Dedicated servers on WMCS to test WDQS scalability strategy
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Gehel
	Apr 23 2019, 2:29 PM

Description

As discussed here, we need to explore alternatives to blazegraph to scale WDQS. This requires free experimentation, and testing of the scalability itself (so requires multiple nodes). Discussion with @bd808 indicates that we should be able to have dedicated servers on WMCS which would provide adequate resources and isolation.

Ideally 3 servers with the following specs would be needed (limited validation of a solution would be possible with 2 servers).

Specs:

8 cores
128G RAM
3T usable SSD storage (probably 6T RAID1)
1G NIC

Details

	Subject	Repo	Branch	Lines +/-
	openstack: switch cloudvirt-wdqs servers to Ceph	operations/puppet	production	+2 -1
	Add cloudvirt-wdqs hosts	operations/puppet	production	+5 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Gehel	T221630 [Epic] Search platform - Hardware requests for 2019-2020
Resolved	Gehel	T206636 Provide a way to have test servers on real hardware, isolated from production for Wikidata Query Service
Resolved	Andrew	T221631 Dedicated servers on WMCS to test WDQS scalability strategy
Resolved	wiki_willy	T232654 eqiad: three clouvirt-wdqs servers for WDQS testing
		Unknown Object (Task)
Resolved	• Cmjohnson	T235685 (Need by: 2020-03-02) rack/setup/install cloudvirt-wdqs100[123].eqiad.wmnet
Resolved	Andrew	T252784 Remove Ceph from cloudvirt-wdqs100x, Add ceph to cloudvirt1004 and cloudvirt1006

Event Timeline

Gehel created this task.Apr 23 2019, 2:29 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 23 2019, 2:29 PM

Gehel triaged this task as Medium priority.Apr 23 2019, 2:30 PM

Gehel added a project: Wikidata-Query-Service.

Restricted Application added a project: Wikidata. · View Herald TranscriptApr 23 2019, 2:30 PM

Gehel mentioned this in T221632: Storage capacity upgrade for WDQS.Apr 23 2019, 2:37 PM

Can you tell me more about what these hosts would actually do? Are they virt hosts hosting one big VM each?

In T221631#5131021, @Andrew wrote:

Can you tell me more about what these hosts would actually do? Are they virt hosts hosting one big VM each?

I think we need to think about this ask from the point of view of virtual instances with the ask being 3 VMs with these specs. The related data point to think about is the testing that @Smalyshev did with help from @Andrew this past year where Stas tried out various testing instances for WDQS staging/testing. That was tracked in T206636: Provide a way to have test servers on real hardware, isolated from production for Wikidata Query Service, which I think this task is really a continuation of.

Krenair subscribed.Apr 23 2019, 3:47 PM

In T206636#4941969, @bd808 wrote:

The next step here is to determine the hardware requirements and then find out if there is current fiscal year budget to cover procuring the required hardware, or if we can allocate a dedicated server from the existing cloudvirt pool and backfill that capacity.

The wdqs10{09,10} prod hosts each have 8 cores, 128GB RAM, 4x800GB SSD (1.6TB RAID10). Our most recent cloudvirt servers (cloudvirt10[25-30]) are 36 physical cores, 512GB RAM, 6x1.92TB SSD (5.7TB RAID10). I think this means that a single cloudvirt would be more than enough hardware if it was not shared with other arbitrary workloads.

In T206636#4946372, @Smalyshev wrote:

Yes, judging from our preliminary test, if we get uncontested use of the server or even a certain chunk of it maybe (not sure if possible?) it would be enough. Note that an interesting scenario that we want to test in foreseeable future involves cluster setup, so we'd want at least 2 hosts (not sure whether they have to be on 2 separate hardware machines) with requirements close to what wdqs hosts have. It could splitting cloudvirt host into two VMs exclusively used by these test hosts would be ok. Not sure if virtualization that we do now allows such kind of fixed resource allocations (probably also I/O resources need to be taken care of?)

In T206636#4948039, @bd808 wrote:

In T206636#4946372, @Smalyshev wrote:

Not sure if virtualization that we do now allows such kind of fixed resource allocations (probably also I/O resources need to be taken care of?)

We have a few special instances that we do this for today. It is something that we are experimenting with for virtualization of shared Cloud Services systems such as the ToolsDB databases. The process that requires cloud-root assistance to create the initial instances using cli magic, but it is possible. Basically we mark the cloudvirt as unavailable for selection by the normal OpenStack scheduler and then force create the desired instances there manually.

bd808 added a parent task: T206636: Provide a way to have test servers on real hardware, isolated from production for Wikidata Query Service.Apr 23 2019, 3:51 PM

I see 2 ways to approach this:

Try to map these requirements into our "standard" cloudvirt hosts and expand our FY2019/2020 planned growth to include the needed expansion
Spec and rack specific hardware sized for these instances using the "fake ironic" model of making them virt servers with a single instance each

A single cloudvirt identical to cloudvirt10[25-30] would provide all of the CPU and RAM needed for all 3 instances. It would not provide the amount of disk being asked for however (9T usable). We are actively discussing moving instance storage to some shared storage cluster (currently looking into Ceph), but do not currently know how this will work out, how long it will take, etc. It is also very unknown what kind of IOPS this will provide.

Yes, this is the continuation of T206636.

In T221631#5131322, @bd808 wrote:

Try to map these requirements into our "standard" cloudvirt hosts and expand our FY2019/2020 planned growth to include the needed expansion

Spec and rack specific hardware sized for these instances using the "fake ironic" model of making them virt servers with a single instance each

In term of constraints:

We know that IO is critical to the performance of WDQS in its current implementation, and we suspect the same thing about any replacement solution. So I'm not sure that sharing the same IO on the 3 VMs is a great solution.
We need to test the distributed nature of whatever replacement solution. By hosting the 3 VMs on the same physical host, we will probably hide whatever issue might come from increased latency between different physical servers.

So my preference would be to have dedicated servers, with a single VM on each, basically using our cloud infrastructure only as an isolation layer. I'm not sure how much sense this makes and we're very much open to suggestions!

In T221631#5131369, @Gehel wrote:

So my preference would be to have dedicated servers, with a single VM on each, basically using our cloud infrastructure only as an isolation layer. I'm not sure how much sense this makes and we're very much open to suggestions!

Yes, it makes sense. We are budgeting for the 3 hosts in Q1 of FY19/20.

Gehel moved this task from needs triage to Ops / SRE on the Discovery-Search board.May 1 2019, 4:35 PM

Smalyshev moved this task from Incoming to Operations/SRE on the Wikidata-Query-Service board.May 2 2019, 9:34 PM

• RazShuty subscribed.May 9 2019, 8:05 AM

I verified today that we have budget for this build in FY19/20 Q1 (July-September 2019). Before I start a procurement request, @Gehel can you confirm that this test cluster is still desired by the Search team?

Perhaps @Smalyshev could confirm this? As I understand it T206561 is stalled on this issue (and has been for almost a year).

Both evaluating Virtuoso and other solutions (like JanusGraph) would require that. @Gehel should know the details.

Assigning to @Andrew so he can start the procurement process with DCOps.

bd808 moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.Sep 11 2019, 3:13 PM

Andrew added a subtask: T232654: eqiad: three clouvirt-wdqs servers for WDQS testing.Sep 11 2019, 6:57 PM

Andrew moved this task from Doing to Blocked on the cloud-services-team (Kanban) board.Sep 25 2019, 3:24 PM

Iamamz3 subscribed.Dec 23 2019, 8:08 PM

RobH closed subtask T232654: eqiad: three clouvirt-wdqs servers for WDQS testing as Resolved.Jan 22 2020, 7:16 PM

Gehel mentioned this in T206636: Provide a way to have test servers on real hardware, isolated from production for Wikidata Query Service.Feb 26 2020, 2:00 PM

Change 575312 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Add cloudvirt-wdqs hosts

https://gerrit.wikimedia.org/r/575312

gerritbot added a project: Patch-For-Review.Feb 27 2020, 5:44 PM

Change 575312 merged by Andrew Bogott:
[operations/puppet@production] Add cloudvirt-wdqs hosts

https://gerrit.wikimedia.org/r/575312

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

cloudvirt-wdqs1002.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202002271949_andrew_128500_cloudvirt-wdqs1002_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudvirt-wdqs1002.eqiad.wmnet']

Of which those FAILED:

['cloudvirt-wdqs1002.eqiad.wmnet']

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

cloudvirt-wdqs1002.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202002271950_andrew_128610_cloudvirt-wdqs1002_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

cloudvirt-wdqs1003.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202002271950_andrew_128639_cloudvirt-wdqs1003_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudvirt-wdqs1002.eqiad.wmnet']

Of which those FAILED:

['cloudvirt-wdqs1002.eqiad.wmnet']

Completed auto-reimage of hosts:

['cloudvirt-wdqs1003.eqiad.wmnet']

Of which those FAILED:

['cloudvirt-wdqs1003.eqiad.wmnet']

Maintenance_bot removed a project: Patch-For-Review.Feb 27 2020, 8:11 PM

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

cloudvirt-wdqs1003.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202002272013_andrew_133073_cloudvirt-wdqs1003_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

cloudvirt-wdqs1002.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202002272018_andrew_134870_cloudvirt-wdqs1002_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudvirt-wdqs1003.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['cloudvirt-wdqs1002.eqiad.wmnet']

and were ALL successful.

@Gehel after a long time in dc-limbo these hosts are just about ready to go. I'm in the process of setting up a project and custom VM flavor for you to use.,

Do you have a sense of when you'll have time to pay attention to the new systems? If at this point you don't want to think about them for a month or two we might repurpose them as temporary test boxes; on the other hand if you want to start hacking right away I can have them ready for you by tomorrow or Monday.

Gehel responded on IRC:

we don't need those cloud wdqs servers for at least another 3 month, maybe more (priority changed).

So, this task is stalled until 2020-06-01. @Gehel, if your timeline changes please let us know -- we can get them back in your hands with a week or so worth of notice.

Andrew removed Andrew as the assignee of this task.Feb 28 2020, 3:50 PM

breadcrumbs for whoever picks this up in June: I created a 'wdqs-scaling ' project and a custom flavor 'wdqs-scaling' which is associated with the 'wdqs' host aggregate and is just big enough to fill up one of these servers with a single giant VM.

Change 576903 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] openstack: switch cloudvirt-wdqs servers to Ceph

https://gerrit.wikimedia.org/r/576903

gerritbot added a project: Patch-For-Review.Mar 4 2020, 5:14 PM

These hosts will temporarily be used for testing CloudVPS's new Ceph storage cluster. To switch back to local storage, we'll need to migrate or delete all running virtual machines and revert this patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/576903

Change 576903 merged by Jhedden:
[operations/puppet@production] openstack: switch cloudvirt-wdqs servers to Ceph

https://gerrit.wikimedia.org/r/576903

Mentioned in SAL (#wikimedia-cloud) [2020-03-04T20:03:31Z] <jeh> add cloudvirt-wdqs100[123] to ceph host aggregate T221631

Maintenance_bot removed a project: Patch-For-Review.Mar 4 2020, 8:10 PM

Gehel mentioned this in T251489: Validate that we have enough resources on WMCS for a SPARQL Endpoint for Commons.Apr 30 2020, 8:03 AM

Priorities have changed again, commons query service is back at the top of the stack. How are we in terms of a timeline making these available? Have we finished up the ceph testing? We aren't certain we will need these servers, I'm just trying to get an idea of where this is and if it's an open option.

@EBernhardson sorry for the delay in responding! We can return these hosts to you at any time -- I'll start working on that shortly unless you're back to not needing them :)

Andrew claimed this task.May 14 2020, 3:07 PM

Andrew added a subtask: T252784: Remove Ceph from cloudvirt-wdqs100x, Add ceph to cloudvirt1004 and cloudvirt1006.May 16 2020, 5:54 PM

@EBernhardson I've switched the wdqs hosts back to local storage so you should be able to recreate the VMs that you need any time. Let me know if you run into any trouble!

Thanks!

bking subscribed.Mar 24 2022, 3:25 PM

Andrew mentioned this in T324147: Investigate and document cloudvirt-wdqs servers.Dec 2 2022, 9:10 PM

Dedicated servers on WMCS to test WDQS scalability strategyClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Dedicated servers on WMCS to test WDQS scalability strategy
Closed, ResolvedPublic
Actions

Related Objects
Search...