New clusters for SPARQL Endpoint for Commons
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Gehel
	Jun 2 2020, 12:13 PM

Description

We need a production service for a SPARQL Endpoint for Commons. Estimates on hardware needs are a complete guess as we don't have a baseline for the load on this service and only vague ideas about the data increase over time.

A few data points:

current estimate is about 5M pages with structured data on Commons
largest increase so far was +1M / month
in comparison WDQS has ~11B triples
current journal size on sdcquery01 (test server) is 2.5G, but that data dump is > 6 month old
we want at least 3 servers per DC, to provide enough redundancy in case of hardware failure
current test server runs with 8G of heap
having a SPARQL Endpoint for Commons is likely going to enable oru community to write more tools / bots and thus increase the growth rate
our current test instance on WMCS has 4 VCPU, 8G RAM and 80G disk (but no query load)

all estimates below are for both datacenter (eqiad + codfw)

Estimated specs (oversized to try to account for growth over the lifetime of those servers):

6x single Xeon (4C/8T), 64G RAM, 500G usable SSD space RAID1 / 10 software, 1G NIC

Minimal specs right now (estimate):

6x single Xeon (2C/4T), 32G RAM, 100G usable HDD space, 1G NIC

Alternate option (discussed with @akosiaris): deploy on ganeti with specs:

4x 4CPU, 16G RAM, 100G HDD
Note: Ganeti might not be a long term option, depending on data growth, but it is unlikely to be an issue during the first year, so we can delay real hardware provisioning until we know more about the growth profile. This would still need provisioning in Ganeti.
Note: Ganeti has a 16G RAM limit, this seems a bit short to have both heap and disk cache, some validation is required

Note: after having run WCQS on WMCS for a while, it seems that we underestimated the hardware requirements. It makes more sense to copy the specs from WDQS. See T254232#6819856.

Related Objects
Search...

Status	Assigned	Task
Resolved	Ladsgroup	T271851 Clean up gui from the wdqs deploy repo and puppet
Resolved	None	T260568 [EPIC] Productionize WCQS
Resolved	Gehel	T247556 [Epic] Search platform - Hardware requests for 2020-2021
Resolved	Gehel	T254232 New clusters for SPARQL Endpoint for Commons
		Unknown Object (Task)
Resolved	RobH	T276644 (Need By: 2021-04-30) rack/setup/install wcqs100[123]
		Unknown Object (Task)
Resolved	Papaul	T276647 (Need By: 2021-04-30) rack/setup/install wcqs200[123]

Event Timeline

Gehel created this task.Jun 2 2020, 12:13 PM

Restricted Application added a project: Wikidata. · View Herald TranscriptJun 2 2020, 12:13 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Gehel updated the task description. (Show Details)Jun 2 2020, 12:17 PM

dcausse updated the task description. (Show Details)Jun 2 2020, 12:18 PM

Addshore subscribed.Jun 2 2020, 12:20 PM

Gehel updated the task description. (Show Details)Jun 2 2020, 12:30 PM

Gehel updated the task description. (Show Details)Jun 2 2020, 12:34 PM

Gehel updated the task description. (Show Details)Jun 2 2020, 1:31 PM

Gehel updated the task description. (Show Details)Jun 2 2020, 1:39 PM

Gehel added a subscriber: akosiaris.

Gehel updated the task description. (Show Details)Jun 2 2020, 1:42 PM

RKemper subscribed.Jun 2 2020, 6:51 PM

After discussion with @dcausse, we agree that the Ganeti option seems the best at this point. The 16G RAM limit should not be an issue with the data size we have at this point.

Talked with @dcausse about this on IRC today, based on the data we are now seeing[1] the estimates in here are too low. We are thinking that a 16G ganetti instance will not be sufficient for the growth we are seeing. I'm fairly suspicious of our "minimal" sizing of 32G ram, considering we have a 4 year refresh cycle. I wouldn't be surprised if we are talking 1B+ triples (half of 60M media files tagged with ~20 tripples) by 2024.. My intuition is that 64G is the minimum we should be considering, but i think price of memory should be taken into account. If we could have 256G and not worry about it for the cost of a few $k, we would get that back and more in saved time and focus for not having to think about things and the time wasted putting hacks in place to fit in something smaller than necessary several years from now.

The beta service is being stood up in WMCS currently, we have access to bare metal machines installed to WMCS specifically for wdqs (report as 132G ram, 3.2T disk, 32 cores. But 132G ram seems suspicious). I'm not sure how long the query service is intended to be in WMCS, but this instance should have enough runway to get us to next fiscal when we can request machines.

[1] https://analytics.wikimedia.org/published/notebooks/computer-aided-tagging/CAT-usage-report.html

Gehel triaged this task as High priority.Sep 8 2020, 7:14 PM

Gehel added a subtask: Unknown Object (Task).Feb 10 2021, 4:40 PM

A few notes from the current wcqs-beta:

CPU is mostly idle, but we have close to no user traffic and data import is only run weekly
of the 118G of RAM, 28G are used, the rest is cache (90G). That let me think that we should be OK with 64G, 128G is probably better to allow for some growth
The journal is currently 376G (with a recent data reload, so probably not that much space wasted. We should shoot for 50% disk utilization max, so that we have space for duplicating the journal. That means 800G to 1T

With all of the above, using the same specs as WDQS (16C/32T @ 2.5GHz, 128G RAM, 6T SSD / 3T useable RAID1/10) seems reasonable, and would allow to switch servers between the clusters. This might be a bit wasteful on SSDs (and those are expensive).

I don't have a good feel on what growth is going to look like. I could argue both ways: either with increased adoption we'll see growth on data size, edit rate and query rate, or we could plateau once the main body of media has SD added to them. My feeling is that using the same specs as WDQS should allow for enough growth (it would be unlikely that WCQS growth much larger than WDQS) and leave the door open to scaling horizontally.

Gehel added a parent task: T260568: [EPIC] Productionize WCQS.Feb 11 2021, 8:40 AM

Gehel updated the task description. (Show Details)Feb 11 2021, 8:42 AM

Gehel updated the task description. (Show Details)

Gehel added a subtask: Unknown Object (Task).Feb 12 2021, 1:59 PM

Papaul closed subtask Unknown Object (Task) as Resolved.Mar 22 2021, 2:30 PM

Jclark-ctr closed subtask Unknown Object (Task) as Resolved.Apr 16 2021, 6:12 PM

Gehel moved this task from Operations/SRE to Current work on the Wikidata-Query-Service board.Aug 5 2021, 2:41 PM

Gehel added a project: Discovery-Search (Current work).

So9q subscribed.Aug 9 2021, 1:11 PM