Page MenuHomePhabricator

New clusters for SPARQL Endpoint for Commons
Closed, ResolvedPublic

Description

We need a production service for a SPARQL Endpoint for Commons. Estimates on hardware needs are a complete guess as we don't have a baseline for the load on this service and only vague ideas about the data increase over time.

A few data points:

  • current estimate is about 5M pages with structured data on Commons
  • largest increase so far was +1M / month
  • in comparison WDQS has ~11B triples
  • current journal size on sdcquery01 (test server) is 2.5G, but that data dump is > 6 month old
  • we want at least 3 servers per DC, to provide enough redundancy in case of hardware failure
  • current test server runs with 8G of heap
  • having a SPARQL Endpoint for Commons is likely going to enable oru community to write more tools / bots and thus increase the growth rate
  • our current test instance on WMCS has 4 VCPU, 8G RAM and 80G disk (but no query load)

all estimates below are for both datacenter (eqiad + codfw)

Estimated specs (oversized to try to account for growth over the lifetime of those servers):

  • 6x single Xeon (4C/8T), 64G RAM, 500G usable SSD space RAID1 / 10 software, 1G NIC

Minimal specs right now (estimate):

  • 6x single Xeon (2C/4T), 32G RAM, 100G usable HDD space, 1G NIC

Alternate option (discussed with @akosiaris): deploy on ganeti with specs:

  • 4x 4CPU, 16G RAM, 100G HDD
  • Note: Ganeti might not be a long term option, depending on data growth, but it is unlikely to be an issue during the first year, so we can delay real hardware provisioning until we know more about the growth profile. This would still need provisioning in Ganeti.
  • Note: Ganeti has a 16G RAM limit, this seems a bit short to have both heap and disk cache, some validation is required

Note: after having run WCQS on WMCS for a while, it seems that we underestimated the hardware requirements. It makes more sense to copy the specs from WDQS. See T254232#6819856.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Gehel added a subscriber: akosiaris.
Gehel added a subscriber: dcausse.

After discussion with @dcausse, we agree that the Ganeti option seems the best at this point. The 16G RAM limit should not be an issue with the data size we have at this point.

Talked with @dcausse about this on IRC today, based on the data we are now seeing[1] the estimates in here are too low. We are thinking that a 16G ganetti instance will not be sufficient for the growth we are seeing. I'm fairly suspicious of our "minimal" sizing of 32G ram, considering we have a 4 year refresh cycle. I wouldn't be surprised if we are talking 1B+ triples (half of 60M media files tagged with ~20 tripples) by 2024.. My intuition is that 64G is the minimum we should be considering, but i think price of memory should be taken into account. If we could have 256G and not worry about it for the cost of a few $k, we would get that back and more in saved time and focus for not having to think about things and the time wasted putting hacks in place to fit in something smaller than necessary several years from now.

The beta service is being stood up in WMCS currently, we have access to bare metal machines installed to WMCS specifically for wdqs (report as 132G ram, 3.2T disk, 32 cores. But 132G ram seems suspicious). I'm not sure how long the query service is intended to be in WMCS, but this instance should have enough runway to get us to next fiscal when we can request machines.

[1] https://analytics.wikimedia.org/published/notebooks/computer-aided-tagging/CAT-usage-report.html

Gehel triaged this task as High priority.Sep 8 2020, 7:14 PM
Gehel added a subtask: Unknown Object (Task).Feb 10 2021, 4:40 PM

A few notes from the current wcqs-beta:

  • CPU is mostly idle, but we have close to no user traffic and data import is only run weekly
  • of the 118G of RAM, 28G are used, the rest is cache (90G). That let me think that we should be OK with 64G, 128G is probably better to allow for some growth
  • The journal is currently 376G (with a recent data reload, so probably not that much space wasted. We should shoot for 50% disk utilization max, so that we have space for duplicating the journal. That means 800G to 1T

With all of the above, using the same specs as WDQS (16C/32T @ 2.5GHz, 128G RAM, 6T SSD / 3T useable RAID1/10) seems reasonable, and would allow to switch servers between the clusters. This might be a bit wasteful on SSDs (and those are expensive).

I don't have a good feel on what growth is going to look like. I could argue both ways: either with increased adoption we'll see growth on data size, edit rate and query rate, or we could plateau once the main body of media has SD added to them. My feeling is that using the same specs as WDQS should allow for enough growth (it would be unlikely that WCQS growth much larger than WDQS) and leave the door open to scaling horizontally.

Gehel updated the task description. (Show Details)
Gehel added a subtask: Unknown Object (Task).Feb 12 2021, 1:59 PM
Papaul closed subtask Unknown Object (Task) as Resolved.Mar 22 2021, 2:30 PM
Jclark-ctr closed subtask Unknown Object (Task) as Resolved.Apr 16 2021, 6:12 PM
Gehel claimed this task.

Service implementation is tracked as part of T280001