We need a production service for a SPARQL Endpoint for Commons. Estimates on hardware needs are a complete guess as we don't have a baseline for the load on this service and only vague ideas about the data increase over time.
A few data points:
- current estimate is about 5M pages with structured data on Commons
- largest increase so far was +1M / month
- in comparison WDQS has ~11B triples
- current journal size on sdcquery01 (test server) is 2.5G, but that data dump is > 6 month old
- we want at least 3 servers per DC, to provide enough redundancy in case of hardware failure
- current test server runs with 8G of heap
- having a SPARQL Endpoint for Commons is likely going to enable oru community to write more tools / bots and thus increase the growth rate
- our current test instance on WMCS has 4 VCPU, 8G RAM and 80G disk (but no query load)
all estimates below are for both datacenter (eqiad + codfw)
Estimated specs (oversized to try to account for growth over the lifetime of those servers):
- 6x single Xeon (4C/8T), 64G RAM, 500G usable SSD space RAID1 / 10 software, 1G NIC
Minimal specs right now (estimate):
- 6x single Xeon (2C/4T), 32G RAM, 100G usable HDD space, 1G NIC
Alternate option (discussed with @akosiaris): deploy on ganeti with specs:
- 4x 4CPU, 16G RAM, 100G HDD
- Note: Ganeti might not be a long term option, depending on data growth, but it is unlikely to be an issue during the first year, so we can delay real hardware provisioning until we know more about the growth profile. This would still need provisioning in Ganeti.
- Note: Ganeti has a 16G RAM limit, this seems a bit short to have both heap and disk cache, some validation is required
Note: after having run WCQS on WMCS for a while, it seems that we underestimated the hardware requirements. It makes more sense to copy the specs from WDQS. See T254232#6819856.