Page MenuHomePhabricator

WDQS is hitting allocator limit on Blazegraph
Closed, ResolvedPublic

Description

As described in T213134: wdqs100[78] database corruption, we've approached - and on some servers, exceeded - the allocator limit on Blazegraph. This task will aggregate all tasks we need to perform to solve the issue. Related incident report: https://wikitech.wikimedia.org/wiki/Incident_documentation/20190110-WDQS

Action plan (tasks to follow shortly):

Immediate:

  • Copy database from wdq[345] to wdq7 and wdq8
  • Restore updates on wdq7 and wdq8
  • Collect allocator stats everywhere and see which servers are also in danger
  • Write an incident report

Sort-term:

  • Split category namespace into a separate instance of Blazegraph (T213212)

Longer-term (will require data reload):

  • Disable "raw records" in Blazegraph
  • Consider inlining values & references
  • Consider setting INLINE_TEXT_LITERALS so short strings would be inlined, this doesn't use allocators
  • Check what other things could be inlined

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Smalyshev updated the task description. (Show Details)
Smalyshev updated the task description. (Show Details)
Smalyshev added a subscriber: Gehel.

Here's allocator status for various servers:

wdq3 228802 33342
wdq4 228514 33630
wdq5 228537 33607
wdq6 261748 396
wdq7 262144 0
wdq8 262144 0
wdq21 228035 34109
wdq22 228068 34076
wdq23 226738 35406
wdq24 260729 1415
wdq25 260868 1276
wdq26 260782 1362
wdq9 247252 14892
wdq10 246942 15202

The first number is sum, the second is the distance from the upper bound.
Looks like wdq6 is in danger, and wdq2[456] are close to it too.

Mentioned in SAL (#wikimedia-operations) [2019-01-09T12:02:42Z] <gehel> repool wdqs100[78] - data import complete - T213210

Smalyshev updated the task description. (Show Details)
Smalyshev lowered the priority of this task from High to Medium.Mar 19 2019, 6:24 PM

Change 512890 had a related patch set uploaded (by Igor Kim; owner: Igor Kim):
[wikidata/query/rdf@master] Updated branching factors for disabled raw records

https://gerrit.wikimedia.org/r/512890

Change 512890 merged by jenkins-bot:
[wikidata/query/rdf@master] Updated branching factors for disabled raw records

https://gerrit.wikimedia.org/r/512890

Currently, the dashboard shows 246K+ allocators, and we are using up under 30 per day, so at this rate it should last us decades :) I think the problem is solved for now.