wdqs1007 started to produce weird write errors (see whole log in https://github.com/blazegraph/database/issues/114) and looks like database is corrupted. Probably needs to be reloaded from another server.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Smalyshev | T213210 WDQS is hitting allocator limit on Blazegraph | |||
Resolved | Smalyshev | T213134 wdqs100[78] database corruption |
Event Timeline
@Gehel According to the discussion in https://github.com/blazegraph/database/issues/114, it might be possible to fix the journal. So let's not reload it until we have checked it.
Same happening with wdq8, 3 hours later. Something spooky is going on... Will talk tomorrow morning with Bryan from Blazegraph, not sure if it's possible to do anything till then.
Mentioned in SAL (#wikimedia-operations) [2019-01-08T04:26:38Z] <onimisionipe> depooling wdqs1008 - T213134
I'm not touching wdqs100[78] yet, so that you have whatever is needed for investigation.
Long story short, the reason for the issue is that we've hit a hard limit on the number of allocators in Blazegraph. See https://wiki.blazegraph.com/wiki/index.php/FixedAllocators for details - there can't be any more than 256K allocators. We'll have to rearrange our data so that we use less allocators. Allocator usage can be seen under http://localhost:9999/bigdata/status?dumpJournal - looking at the servers show we have a lot of small allocator blocks used. We should re-arrange the data so that we don't use this much since it's a constrained resource and raising the limit is currently impossible in Blazegraph.
Action plan (tasks to follow shortly):
Immediate:
- Copy database from wdq[345] to wdq7 and wdq8
- Restore updates on wdq7 and wdq8
- Collect allocator stats everywhere and see which servers are also in danger
- Write an incident report
Sort-term:
- Split category namespace into a separate instance of Blazegraph
Longer-term (will require data reload):
- Disable "raw records" in Blazegraph
- Consider inlining values & references
- Consider setting INLINE_TEXT_LITERALS so short strings would be inlined, this doesn't use allocators
- Check what other things could be inlined
The servers are normal now, so I am closing this one and the rest will be done in T213210.