wdqs1007 started to produce weird write errors (see whole log in https://github.com/blazegraph/database/issues/114) and looks like database is corrupted. Probably needs to be reloaded from another server.
Long story short, the reason for the issue is that we've hit a hard limit on the number of allocators in Blazegraph. See https://wiki.blazegraph.com/wiki/index.php/FixedAllocators for details - there can't be any more than 256K allocators. We'll have to rearrange our data so that we use less allocators. Allocator usage can be seen under http://localhost:9999/bigdata/status?dumpJournal - looking at the servers show we have a lot of small allocator blocks used. We should re-arrange the data so that we don't use this much since it's a constrained resource and raising the limit is currently impossible in Blazegraph.
Action plan (tasks to follow shortly):
- Copy database from wdq to wdq7 and wdq8
- Restore updates on wdq7 and wdq8
- Collect allocator stats everywhere and see which servers are also in danger
- Write an incident report
- Split category namespace into a separate instance of Blazegraph
Longer-term (will require data reload):
- Disable "raw records" in Blazegraph
- Consider inlining values & references
- Consider setting INLINE_TEXT_LITERALS so short strings would be inlined, this doesn't use allocators
- Check what other things could be inlined