Page MenuHomePhabricator

wdqs100[78] database corruption
Closed, ResolvedPublic

Description

wdqs1007 started to produce weird write errors (see whole log in https://github.com/blazegraph/database/issues/114) and looks like database is corrupted. Probably needs to be reloaded from another server.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Smalyshev triaged this task as Unbreak Now! priority.Jan 8 2019, 12:22 AM

@Gehel According to the discussion in https://github.com/blazegraph/database/issues/114, it might be possible to fix the journal. So let's not reload it until we have checked it.

Same happening with wdq8, 3 hours later. Something spooky is going on... Will talk tomorrow morning with Bryan from Blazegraph, not sure if it's possible to do anything till then.

Same happening with wdq8, 3 hours later. Something spooky is going on... Will talk tomorrow morning with Bryan from Blazegraph, not sure if it's possible to do anything till then.

I'm not touching wdqs100[78] yet, so that you have whatever is needed for investigation.

Addshore renamed this task from wdqs1007 database corruption to wdqs100[78] database corruption.Jan 8 2019, 11:51 AM
Addshore subscribed.

Long story short, the reason for the issue is that we've hit a hard limit on the number of allocators in Blazegraph. See https://wiki.blazegraph.com/wiki/index.php/FixedAllocators for details - there can't be any more than 256K allocators. We'll have to rearrange our data so that we use less allocators. Allocator usage can be seen under http://localhost:9999/bigdata/status?dumpJournal - looking at the servers show we have a lot of small allocator blocks used. We should re-arrange the data so that we don't use this much since it's a constrained resource and raising the limit is currently impossible in Blazegraph.

Action plan (tasks to follow shortly):

Immediate:

  • Copy database from wdq[345] to wdq7 and wdq8
  • Restore updates on wdq7 and wdq8
  • Collect allocator stats everywhere and see which servers are also in danger
  • Write an incident report

Sort-term:

  • Split category namespace into a separate instance of Blazegraph

Longer-term (will require data reload):

  • Disable "raw records" in Blazegraph
  • Consider inlining values & references
  • Consider setting INLINE_TEXT_LITERALS so short strings would be inlined, this doesn't use allocators
  • Check what other things could be inlined
Smalyshev claimed this task.

The servers are normal now, so I am closing this one and the rest will be done in T213210.