Page MenuHomePhabricator

blazegraph journal on wdqs1005 is oversized
Closed, ResolvedPublic

Description

For some reason, the journal on wdqs1005 has grown to 1.1T (from the usual ~650G)

Event Timeline

Gehel created this task.Wed, Nov 13, 4:40 PM
Restricted Application added a project: Wikidata. · View Herald TranscriptWed, Nov 13, 4:40 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Mentioned in SAL (#wikimedia-operations) [2019-11-13T16:40:59Z] <gehel> depool wdqs1005 - T238232

@Igorkim78 file has been uploaded.

Igor advised we query curl http://localhost:9999/bigdata/status?dumpJournal&dumpPages to get dumps from wdqs1005 and another health server (wdqs1006).

Wdqs1006 reports 574.6GiB are reserved for the journal and 544.3GiB are actually used (~5% of space unused).
While Wdqs1005 reports 1037.7GiB are reserved and only 543.5 are actully used (~47% of space unused).
Most of the %FileWaste or reserved for 8K allocators, but %SlotWaste is also higher than usual for 4k (10 times higher than usual), 2k, 64 (3 times), 320 and 768 allocators (2 times).

Slots allocated using 8k allocators are similar on both servers (less than 5% difference) about 5,179M vs 5,431M and only ~1% of them remain in use ~63M. This happens due to updates, for each update parts of the indices related to the changing data have to be copied to a new allocator with changes applied, then the old allocator might be marked as unused and then reused for the later updates after all connections which refer to the commit point linked to the mentioned allocators are closed. But if a commit point could not be released, the allocators are also remain locked.

Analyzing Graphana reports, I assume most of the allocators where consumed gradually from Nov 1, 6:00 to Nov 4, 18:00.

Given all the above, the conclusion is that something (most probably some intentionally or unintentionally unclosed connection) was blocking releasing allocators for 3.5 days preventing their reuse, thus updates had to allocate new allocators, then the commit point was released and the locked allocators are also released, but they could not be removed from file, just increasing sparse space.

Mentioned in SAL (#wikimedia-operations) [2019-11-14T10:06:33Z] <gehel> copying journal from wdqs1007 to wdqs1005 - T238232

Copy completed, server back up, pooled and catched up on lag

TJones closed this task as Resolved.Wed, Nov 20, 4:56 PM
TJones claimed this task.