Page MenuHomePhabricator

WDQS disk usage increase is correlated with reloading of categories
Closed, ResolvedPublic

Description

We are getting low on disk for WDQS servers. This is being addressed in T196485. In the meantime, while looking at graphs, we see weekly increase in disk usage, at the same time as the reloadCategories.sh cron is scheduled. The increase seem to be between 5 and 20 GB each time. The latest categories .ttl.gz files are left on disk, but they are too small to explain this increase.

It looks to me like the old categories namespaces are not cleaned up.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Looking at http://localhost:9999/bigdata/#namespaces it seems that categories namespaces are deleted. But maybe the disk space is not recovered on deletion?

It looks like there is some configuration around the release of historical data. Setting com.bigdata.service.AbstractTransactionService.minReleaseAge=1 might allow to reclaim space.

Yep, this looks like what we should be doing.

Damn, we already set minReleaseAge=1 in RWStore.properties. We need to be looking for something else.

Change 448591 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] wdqs: disable categories reload

https://gerrit.wikimedia.org/r/448591

Change 448591 merged by Gehel:
[operations/puppet@production] wdqs: disable categories reload

https://gerrit.wikimedia.org/r/448591

Change 448597 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] wdqs: fix ensure of reload categories cron

https://gerrit.wikimedia.org/r/448597

Change 448597 merged by Gehel:
[operations/puppet@production] wdqs: fix ensure of reload categories cron

https://gerrit.wikimedia.org/r/448597

Generally since new categories are loaded before old ones are deleted, the space bump is expected - and Blazegraph allocates disk space in big chunks, so it can be noticeable. What is also expected is that when old categories namespace is removed, the space is freed up and then reused for the newly incoming data. However, I am not sure how to check that. There might be fragmentation or leak issues.

We probably need to look into internal Blazegraph metrics and see how actual disk memory usage vs. number of triples vs. allocated memory looks like.

Blazegraph also has space compacting tool, but it requires database shutdown and I am not sure how long it would take to process. I can experiment on that.

Also, once T198356 is implemented, we won't need to reload category namespace (at least not too often) so that issue should be eliminated.

Smalyshev claimed this task.

Does not happen anymore since we're using dailies.