Page MenuHomePhabricator

disk space alert on maps1001
Closed, ResolvedPublic

Description

Needs investigation.

4:12 PM <icinga-wm> PROBLEM - Disk space on maps1001 is CRITICAL: DISK CRITICAL - free space: /srv 54936 MB (3% inode=99%)

Event Timeline

This is related to T194966. New disks are on the way to increase storage space.

In the meantime, I have reduced cassandra GC grace time from 10 days to 4 days. This seemed to have helped a bit, but not that much.

Mentioned in SAL (#wikimedia-operations) [2018-07-24T12:11:49Z] <gehel> vacuum full of postgres on maps1001 to try to reclaim space - T200228

For the record, check_postgres_bloat -H localhost -u osmupdater -db gis gives a report of the wasted space in the gis db. In our case, almost 150GB seems to be recoverable.

Thanks, @Gehel. I guess this is a duplicate of T194966 and can be closed, then.

For the record, check_postgres_bloat -H localhost -u osmupdater -db gis gives a report of the wasted space in the gis db. In our case, almost 150GB seems to be recoverable.

Ah, thanks, this is really handy to know!

@Mholloway I was understanding T194966 as the general issue of space increasing and this ticket as the current disk over usage alert. But either way is fine with me.

Gehel triaged this task as High priority.Jul 24 2018, 1:35 PM

OK—let's leave open until the situation is resolved.

Change 447627 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] maps: disable OSM updates on eqiad while vacuum is running

https://gerrit.wikimedia.org/r/447627

Change 447627 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] maps: disable OSM updates on eqiad while vacuum is running

https://gerrit.wikimedia.org/r/447627

Mentioned in SAL (#wikimedia-operations) [2018-07-24T15:29:20Z] <gehel> restart postgres on maps1001 - T200228

Change 447627 merged by Gehel:
[operations/puppet@production] maps: disable OSM updates on eqiad while vacuum is running

https://gerrit.wikimedia.org/r/447627

Mentioned in SAL (#wikimedia-operations) [2018-07-24T19:09:16Z] <gehel> resetting postgres data on maps1003 after failing replication - T200228

60GB have been recovered so far. We should be able to recover 125GB from planet_osm_ways, but given the size of that table and the amount of available disk, it will probably fail. Let's keep the option open but wait for new disks and reimage instead.

Note: during vacuum, the slave could not keep up with replication and are now out of sync. I'm resetting them one by one.

Mentioned in SAL (#wikimedia-operations) [2018-07-25T06:53:18Z] <gehel> resetting postgres data on maps1004 after failing replication - T200228

Mentioned in SAL (#wikimedia-operations) [2018-07-25T13:00:53Z] <gehel> resetting postgres data on maps1002 after failing replication - T200228

Gehel claimed this task.

we now have ~200GB free on the /srv partition. This should be sufficient to hold until the new disks and the full reimage. Closing this for now.

Change 449528 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] maps: re-enable osm replication

https://gerrit.wikimedia.org/r/449528

Change 449528 merged by Gehel:
[operations/puppet@production] maps: re-enable osm replication

https://gerrit.wikimedia.org/r/449528