Needs investigation.
4:12 PM <icinga-wm> PROBLEM - Disk space on maps1001 is CRITICAL: DISK CRITICAL - free space: /srv 54936 MB (3% inode=99%)
Needs investigation.
4:12 PM <icinga-wm> PROBLEM - Disk space on maps1001 is CRITICAL: DISK CRITICAL - free space: /srv 54936 MB (3% inode=99%)
This is related to T194966. New disks are on the way to increase storage space.
In the meantime, I have reduced cassandra GC grace time from 10 days to 4 days. This seemed to have helped a bit, but not that much.
Mentioned in SAL (#wikimedia-operations) [2018-07-24T12:11:49Z] <gehel> vacuum full of postgres on maps1001 to try to reclaim space - T200228
For the record, check_postgres_bloat -H localhost -u osmupdater -db gis gives a report of the wasted space in the gis db. In our case, almost 150GB seems to be recoverable.
@Mholloway I was understanding T194966 as the general issue of space increasing and this ticket as the current disk over usage alert. But either way is fine with me.
Change 447627 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] maps: disable OSM updates on eqiad while vacuum is running
Change 447627 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] maps: disable OSM updates on eqiad while vacuum is running
Mentioned in SAL (#wikimedia-operations) [2018-07-24T15:29:20Z] <gehel> restart postgres on maps1001 - T200228
Change 447627 merged by Gehel:
[operations/puppet@production] maps: disable OSM updates on eqiad while vacuum is running
Mentioned in SAL (#wikimedia-operations) [2018-07-24T19:09:16Z] <gehel> resetting postgres data on maps1003 after failing replication - T200228
60GB have been recovered so far. We should be able to recover 125GB from planet_osm_ways, but given the size of that table and the amount of available disk, it will probably fail. Let's keep the option open but wait for new disks and reimage instead.
Note: during vacuum, the slave could not keep up with replication and are now out of sync. I'm resetting them one by one.
Mentioned in SAL (#wikimedia-operations) [2018-07-25T06:53:18Z] <gehel> resetting postgres data on maps1004 after failing replication - T200228
Mentioned in SAL (#wikimedia-operations) [2018-07-25T13:00:53Z] <gehel> resetting postgres data on maps1002 after failing replication - T200228
we now have ~200GB free on the /srv partition. This should be sufficient to hold until the new disks and the full reimage. Closing this for now.
Change 449528 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] maps: re-enable osm replication
Change 449528 merged by Gehel:
[operations/puppet@production] maps: re-enable osm replication