Current Status: WARNING (for 4d 10h 15m 36s) Status Information: DISK WARNING - free space: /srv 103841 MB (5% inode=93%): Performance Data: /dev=0MB;9;9;0;10 /run=1352MB;12081;12467;0;12853 /=4902MB;88058;90868;0;93679 /dev/shm=0MB;30205;31169;0;32133 /run/lock=0MB;4;4;0;5 /sys/fs/cgroup=0MB;30205;31169;0;32133 /srv=1945563MB;2029556;2094329;0;2159103
Description
Related Objects
- Mentioned Here
- T210818: Move admin cron jobs to systemd timers
Event Timeline
/srv/carbon/whisper/admin-monitoring/fullstackd-* is using 1.5TB
@Andrew do we need to keep Graphite monitoring data for the fullstackd instances? I couldn't find an alert for them so I'm not sure if we're using this data.
I believe that is consumed by this dashboard:
https://grafana.wikimedia.org/d/000000405/wmcs-api-uptimes?orgId=1
It may be used elsewhere. I don't mind if you purge older records if needed.
Relevant docs: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Monitoring#Metrics_life-cycle
/usr/local/bin/archive-instances is failing since 2018-05-31.
/var/log/graphite/instance-archiver.log @ labmon1001:
2018-05-31 13:01:14,813 Exception! Traceback (most recent call last): File "/usr/local/bin/archive-instances", line 149, in <module> archived_name = archive_host(project, host) File "/usr/local/bin/archive-instances", line 79, in archive_host os.rename(cur_path, archived_path) OSError: [Errno 13] Permission denied
Latest error:
2019-02-11 13:01:37,019 Found 142 host(s) in 55923 project(s) to archive 2019-02-11 13:01:37,020 Exception! Traceback (most recent call last): File "/usr/local/bin/archive-instances", line 149, in <module> archived_name = archive_host(project, host) File "/usr/local/bin/archive-instances", line 79, in archive_host os.rename(cur_path, archived_path) OSError: [Errno 13] Permission denied
It seems that at some point in April 2018, archive-instances was executed as root and the ownership of some files under /srv/carbon/whisper/archived_metrics was incorrect. The regular cronjob running as _graphite could move files there.
Fix user/group permissions.
gtirloni@labmon1001:~$ sudo chown _graphite:graphite /srv/carbon/whisper/archived_metrics _graphite@labmon1001:~$ df -h /srv Filesystem Size Used Avail Use% Mounted on /dev/mapper/labmon1001--vg-data 2.1T 2.0T 52G 98% /srv _graphite@labmon1001:~$ time /usr/local/bin/archive-instances _graphite@labmon1001:~$ find /srv/carbon/whisper/archived_metrics -mtime +90 -type f -delete _graphite@labmon1001:~$ df -h /srv Filesystem Size Used Avail Use% Mounted on /dev/mapper/labmon1001--vg-data 2.1T 1.2T 850G 58% /srv
The situation regarding monitoring cronjobs will improve with T210818
Mentioned in SAL (#wikimedia-cloud) [2019-02-11T18:13:16Z] <gtirloni> cleaned old metrics data in labmon1001 T215417