Page MenuHomePhabricator

labmon1001: archive-instances not working
Closed, ResolvedPublic

Description

Current Status:	  WARNING   (for 4d 10h 15m 36s)
Status Information:	DISK WARNING - free space: /srv 103841 MB (5% inode=93%):
Performance Data:	/dev=0MB;9;9;0;10 /run=1352MB;12081;12467;0;12853 /=4902MB;88058;90868;0;93679 /dev/shm=0MB;30205;31169;0;32133 /run/lock=0MB;4;4;0;5 /sys/fs/cgroup=0MB;30205;31169;0;32133 /srv=1945563MB;2029556;2094329;0;2159103

Event Timeline

/srv/carbon/whisper/admin-monitoring/fullstackd-* is using 1.5TB

@Andrew do we need to keep Graphite monitoring data for the fullstackd instances? I couldn't find an alert for them so I'm not sure if we're using this data.

I believe that is consumed by this dashboard:

https://grafana.wikimedia.org/d/000000405/wmcs-api-uptimes?orgId=1

It may be used elsewhere. I don't mind if you purge older records if needed.

Relevant docs: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Monitoring#Metrics_life-cycle

/usr/local/bin/archive-instances is failing since 2018-05-31.

/var/log/graphite/instance-archiver.log @ labmon1001:

2018-05-31 13:01:14,813 Exception!
Traceback (most recent call last):
  File "/usr/local/bin/archive-instances", line 149, in <module>
    archived_name = archive_host(project, host)
  File "/usr/local/bin/archive-instances", line 79, in archive_host
    os.rename(cur_path, archived_path)
OSError: [Errno 13] Permission denied

Latest error:

2019-02-11 13:01:37,019 Found 142 host(s) in 55923 project(s) to archive
2019-02-11 13:01:37,020 Exception!
Traceback (most recent call last):
  File "/usr/local/bin/archive-instances", line 149, in <module>
    archived_name = archive_host(project, host)
  File "/usr/local/bin/archive-instances", line 79, in archive_host
    os.rename(cur_path, archived_path)
OSError: [Errno 13] Permission denied
GTirloni renamed this task from labmon1001: DISK WARNING - free space: /srv 103841 MB (5% inode=93%): to labmon1001: archive-instances not working.Feb 11 2019, 5:47 PM

It seems that at some point in April 2018, archive-instances was executed as root and the ownership of some files under /srv/carbon/whisper/archived_metrics was incorrect. The regular cronjob running as _graphite could move files there.

Fix user/group permissions.

gtirloni@labmon1001:~$ sudo chown _graphite:graphite /srv/carbon/whisper/archived_metrics

_graphite@labmon1001:~$ df -h /srv
Filesystem                       Size  Used Avail Use% Mounted on
/dev/mapper/labmon1001--vg-data  2.1T  2.0T   52G  98% /srv

_graphite@labmon1001:~$ time /usr/local/bin/archive-instances

_graphite@labmon1001:~$ find /srv/carbon/whisper/archived_metrics -mtime +90 -type f -delete

_graphite@labmon1001:~$ df -h /srv
Filesystem                       Size  Used Avail Use% Mounted on
/dev/mapper/labmon1001--vg-data  2.1T  1.2T  850G  58% /srv

The situation regarding monitoring cronjobs will improve with T210818

Mentioned in SAL (#wikimedia-cloud) [2019-02-11T18:13:16Z] <gtirloni> cleaned old metrics data in labmon1001 T215417