Page MenuHomePhabricator

tools: out of disk: tools-worker-1020
Closed, ResolvedPublic

Description

The tools-worker-1020.tools.eqiad.wmflabs VM is running out of disk space as reported by the grafana board https://grafana-labs.wikimedia.org/dashboard/db/tools-basic-alerts?refresh=5m&orgId=1

I jumped in via SSH and investigated where the big amount of data is:

df -h
Filesystem                                                Size  Used Avail Use% Mounted on
udev                                                       10M     0   10M   0% /dev
tmpfs                                                     1.6G  178M  1.4G  12% /run
/dev/vda3                                                  19G   17G  733M  96% /
[...]

The problem seems to be in /var/log/

aborrero@tools-worker-1020:/var/log$ ls -Slah | head
total 14G
-rw-r--r--  1 root        adm     6.4G Jan 10 12:48 syslog
-rw-r-----  1 root        adm     6.4G Jan 10 12:48 daemon.log
-rw-r-----  1 root        adm     285M Jan 10 12:48 auth.log
-rw-r--r--  1 root        adm      36M Jan 10 12:40 messages
-rw-r-----  1 root        adm      29M Jan 10 06:25 kern.log
-rw-r-----  1 root        adm      14M Jul  2  2017 daemon.log.1
-rw-r-----  1 root        adm     9.7M Jul  2  2017 auth.log.1
-rw-r-----  1 root        adm     7.0M Jan 10 12:40 user.log
-rw-rw-r--  1 root        utmp    5.1M Jan 10 12:32 lastlog

Both files syslog and daemon.log are flooded with the same:

aborrero@tools-worker-1020:/var/log$ sudo tail syslog
Jan 10 12:49:12 tools-worker-1020 kubelet[30011]: E0110 12:49:12.735990   30011 kubelet_getters.go:249] Could not read directory /var/lib/kubelet/pods/bcb36fe1-7d3d-11e7-9b1a-fa163edef48a/volumes: open /var/lib/kubelet/pods/bcb36fe1-7d3d-11e7-9b1a-fa163edef48a/volumes: no such file or directory
Jan 10 12:49:12 tools-worker-1020 kubelet[30011]: E0110 12:49:12.736075   30011 kubelet_volumes.go:159] Orphaned pod "bcb36fe1-7d3d-11e7-9b1a-fa163edef48a" found, but error open /var/lib/kubelet/pods/bcb36fe1-7d3d-11e7-9b1a-fa163edef48a/volumes: no such file or directory occured during reading volume dir from disk
Jan 10 12:49:14 tools-worker-1020 kubelet[30011]: E0110 12:49:14.730062   30011 kubelet_getters.go:249] Could not read directory /var/lib/kubelet/pods/bcb36fe1-7d3d-11e7-9b1a-fa163edef48a/volumes: open /var/lib/kubelet/pods/bcb36fe1-7d3d-11e7-9b1a-fa163edef48a/volumes: no such file or directory
Jan 10 12:49:14 tools-worker-1020 kubelet[30011]: E0110 12:49:14.730103   30011 kubelet_volumes.go:159] Orphaned pod "bcb36fe1-7d3d-11e7-9b1a-fa163edef48a" found, but error open /var/lib/kubelet/pods/bcb36fe1-7d3d-11e7-9b1a-fa163edef48a/volumes: no such file or directory occured during reading volume dir from disk
Jan 10 12:49:16 tools-worker-1020 kubelet[30011]: E0110 12:49:16.739702   30011 kubelet_getters.go:249] Could not read directory /var/lib/kubelet/pods/bcb36fe1-7d3d-11e7-9b1a-fa163edef48a/volumes: open /var/lib/kubelet/pods/bcb36fe1-7d3d-11e7-9b1a-fa163edef48a/volumes: no such file or directory
Jan 10 12:49:16 tools-worker-1020 kubelet[30011]: E0110 12:49:16.739743   30011 kubelet_volumes.go:159] Orphaned pod "bcb36fe1-7d3d-11e7-9b1a-fa163edef48a" found, but error open /var/lib/kubelet/pods/bcb36fe1-7d3d-11e7-9b1a-fa163edef48a/volumes: no such file or directory occured during reading volume dir from disk
Jan 10 12:49:18 tools-worker-1020 kubelet[30011]: E0110 12:49:18.731541   30011 kubelet_getters.go:249] Could not read directory /var/lib/kubelet/pods/bcb36fe1-7d3d-11e7-9b1a-fa163edef48a/volumes: open /var/lib/kubelet/pods/bcb36fe1-7d3d-11e7-9b1a-fa163edef48a/volumes: no such file or directory
Jan 10 12:49:18 tools-worker-1020 kubelet[30011]: E0110 12:49:18.731573   30011 kubelet_volumes.go:159] Orphaned pod "bcb36fe1-7d3d-11e7-9b1a-fa163edef48a" found, but error open /var/lib/kubelet/pods/bcb36fe1-7d3d-11e7-9b1a-fa163edef48a/volumes: no such file or directory occured during reading volume dir from disk
Jan 10 12:49:20 tools-worker-1020 kubelet[30011]: E0110 12:49:20.742577   30011 kubelet_getters.go:249] Could not read directory /var/lib/kubelet/pods/bcb36fe1-7d3d-11e7-9b1a-fa163edef48a/volumes: open /var/lib/kubelet/pods/bcb36fe1-7d3d-11e7-9b1a-fa163edef48a/volumes: no such file or directory
Jan 10 12:49:20 tools-worker-1020 kubelet[30011]: E0110 12:49:20.742617   30011 kubelet_volumes.go:159] Orphaned pod "bcb36fe1-7d3d-11e7-9b1a-fa163edef48a" found, but error open /var/lib/kubelet/pods/bcb36fe1-7d3d-11e7-9b1a-fa163edef48a/volumes: no such file or directory occured during reading volume dir from disk

I checked for trends in the disk usage in graphite and it seems this is not a new issue (https://graphite-labs.wikimedia.org/graphlot/?width=586&height=308&_salt=1515589449.215&from=00%3A00_20170801&until=23%3A59_20180110&target=tools.tools-worker-1020.diskspace.root.byte_percentfree)

tools-worker-1020-df.png (404×603 px, 29 KB)

Other worker hosts doesn't show this behavior.

Probably an issue with kubelet in this host.

Event Timeline

Mentioned in SAL (#wikimedia-cloud) [2018-01-10T13:13:23Z] <arturo> detected low space in tools-worker-1020, big files in /var/log due to kubelet issue. Opened T184604

Mentioned in SAL (#wikimedia-cloud) [2018-01-10T13:18:21Z] <arturo> aborrero@tools-k8s-master-01:~$ sudo kubectl cordon tools-worker-1020.tools.eqiad.wmflabs for T184604

Mentioned in SAL (#wikimedia-cloud) [2018-01-10T13:45:13Z] <arturo> T184604 aborrero@tools-worker-1020:/var/log$ sudo mkdir /var/lib/kubelet/pods/bcb36fe1-7d3d-11e7-9b1a-fa163edef48a/volumes

Mentioned in SAL (#wikimedia-cloud) [2018-01-10T13:46:25Z] <arturo> T184604 aborrero@tools-k8s-master-01:~$ sudo kubectl uncordon tools-worker-1020.tools.eqiad.wmflabs

aborrero triaged this task as Low priority.

After cleaning by hand those files (logrotate won't handle them) I tried with cordon and drain of tools-worker-1020 and restarting kubelet. Continuous log errors are being produced again.

All lines were referencing a missing /var/lib/kubelet/pods/bcb36fe1-7d3d-11e7-9b1a-fa163edef48a/volumes directory, so I tried by creating it by hand and magically, the error seems to disappear. So I cordoned the instance again, and lets see what happens.

BTW logrotate seems to be not working as supposed to work:

sudo logrotate -f /etc/logrotate.d/rsyslog 
error: error creating output file /var/log/syslog.1.gz: File exists
error: error creating output file /var/log/mail.info.1.gz: File exists
error: error creating output file /var/log/mail.warn.1.gz: File exists
error: error creating output file /var/log/mail.err.1.gz: File exists
error: error creating output file /var/log/mail.log.1.gz: File exists
error: error creating output file /var/log/daemon.log.1.gz: File exists
error: error creating output file /var/log/kern.log.1.gz: File exists
error: error creating output file /var/log/auth.log.1.gz: File exists
error: error creating output file /var/log/user.log.1.gz: File exists
error: error creating output file /var/log/messages.1.gz: File exists

Those files are all empty, I don't know why, but I seems like they are preventing a normal logrotation. Deleting them by hand right now:

aborrero@tools-worker-1020:/var/log$ for i in $(sudo logrotate -f /etc/logrotate.d/rsyslog 2>&1 | awk -F' ' '{print $6}' | tr -d ':') ; do sudo rm -i $i ; done

After deleting these files, logrotate seems to work properly. I checked by triggering some rotations by hand.

Mentioned in SAL (#wikimedia-cloud) [2018-01-10T13:57:44Z] <arturo> T184604 cleaned stalled log files that prevented logrotate from working. Triggered a couple of logrorate runs by hand in tools-worker-1020.tools.eqiad.wmflabs

238482n375 lowered the priority of this task from Low to Lowest.
238482n375 moved this task from Next Up to In Code Review on the Analytics-Kanban board.
238482n375 edited subscribers, added: 238482n375; removed: Aklapper.

SG9tZVBoYWJyaWNhdG9yCk5vIG1lc3NhZ2VzLiBObyBub3RpZmljYXRpb25zLgoKICAgIFNlYXJjaAoKQ3JlYXRlIFRhc2sKTWFuaXBoZXN0ClQxOTcyODEKRml4IGZhaWxpbmcgd2VicmVxdWVzdCBob3VycyAodXBsb2FkIGFuZCB0ZXh0IDIwMTgtMDYtMTQtMTEpCk9wZW4sIE5lZWRzIFRyaWFnZVB1YmxpYwoKICAgIEVkaXQgVGFzawogICAgRWRpdCBSZWxhdGVkIFRhc2tzLi4uCiAgICBFZGl0IFJlbGF0ZWQgT2JqZWN0cy4uLgogICAgUHJvdGVjdCBhcyBzZWN1cml0eSBpc3N1ZQoKICAgIE11dGUgTm90aWZpY2F0aW9ucwogICAgQXdhcmQgVG9rZW4KICAgIEZsYWcgRm9yIExhdGVyCgpUYWdzCgogICAgQW5hbHl0aWNzLUthbmJhbiAoSW4gUHJvZ3Jlc3MpCgpTdWJzY3JpYmVycwpBa2xhcHBlciwgSkFsbGVtYW5kb3UKQXNzaWduZWQgVG8KSkFsbGVtYW5kb3UKQXV0aG9yZWQgQnkKSkFsbGVtYW5kb3UsIEZyaSwgSnVuIDE1CkRlc2NyaXB0aW9uCgpPb3ppZSBqb2JzIGhhdmUgYmVlbiBmYWlsaW5nIGF0IGxlYXN0IGEgZmV3IHRpbWVzIGVhY2guIE1vcmUgaW52ZXN0aWdhdGlvbiBuZWVkZWQuCkpBbGxlbWFuZG91IGNyZWF0ZWQgdGhpcyB0YXNrLkZyaSwgSnVuIDE1LCA3OjIxIEFNCkhlcmFsZCBhZGRlZCBhIHN1YnNjcmliZXI6IEFrbGFwcGVyLiC3IFZpZXcgSGVyYWxkIFRyYW5zY3JpcHRGcmksIEp1biAxNSwgNzoyMSBBTQpKQWxsZW1hbmRvdSBjbGFpbWVkIHRoaXMgdGFzay5GcmksIEp1biAxNSwgNzoyMiBBTQpKQWxsZW1hbmRvdSB1cGRhdGVkIHRoZSB0YXNrIGRlc2NyaXB0aW9uLiAoU2hvdyBEZXRhaWxzKQpKQWxsZW1hbmRvdSBhZGRlZCBhIHByb2plY3Q6IEFuYWx5dGljcy1LYW5iYW4uCkpBbGxlbWFuZG91IG1vdmVkIHRoaXMgdGFzayBmcm9tIE5leHQgVXAgdG8gSW4gUHJvZ3Jlc3Mgb24gdGhlIEFuYWx5dGljcy1LYW5iYW4gYm9hcmQuCkNoYW5nZSBTdWJzY3JpYmVycwpDaGFuZ2UgUHJpb3JpdHkKQXNzaWduIC8gQ2xhaW0KTW92ZSBvbiBXb3JrYm9hcmQKQ2hhbmdlIFByb2plY3QgVGFncwpBbmFseXRpY3MtS2FuYmFuCtcKU2VjdXJpdHkK1wpXaWtpbWVkaWEtVkUtQ2FtcGFpZ25zIChTMi0yMDE4KQrXClNjYXAK1wpTY2FwIChTY2FwMy1BZG9wdGlvbi1QaGFzZTIpCtcKQWJ1c2VGaWx0ZXIK1wpEYXRhLXJlbGVhc2UK1wpIYXNodGFncwrXCkxhYnNEQi1BdWRpdG9yCtcKTGFkaWVzLVRoYXQtRk9TUy1NZWRpYVdpa2kK1wpMYW5ndWFnZS0yMDE4LUFwci1KdW5lCtcKTGFuZ3VhZ2UtMjAxOC1KYW4tTWFyCtcKSEhWTQrXCkhBV2VsY29tZQrXCkJvbGQKSXRhbGljcwpNb25vc3BhY2VkCkxpbmsKQnVsbGV0ZWQgTGlzdApOdW1iZXJlZCBMaXN0CkNvZGUgQmxvY2sKUXVvdGUKVGFibGUKVXBsb2FkIEZpbGUKTWVtZQpQcmV2aWV3CkhlbHAKRnVsbHNjcmVlbiBNb2RlClBpbiBGb3JtIE9uIFNjcmVlbgoyMzg0ODJuMzc1IGFkZGVkIHByb2plY3RzOiBTZWN1cml0eSwgV2lraW1lZGlhLVZFLUNhbXBhaWducyAoUzItMjAxOCksIFNjYXAgKFNjYXAzLUFkb3B0aW9uLVBoYXNlMiksIEFidXNlRmlsdGVyLCBEYXRhLXJlbGVhc2UsIEhhc2h0YWdzLCBMYWJzREItQXVkaXRvciwgTGFkaWVzLVRoYXQtRk9TUy1NZWRpYVdpa2ksIExhbmd1YWdlLTIwMTgtQXByLUp1bmUsIExhbmd1YWdlLTIwMTgtSmFuLU1hciwgSEhWTSwgSEFXZWxjb21lLlBSRVZJRVcKMjM4NDgybjM3NSBtb3ZlZCB0aGlzIHRhc2sgZnJvbSBJbiBQcm9ncmVzcyB0byBJbiBDb2RlIFJldmlldyBvbiB0aGUgQW5hbHl0aWNzLUthbmJhbiBib2FyZC4KMjM4NDgybjM3NSByZW1vdmVkIEpBbGxlbWFuZG91IGFzIHRoZSBhc3NpZ25lZSBvZiB0aGlzIHRhc2suCjIzODQ4Mm4zNzUgdHJpYWdlZCB0aGlzIHRhc2sgYXMgTG93ZXN0IHByaW9yaXR5LgoyMzg0ODJuMzc1IHJlbW92ZWQgc3Vic2NyaWJlcnM6IEFrbGFwcGVyLCBKQWxsZW1hbmRvdS4KQ29udGVudCBsaWNlbnNlZCB1bmRlciBDcmVhdGl2ZSBDb21tb25zIEF0dHJpYnV0aW9uLVNoYXJlQWxpa2UgMy4wIChDQy1CWS1TQSkgdW5sZXNzIG90aGVyd2lzZSBub3RlZDsgY29kZSBsaWNlbnNlZCB1bmRlciBHTlUgR2VuZXJhbCBQdWJsaWMgTGljZW5zZSAoR1BMKSBvciBvdGhlciBvcGVuIHNvdXJjZSBsaWNlbnNlcy4gQnkgdXNpbmcgdGhpcyBzaXRlLCB5b3UgYWdyZWUgdG8gdGhlIFRlcm1zIG9mIFVzZSwgUHJpdmFjeSBQb2xpY3ksIGFuZCBDb2RlIG9mIENvbmR1Y3QuILcgV2lraW1lZGlhIEZvdW5kYXRpb24gtyBQcml2YWN5IFBvbGljeSC3IENvZGUgb2YgQ29uZHVjdCC3IFRlcm1zIG9mIFVzZSC3IERpc2NsYWltZXIgtyBDQy1CWS1TQSC3IEdQTApZb3VyIGJyb3dzZXIgdGltZXpvbmUgc2V0dGluZyBkaWZmZXJzIGZyb20gdGhlIHRpbWV6b25lIHNldHRpbmcgaW4geW91ciBwcm9maWxlLCBjbGljayB0byByZWNvbmNpbGUu