Page MenuHomePhabricator

[tools-sgecron-01] The server is getting out of space, daemon.log is growing a lot
Closed, ResolvedPublic

Description

It seems that there's some process generating lots of logs due to exceptions (/var/log/daemon.log):

Nov 27 18:20:59 tools-sgecron-01 collector-runner[9414]: 2020-11-27 18:20:59,699 Exception trying to validate / load tool wikiviz
Nov 27 18:20:59 tools-sgecron-01 collector-runner[9414]: Traceback (most recent call last):
Nov 27 18:20:59 tools-sgecron-01 collector-runner[9414]:   File "/usr/lib/python3/dist-packages/tools/manifest/webservicemonitor.py", line 39, in from_name
Nov 27 18:20:59 tools-sgecron-01 collector-runner[9414]:     user_info = pwd.getpwnam(username)
Nov 27 18:20:59 tools-sgecron-01 collector-runner[9414]: KeyError: 'getpwnam(): name not found: tools.wikiviz'
Nov 27 18:20:59 tools-sgecron-01 collector-runner[9414]: During handling of the above exception, another exception occurred:
Nov 27 18:20:59 tools-sgecron-01 collector-runner[9414]: Traceback (most recent call last):
Nov 27 18:20:59 tools-sgecron-01 collector-runner[9414]:   File "/usr/lib/python3/dist-packages/tools/manifest/webservicemonitor.py", line 146, in collect
Nov 27 18:20:59 tools-sgecron-01 collector-runner[9414]:     tool = Tool.from_name(toolname)
Nov 27 18:20:59 tools-sgecron-01 collector-runner[9414]:   File "/usr/lib/python3/dist-packages/tools/manifest/webservicemonitor.py", line 42, in from_name
Nov 27 18:20:59 tools-sgecron-01 collector-runner[9414]:     raise Tool.InvalidToolException("No tool with name %s" % (name,))
Nov 27 18:20:59 tools-sgecron-01 collector-runner[9414]: tools.manifest.webservicemonitor.Tool.InvalidToolException: No tool with name wikiviz

It seems to be trying to pull lots of tools that it does not find.

This makes users unable to save any crontab on any tools-sge* host (as they are copied to sgecron, but it's out of space).

Event Timeline

dcaro created this task.Nov 27 2020, 6:24 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 27 2020, 6:24 PM
dcaro added a comment.Nov 27 2020, 6:25 PM

Note that for example wikiviz does exist:

https://admin.toolforge.org/tool/wikiviz

dcaro added a comment.Nov 27 2020, 6:32 PM

This works:

root@tools-sgecron-01:/var/log# getent passwd tools.wikiviz
tools.wikiviz:*:<uid>:<gid>:tools.wikiviz:/data/project/wikiviz:/bin/bash

and this:

root@tools-sgecron-01:/var/log# python -c "import pwd; print pwd.getpwnam('tools.wikiviz')"
pwd.struct_passwd(pw_name='tools.wikiviz', pw_passwd='*', pw_uid=<uid>, pw_gid=<gid>, pw_gecos='tools.wikiviz', pw_dir='/data/project/wikiviz', pw_shell='/bin/bash')
Krenair added a comment.EditedNov 29 2020, 5:58 PM

sudo service webservicemonitor restart has shut it up. Broken connection to LDAP/SSSD or something? I notice sssd has only been running since Tue 2020-11-24 18:06:07 UTC; 4 days ago, and zgrep collector-runner /var/log/syslog.3.gz | grep Traceback -C3 | head -n 300 reveals these exceptions took off only 34 seconds later. That file also shows puppet had just applied a config change and restarted sssd. Maybe we're missing a subscribe/notify relationship in puppet to have it restart webservicemonitor as well, or if that's awkward (do we still have some old sssd alternative lurking somewhere that's conditional in puppet?) then maybe we can make it detect this through monitoring the existence of some always-existing LDAP user, and when that fails, crash to have systemd restart it.

Bstorm added a subscriber: Bstorm.Nov 30 2020, 3:34 PM

We had a broken LDAP issue last week. LDAP was hard down. I can get the datetime later. It may have been broken since then since I did not check it.

Andrew closed this task as Resolved.Dec 8 2020, 5:18 PM
Andrew claimed this task.