Page MenuHomePhabricator

Fix systemd and possibly logrotate around the wmf-pt-kill service for multi-instance wikireplicas
Closed, ResolvedPublic

Description

Pages went off on Sun Feb 7 00:05:02 UTC 2021 because logrotate still has a job installed by the package (as well as the systemd service that the package installs), and when triggered by the schedule, it caused failure statuses in systemd.

Jan 17 00:00:02 clouddb1016 logrotate[6515]: Job for wmf-pt-kill.service failed because the control process exited with error code.
Jan 17 00:00:02 clouddb1016 logrotate[6515]: See "systemctl status wmf-pt-kill.service" and "journalctl -xe" for details.
Jan 17 00:00:02 clouddb1016 logrotate[6515]: error: error running shared postrotate script for '/var/log/wmf-pt-kill/wmf-pt-kill.log '
Jan 17 00:00:02 clouddb1016 systemd[1]: logrotate.service: Main process exited, code=exited, status=1/FAILURE
Jan 17 00:00:02 clouddb1016 systemd[1]: logrotate.service: Failed with result 'exit-code'.
Jan 17 00:00:02 clouddb1016 systemd[1]: Failed to start Rotate log files.

Get puppet to clean up the logrotate for wmf-pt-kill, add logrotate scripts for the multi-socket services, and perhaps mask the service that is installed by the package so it stops "failing" when things try to run it.

Event Timeline

Bstorm triaged this task as Medium priority.Feb 7 2021, 12:19 AM
Bstorm created this task.

Change 662797 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] wikireplicas: adjust logrotate for multiinstance on wmf-pt-kill

https://gerrit.wikimedia.org/r/662797

Made some comments on the patchset regarding the current situation with two processes accessing the same file.

Change 662797 merged by Bstorm:
[operations/puppet@production] wikireplicas: adjust logrotate for multiinstance on wmf-pt-kill

https://gerrit.wikimedia.org/r/662797

I think this should be good now. We'll know if it continues to log things after logrotate runs. If it doesn't, then copytruncate wasn't sufficient and it does need a restart.

For the record I have generated a query that got logged - we can check once it is rotated and generate another one and check if it gets logged to the new file

root@clouddb1014:/var/log# cat wmf-pt-kill/wmf-pt-kill-s7.log
# 2021-02-15T08:22:29 KILL 4939549 (Query 309 sec) select sleep(600)
Bstorm claimed this task.

This is done, I believe.