Page MenuHomePhabricator

Logrotate should restart services when more people are around
Closed, ResolvedPublic

Description

One of reasons behind really long outage: https://wikitech.wikimedia.org/wiki/Incident_documentation/20181129-ores

It restarted services at 6am when there is not much people are around.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I am afraid we can't really change it. It's been at 06:25am (UTC in our case) forever and people expect that. Changing it would break the current expectations of people. Note that this is true for all services and software and it hasn't really caused an issue for a long time. So we should make a better job of surfacing and fixing the issues, not changing the logrotate schedule

jijiki triaged this task as Medium priority.Dec 3 2018, 1:43 PM

@Ladsgroup feel free to mark this as "Resolved" if you feel we don't have other options.

So we should make a better job of surfacing and fixing the issues, not changing the logrotate schedule

In general I agree with your note but I was thinking of preventing similar down times in several levels. I'm already working on binding the services and the config files but this seemed like a good idea to me. I leave it to SREs to decide, feel free to close this task if you still think it should not happen.

akosiaris claimed this task.

I 'll do so, thanks