Page MenuHomePhabricator

High latency on appservers
Closed, DeclinedPublic

Event Timeline

Change 656546 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::mediawiki::appserver: temporary disable php-fpm restarts

https://gerrit.wikimedia.org/r/656546

Change 656546 merged by Elukey:
[operations/puppet@production] role::mediawiki::appserver: temporary disable php-fpm restarts

https://gerrit.wikimedia.org/r/656546

Mentioned in SAL (#wikimedia-operations) [2021-01-16T12:10:13Z] <elukey> 'elukey@cumin1001:~$ sudo cumin 'A:mw-eqiad' 'run-puppet-agent' -b 10' T272215)

Change 656548 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::mediawiki::canary_appserver: disable php-fpm restart timer

https://gerrit.wikimedia.org/r/656548

Change 656548 merged by Elukey:
[operations/puppet@production] role::mediawiki::canary_appserver: disable php-fpm restart timer

https://gerrit.wikimedia.org/r/656548

Mentioned in SAL (#wikimedia-operations) [2021-01-16T12:18:53Z] <elukey> elukey@cumin1001:~$ sudo cumin 'A:mw-app-canary and A:mw-eqiad' 'run-puppet-agent' -b 10 - T272215

To keep archives happy: by mistake, in my first change, I disabled the timers on the API canaries and we decided to leave it as it was later on.

The timers have been reenabled, and the next scap deployment should properly run check_and_restart for php7-fpm, and restart those.

jcrespo triaged this task as Medium priority.EditedJan 18 2021, 5:25 PM

This was UBN on Saturday, based on Joe's comment, I am putting this now to Medium.

More details are yet to be provided on the Incident report, I can help with that once the right people are happy with the status of things.

Mentioned in SAL (#wikimedia-operations) [2021-01-19T04:39:34Z] <Krinkle> unlocked per ttps://phabricator.wikimedia.org/T272215#6755025

Krinkle subscribed.

More details are yet to be provided on the Incident report, I can help with that once the right people are happy with the status of things.

Tentatively tagging with Incident-Docs as such. If I recall correctly, there are some follow-ups we wanted to file for this but I don't see them. Anyway, feel free to close if all done (some links to tasks would be nice for future reference).

I personally don't feel capable neither to write proper docs, file follow ups nor to close it. When I said "more details are yet to be provided", it was a call for help 0:-). Hopefully someone in serviceops can provide such help?

More details are yet to be provided on the Incident report, I can help with that once the right people are happy with the status of things.

Tentatively tagging with Incident-Docs as such. If I recall correctly, there are some follow-ups we wanted to file for this but I don't see them. Anyway, feel free to close if all done (some links to tasks would be nice for future reference).

There were followups that were resolved the next monday. Some had tasks, for some others I didn't bother doing so because they were extremely quick fixes.

Then there is the long-standing problem with the pybal etcd driver which has a task opened (and lingering around for lack of resources) since a long time.

lmata subscribed.

closing this documentation task as it is unlikely the documentation will be completed further