Document at https://docs.google.com/document/d/1aE40srfhssb4ra2q_wiJ6KFXDuUjkiGM_3S49EqNwxw
More details coming soon.
Placeholder for public summary: https://wikitech.wikimedia.org/wiki/Incident_documentation/20210116-appserver_latency
Document at https://docs.google.com/document/d/1aE40srfhssb4ra2q_wiJ6KFXDuUjkiGM_3S49EqNwxw
More details coming soon.
Placeholder for public summary: https://wikitech.wikimedia.org/wiki/Incident_documentation/20210116-appserver_latency
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Declined | None | T272215 High latency on appservers | |||
Resolved | Joe | T272262 The safe service restart script doesn't detect failure when running with poolcounter. |
Change 656546 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::mediawiki::appserver: temporary disable php-fpm restarts
Change 656546 merged by Elukey:
[operations/puppet@production] role::mediawiki::appserver: temporary disable php-fpm restarts
Mentioned in SAL (#wikimedia-operations) [2021-01-16T12:10:13Z] <elukey> 'elukey@cumin1001:~$ sudo cumin 'A:mw-eqiad' 'run-puppet-agent' -b 10' T272215)
Change 656548 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::mediawiki::canary_appserver: disable php-fpm restart timer
Change 656548 merged by Elukey:
[operations/puppet@production] role::mediawiki::canary_appserver: disable php-fpm restart timer
Mentioned in SAL (#wikimedia-operations) [2021-01-16T12:18:53Z] <elukey> elukey@cumin1001:~$ sudo cumin 'A:mw-app-canary and A:mw-eqiad' 'run-puppet-agent' -b 10 - T272215
To keep archives happy: by mistake, in my first change, I disabled the timers on the API canaries and we decided to leave it as it was later on.
The timers have been reenabled, and the next scap deployment should properly run check_and_restart for php7-fpm, and restart those.
This was UBN on Saturday, based on Joe's comment, I am putting this now to Medium.
More details are yet to be provided on the Incident report, I can help with that once the right people are happy with the status of things.
Mentioned in SAL (#wikimedia-operations) [2021-01-19T04:39:34Z] <Krinkle> unlocked per ttps://phabricator.wikimedia.org/T272215#6755025
Tentatively tagging with Incident-Docs as such. If I recall correctly, there are some follow-ups we wanted to file for this but I don't see them. Anyway, feel free to close if all done (some links to tasks would be nice for future reference).
I personally don't feel capable neither to write proper docs, file follow ups nor to close it. When I said "more details are yet to be provided", it was a call for help 0:-). Hopefully someone in serviceops can provide such help?
There were followups that were resolved the next monday. Some had tasks, for some others I didn't bother doing so because they were extremely quick fixes.
Then there is the long-standing problem with the pybal etcd driver which has a task opened (and lingering around for lack of resources) since a long time.
closing this documentation task as it is unlikely the documentation will be completed further