High latency on appservers
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	jcrespo
	Jan 16 2021, 11:48 AM

Description

https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=1610774926682&to=1610800126682&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&var-method=GET&var-code=200

Document at https://docs.google.com/document/d/1aE40srfhssb4ra2q_wiJ6KFXDuUjkiGM_3S49EqNwxw

More details coming soon.

Placeholder for public summary: https://wikitech.wikimedia.org/wiki/Incident_documentation/20210116-appserver_latency

Details

	Subject	Repo	Branch	Lines +/-
	role::mediawiki::appserver: temporary disable php-fpm restarts	operations/puppet	production	+5 -2
	role::mediawiki::canary_appserver: disable php-fpm restart timer	operations/puppet	production	+2 -1

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Declined		None	T272215 High latency on appservers
		Resolved		Joe	T272262 The safe service restart script doesn't detect failure when running with poolcounter.

Event Timeline

jcrespo created this task.Jan 16 2021, 11:48 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 16 2021, 11:48 AM

LSobanski subscribed.Jan 16 2021, 11:49 AM

Change 656546 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::mediawiki::appserver: temporary disable php-fpm restarts

https://gerrit.wikimedia.org/r/656546

gerritbot added a project: Patch-For-Review.Jan 16 2021, 11:50 AM

Change 656546 merged by Elukey:
[operations/puppet@production] role::mediawiki::appserver: temporary disable php-fpm restarts

https://gerrit.wikimedia.org/r/656546

Mentioned in SAL (#wikimedia-operations) [2021-01-16T12:10:13Z] <elukey> 'elukey@cumin1001:~$ sudo cumin 'A:mw-eqiad' 'run-puppet-agent' -b 10' T272215)

Maintenance_bot removed a project: Patch-For-Review.Jan 16 2021, 12:10 PM

Change 656548 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::mediawiki::canary_appserver: disable php-fpm restart timer

https://gerrit.wikimedia.org/r/656548

Change 656548 merged by Elukey:
[operations/puppet@production] role::mediawiki::canary_appserver: disable php-fpm restart timer

https://gerrit.wikimedia.org/r/656548

Mentioned in SAL (#wikimedia-operations) [2021-01-16T12:18:53Z] <elukey> elukey@cumin1001:~$ sudo cumin 'A:mw-app-canary and A:mw-eqiad' 'run-puppet-agent' -b 10 - T272215

RhinosF1 subscribed.Jan 16 2021, 12:23 PM

To keep archives happy: by mistake, in my first change, I disabled the timers on the API canaries and we decided to leave it as it was later on.

taavi subscribed.Jan 16 2021, 12:30 PM

Maintenance_bot removed a project: Patch-For-Review.Jan 16 2021, 1:10 PM

Ladsgroup subscribed.Jan 16 2021, 3:06 PM

dancy subscribed.Jan 16 2021, 5:48 PM

CDanis subscribed.Jan 16 2021, 5:59 PM

Joe added a subtask: T272262: The safe service restart script doesn't detect failure when running with poolcounter..Jan 18 2021, 10:37 AM

Joe closed subtask T272262: The safe service restart script doesn't detect failure when running with poolcounter. as Resolved.

The timers have been reenabled, and the next scap deployment should properly run check_and_restart for php7-fpm, and restart those.

This was UBN on Saturday, based on Joe's comment, I am putting this now to Medium.

More details are yet to be provided on the Incident report, I can help with that once the right people are happy with the status of things.

jcrespo moved this task from Backlog to Acknowledged on the SRE board.Jan 18 2021, 5:27 PM

Mentioned in SAL (#wikimedia-operations) [2021-01-19T04:39:34Z] <Krinkle> unlocked per ttps://phabricator.wikimedia.org/T272215#6755025

Krinkle edited projects, added Sustainability (Incident Followup); removed Wikimedia-Incident.Feb 17 2021, 4:50 AM

In T272215#6755992, @jcrespo wrote:

More details are yet to be provided on the Incident report, I can help with that once the right people are happy with the status of things.

Tentatively tagging with Incident-Docs as such. If I recall correctly, there are some follow-ups we wanted to file for this but I don't see them. Anyway, feel free to close if all done (some links to tasks would be nice for future reference).

Krinkle updated the task description. (Show Details)Feb 17 2021, 5:14 AM

I personally don't feel capable neither to write proper docs, file follow ups nor to close it. When I said "more details are yet to be provided", it was a call for help 0:-). Hopefully someone in serviceops can provide such help?

In T272215#6836259, @Krinkle wrote:

In T272215#6755992, @jcrespo wrote:

More details are yet to be provided on the Incident report, I can help with that once the right people are happy with the status of things.

Tentatively tagging with Incident-Docs as such. If I recall correctly, there are some follow-ups we wanted to file for this but I don't see them. Anyway, feel free to close if all done (some links to tasks would be nice for future reference).

There were followups that were resolved the next monday. Some had tasks, for some others I didn't bother doing so because they were extremely quick fixes.

Then there is the long-standing problem with the pybal etcd driver which has a task opened (and lingering around for lack of resources) since a long time.

RhinosF1 mentioned this in T277433: InstantCommons can render a wiki completely unavailable during an outage..Mar 15 2021, 7:02 AM

closing this documentation task as it is unlikely the documentation will be completed further

High latency on appserversClosed, DeclinedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

High latency on appservers
Closed, DeclinedPublic
Actions

Related Objects
Search...