Resolve service instability due to excessive event loop blockage since starting PCS response pregeneration
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	jijiki
	Jul 29 2019, 10:11 PM

Description

It appears that sometime after 2019-07-24 ~20:00 UTC, mobileapps started generating a lot of errors of worker died, restarting and worker stopped sending heartbeats, killing.

SAL around that time:

20:45 ppchelko@deploy1001: Started deploy [restbase/deploy@7911f65]: Store PCS endpoints T222384
20:39 bsitzmann@deploy1001: Finished deploy [mobileapps/deploy@2e2ce6c]: Update mobileapps to 1751a2e (duration: 04m 20s)
20:38 ppchelko@deploy1001: Finished deploy [changeprop/deploy@bf28187]: Rerender PCS endpoints T222384 (duration: 01m 34s)
20:36 ppchelko@deploy1001: Started deploy [changeprop/deploy@bf28187]: Rerender PCS endpoints T222384
20:35 bsitzmann@deploy1001: Started deploy [mobileapps/deploy@2e2ce6c]: Update mobileapps to 1751a2e

When this is resolved, we need to revert 541891

Details

Subject	Repo	Branch	Lines +/-
Prevent expensive mobile-html transforms from blocking the event loop	mediawiki/services/mobileapps	master	+4 K -2 K
Prevent expensive mobile-html transforms from blocking the event loop	mediawiki/services/mobileapps	master	+4 K -2 K
Prevent expensive mobile-html transforms from blocking the event loop	mediawiki/services/mobileapps	master	+4 K -2 K
Prevent expensive mobile-html transforms from blocking the event loop	mediawiki/services/mobileapps	master	+4 K -2 K
Revert "lvs::monitor_services: increase number of tries before MCS is critical"	operations/puppet	production	+1 -1
Speed up media endpoints, prevent them from blocking the event loop	mediawiki/services/mobileapps	master	+125 -51
Remove classifyElements from prepForTheme	mediawiki/services/mobileapps	master	+41 -75
lvs::monitor_services: increase number of tries before MCS is critical	operations/puppet	production	+1 -1
Performance tweaks: domUtil + addSectionEditButtons	mediawiki/services/mobileapps	master	+12 -16
Increase num_workers to (ncpu * 1.5)	mediawiki/services/mobileapps/deploy	master	+1 -1
Performance: Remove language variant info from page metadata response	mediawiki/services/mobileapps	master	+7 -101
Performance: Change onthisday spec x-ample to selected only	mediawiki/services/mobileapps	master	+2 -21

Related Objects
Search...

Status	Assigned	Task
Resolved	• Mholloway	T229286 Resolve service instability due to excessive event loop blockage since starting PCS response pregeneration
Resolved	• Mholloway	T170455 Extract the feed endpoints from PCS into a new wikifeeds service
Declined	None	T222821 Sunset /page/definitions?
Resolved	Krenair	T225557 Puppet fails on newly created deployment-wikifeeds01
Resolved	MSantos	T229287 Profile wikifeeds memory usage for Helm chart
Resolved	• Mholloway	T233028 Define SLIs/SLOs for wikifeeds
Resolved	• mobrovac	T231361 Stop pregenerating and storing /page/definition responses
Invalid	None	T234215 Stop pregenerating /page/media
Resolved	• mobrovac	T235173 Stop pregenerating /page/metadata
Resolved	Mhurd	T235213 Optimize talk endpoint performance
Resolved	JoeWalsh	T235571 Consolidate wikimedia-page-library into the mobileapps repo

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

• Mholloway raised the priority of this task from Medium to High.Jul 31 2019, 6:49 PM

In T229286#5381330, @Mholloway wrote:

So one issue right off the bat is that we're generating /feed/onthisday/all in the service in addition to the individual pieces, which seems like a lot of redundant work. How much work would it be to compose /feed/onthisday/all in RESTBase instead, the way we've been doing with /feed/featured for some time now?

We already are doing so https://github.com/wikimedia/restbase/blob/master/v1/onthisday.js#L57-L76

Weird. Why are we still getting internal requests for /feed/onthisday/all, then?

In T229286#5381396, @Mholloway wrote:

Weird. Why are we still getting internal requests for /feed/onthisday/all, then?

It's an MCS x-ample test, see spec/feed/on-this-day.yaml I believe the x-ample should be changed to make it less costly.

Change 526752 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[mediawiki/services/mobileapps@master] Performance: Change onthisday spec x-amples to deaths only

https://gerrit.wikimedia.org/r/526752

gerritbot added a project: Patch-For-Review.Jul 31 2019, 7:12 PM

Change 526752 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] Performance: Change onthisday spec x-ample to selected only

https://gerrit.wikimedia.org/r/526752

Change 526763 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[mediawiki/services/mobileapps@master] Performance: Remove language variant info from page metadata response

https://gerrit.wikimedia.org/r/526763

Change 526763 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] Performance: Remove language variant info from page metadata response

https://gerrit.wikimedia.org/r/526763

In T229286#5381330, @Mholloway wrote:

I see that the new /page/talk endpoint also has pretty bad latency, which is not exactly a surprise. I'll take a look and see if I can find some tweaks to make there.

On second thought, if /page/talk isn't being pregenerated, and is only getting ~1 req/s, that's the wrong place to focus. I'll take a good, hard look at mobile-html in the morning.

I should point out that the picture actually got much worse on 7/30 — more than double the already-elevated error rate shown in the description. I think it was caused by the new language variant support that was deployed that day.

I think we're just asking this one poor service to do too much.

Patching up a few crashers (captured in T229521 and T229630) looks to have improved the situation considerably with respect to worker deaths, but the current level is still elevated compared to the baseline prior to 7/24.

• Mholloway mentioned this in T184753: Use stored page leads when creating page summaries to reduce MCS load.Aug 4 2019, 1:47 PM

• Mholloway claimed this task.Aug 5 2019, 1:48 PM

• Mholloway moved this task from To Do to Doing on the Product-Infrastructure-Team-Backlog-Deprecated (Kanban) board.

@Mholloway We are getting a number of alerts where scb* servers return 504 to healthchecks, what do you suggest?

The plan is to relieve pressure on mobileapps by pushing forward on the MCS-PCS split (i.e., launching the standalone Wikifeeds service and moving feed endpoint traffic over to it). An initial chart is in progress in https://gerrit.wikimedia.org/r/#/c/operations/deployment-charts/+/526679/.

• Mholloway signed these changes with MFA.Aug 13 2019, 2:40 PM

• Mholloway removed a subscriber: • wikifeeds.

• Mholloway added a subtask: T170455: Extract the feed endpoints from PCS into a new wikifeeds service.Aug 13 2019, 3:21 PM

• Mholloway moved this task from Doing to Blocked on the Product-Infrastructure-Team-Backlog-Deprecated (Kanban) board.Aug 13 2019, 7:12 PM

• Mholloway updated the task description. (Show Details)Aug 15 2019, 1:19 AM

@mobrovac @Pchelolo are we still pregenerating responses for the /page/definition endpoint? I don't know how to determine that from the RESTBase code, but it looks like it's still the case given the steady stream of /page/definition 501s in the mobileapps logs.

If we are still pregenerating, I think we should strongly consider stopping, given the very low request rate (~2.5 req/min, according to T222821#5168253), which we should easily be able to handle without pregeneration. That would likely help ease the burden here.

Stopping regeneration === stopping storage. However, looking into grafana external vs internal definitions latencies, the difference is not that high, so I think we could safely do so.

Could you please create a subtask for doing just that?
It would require

Removing storage from RESTBase/Cassandra
Modifying test/prod change-prop config to stop regenerating

• Mholloway moved this task from Blocked to Doing on the Product-Infrastructure-Team-Backlog-Deprecated (Kanban) board.Aug 27 2019, 3:38 PM

In T229286#5442034, @Pchelolo wrote:

Could you please create a subtask for doing just that?

Done: T231361: Stop pregenerating and storing /page/definition responses

It looks like the rate of worker deaths approximately doubled for some unknown reason on 8/19 (though to nowhere near the previous high levels).

There was no mobileapps deployment, and the RESTBase deploy commit for that day looks unrelated.

• mobrovac closed subtask T231361: Stop pregenerating and storing /page/definition responses as Resolved.Sep 11 2019, 11:00 AM

Stopping /page/definition pregeneration appears to have helped, but we're still seeing a nontrivial number of worker deaths.

Screenshot from 2019-09-11 08-22-09 cropped.png (317×1 px, 22 KB)

From activity on T170455: Extract the feed endpoints from PCS into a new wikifeeds service it looks like that's nearing deployment; I expect that moving feed traffic to that service will also help.

jijiki moved this task from Incoming 🐫 to this.quarter 🍕 on the serviceops board.Sep 13 2019, 3:27 AM

akosiaris closed subtask T170455: Extract the feed endpoints from PCS into a new wikifeeds service as Resolved.Sep 18 2019, 2:04 PM

akosiaris reopened subtask T170455: Extract the feed endpoints from PCS into a new wikifeeds service as Open.

Change 538036 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[mediawiki/services/mobileapps/deploy@master] Increase num_workers to (ncpu * 1.5)

https://gerrit.wikimedia.org/r/538036

Change 538036 merged by Mholloway:
[mediawiki/services/mobileapps/deploy@master] Increase num_workers to (ncpu * 1.5)

https://gerrit.wikimedia.org/r/538036

Mentioned in SAL (#wikimedia-operations) [2019-09-19T14:51:07Z] <mholloway-shell@deploy1001> Started deploy [mobileapps/deploy@16a6af1]: Increase num_workers to (ncpu * 1.5) (T229286)

Mentioned in SAL (#wikimedia-operations) [2019-09-19T14:56:44Z] <mholloway-shell@deploy1001> Finished deploy [mobileapps/deploy@16a6af1]: Increase num_workers to (ncpu * 1.5) (T229286) (duration: 05m 39s)

• Mholloway mentioned this in rGMOD16a6af10c741: Increase num_workers to (ncpu * 1.5).Sep 19 2019, 3:33 PM

Increasing the number of workers by 50% had a different effect than expected. We expected response latencies to increase slightly and worker deaths to reduce; in fact, however, the result was a small improvement in p50 internal response latencies and no change in the rate of worker deaths.

In T229286#5506798, @Mholloway wrote:

Increasing the number of workers by 50% had a different effect than expected. We expected response latencies to increase slightly and worker deaths to reduce; in fact, however, the result was a small improvement in p50 internal response latencies

This is actually easily explained by the fact that since there are more workers now, work on any given request starts sooner, so even if it takes slightly longer, the gain of starting work sooner outweighs the delay that happens due to longer processing times.

and no change in the rate of worker deaths.

This is indeed weird and unexpected. Technically, though, the ratio of worker deaths in now lower given that there are more workers. However, that's not really an improvement if the same absolute amount of requests fails due to worker deaths.

Change 538728 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[mediawiki/services/mobileapps@master] Performance tweaks: domUtil + addSectionEditButtons

https://gerrit.wikimedia.org/r/538728

Change 538728 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] Performance tweaks: domUtil + addSectionEditButtons

https://gerrit.wikimedia.org/r/538728

Mentioned in SAL (#wikimedia-operations) [2019-09-25T12:13:25Z] <mholloway-shell@deploy1001> Started deploy [mobileapps/deploy@241b284]: Performance tweaks: domUtil + addSectionEditButtons (T229286)

Mentioned in SAL (#wikimedia-operations) [2019-09-25T12:18:43Z] <mholloway-shell@deploy1001> Finished deploy [mobileapps/deploy@241b284]: Performance tweaks: domUtil + addSectionEditButtons (T229286) (duration: 05m 17s)

Mentioned in SAL (#wikimedia-operations) [2019-09-25T20:17:28Z] <mholloway-shell@deploy1001> Started deploy [mobileapps/deploy@dbf4e7e]: Speed up querySelectors in domUtil (T229286)

Mentioned in SAL (#wikimedia-operations) [2019-09-25T20:23:00Z] <mholloway-shell@deploy1001> Finished deploy [mobileapps/deploy@dbf4e7e]: Speed up querySelectors in domUtil (T229286) (duration: 05m 32s)

I've confirmed with @JoeWalsh that the apps aren't using the /page/media endpoint (the Android app has already transitioned to /page/media-list), so we can stop pregenerating it. I've created a task for this and added it as a subtask here.

• Mholloway moved this task from Doing to Blocked on the Product-Infrastructure-Team-Backlog-Deprecated (Kanban) board.Sep 30 2019, 3:54 PM

• mobrovac closed subtask T234215: Stop pregenerating /page/media as Invalid.Sep 30 2019, 3:59 PM

Ah, I see now from https://github.com/wikimedia/mediawiki-services-change-propagation-deploy/commit/6df20caea35912d5003c50a6e2674b6d4707d692 that the PCS endpoints being pregenerated since 7/24 are:

media-list
metadata
mobile-html
references

• Mholloway moved this task from Blocked to Doing on the Product-Infrastructure-Team-Backlog-Deprecated (Kanban) board.Oct 1 2019, 3:38 PM

Change 541891 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] lvs::monitor_services: increase number of tries before MCS is critical

https://gerrit.wikimedia.org/r/541891

jijiki updated the task description. (Show Details)Oct 9 2019, 6:43 PM

@Mholloway We are thinking of switching the recipients of MCS alerts of this sort to only mobileapps as we are getting more alerts that are non-actionable, is that ok ? The alternative would be to change the number of tries before alerting and/or the interval between tries.

• Mholloway moved this task from Doing to Blocked on the Product-Infrastructure-Team-Backlog-Deprecated (Kanban) board.Oct 10 2019, 7:02 PM

I'm continuing to work with @JoeWalsh to tweak performance, and hoping that T235173: Stop pregenerating /page/metadata will also take some pressure off the event loop. (In the future it would be good if we start pregenerating endpoint responses in a staggered fashion rather than all at once, so that it's easier to identify offenders and roll back for improvement where needed.)

@jijiki I guess I don't have any objection to us being the only recipients of alerts.

JoeWalsh added a subtask: T235213: Optimize talk endpoint performance.Oct 11 2019, 12:27 PM

• mobrovac closed subtask T235173: Stop pregenerating /page/metadata as Resolved.Oct 14 2019, 9:01 AM

Change 541891 merged by Effie Mouzeli:
[operations/puppet@production] lvs::monitor_services: increase number of tries before MCS is critical

https://gerrit.wikimedia.org/r/541891

• Mholloway mentioned this in T235571: Consolidate wikimedia-page-library into the mobileapps repo.Oct 15 2019, 10:02 PM

LGoto added a subtask: T235571: Consolidate wikimedia-page-library into the mobileapps repo.Oct 16 2019, 3:37 PM

• Mholloway renamed this task from "worker died, restarting" mobileapps issue to Resolve service instability due to excessive event loop blockage since starting PCS response pregeneration.Oct 16 2019, 3:56 PM

• Mholloway added a project: Epic.

• Mholloway removed a project: Patch-For-Review.

• Mholloway edited projects, added Product-Infrastructure-Team-Backlog-Deprecated; removed Product-Infrastructure-Team-Backlog-Deprecated (Kanban).

• Mholloway moved this task from Needs triage to Epics on the Product-Infrastructure-Team-Backlog-Deprecated board.

• Mholloway closed subtask T170455: Extract the feed endpoints from PCS into a new wikifeeds service as Resolved.Oct 16 2019, 3:58 PM

Change 544854 had a related patch set uploaded (by Joewalsh; owner: Joewalsh):
[mediawiki/services/mobileapps@master] Remove classifyElements from prepForTheme

https://gerrit.wikimedia.org/r/544854

gerritbot added a project: Patch-For-Review.Oct 21 2019, 10:09 AM

Change 544879 had a related patch set uploaded (by Joewalsh; owner: Joewalsh):
[mediawiki/services/mobileapps@master] Speed up media endpoints, prevent them from blocking the event loop

https://gerrit.wikimedia.org/r/544879

Change 544936 had a related patch set uploaded (by Joewalsh; owner: Joewalsh):
[mediawiki/services/mobileapps@master] WIP: Prevent expensive mobile-html transforms from blocking the event loop

https://gerrit.wikimedia.org/r/544936

JoeWalsh mentioned this in T236089: [EPIC] Page content service tech debt.Oct 21 2019, 4:40 PM

Change 544854 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] Remove classifyElements from prepForTheme

https://gerrit.wikimedia.org/r/544854

Change 544879 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] Speed up media endpoints, prevent them from blocking the event loop

https://gerrit.wikimedia.org/r/544879

JoeWalsh mentioned this in T236137: mobile-html: Exclude elements from theming using a `notheme` class instead of querying every page for specific elements.Oct 22 2019, 11:09 AM

This issue is now resolved after removing a very expensive part of mobile-html and re-writing the media* endpoints to be more performant and preventing them from blocking the event loop.

Is there a best practice for expensive tasks in our node services? Are total processing time and event loop blocking time reasonable metrics to test against?

Change 545873 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[operations/puppet@production] Revert "lvs::monitor_services: increase number of tries before MCS is critical"

https://gerrit.wikimedia.org/r/545873

Change 545873 merged by Effie Mouzeli:
[operations/puppet@production] Revert "lvs::monitor_services: increase number of tries before MCS is critical"

https://gerrit.wikimedia.org/r/545873

@Mholloway Can we mark this as resolved?

• Mholloway closed this task as Resolved.Oct 30 2019, 12:13 PM

Change 551803 had a related patch set uploaded (by Joewalsh; owner: Joewalsh):
[mediawiki/services/mobileapps@master] Prevent expensive mobile-html transforms from blocking the event loop

https://gerrit.wikimedia.org/r/551803

Change 551803 abandoned by Joewalsh:
Prevent expensive mobile-html transforms from blocking the event loop

https://gerrit.wikimedia.org/r/551803

Change 544936 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] Prevent expensive mobile-html transforms from blocking the event loop

https://gerrit.wikimedia.org/r/544936

Change 551875 had a related patch set uploaded (by Mholloway; owner: Joewalsh):
[mediawiki/services/mobileapps@master] Prevent expensive mobile-html transforms from blocking the event loop

https://gerrit.wikimedia.org/r/551875

Change 551876 had a related patch set uploaded (by Joewalsh; owner: Joewalsh):
[mediawiki/services/mobileapps@master] Prevent expensive mobile-html transforms from blocking the event loop

https://gerrit.wikimedia.org/r/551876

Change 551875 abandoned by Mholloway:
Prevent expensive mobile-html transforms from blocking the event loop

https://gerrit.wikimedia.org/r/551875

Change 551876 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] Prevent expensive mobile-html transforms from blocking the event loop

https://gerrit.wikimedia.org/r/551876

Mentioned in SAL (#wikimedia-operations) [2019-11-19T19:01:37Z] <mholloway-shell@deploy1001> Started deploy [mobileapps/deploy@6e6bd42]: Prevent expensive content transforms from blocking the event loop (T229286)

Mentioned in SAL (#wikimedia-operations) [2019-11-19T19:08:27Z] <mholloway-shell@deploy1001> Finished deploy [mobileapps/deploy@6e6bd42]: Prevent expensive content transforms from blocking the event loop (T229286) (duration: 06m 49s)

• Mholloway closed subtask T235571: Consolidate wikimedia-page-library into the mobileapps repo as Resolved.Nov 21 2019, 3:34 PM

• Mholloway mentioned this in T239344: Mobileapps flapping since 2019-11-26 0:00 UTC.Nov 27 2019, 2:48 PM

• JMinor closed subtask T235213: Optimize talk endpoint performance as Resolved.Dec 2 2019, 7:01 PM

	F30299163: Screenshot from 2019-09-11 08-22-09 cropped.png
	Sep 11 2019, 12:39 PM

Resolve service instability due to excessive event loop blockage since starting PCS response pregenerationClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Resolve service instability due to excessive event loop blockage since starting PCS response pregeneration
Closed, ResolvedPublic
Actions

Related Objects
Search...