Page MenuHomePhabricator

Resolve service instability due to excessive event loop blockage since starting PCS response pregeneration
Closed, ResolvedPublic

Description

It appears that sometime after 2019-07-24 ~20:00 UTC, mobileapps started generating a lot of errors of worker died, restarting and worker stopped sending heartbeats, killing.

SAL around that time:

20:45 ppchelko@deploy1001: Started deploy [restbase/deploy@7911f65]: Store PCS endpoints T222384
20:39 bsitzmann@deploy1001: Finished deploy [mobileapps/deploy@2e2ce6c]: Update mobileapps to 1751a2e (duration: 04m 20s)
20:38 ppchelko@deploy1001: Finished deploy [changeprop/deploy@bf28187]: Rerender PCS endpoints T222384 (duration: 01m 34s)
20:36 ppchelko@deploy1001: Started deploy [changeprop/deploy@bf28187]: Rerender PCS endpoints T222384
20:35 bsitzmann@deploy1001: Started deploy [mobileapps/deploy@2e2ce6c]: Update mobileapps to 1751a2e

mcs.png (356×1 px, 74 KB)

When this is resolved, we need to revert 541891

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Mholloway raised the priority of this task from Medium to High.Jul 31 2019, 6:49 PM

So one issue right off the bat is that we're generating /feed/onthisday/all in the service in addition to the individual pieces, which seems like a lot of redundant work. How much work would it be to compose /feed/onthisday/all in RESTBase instead, the way we've been doing with /feed/featured for some time now?

We already are doing so https://github.com/wikimedia/restbase/blob/master/v1/onthisday.js#L57-L76

Weird. Why are we still getting internal requests for /feed/onthisday/all, then?

Weird. Why are we still getting internal requests for /feed/onthisday/all, then?

It's an MCS x-ample test, see spec/feed/on-this-day.yaml I believe the x-ample should be changed to make it less costly.

Change 526752 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[mediawiki/services/mobileapps@master] Performance: Change onthisday spec x-amples to deaths only

https://gerrit.wikimedia.org/r/526752

Change 526752 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] Performance: Change onthisday spec x-ample to selected only

https://gerrit.wikimedia.org/r/526752

Change 526763 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[mediawiki/services/mobileapps@master] Performance: Remove language variant info from page metadata response

https://gerrit.wikimedia.org/r/526763

Change 526763 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] Performance: Remove language variant info from page metadata response

https://gerrit.wikimedia.org/r/526763

I see that the new /page/talk endpoint also has pretty bad latency, which is not exactly a surprise. I'll take a look and see if I can find some tweaks to make there.

On second thought, if /page/talk isn't being pregenerated, and is only getting ~1 req/s, that's the wrong place to focus. I'll take a good, hard look at mobile-html in the morning.

I should point out that the picture actually got much worse on 7/30 — more than double the already-elevated error rate shown in the description. I think it was caused by the new language variant support that was deployed that day.

I think we're just asking this one poor service to do too much.

Patching up a few crashers (captured in T229521 and T229630) looks to have improved the situation considerably with respect to worker deaths, but the current level is still elevated compared to the baseline prior to 7/24.

@Mholloway We are getting a number of alerts where scb* servers return 504 to healthchecks, what do you suggest?

The plan is to relieve pressure on mobileapps by pushing forward on the MCS-PCS split (i.e., launching the standalone Wikifeeds service and moving feed endpoint traffic over to it). An initial chart is in progress in https://gerrit.wikimedia.org/r/#/c/operations/deployment-charts/+/526679/.

Mholloway removed a subscriber: wikifeeds.

@mobrovac @Pchelolo are we still pregenerating responses for the /page/definition endpoint? I don't know how to determine that from the RESTBase code, but it looks like it's still the case given the steady stream of /page/definition 501s in the mobileapps logs.

If we are still pregenerating, I think we should strongly consider stopping, given the very low request rate (~2.5 req/min, according to T222821#5168253), which we should easily be able to handle without pregeneration. That would likely help ease the burden here.

Stopping regeneration === stopping storage. However, looking into grafana external vs internal definitions latencies, the difference is not that high, so I think we could safely do so.

Could you please create a subtask for doing just that?
It would require

  1. Removing storage from RESTBase/Cassandra
  2. Modifying test/prod change-prop config to stop regenerating

It looks like the rate of worker deaths approximately doubled for some unknown reason on 8/19 (though to nowhere near the previous high levels).

There was no mobileapps deployment, and the RESTBase deploy commit for that day looks unrelated.

Stopping /page/definition pregeneration appears to have helped, but we're still seeing a nontrivial number of worker deaths.

Screenshot from 2019-09-11 08-22-09 cropped.png (317×1 px, 22 KB)

From activity on T170455: Extract the feed endpoints from PCS into a new wikifeeds service it looks like that's nearing deployment; I expect that moving feed traffic to that service will also help.

Change 538036 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[mediawiki/services/mobileapps/deploy@master] Increase num_workers to (ncpu * 1.5)

https://gerrit.wikimedia.org/r/538036

Change 538036 merged by Mholloway:
[mediawiki/services/mobileapps/deploy@master] Increase num_workers to (ncpu * 1.5)

https://gerrit.wikimedia.org/r/538036

Mentioned in SAL (#wikimedia-operations) [2019-09-19T14:51:07Z] <mholloway-shell@deploy1001> Started deploy [mobileapps/deploy@16a6af1]: Increase num_workers to (ncpu * 1.5) (T229286)

Mentioned in SAL (#wikimedia-operations) [2019-09-19T14:56:44Z] <mholloway-shell@deploy1001> Finished deploy [mobileapps/deploy@16a6af1]: Increase num_workers to (ncpu * 1.5) (T229286) (duration: 05m 39s)

Increasing the number of workers by 50% had a different effect than expected. We expected response latencies to increase slightly and worker deaths to reduce; in fact, however, the result was a small improvement in p50 internal response latencies and no change in the rate of worker deaths.

Increasing the number of workers by 50% had a different effect than expected. We expected response latencies to increase slightly and worker deaths to reduce; in fact, however, the result was a small improvement in p50 internal response latencies

This is actually easily explained by the fact that since there are more workers now, work on any given request starts sooner, so even if it takes slightly longer, the gain of starting work sooner outweighs the delay that happens due to longer processing times.

and no change in the rate of worker deaths.

This is indeed weird and unexpected. Technically, though, the ratio of worker deaths in now lower given that there are more workers. However, that's not really an improvement if the same absolute amount of requests fails due to worker deaths.

Change 538728 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[mediawiki/services/mobileapps@master] Performance tweaks: domUtil + addSectionEditButtons

https://gerrit.wikimedia.org/r/538728

Change 538728 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] Performance tweaks: domUtil + addSectionEditButtons

https://gerrit.wikimedia.org/r/538728

Mentioned in SAL (#wikimedia-operations) [2019-09-25T12:13:25Z] <mholloway-shell@deploy1001> Started deploy [mobileapps/deploy@241b284]: Performance tweaks: domUtil + addSectionEditButtons (T229286)

Mentioned in SAL (#wikimedia-operations) [2019-09-25T12:18:43Z] <mholloway-shell@deploy1001> Finished deploy [mobileapps/deploy@241b284]: Performance tweaks: domUtil + addSectionEditButtons (T229286) (duration: 05m 17s)

Mentioned in SAL (#wikimedia-operations) [2019-09-25T20:17:28Z] <mholloway-shell@deploy1001> Started deploy [mobileapps/deploy@dbf4e7e]: Speed up querySelectors in domUtil (T229286)

Mentioned in SAL (#wikimedia-operations) [2019-09-25T20:23:00Z] <mholloway-shell@deploy1001> Finished deploy [mobileapps/deploy@dbf4e7e]: Speed up querySelectors in domUtil (T229286) (duration: 05m 32s)

Mholloway added a subscriber: JoeWalsh.

I've confirmed with @JoeWalsh that the apps aren't using the /page/media endpoint (the Android app has already transitioned to /page/media-list), so we can stop pregenerating it. I've created a task for this and added it as a subtask here.

Ah, I see now from https://github.com/wikimedia/mediawiki-services-change-propagation-deploy/commit/6df20caea35912d5003c50a6e2674b6d4707d692 that the PCS endpoints being pregenerated since 7/24 are:

  • media-list
  • metadata
  • mobile-html
  • references

Change 541891 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] lvs::monitor_services: increase number of tries before MCS is critical

https://gerrit.wikimedia.org/r/541891

@Mholloway We are thinking of switching the recipients of MCS alerts of this sort to only mobileapps as we are getting more alerts that are non-actionable, is that ok ? The alternative would be to change the number of tries before alerting and/or the interval between tries.

I'm continuing to work with @JoeWalsh to tweak performance, and hoping that T235173: Stop pregenerating /page/metadata will also take some pressure off the event loop. (In the future it would be good if we start pregenerating endpoint responses in a staggered fashion rather than all at once, so that it's easier to identify offenders and roll back for improvement where needed.)

@jijiki I guess I don't have any objection to us being the only recipients of alerts.

Change 541891 merged by Effie Mouzeli:
[operations/puppet@production] lvs::monitor_services: increase number of tries before MCS is critical

https://gerrit.wikimedia.org/r/541891

Mholloway renamed this task from "worker died, restarting" mobileapps issue to Resolve service instability due to excessive event loop blockage since starting PCS response pregeneration.Oct 16 2019, 3:56 PM
Mholloway added a project: Epic.
Mholloway removed a project: Patch-For-Review.

Change 544854 had a related patch set uploaded (by Joewalsh; owner: Joewalsh):
[mediawiki/services/mobileapps@master] Remove classifyElements from prepForTheme

https://gerrit.wikimedia.org/r/544854

Change 544879 had a related patch set uploaded (by Joewalsh; owner: Joewalsh):
[mediawiki/services/mobileapps@master] Speed up media endpoints, prevent them from blocking the event loop

https://gerrit.wikimedia.org/r/544879

Change 544936 had a related patch set uploaded (by Joewalsh; owner: Joewalsh):
[mediawiki/services/mobileapps@master] WIP: Prevent expensive mobile-html transforms from blocking the event loop

https://gerrit.wikimedia.org/r/544936

Change 544854 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] Remove classifyElements from prepForTheme

https://gerrit.wikimedia.org/r/544854

Change 544879 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] Speed up media endpoints, prevent them from blocking the event loop

https://gerrit.wikimedia.org/r/544879

This issue is now resolved after removing a very expensive part of mobile-html and re-writing the media* endpoints to be more performant and preventing them from blocking the event loop.

Is there a best practice for expensive tasks in our node services? Are total processing time and event loop blocking time reasonable metrics to test against?

Change 545873 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[operations/puppet@production] Revert "lvs::monitor_services: increase number of tries before MCS is critical"

https://gerrit.wikimedia.org/r/545873

Change 545873 merged by Effie Mouzeli:
[operations/puppet@production] Revert "lvs::monitor_services: increase number of tries before MCS is critical"

https://gerrit.wikimedia.org/r/545873

Change 551803 had a related patch set uploaded (by Joewalsh; owner: Joewalsh):
[mediawiki/services/mobileapps@master] Prevent expensive mobile-html transforms from blocking the event loop

https://gerrit.wikimedia.org/r/551803

Change 551803 abandoned by Joewalsh:
Prevent expensive mobile-html transforms from blocking the event loop

https://gerrit.wikimedia.org/r/551803

Change 544936 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] Prevent expensive mobile-html transforms from blocking the event loop

https://gerrit.wikimedia.org/r/544936

Change 551875 had a related patch set uploaded (by Mholloway; owner: Joewalsh):
[mediawiki/services/mobileapps@master] Prevent expensive mobile-html transforms from blocking the event loop

https://gerrit.wikimedia.org/r/551875

Change 551876 had a related patch set uploaded (by Joewalsh; owner: Joewalsh):
[mediawiki/services/mobileapps@master] Prevent expensive mobile-html transforms from blocking the event loop

https://gerrit.wikimedia.org/r/551876

Change 551875 abandoned by Mholloway:
Prevent expensive mobile-html transforms from blocking the event loop

https://gerrit.wikimedia.org/r/551875

Change 551876 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] Prevent expensive mobile-html transforms from blocking the event loop

https://gerrit.wikimedia.org/r/551876

Mentioned in SAL (#wikimedia-operations) [2019-11-19T19:01:37Z] <mholloway-shell@deploy1001> Started deploy [mobileapps/deploy@6e6bd42]: Prevent expensive content transforms from blocking the event loop (T229286)

Mentioned in SAL (#wikimedia-operations) [2019-11-19T19:08:27Z] <mholloway-shell@deploy1001> Finished deploy [mobileapps/deploy@6e6bd42]: Prevent expensive content transforms from blocking the event loop (T229286) (duration: 06m 49s)