Page MenuHomePhabricator

Resolve service instability due to excessive event loop blockage since starting PCS response pregeneration
Open, HighPublic

Description

It appears that sometime after 2019-07-24 ~20:00 UTC, mobileapps started generating a lot of errors of worker died, restarting and worker stopped sending heartbeats, killing.

SAL around that time:

20:45 ppchelko@deploy1001: Started deploy [restbase/deploy@7911f65]: Store PCS endpoints T222384
20:39 bsitzmann@deploy1001: Finished deploy [mobileapps/deploy@2e2ce6c]: Update mobileapps to 1751a2e (duration: 04m 20s)
20:38 ppchelko@deploy1001: Finished deploy [changeprop/deploy@bf28187]: Rerender PCS endpoints T222384 (duration: 01m 34s)
20:36 ppchelko@deploy1001: Started deploy [changeprop/deploy@bf28187]: Rerender PCS endpoints T222384
20:35 bsitzmann@deploy1001: Started deploy [mobileapps/deploy@2e2ce6c]: Update mobileapps to 1751a2e

When this is resolved, we need to revert 541891

Event Timeline

jijiki created this task.Jul 29 2019, 10:11 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 29 2019, 10:11 PM
jijiki triaged this task as Normal priority.Jul 29 2019, 10:12 PM

These worker deaths are being caused by the additional load generated by pregenerating responses for the PCS endpoints for storage in RESTBase.

We are pushing forward on splitting out the /feed endpoints into a new Wikifeeds service and serving them from there, which should help ease the load. Next step is creating a Helm chart, which is in progress (T229287).

Aside from that, I would have expected that the pregeneration jobs would be finishing up by now and the worker deaths subsiding, but instead it looks like they're getting more frequent over the past couple of days. (Any idea when those will be finished, @Pchelolo?)

Aside from that, I would have expected that the pregeneration jobs would be finishing up by now and the worker deaths subsiding, but instead it looks like they're getting more frequent over the past couple of days. (Any idea when those will be finished, @Pchelolo?)

You have a misconception about what pregeneration means. We are not pregenerating the content and then leave it to be. Pregeneraration is done on every edit/template edit to refresh the stored content, so the rate will be as it is now and its not going to get lower ever.

So one issue right off the bat is that we're generating /feed/onthisday/all in the service in addition to the individual pieces, which seems like a lot of redundant work. How much work would it be to compose /feed/onthisday/all in RESTBase instead, the way we've been doing with /feed/featured for some time now?

I see that the new /page/talk endpoint also has pretty bad latency, which is not exactly a surprise. I'll take a look and see if I can find some tweaks to make there.

Mholloway raised the priority of this task from Normal to High.Jul 31 2019, 6:49 PM

So one issue right off the bat is that we're generating /feed/onthisday/all in the service in addition to the individual pieces, which seems like a lot of redundant work. How much work would it be to compose /feed/onthisday/all in RESTBase instead, the way we've been doing with /feed/featured for some time now?

We already are doing so https://github.com/wikimedia/restbase/blob/master/v1/onthisday.js#L57-L76

Weird. Why are we still getting internal requests for /feed/onthisday/all, then?

Weird. Why are we still getting internal requests for /feed/onthisday/all, then?

It's an MCS x-ample test, see spec/feed/on-this-day.yaml I believe the x-ample should be changed to make it less costly.

Change 526752 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[mediawiki/services/mobileapps@master] Performance: Change onthisday spec x-amples to deaths only

https://gerrit.wikimedia.org/r/526752

Change 526752 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] Performance: Change onthisday spec x-ample to selected only

https://gerrit.wikimedia.org/r/526752

Change 526763 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[mediawiki/services/mobileapps@master] Performance: Remove language variant info from page metadata response

https://gerrit.wikimedia.org/r/526763

Change 526763 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] Performance: Remove language variant info from page metadata response

https://gerrit.wikimedia.org/r/526763

I see that the new /page/talk endpoint also has pretty bad latency, which is not exactly a surprise. I'll take a look and see if I can find some tweaks to make there.

On second thought, if /page/talk isn't being pregenerated, and is only getting ~1 req/s, that's the wrong place to focus. I'll take a good, hard look at mobile-html in the morning.

I should point out that the picture actually got much worse on 7/30 — more than double the already-elevated error rate shown in the description. I think it was caused by the new language variant support that was deployed that day.

I think we're just asking this one poor service to do too much.

Patching up a few crashers (captured in T229521 and T229630) looks to have improved the situation considerably with respect to worker deaths, but the current level is still elevated compared to the baseline prior to 7/24.

@Mholloway We are getting a number of alerts where scb* servers return 504 to healthchecks, what do you suggest?

Mholloway added a subscriber: wikifeeds.EditedAug 13 2019, 2:39 PM

The plan is to relieve pressure on mobileapps by pushing forward on the MCS-PCS split (i.e., launching the standalone Wikifeeds service and moving feed endpoint traffic over to it). An initial chart is in progress in https://gerrit.wikimedia.org/r/#/c/operations/deployment-charts/+/526679/.

Mholloway signed these changes with MFA.Aug 13 2019, 2:40 PM
Mholloway removed a subscriber: wikifeeds.
Mholloway updated the task description. (Show Details)Aug 15 2019, 1:19 AM
Mholloway added a subscriber: mobrovac.EditedAug 27 2019, 2:48 PM

@mobrovac @Pchelolo are we still pregenerating responses for the /page/definition endpoint? I don't know how to determine that from the RESTBase code, but it looks like it's still the case given the steady stream of /page/definition 501s in the mobileapps logs.

If we are still pregenerating, I think we should strongly consider stopping, given the very low request rate (~2.5 req/min, according to T222821#5168253), which we should easily be able to handle without pregeneration. That would likely help ease the burden here.

Stopping regeneration === stopping storage. However, looking into grafana external vs internal definitions latencies, the difference is not that high, so I think we could safely do so.

Could you please create a subtask for doing just that?
It would require

  1. Removing storage from RESTBase/Cassandra
  2. Modifying test/prod change-prop config to stop regenerating

Could you please create a subtask for doing just that?

Done: T231361: Stop pregenerating and storing /page/definition responses

It looks like the rate of worker deaths approximately doubled for some unknown reason on 8/19 (though to nowhere near the previous high levels).

There was no mobileapps deployment, and the RESTBase deploy commit for that day looks unrelated.

Stopping /page/definition pregeneration appears to have helped, but we're still seeing a nontrivial number of worker deaths.

From activity on T170455: Extract the feed endpoints from PCS into a new wikifeeds service it looks like that's nearing deployment; I expect that moving feed traffic to that service will also help.

jijiki moved this task from Backlog to Externally Blocked on the serviceops board.Sep 13 2019, 3:27 AM

Change 538036 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[mediawiki/services/mobileapps/deploy@master] Increase num_workers to (ncpu * 1.5)

https://gerrit.wikimedia.org/r/538036

Change 538036 merged by Mholloway:
[mediawiki/services/mobileapps/deploy@master] Increase num_workers to (ncpu * 1.5)

https://gerrit.wikimedia.org/r/538036

Mentioned in SAL (#wikimedia-operations) [2019-09-19T14:51:07Z] <mholloway-shell@deploy1001> Started deploy [mobileapps/deploy@16a6af1]: Increase num_workers to (ncpu * 1.5) (T229286)

Mentioned in SAL (#wikimedia-operations) [2019-09-19T14:56:44Z] <mholloway-shell@deploy1001> Finished deploy [mobileapps/deploy@16a6af1]: Increase num_workers to (ncpu * 1.5) (T229286) (duration: 05m 39s)

Mholloway added a comment.EditedSep 19 2019, 4:31 PM

Increasing the number of workers by 50% had a different effect than expected. We expected response latencies to increase slightly and worker deaths to reduce; in fact, however, the result was a small improvement in p50 internal response latencies and no change in the rate of worker deaths.

Increasing the number of workers by 50% had a different effect than expected. We expected response latencies to increase slightly and worker deaths to reduce; in fact, however, the result was a small improvement in p50 internal response latencies

This is actually easily explained by the fact that since there are more workers now, work on any given request starts sooner, so even if it takes slightly longer, the gain of starting work sooner outweighs the delay that happens due to longer processing times.

and no change in the rate of worker deaths.

This is indeed weird and unexpected. Technically, though, the ratio of worker deaths in now lower given that there are more workers. However, that's not really an improvement if the same absolute amount of requests fails due to worker deaths.

Change 538728 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[mediawiki/services/mobileapps@master] Performance tweaks: domUtil + addSectionEditButtons

https://gerrit.wikimedia.org/r/538728

Change 538728 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] Performance tweaks: domUtil + addSectionEditButtons

https://gerrit.wikimedia.org/r/538728

Mentioned in SAL (#wikimedia-operations) [2019-09-25T12:13:25Z] <mholloway-shell@deploy1001> Started deploy [mobileapps/deploy@241b284]: Performance tweaks: domUtil + addSectionEditButtons (T229286)

Mentioned in SAL (#wikimedia-operations) [2019-09-25T12:18:43Z] <mholloway-shell@deploy1001> Finished deploy [mobileapps/deploy@241b284]: Performance tweaks: domUtil + addSectionEditButtons (T229286) (duration: 05m 17s)

Mentioned in SAL (#wikimedia-operations) [2019-09-25T20:17:28Z] <mholloway-shell@deploy1001> Started deploy [mobileapps/deploy@dbf4e7e]: Speed up querySelectors in domUtil (T229286)

Mentioned in SAL (#wikimedia-operations) [2019-09-25T20:23:00Z] <mholloway-shell@deploy1001> Finished deploy [mobileapps/deploy@dbf4e7e]: Speed up querySelectors in domUtil (T229286) (duration: 05m 32s)

Mholloway added a subscriber: JoeWalsh.

I've confirmed with @JoeWalsh that the apps aren't using the /page/media endpoint (the Android app has already transitioned to /page/media-list), so we can stop pregenerating it. I've created a task for this and added it as a subtask here.

Mholloway added a comment.EditedMon, Sep 30, 4:17 PM

Ah, I see now from https://github.com/wikimedia/mediawiki-services-change-propagation-deploy/commit/6df20caea35912d5003c50a6e2674b6d4707d692 that the PCS endpoints being pregenerated since 7/24 are:

  • media-list
  • metadata
  • mobile-html
  • references

Change 541891 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] lvs::monitor_services: increase number of tries before MCS is critical

https://gerrit.wikimedia.org/r/541891

jijiki updated the task description. (Show Details)Wed, Oct 9, 6:43 PM

@Mholloway We are thinking of switching the recipients of MCS alerts of this sort to only mobileapps as we are getting more alerts that are non-actionable, is that ok ? The alternative would be to change the number of tries before alerting and/or the interval between tries.

I'm continuing to work with @JoeWalsh to tweak performance, and hoping that T235173: Stop pregenerating /page/metadata will also take some pressure off the event loop. (In the future it would be good if we start pregenerating endpoint responses in a staggered fashion rather than all at once, so that it's easier to identify offenders and roll back for improvement where needed.)

@jijiki I guess I don't have any objection to us being the only recipients of alerts.

Change 541891 merged by Effie Mouzeli:
[operations/puppet@production] lvs::monitor_services: increase number of tries before MCS is critical

https://gerrit.wikimedia.org/r/541891

Mholloway renamed this task from "worker died, restarting" mobileapps issue to Resolve service instability due to excessive event loop blockage since starting PCS response pregeneration.Wed, Oct 16, 3:56 PM
Mholloway added a project: Epic.
Mholloway removed a project: Patch-For-Review.
Mholloway moved this task from Needs triage to Epics on the Product-Infrastructure-Team-Backlog board.