Page MenuHomePhabricator

Feed checks timeout on RESTBase deploy
Closed, ResolvedPublic

Description

We already add a pretty significant delay between groups and between deploy stage and check stage, but the feed endpoints still timeout 2-3 times per deployment.

Interesting, that the checks don't fail after the deploy is done, also MCS shows increased p99 latencies for feeds endpoints during RESTBase deploy.

This is very annoying as it forces us to repeat the deploy several times before all the groups succeed.

Event Timeline

@Pchelolo can you add projects to this task ? Thanks !

The feed content includes a lot of things already stored in RESTBase (Parsoid content of pages, PagewView API, summaries of included pages, ...). Are the hosts getting updated properly depooled? Should MCS use a different hostname to refer to content in RB?

Are the hosts getting updated properly depooled?

They should be... If it didn't we'd had much bigger issues.

One thing that strikes me is that I don't see any significant spikes in p95/p99 latencies for endpoints in RESTBase dashboard correlating with the deployments.

We've seen this before, but then problem somehow went away on its own only to come back again. We thought that during all the restarts, we don't actually wait for all the workers to come up, so during a lot of restarts the overall number of workers can be smaller, but we've added very significant delays to ensure that's not the case.

Another idea is to temporarily turn on debug to true in the config.prod.yaml, similarly to how it's done in the config.dev.yaml[1]. This would show also outgoing requests (to backend services, like MW API or other services in RB). Check it out locally first to see if there would be any useful info there.

[1] https://phabricator.wikimedia.org/diffusion/GMOA/browse/master/config.dev.yaml$79

Another idea is to temporarily turn on debug to true in the config.prod.yaml, similarly to how it's done in the config.dev.yaml

@bearND I'm worried that would be to much logging.. But I think we will eventually have to do that for both RB and MCS and do a rolling restart of RB. Had to repeat the deploy 4 times today because of this.

mobrovac subscribed.

Raising priority to high as this has started happening more and more often, even during normal RESTBase operation.

Mentioned in SAL (#wikimedia-operations) [2018-10-18T10:29:46Z] <mobrovac@deploy1001> Started deploy [restbase/deploy@88c8f26]: Parallelise onthisday call - T203588

Mentioned in SAL (#wikimedia-operations) [2018-10-18T10:41:09Z] <mobrovac@deploy1001> Finished deploy [restbase/deploy@88c8f26]: Parallelise onthisday call - T203588 (duration: 11m 24s)

Mentioned in SAL (#wikimedia-operations) [2018-10-18T11:33:13Z] <mobrovac@deploy1001> Started deploy [restbase/deploy@1041a02]: Disable onthisday check - T203588

Mentioned in SAL (#wikimedia-operations) [2018-10-18T11:54:36Z] <mobrovac@deploy1001> Finished deploy [restbase/deploy@1041a02]: Disable onthisday check - T203588 (duration: 21m 23s)

mobrovac lowered the priority of this task from High to Medium.EditedOct 18 2018, 11:58 AM

While PR 1074 did drastically improve the performance of the onthisday end point, RB deployments were still failing (I managed to fully deploy RB after 5 attempts). Therefore, I opted to disable the check for the time being. There were no deployment failures afterwords.

This is only a temporary measure, though. We still need to address this problem.

This have not been happening for a while now, so it somehow fixed itself. Please reopen if it starts happening again.