Page MenuHomePhabricator

GrowthExperiments Special:Homepage: investigate performance regression since September 6 2022
Closed, ResolvedPublic

Description

It appears that there is a performance regression for GrowthExperiments Special:Homepage feature coinciding with the rollout of wmf.28 (Grafana).

Last 7 days (compared to 7 days prior in orange):

image.png (2×2 px, 1 MB)

30 day view:

image.png (2×2 px, 1 MB)

Note that there are several samples recording ~2 seconds to load Special:Homepage on desktop, compared to ~700ms which we usually expect.

The spike (logs for the day) coincides with:

  • 2022-09-06 03:02 Started scap: testwikis wikis to 1.39.0-wmf.28 refs T314189 log entry
  • 2022-09-06 03:26 multi-DC stage 4: all traffic to appservers-ro, rolling out via puppet 03:24-03:54 (log entry)

At first glance, I don't see any changes in GrowthExperiments or MediaWiki core that would account for the performance regression.

Candidates for contributors to this:

  • image-recommendation cache fetch see T317187#8220110
  • wmf.27-wmf.28 changes – unlikely given that we observe the spikes when the train is at group0 and most of traffic to Special:Homepage comes from group1/group2 wikis
  • ElasticSearch config switch patch, deployed prior to wmf.28 rollout.
  • multi-DC deployment
  • php7.4 deployment
  • ...

Event Timeline

Tentatively adding as a deployment blocker until we could figure out what is causing this.

kostajh renamed this task from Investigate performance regression in wmf.28 to GrowthExperiments Special:Homepage: investigate performance regression in wmf.28.Sep 7 2022, 12:31 PM
kostajh updated the task description. (Show Details)

It seems like ElasticSearch integration might be a significant contributor to the degradation, see for example https://performance.wikimedia.org/xhgui/run/symbol?id=631891d12005d5f09bb75548&symbol=SearchEngine%3A%3AsearchText (using PHP 7.4 backend), which would align with the risky patches identified in T314189#8200929 (cc @EBernhardson @dcausse).

kostajh triaged this task as High priority.Sep 7 2022, 12:56 PM

Another possibility is that the interaction with the image suggestion service (example XHGui) is implicated; if metadata for an image suggestion is found in the cache, our code calls the service with getWithSetCallback. But so far, I only see that if I set resetCache=1 as a query parameter to Special:Homepage, and that wouldn't explain the overall spike in the graphs above.

@kostajh thanks for pinging us, we are in the middle of the elasticsearch upgrade and the search latency might be affected because during the rollout of wmf.28 the MW appservers running in eqiad are going to hit elasticsearch in codfw and thus will have an extra latency to pay compared to running wmf.27 where they hit elasticsearch in the same DC.
I don't know if this can explain all the degradation you see but it's certainly contributing to it.

Sepcial:Homepage?useCache=0 using no topic filters and copyedit|linkrecommendation as the task types.

  • PHP 7.4, testwiki (wmf.28) 1,524,438 µs
  • PHP 7.4 cswiki (wmf.27) 1,346,095 µs
  • PHP 7.2 testwiki (wmf.28) 1,289,526 µs
  • PHP 7.2 cswiki (wmf.27) 1,160,520 µs

So perhaps both the PHP 7.4 rollout (T271736) and the ElasticSearch change are implicated. I don't know why the PHP 7.4 rollout would be related, though.

It can't be due to the wmf.28 deployment because the latency spike occurred at 04:26, which was before the deployment which was at 08:09. A single sample with p99=p75=12.3 seconds at 2022-09-06 04:26 dominates the p75 view at wider time ranges. In the p99 graph, you can see p99 spikes of similar magnitude going back a long way, for example 12.5 seconds at 2022-08-13 20:15.

The reason p75=p99 in the bucket at 04:26 is because there were only 2 samples in the bucket, so the p75 is by definition the same as p99, just being the larger of the two samples.

homepage desktop count.png (300×873 px, 25 KB)

Another possibility is that the interaction with the image suggestion service (example XHGui) is implicated; if metadata for an image suggestion is found in the cache, our code calls the service with getWithSetCallback. But so far, I only see that if I set resetCache=1 as a query parameter to Special:Homepage, and that wouldn't explain the overall spike in the graphs above.

Image recommendations have their own timing metric in statsd/graphite. It often takes exactly 10 seconds, so maybe that is a timeout. You can see it in the p99:

imageRecommendationProvider-p99.png (299×1 px, 33 KB)

It could be our general Kubernetes problem, which I wrote about at T313374. Envoy metrics show that 7% of requests to the image-suggestion endpoint result in a 503.

In the last 7 days:

Response codeRequests
200338168
4040
50326397
50472

503 responses as a percentage of all responses, by Envoy destination, last 24 hours:

[edit: this is only appserver requests, and ores mostly uses the job queue, so the number for ores here is not reflective of the ores service health as a whole]

Envoy service503 %
ores14.50%
image-suggestion8.06%
shellbox0.10%
shellbox-syntaxhighlight0.04%
restbase0.00%
search-psi-https_eqiad0.00%
local_port_800.00%
search-https_eqiad0.00%
eventgate-analytics0.00%
eventgate-analytics-external0.00%
sessionstore0.00%
echostore0.00%
eventgate-main0.00%
api-rw0.00%
mathoid0.00%
parsoid-php0.00%
search-https_codfw0.00%
search-omega-https_eqiad0.00%
shellbox-constraints0.00%
shellbox-timeline0.00%
termbox0.00%
wdqs-internal0.00%

ores and image-suggestion almost always appear in the Envoy logs as UC "Upstream connection termination in addition to 503 response code", unlike Shellbox where errors are flagged as UF "Upstream connection failure in addition to 503 response code".

I confirmed that 503 errors are not correlated with pod termination.

The appserver envoy metrics show many 503s, but the istio metrics and logs show no errors. So my theory is that something goes wrong with kube-proxy, but only for ingestion. If I understand correctly, the request flow is MW -> envoy -> kube-proxy -> istio/envoy -> kube-proxy -> image-suggestions. If kube-proxy was broken both times, istio would see errors.

The appserver envoy metrics show many 503s, but the istio metrics and logs show no errors. So my theory is that something goes wrong with kube-proxy, but only for ingestion. If I understand correctly, the request flow is MW -> envoy -> kube-proxy -> istio/envoy -> kube-proxy -> image-suggestions. If kube-proxy was broken both times, istio would see errors.

If kube-proxy was broken, most of the services you've listed above would get the same kind of error rates you get from image-suggestion, so I don't think it's probable.

We will have to dig a bit deeper, I suspect the problem might be an interaction between envoy and istio (which ironically, is also envoy).

Another possibility is that the interaction with the image suggestion service (example XHGui) is implicated; if metadata for an image suggestion is found in the cache, our code calls the service with getWithSetCallback. But so far, I only see that if I set resetCache=1 as a query parameter to Special:Homepage, and that wouldn't explain the overall spike in the graphs above.

Image recommendations have their own timing metric in statsd/graphite. It often takes exactly 10 seconds, so maybe that is a timeout. You can see it in the p99:

imageRecommendationProvider-p99.png (299×1 px, 33 KB)

Actually, sorry that shouldn't be related. It's not straightforward AFAICT to quickly see from an XHGui run what is being executed in a deferred context. But the image suggestion cache fetch (or miss) should be happening in a deferred update. I think we can rule that out as a contributing factor to this problem.

It could be our general Kubernetes problem, which I wrote about at T313374. Envoy metrics show that 7% of requests to the image-suggestion endpoint result in a 503.

In the last 7 days:

Response codeRequests
200338168
4040
50326397
50472

Thanks for sharing that. We can follow-up on that in T313973: GrowthExperiments\NewcomerTasks\AddImage\ServiceImageRecommendationProvider::get Unable to decode JSON response for page {title} upstream connect error or disconnect/reset before headers. reset reason: connection termination.

It could be our general Kubernetes problem, which I wrote about at T313374. Envoy metrics show that 7% of requests to the image-suggestion endpoint result in a 503.

In the last 7 days:

Response codeRequests
200338168
4040
50326397
50472

AIUI you got these numbers from calls originating from the appserver cluster only, right?
When taking requests from all clusters into account, the errors are more in the region of 3%:
https://grafana-rw.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-origin=All&var-origin_instance=All&var-destination=image-suggestion&from=now-7d&to=now
But I would also assume these are a consequence of T313973 (or: we should at least rule that out).

@kostajh thanks for pinging us, we are in the middle of the elasticsearch upgrade and the search latency might be affected because during the rollout of wmf.28 the MW appservers running in eqiad are going to hit elasticsearch in codfw and thus will have an extra latency to pay compared to running wmf.27 where they hit elasticsearch in the same DC.
I don't know if this can explain all the degradation you see but it's certainly contributing to it.

OK. Yeah, this definitely contributes to it, but hard to say how much. When rendering Special:Homepage, we fetch a task set for a user from the cache. But before returning the response, we call ElasticSearch to validate the contents of the task set. I see a spike in the Local search task suggester p95 chart, but it is only in the tens of milliseconds and doesn't really explain the larger jumps we see in the Special:Homepage rendering times.

That said, I've chatted with @Tgr and other Growth engineers, and the spike in Special:Homepage rendering time is not unacceptably large. So I think we could move forward with wmf.28 rollout (cc @jeena, @hashar) and wait to see how the completed ElasticSearch upgrade affects the performance metrics.

AIUI you got these numbers from calls originating from the appserver cluster only, right?
When taking requests from all clusters into account, the errors are more in the region of 3%:

Yes, but this task is for Special:Homepage which is served by the appserver cluster.

I think this task should be closed as invalid or merged with T313973 since I don't think there is any performance regression in wmf.28. As Eric wrote at T313973#8218325:

To be clear to anyone joining the discussion late: If something is not done, records will continue to grow without bounds, the queries will take longer and longer to answer, and responses will become larger and larger.

The graphs posted in the task description show that the service got slower before the wmf.28 deployment to group0. And most requests are not for group0.

kostajh renamed this task from GrowthExperiments Special:Homepage: investigate performance regression in wmf.28 to GrowthExperiments Special:Homepage: investigate performance regression since September 6 2022.Sep 9 2022, 8:44 AM

AIUI you got these numbers from calls originating from the appserver cluster only, right?
When taking requests from all clusters into account, the errors are more in the region of 3%:

Yes, but this task is for Special:Homepage which is served by the appserver cluster.

I think this task should be closed as invalid or merged with T313973 since I don't think there is any performance regression in wmf.28. As Eric wrote at T313973#8218325:

Agreed that it doesn't appear related to wmf.28. I've updated the task title to reflect that, and removed the wmf.28 blocker as parent task in T317187#8220207.

To be clear to anyone joining the discussion late: If something is not done, records will continue to grow without bounds, the queries will take longer and longer to answer, and responses will become larger and larger.

The graphs posted in the task description show that the service got slower before the wmf.28 deployment to group0. And most requests are not for group0.

To clarify (and rephrase what I wrote in T317187#8220110), the issues with slow query times for the image suggestion service should not have anything to do with the spikes seen in the p99 for Special:Homepage rendering since September 6 2022. The image suggestion service can indeed be called during render of Special:Homepage, but only in a deferred, post-send update (see TaskSetListener.php). So whatever issues exist with the image suggestion service, they don't seem related to the spike seen in the p99 graph for desktop rendering of Special:Homepage.

I added a section in the task description with possible contributors to the spike.

Here's how the last week looks:

image.png (1×2 px, 616 KB)

Desktop rendering speed for Special:Homepage has about doubled compared to last week. Mobile has also increased, but is not as bad, so that might be a clue.

We might wait until T308676: Elasticsearch 7.10.2 rollout plan to be completed, to see how that impacts things. It's possible that the code for validating the contents of a cached TaskSet is causing the slowdown, since that involves a cross-DC search query.

Things look more normal now after the ElasticSearch 7 migration was completed, and search traffic is again served from eqiad.

image.png (630×2 px, 234 KB)

The orange line at 20:15 is 20:15 thcipriani@deploy1002: Finished scap: Backport for Revert "cirrus: Handle transition to elasticsearch 7.10" (T308676) (duration: 07m 39s)

Checking for wmf.3/wmf.2- I'll check after wmf.3 deployment will be done for all groups.

Last week:

Screen Shot 2022-09-27 at 3.52.14 PM.png (1×2 px, 757 KB)

Checked - wmf.6 deployment to group 0. It still shows quite high numbers.

Screen Shot 2022-10-18 at 2.50.56 PM.png (1×2 px, 530 KB)

I think it is more or less "normal", though I agree that ~900ms is too long for rendering the homepage.

The spike we were tracking with this task was from September 7 to 15, where you can see the render time was much worse.

image.png (1×2 px, 697 KB)

Checked - wmf.6 deployment to group 0. It still shows quite high numbers.

Screen Shot 2022-10-18 at 2.50.56 PM.png (1×2 px, 530 KB)

I think it is more or less "normal", though I agree that ~900ms is too long for rendering the homepage.

The spike we were tracking with this task was from September 7 to 15, where you can see the render time was much worse.

image.png (1×2 px, 697 KB)

Thanks! Closing as Resolved.