Page MenuHomePhabricator

Performance regression: language variant endpoint performance degraded after a deploy (1.35.0-wmf.24)
Open, MediumPublic

Description

See this grafana panel for the March 1- March 31 timeframe. It shows that after the March 18th deploy, the performance of language variant endpoints degraded (upto 2x).

The lang conversion request rate panel shows that request rates also went up. This rate had temporarily dipped drastically after March 9th, but the new request rate on March 18th was higher than the request rate before March 9th.

It is possible that the increased request rate accounts for some of the changes -- maybe we have a different mix of requests after March 18th?

In any case, this should be investigated. The deployment log for Parsoid on March 18th doesn't show anything out of the ordinary. There was a bump of the langconv library version. Worth looking at what changed there.

Event Timeline

ssastry created this task.

According to the SAL, Parsoid got bumped w/ we deployed v0.12.0-a5 as part of 1.35.0-wmf.24 to group0 on 00:39 UTC 2020-03-19. (1.35.0-wmf.23 was delayed and finished being deployed to all wikis only on 14:06 UTC 2020-03-18).

The slowdown occurred over the space of two hours 18:00-20:00 UTC on 2020-03-17. (We only have hourly samples, but the sample for 19:00 is halfway between the sample for 18:00 and 20:00, so this appears to have been a somewhat gradual change.)

This is more consistent with some other configuration change which brought a different mix of traffic.

There's a mobileapps deploy at 18:01 2020-03-17: bsitzmann@deploy1001: Finished deploy [mobileapps/deploy@b6bff94]: Update mobileapps to 3c73ca3 (duration: 06m 06s)

That change doesn't look particularly relevant, though: https://gerrit.wikimedia.org/r/580395

Still a mystery. For investigation, here is the list of SAL events from 17:00 to 20:00 on 2020-03-17:

18:53 ladsgroup@deploy1001: Synchronized php-1.35.0-wmf.24/extensions/Wikibase/lib/includes/Store/Sql/Terms/DatabaseItemTermStoreWriter.php: Do not lock rows when there's no term returned (T247553 T246898), To catch the train (duration: 01m 08s)
18:50 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
18:45 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
18:45 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
18:41 dzahn@cumin1001: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1)
18:39 mutante: removing mw1238 through mw1243 - decom with cookbook (T247780 T245099)
18:38 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
18:38 dzahn@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
18:37 dzahn@cumin1001: START - Cookbook sre.hosts.decommission
18:35 dzahn@cumin1001: conftool action : set/pooled=inactive; selector: name=mw123[8-9].eqiad.wmnet
18:35 dzahn@cumin1001: conftool action : set/pooled=inactive; selector: name=mw124[0-3].eqiad.wmnet
18:29 otto@deploy1001: helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
18:01 bsitzmann@deploy1001: Finished deploy [mobileapps/deploy@b6bff94]: Update mobileapps to 3c73ca3 (duration: 06m 06s)
18:00 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
17:58 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
17:56 ladsgroup@deploy1001: Synchronized php-1.35.0-wmf.23/languages/LanguageConverter.php: languages: Don't assume in LanguageConverter (T235360) (duration: 01m 07s)
17:55 bsitzmann@deploy1001: Started deploy [mobileapps/deploy@b6bff94]: Update mobileapps to 3c73ca3
17:55 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
17:53 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw124[0-3].eqiad.wmnet
17:53 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw123[89].eqiad.wmnet
17:52 Amir1: warming up cache for Q70M to Q80M for new term store on db1111, db1126, db1104, db1092 (T219123)
17:46 ladsgroup@deploy1001: Synchronized php-1.35.0-wmf.23/extensions/Wikibase/lib/includes/Store/Sql/Terms/DatabaseItemTermStoreWriter.php: Do not lock rows when there's no term returned (T247553 T246898) (duration: 01m 07s)
17:42 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
17:40 otto@deploy1001: helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
17:37 ejegg: updated payments-wiki from 86ce0361f9 to 72856949a1
17:30 bearND: mobileapps deploy failed on canary, rolled back
17:29 bsitzmann@deploy1001: Finished deploy [mobileapps/deploy@266e6da]: Update mobileapps to 6370784 (duration: 04m 00s)
17:25 bsitzmann@deploy1001: Started deploy [mobileapps/deploy@266e6da]: Update mobileapps to 6370784
17:24 elukey@deploy1001: Finished deploy [analytics/superset/deploy@3f3ddcb]: Upgrade PyHive to 0.6.2 (duration: 00m 43s)
17:24 elukey@deploy1001: Started deploy [analytics/superset/deploy@3f3ddcb]: Upgrade PyHive to 0.6.2
17:18 dzahn@cumin1001: conftool action : set/pooled=yes; selector: name=mw1280.eqiad.wmnet
17:17 dzahn@cumin1001: conftool action : set/pooled=no; selector: name=mw1280.eqiad.wmnet
17:10 jynus: purging some old rows on pc1010 on a screen to earn some time T247788

There's some wikibase activity, maybe they are using the restbase language converter endpoint?