Page MenuHomePhabricator

1.34.0-wmf.21 cause termbox to emit: Test get rendered termbox returned the unexpected status 500
Closed, ResolvedPublicPRODUCTION ERROR

Description

Error message
PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL:
Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
Impact

I guess some termbox are not rendered for wikidata.org

Notes

Happened after deployment of 1.34.0-wmf.21 with the Icinga alarm going on and off.

Repro:

deploy1001:~$ service-checker-swagger -t 15 termbox.svc.codfw.wmnet http://termbox.svc.codfw.wmnet:3030
/termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200)

Though sometime we have All endpoints are healthy. So it is going on and off :/

Logs via type:wikibase-termbox: https://logstash.wikimedia.org/goto/3146ff65c2320362e12501bb1fe55789

Event Timeline

Change 534514 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/mediawiki-config@master] Rollback wikidata to 1.34.0-wmf.20

https://gerrit.wikimedia.org/r/534514

Note that the alarm keeps flapping since we also have: RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy

Change 534514 merged by jenkins-bot:
[operations/mediawiki-config@master] Rollback wikidata to 1.34.0-wmf.20

https://gerrit.wikimedia.org/r/534514

Mentioned in SAL (#wikimedia-operations) [2019-09-04T19:35:50Z] <hashar@deploy1001> rebuilt and synchronized wikiversions files: rollback wikidatawiki to 1.34.0-wmf.20 for T232035 - T220746

hashar triaged this task as Unbreak Now! priority.Sep 4 2019, 7:58 PM

I guess I will poke wmde/serviceops tomorrow morning to figure out what went bad. Logstash hasn't many details beside host(s) like termbox-production-XXX-YYY emitting:

timeout of 3000ms exceeded

 ? request_headers_Accept 	
	application/json, text/plain, */*
? request_headers_Host 	
	www.wikidata.org
? request_headers_User-Agent 	
	wikibase-termbox/0.1.0 (The Wikidata team) axios/^0.18.1
? request_params_format 	
	json
? request_params_id 	
	Q1
? request_params_revision 	
	103
? request_params_title 	
	Special:EntityData
t request_url 	
	http://api-ro.discovery.wmnet/w/index.php

t url
  /termbox?preferredLanguages=de%7Cen&entity=Q1&editLink=%2Fedit%2FQ1347&language=de&revision=103

Which I guess would be a request made to https://www.wikidata.org/w/index.php?title=Special:EntityData&format=json&id=Q1&revision=103

Anyway, after the rollback to 1.34.0-wmf.20, the alarm is gone.

I think this is probably the same as T229313. We suspected it might be related to T231011; perhaps the new train makes this problem more pronounced

I think this is probably the same as T229313. We suspected it might be related to T231011; perhaps the new train makes this problem more pronounced

Those slowdowns have gone away, and occasional timeouts you've seen before can't be the same problem as the one here, which seems to be release-related.

AIUI, .21 is probably causing a perf degradation on wikidata and this should be properly investigated.

Mentioned in SAL (#wikimedia-operations) [2019-09-05T07:56:13Z] <hashar> Switching "wikidatawiki" on mwdebug1001 to 1.34.0-wmf.21 by editing /srv/mediawiki/wikiversions.php # T232035

Some debugging thingies, reaching out to https://www.wikidata.org/w/index.php?title=Special:EntityData&format=json&id=Q1&revision=103 via mwdebug1001.eqiad.wmnet:

time curl -H 'Host: www.wikidata.org' 'http://mwdebug1001.eqiad.wmnet/w/index.php?title=Special:EntityData&format=json&id=Q1&revision=103'

With 1.34.0-wmf.20, first request took 1.6s, second, 106ms.

On mwdebug1001 I then edited /srv/mediawiki/wikiversions.php to bump wikidatawiki. The above query is still fast, 90ms - 126ms.

I tried the same from codfw's deployment host and saw the same. No obvious difference between api-ro and mwdebug1001 in the time to respond.

Change 534578 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/mediawiki-config@master] Promote wikidata to 1.34.0-wmf.21

https://gerrit.wikimedia.org/r/534578

Change 534578 merged by jenkins-bot:
[operations/mediawiki-config@master] Promote wikidata to 1.34.0-wmf.21

https://gerrit.wikimedia.org/r/534578

Mentioned in SAL (#wikimedia-operations) [2019-09-05T08:21:35Z] <hashar@deploy1001> rebuilt and synchronized wikiversions files: Promote wikidatawiki to 1.34.0-wmf.21 for T232035 - T220746

Promoted wikidatawiki again and I ran the service checker again on deploy1001.eqiad.wmnet using:

time service-checker-swagger -t 15 termbox.svc.codfw.wmnet http://termbox.svc.codfw.wmnet:3030
All endpoints are healthy

Query takes roughly 750ms - 1000 ms and sometime take slightly more than 3 seconds which cause the service to be flagged as CRITICAL.

Note that when the service is flagged as critical, time reports 3600 ms. So I guess roughly 600ms of processing + the 3 second timeout.

service-checker-swagger does not give any detail as to what it is testing. I could not find any mediawiki logs that could hint at why the queries are slow :-\

Change 534583 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/mediawiki-config@master] Rollback wikidata to 1.34.0-wmf.20

https://gerrit.wikimedia.org/r/534583

Looking at icinga_contact.raw: "irc", Termbox over 15 days https://logstash.wikimedia.org/goto/336ca058ad55c63297324fd699fd7b83 That shows alarms every time we deploy. So most probably there is a weird issue with hitting cold caches due to the new wmf version :-^

Change 534583 merged by jenkins-bot:
[operations/mediawiki-config@master] Rollback wikidata to 1.34.0-wmf.20

https://gerrit.wikimedia.org/r/534583

So the real issue was:

  • termbox correctly uses the api-ro.discovery.wmnet host
  • the discovery record was incorrectly set to active-active
  • so requests from termbox would just go to the nearest dc, meaning that in codfw it would face super-cold caches after every release
  • as a consequence, some requests would time out because of the cold caches at all levels

Mentioned in SAL (#wikimedia-operations) [2019-09-05T08:55:26Z] <hashar@deploy1001> rebuilt and synchronized wikiversions files: Rollback wikidatawiki to 1.34.0-wmf.20 for T232035

Rolled back to 1.34.0-wmf.20. The latency is similar ~ 750ms

https://grafana.wikimedia.org/d/AJf0z_7Wz/termbox on codfw datacenter shows some errors. So all indications seem to aim at queries hitting codfw and thus cold caches which cause some queries to be slower than expected. It is unknown why the behavior changed though, most probably a change in mediawiki/core caches handling ? :-\

Change 534585 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/mediawiki-config@master] Wikidata back to 1.34.0-wmf.21

https://gerrit.wikimedia.org/r/534585

Change 534585 merged by jenkins-bot:
[operations/mediawiki-config@master] Wikidata back to 1.34.0-wmf.21

https://gerrit.wikimedia.org/r/534585

hashar claimed this task.

Fixed with the help of @Tarrow and @Joe

Apparently due to api-ro queries hitting codfw which has cold cache and thus cause queries to be slow / exceed the timeout whenever a new wmf version has been deployed. We most probably have hit the issue on every single train deployment but this week was the first time investigating.

Probably fixed for real by:

08:38 	<oblivian@puppetmaster1001> 	conftool action : set/pooled=false; selector: dnsdisc=a.*-ro,name=codfw

Which make the requests to api-ro go to the live dc

Mentioned in SAL (#wikimedia-operations) [2019-09-05T09:05:54Z] <hashar@deploy1001> rebuilt and synchronized wikiversions files: Promote wikidatawiki to 1.34.0-wmf.21 for T232035 - T220746

DannyS712 subscribed.

[batch] remove patch for review tag from resolved tasks