Maniphest T232035

1.34.0-wmf.21 cause termbox to emit: Test get rendered termbox returned the unexpected status 500
Closed, ResolvedPublicPRODUCTION ERROR
Actions

Description

Error message

PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL:
Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service

Impact

I guess some termbox are not rendered for wikidata.org

Notes

Happened after deployment of 1.34.0-wmf.21 with the Icinga alarm going on and off.

Repro:

deploy1001:~$ service-checker-swagger -t 15 termbox.svc.codfw.wmnet http://termbox.svc.codfw.wmnet:3030
/termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200)

Though sometime we have All endpoints are healthy. So it is going on and off :/

Logs via type:wikibase-termbox: https://logstash.wikimedia.org/goto/3146ff65c2320362e12501bb1fe55789

Details

Stack Trace

Subject	Repo	Branch	Lines +/-
Wikidata back to 1.34.0-wmf.21	operations/mediawiki-config	master	+1 -1
Rollback wikidata to 1.34.0-wmf.20	operations/mediawiki-config	master	+1 -1
Promote wikidata to 1.34.0-wmf.21	operations/mediawiki-config	master	+1 -1
Rollback wikidata to 1.34.0-wmf.20	operations/mediawiki-config	master	+1 -1

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved	Release	hashar	T220746 1.34.0-wmf.21 deployment blockers
		Resolved	PRODUCTION ERROR	hashar	T232035 1.34.0-wmf.21 cause termbox to emit: Test get rendered termbox returned the unexpected status 500

Event Timeline

hashar created this task.Sep 4 2019, 7:27 PM

Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptSep 4 2019, 7:27 PM

hashar added projects: Wikidata-Termbox, Wikidata.Sep 4 2019, 7:28 PM

Change 534514 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/mediawiki-config@master] Rollback wikidata to 1.34.0-wmf.20

https://gerrit.wikimedia.org/r/534514

gerritbot added a project: Patch-For-Review.Sep 4 2019, 7:30 PM

Note that the alarm keeps flapping since we also have: RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy

hashar updated the task description. (Show Details)Sep 4 2019, 7:32 PM

Change 534514 merged by jenkins-bot:
[operations/mediawiki-config@master] Rollback wikidata to 1.34.0-wmf.20

https://gerrit.wikimedia.org/r/534514

Mentioned in SAL (#wikimedia-operations) [2019-09-04T19:35:50Z] <hashar@deploy1001> rebuilt and synchronized wikiversions files: rollback wikidatawiki to 1.34.0-wmf.20 for T232035 - T220746

Stashbot mentioned this in T220746: 1.34.0-wmf.21 deployment blockers.Sep 4 2019, 7:35 PM

hashar edited projects, added serviceops; removed Patch-For-Review.Sep 4 2019, 7:52 PM

I guess I will poke wmde/serviceops tomorrow morning to figure out what went bad. Logstash hasn't many details beside host(s) like termbox-production-XXX-YYY emitting:

timeout of 3000ms exceeded

 ? request_headers_Accept 	
	application/json, text/plain, */*
? request_headers_Host 	
	www.wikidata.org
? request_headers_User-Agent 	
	wikibase-termbox/0.1.0 (The Wikidata team) axios/^0.18.1
? request_params_format 	
	json
? request_params_id 	
	Q1
? request_params_revision 	
	103
? request_params_title 	
	Special:EntityData
t request_url 	
	http://api-ro.discovery.wmnet/w/index.php

t url
  /termbox?preferredLanguages=de%7Cen&entity=Q1&editLink=%2Fedit%2FQ1347&language=de&revision=103

Which I guess would be a request made to https://www.wikidata.org/w/index.php?title=Special:EntityData&format=json&id=Q1&revision=103

Anyway, after the rollback to 1.34.0-wmf.20, the alarm is gone.

Restricted Application added a subscriber: Liuxinyu970226. · View Herald TranscriptSep 4 2019, 7:58 PM

Ladsgroup edited projects, added Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)); removed Release-Engineering-Team (Deployment services).Sep 4 2019, 9:12 PM

Ladsgroup added subscribers: Lydia_Pintscher, • alaa_wmde, Tarrow and 3 others.

The link to logs: https://logstash.wikimedia.org/goto/3146ff65c2320362e12501bb1fe55789

Agusbou2015 subscribed.Sep 4 2019, 10:57 PM

hashar updated the task description. (Show Details)Sep 5 2019, 6:40 AM

I think this is probably the same as T229313. We suspected it might be related to T231011; perhaps the new train makes this problem more pronounced

• Matthias_Geisler_WMDE subscribed.Sep 5 2019, 7:28 AM

In T232035#5467309, @Tarrow wrote:

I think this is probably the same as T229313. We suspected it might be related to T231011; perhaps the new train makes this problem more pronounced

Those slowdowns have gone away, and occasional timeouts you've seen before can't be the same problem as the one here, which seems to be release-related.

AIUI, .21 is probably causing a perf degradation on wikidata and this should be properly investigated.

Mentioned in SAL (#wikimedia-operations) [2019-09-05T07:56:13Z] <hashar> Switching "wikidatawiki" on mwdebug1001 to 1.34.0-wmf.21 by editing /srv/mediawiki/wikiversions.php # T232035

Some debugging thingies, reaching out to https://www.wikidata.org/w/index.php?title=Special:EntityData&format=json&id=Q1&revision=103 via mwdebug1001.eqiad.wmnet:

time curl -H 'Host: www.wikidata.org' 'http://mwdebug1001.eqiad.wmnet/w/index.php?title=Special:EntityData&format=json&id=Q1&revision=103'

With 1.34.0-wmf.20, first request took 1.6s, second, 106ms.

On mwdebug1001 I then edited /srv/mediawiki/wikiversions.php to bump wikidatawiki. The above query is still fast, 90ms - 126ms.

I tried the same from codfw's deployment host and saw the same. No obvious difference between api-ro and mwdebug1001 in the time to respond.

Change 534578 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/mediawiki-config@master] Promote wikidata to 1.34.0-wmf.21

https://gerrit.wikimedia.org/r/534578

gerritbot added a project: Patch-For-Review.Sep 5 2019, 8:13 AM

Change 534578 merged by jenkins-bot:
[operations/mediawiki-config@master] Promote wikidata to 1.34.0-wmf.21

https://gerrit.wikimedia.org/r/534578

Mentioned in SAL (#wikimedia-operations) [2019-09-05T08:21:35Z] <hashar@deploy1001> rebuilt and synchronized wikiversions files: Promote wikidatawiki to 1.34.0-wmf.21 for T232035 - T220746

Promoted wikidatawiki again and I ran the service checker again on deploy1001.eqiad.wmnet using:

time service-checker-swagger -t 15 termbox.svc.codfw.wmnet http://termbox.svc.codfw.wmnet:3030
All endpoints are healthy

Query takes roughly 750ms - 1000 ms and sometime take slightly more than 3 seconds which cause the service to be flagged as CRITICAL.

Note that when the service is flagged as critical, time reports 3600 ms. So I guess roughly 600ms of processing + the 3 second timeout.

service-checker-swagger does not give any detail as to what it is testing. I could not find any mediawiki logs that could hint at why the queries are slow :-\

Change 534583 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/mediawiki-config@master] Rollback wikidata to 1.34.0-wmf.20

https://gerrit.wikimedia.org/r/534583

Looking at icinga_contact.raw: "irc", Termbox over 15 days https://logstash.wikimedia.org/goto/336ca058ad55c63297324fd699fd7b83 That shows alarms every time we deploy. So most probably there is a weird issue with hitting cold caches due to the new wmf version :-^

Change 534583 merged by jenkins-bot:
[operations/mediawiki-config@master] Rollback wikidata to 1.34.0-wmf.20

https://gerrit.wikimedia.org/r/534583

So the real issue was:

termbox correctly uses the api-ro.discovery.wmnet host
the discovery record was incorrectly set to active-active
so requests from termbox would just go to the nearest dc, meaning that in codfw it would face super-cold caches after every release
as a consequence, some requests would time out because of the cold caches at all levels

Mentioned in SAL (#wikimedia-operations) [2019-09-05T08:55:26Z] <hashar@deploy1001> rebuilt and synchronized wikiversions files: Rollback wikidatawiki to 1.34.0-wmf.20 for T232035

Rolled back to 1.34.0-wmf.20. The latency is similar ~ 750ms

https://grafana.wikimedia.org/d/AJf0z_7Wz/termbox on codfw datacenter shows some errors. So all indications seem to aim at queries hitting codfw and thus cold caches which cause some queries to be slower than expected. It is unknown why the behavior changed though, most probably a change in mediawiki/core caches handling ? :-\

Change 534585 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/mediawiki-config@master] Wikidata back to 1.34.0-wmf.21

https://gerrit.wikimedia.org/r/534585

Change 534585 merged by jenkins-bot:
[operations/mediawiki-config@master] Wikidata back to 1.34.0-wmf.21

https://gerrit.wikimedia.org/r/534585

Fixed with the help of @Tarrow and @Joe

Apparently due to api-ro queries hitting codfw which has cold cache and thus cause queries to be slow / exceed the timeout whenever a new wmf version has been deployed. We most probably have hit the issue on every single train deployment but this week was the first time investigating.

Probably fixed for real by:

08:38 	<oblivian@puppetmaster1001> 	conftool action : set/pooled=false; selector: dnsdisc=a.*-ro,name=codfw

Which make the requests to api-ro go to the live dc

Mentioned in SAL (#wikimedia-operations) [2019-09-05T09:05:54Z] <hashar@deploy1001> rebuilt and synchronized wikiversions files: Promote wikidatawiki to 1.34.0-wmf.21 for T232035 - T220746

Jdforrester-WMF moved this task from INBOX to Completed on the Release-Engineering-Team-TODO (201909) board.Sep 20 2019, 8:35 PM

Maintenance_bot moved this task from incoming to in progress on the Wikidata board.Oct 25 2019, 4:13 AM

[batch] remove patch for review tag from resolved tasks

1.34.0-wmf.21 cause termbox to emit: Test get rendered termbox returned the unexpected status 500Closed, ResolvedPublicPRODUCTION ERRORActions