/api/rest_v1/page/pdf/* service unstable
Open, LowPublic

Description

Alert:

[09:47 UTC] <icinga-wm> PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5

The top failing urls at the time: https://logstash.wikimedia.org/goto/58461347c38952237e54c310e42fa8d4

GET https://es.wikipedia.org/api/rest_v1/page/pdf/Mari_(diosa_vasca)
	2,160
GET https://zh.wikipedia.org/api/rest_v1/page/pdf/瓜拉克萨巴
	1,029
GET https://es.wikipedia.org/api/rest_v1/page/pdf/Marianne_Jean-Baptiste
	1,013
GET https://ja.wikipedia.org/api/rest_v1/page/pdf/カウナス・モスク
	1,011
GET https://ru.wikipedia.org/api/rest_v1/page/pdf/Ftp_(программа)
	1,005
GET https://es.wikipedia.org/api/rest_v1/page/pdf/Francisco_Herboso_España
	1,003
GET https://ja.wikipedia.org/api/rest_v1/page/pdf/山口俊一
	830
GET https://ru.wikipedia.org/api/rest_v1/page/pdf/Лескен_(Северная_Осетия)
	748
GET https://ar.wikipedia.org/api/rest_v1/page/pdf/بريلة
	610
GET https://pl.wikipedia.org/w/api.php?ucuser=Paweł Ziemian BOT&maxlag=10&uclimit=1&format=json&action=query&rawcontinue=&list=usercontribs&ucprop=ids|title|timestamp|comment|flags

There was a mix of HTTP responses with 500 and 503 codes.

jcrespo created this task.Tue, Jan 8, 10:13 AM
Restricted Application added subscribers: Cosine02, Aklapper. · View Herald TranscriptTue, Jan 8, 10:13 AM

Mentioned in SAL (#wikimedia-operations) [2019-01-08T11:33:03Z] <mobrovac@deploy1001> Started restart [electron-render/deploy@94d27d7]: Electron strugling, restart - T213154

This is a known and recurring issue where the electron service fails to respond to requests in time. I have restarted it as this usually helps, but in the long run we will be replacing in with Proton (which should happen this Q).

mobrovac triaged this task as Low priority.Tue, Jan 8, 11:37 AM

If this is a known, ongoing, in-process-of-being decommission issue, you can close this ticket, no reason to keep it open. But I would suggest sending an email to ops@ linking to the above comment and saying so (I didn't know this, and probably more people didn't either, but it sends alerts to icinga).