Page MenuHomePhabricator

/api/rest_v1/page/pdf/* service unstable
Closed, ResolvedPublic

Description

Alert:

[09:47 UTC] <icinga-wm> PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5

The top failing urls at the time: https://logstash.wikimedia.org/goto/58461347c38952237e54c310e42fa8d4

GET https://es.wikipedia.org/api/rest_v1/page/pdf/Mari_(diosa_vasca)
	2,160
GET https://zh.wikipedia.org/api/rest_v1/page/pdf/瓜拉克萨巴
	1,029
GET https://es.wikipedia.org/api/rest_v1/page/pdf/Marianne_Jean-Baptiste
	1,013
GET https://ja.wikipedia.org/api/rest_v1/page/pdf/カウナス・モスク
	1,011
GET https://ru.wikipedia.org/api/rest_v1/page/pdf/Ftp_(программа)
	1,005
GET https://es.wikipedia.org/api/rest_v1/page/pdf/Francisco_Herboso_España
	1,003
GET https://ja.wikipedia.org/api/rest_v1/page/pdf/山口俊一
	830
GET https://ru.wikipedia.org/api/rest_v1/page/pdf/Лескен_(Северная_Осетия)
	748
GET https://ar.wikipedia.org/api/rest_v1/page/pdf/بريلة
	610
GET https://pl.wikipedia.org/w/api.php?ucuser=Paweł Ziemian BOT&maxlag=10&uclimit=1&format=json&action=query&rawcontinue=&list=usercontribs&ucprop=ids|title|timestamp|comment|flags

There was a mix of HTTP responses with 500 and 503 codes.

Event Timeline

jcrespo created this task.Jan 8 2019, 10:13 AM
Restricted Application added subscribers: Cosine02, Aklapper. · View Herald TranscriptJan 8 2019, 10:13 AM

Mentioned in SAL (#wikimedia-operations) [2019-01-08T11:33:03Z] <mobrovac@deploy1001> Started restart [electron-render/deploy@94d27d7]: Electron strugling, restart - T213154

This is a known and recurring issue where the electron service fails to respond to requests in time. I have restarted it as this usually helps, but in the long run we will be replacing in with Proton (which should happen this Q).

mobrovac triaged this task as Low priority.Jan 8 2019, 11:37 AM

If this is a known, ongoing, in-process-of-being decommission issue, you can close this ticket, no reason to keep it open. But I would suggest sending an email to ops@ linking to the above comment and saying so (I didn't know this, and probably more people didn't either, but it sends alerts to icinga).

Pchelolo closed this task as Resolved.Jun 24 2019, 10:43 PM
Pchelolo claimed this task.
Pchelolo added a subscriber: Pchelolo.

The service backing this feature was swapped for a completely new one. No reason to have this task anymore.