electron/pdfrender hangs
Open, HighPublic

Description

This task is a fork of T159922 where we try to keep track of the pdfrender service hangs during production use.

mobrovac created this task.Sep 4 2017, 11:04 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 4 2017, 11:04 AM

Mentioned in SAL (#wikimedia-operations) [2017-09-13T17:55:05Z] <gwicke> rolling restart of pdfrender service in equiad after hang T174916 T172815

Mentioned in SAL (#wikimedia-operations) [2017-10-12T15:17:27Z] <mobrovac@tin> Started restart [electron-render/deploy@8dd5f13]: Electron hanging - T174916

Mentioned in SAL (#wikimedia-operations) [2017-10-15T11:15:54Z] <mobrovac@tin> Started restart [electron-render/deploy@8dd5f13]: Electron hanging - T174916

Mentioned in SAL (#wikimedia-operations) [2017-10-16T11:10:18Z] <mobrovac@tin> Started restart [electron-render/deploy@8dd5f13]: Electron hanging - T174916

Mentioned in SAL (#wikimedia-operations) [2017-10-19T10:07:51Z] <mobrovac@tin> Started restart [electron-render/deploy@8dd5f13]: Electron hanging - T174916

Mentioned in SAL (#wikimedia-operations) [2017-11-06T12:43:03Z] <mobrovac@tin> Started restart [electron-render/deploy@8dd5f13]: Electron stuck, restrting - T174916

Mentioned in SAL (#wikimedia-operations) [2017-11-07T07:18:02Z] <mobrovac@tin> Started restart [electron-render/deploy@8dd5f13]: Electron stuck, restarting - T174916

Mentioned in SAL (#wikimedia-operations) [2017-11-19T07:01:33Z] <mobrovac@tin> Started restart [electron-render/deploy@8dd5f13]: electron stuck - T174916

Mentioned in SAL (#wikimedia-operations) [2017-11-21T14:08:04Z] <mobrovac@tin> Started restart [electron-render/deploy@8dd5f13]: electron stuck - T174916

Mentioned in SAL (#wikimedia-operations) [2017-12-12T21:12:01Z] <mobrovac@tin> Started restart [electron-render/deploy@94d27d7]: Electron hanging - T174916

pdfrender on all eqiad hosts required restarts tonight (UTC), see SAL. Thanks @madhuvishy for taking care of it.

Mentioned in SAL (#wikimedia-operations) [2017-12-27T15:54:30Z] <mobrovac@tin> Started restart [electron-render/deploy@94d27d7]: Bounce Electron, stuck - T174916

Mentioned in SAL (#wikimedia-operations) [2018-02-26T17:57:54Z] <mobrovac@tin> Started restart [electron-render/deploy@94d27d7]: Stuck, restart - T174916

Restarted today on scb1001,2,4, best guess is a bad render because they all went out within the same few minutes, i.e. around 4:20 am

Apr 02 04:20:39 scb1002 pdfrender[12485]: [2018-04-02T04:20:39.076Z] -@208.80.155.119 - GET 200 / 72 "check_http/v2.1.1 (monitoring-plugins 2.1.1)" 0.294 ms

It would be nice to know what jobs cause this hang.
It would also be nice to have a timeout on jobs so that they can be shot after some number of minutes.

Mentioned in SAL (#wikimedia-operations) [2018-04-13T07:52:50Z] <mobrovac@tin> Started restart [electron-render/deploy@94d27d7]: Kick Electron, hanging - T174916

Mentioned in SAL (#wikimedia-operations) [2018-04-13T07:58:04Z] <mobrovac@tin> Started restart [electron-render/deploy@94d27d7]: Kick Electron, hanging, take 2 - T174916

There was instability on most or all sdb1* pdfrender services from 16:32 to 16:45, stopped after a manual restart.

Mentioned in SAL (#wikimedia-operations) [2018-09-10T07:18:17Z] <volans> restarted pdfrender on scb2004 - T174916

Mentioned in SAL (#wikimedia-operations) [2018-11-08T01:18:44Z] <mutante> scb1004 - systemctl restart pdfrender (T174916)