Page MenuHomePhabricator

electron/pdfrender hangs
Closed, ResolvedPublic

Description

This task is a fork of T159922 where we try to keep track of the pdfrender service hangs during production use.

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2017-09-13T17:55:05Z] <gwicke> rolling restart of pdfrender service in equiad after hang T174916 T172815

Mentioned in SAL (#wikimedia-operations) [2017-10-12T15:17:27Z] <mobrovac@tin> Started restart [electron-render/deploy@8dd5f13]: Electron hanging - T174916

Mentioned in SAL (#wikimedia-operations) [2017-10-15T11:15:54Z] <mobrovac@tin> Started restart [electron-render/deploy@8dd5f13]: Electron hanging - T174916

Mentioned in SAL (#wikimedia-operations) [2017-10-16T11:10:18Z] <mobrovac@tin> Started restart [electron-render/deploy@8dd5f13]: Electron hanging - T174916

Mentioned in SAL (#wikimedia-operations) [2017-10-19T10:07:51Z] <mobrovac@tin> Started restart [electron-render/deploy@8dd5f13]: Electron hanging - T174916

Mentioned in SAL (#wikimedia-operations) [2017-11-06T12:43:03Z] <mobrovac@tin> Started restart [electron-render/deploy@8dd5f13]: Electron stuck, restrting - T174916

Mentioned in SAL (#wikimedia-operations) [2017-11-07T07:18:02Z] <mobrovac@tin> Started restart [electron-render/deploy@8dd5f13]: Electron stuck, restarting - T174916

Mentioned in SAL (#wikimedia-operations) [2017-11-19T07:01:33Z] <mobrovac@tin> Started restart [electron-render/deploy@8dd5f13]: electron stuck - T174916

Mentioned in SAL (#wikimedia-operations) [2017-11-21T14:08:04Z] <mobrovac@tin> Started restart [electron-render/deploy@8dd5f13]: electron stuck - T174916

Mentioned in SAL (#wikimedia-operations) [2017-12-12T21:12:01Z] <mobrovac@tin> Started restart [electron-render/deploy@94d27d7]: Electron hanging - T174916

pdfrender on all eqiad hosts required restarts tonight (UTC), see SAL. Thanks @madhuvishy for taking care of it.

Mentioned in SAL (#wikimedia-operations) [2017-12-27T15:54:30Z] <mobrovac@tin> Started restart [electron-render/deploy@94d27d7]: Bounce Electron, stuck - T174916

Mentioned in SAL (#wikimedia-operations) [2018-02-26T17:57:54Z] <mobrovac@tin> Started restart [electron-render/deploy@94d27d7]: Stuck, restart - T174916

Restarted today on scb1001,2,4, best guess is a bad render because they all went out within the same few minutes, i.e. around 4:20 am

Apr 02 04:20:39 scb1002 pdfrender[12485]: [2018-04-02T04:20:39.076Z] -@208.80.155.119 - GET 200 / 72 "check_http/v2.1.1 (monitoring-plugins 2.1.1)" 0.294 ms

It would be nice to know what jobs cause this hang.
It would also be nice to have a timeout on jobs so that they can be shot after some number of minutes.

Mentioned in SAL (#wikimedia-operations) [2018-04-13T07:52:50Z] <mobrovac@tin> Started restart [electron-render/deploy@94d27d7]: Kick Electron, hanging - T174916

Mentioned in SAL (#wikimedia-operations) [2018-04-13T07:58:04Z] <mobrovac@tin> Started restart [electron-render/deploy@94d27d7]: Kick Electron, hanging, take 2 - T174916

There was instability on most or all sdb1* pdfrender services from 16:32 to 16:45, stopped after a manual restart.

Mentioned in SAL (#wikimedia-operations) [2018-09-10T07:18:17Z] <volans> restarted pdfrender on scb2004 - T174916

Mentioned in SAL (#wikimedia-operations) [2018-11-08T01:18:44Z] <mutante> scb1004 - systemctl restart pdfrender (T174916)

Just a note, the service was flapping for a while, and I have restarted it on scb1004.

Screenshot from 2019-03-27 19-52-53.png (341×903 px, 23 KB)

Additional follow-up: THere were numerous OOMs in the log, even though the box has around 20gb of free ram +/- buffers. I'm not sure if there's a service that spikes up that high or if its the slice that's causing the OOM, but an interesting data point.

Mentioned in SAL (#wikimedia-operations) [2019-04-01T22:52:45Z] <XioNoX> restart pdfrender on scb1003 - T174916

Mentioned in SAL (#wikimedia-operations) [2019-04-04T00:40:33Z] <chaomodus> restart pdfrender on scb1003 - T174916

Mentioned in SAL (#wikimedia-operations) [2019-04-05T22:46:11Z] <chaomodus> restarted pdfrender on scb1002 T174916

Mentioned in SAL (#wikimedia-operations) [2019-04-08T18:24:30Z] <mobrovac> restart pdfrender on scb2001 - T174916

Mentioned in SAL (#wikimedia-operations) [2019-05-07T09:47:18Z] <jbond42> restart pdfrender on scb1004 - T174916

Pchelolo claimed this task.
Pchelolo added a subscriber: Pchelolo.

Electron is not used anymore, closing.