Page MenuHomePhabricator

pdfrender fails to serve requests since Mar 8 00:30:32 UTC on scb1003
Closed, ResolvedPublic

Description

On March 8 00:30:32 the pdfrender service on scb1003 got a SIGTERM (from systemd) and shutdown gracefully. This was part of a normal deploy and is documented in SAL in https://tools.wmflabs.org/sal/production?d=2017-03-08.

However after the being started normally, the process is not listening on its TCP port and icinga has alerted on it.

connect to address 10.64.32.153 and port 5252: Connection refused

The processes seem to be running fine according to systemctl status pdfrender

* pdfrender.service - "pdfrender service"
   Loaded: loaded (/lib/systemd/system/pdfrender.service; enabled)
   Active: active (running) since Wed 2017-03-08 00:30:33 UTC; 8h ago
 Main PID: 30726 (firejail)
   CGroup: /system.slice/pdfrender.service
           |-30726 /usr/bin/firejail --profile=/etc/firejail/pdfrender.profile /usr/bin/nodejs /srv/deployment/electron-render/deploy/src...
           |-30728 /usr/bin/python /usr/bin/xpra start :427 --no-daemon
           |-30731 Xorg-for-Xpra-:427 -dpi 96 -noreset -nolisten tcp +extension GLX +extension RANDR +extension RENDER -logfile /home/pdf...
           |-30805 /usr/bin/firejail --profile=/etc/firejail/pdfrender.profile /usr/bin/nodejs /srv/deployment/electron-render/deploy/src...
           |-30806 /usr/bin/firejail --profile=/etc/firejail/pdfrender.profile /usr/bin/nodejs /srv/deployment/electron-render/deploy/src...
           |-30813 /usr/bin/nodejs /srv/deployment/electron-render/deploy/src/bin/electron-render-service.js
           |-30819 /srv/deployment/electron-render/deploy-cache/revs/5ec56146bd70cd14c62a42cfd5a9d1ae4e58c14d/node_modules/electron-prebu...
           |-30852 /srv/deployment/electron-render/deploy-cache/revs/5ec56146bd70cd14c62a42cfd5a9d1ae4e58c14d/node_modules/electron-prebu...
           `-30882 /srv/deployment/electron-render/deploy-cache/revs/5ec56146bd70cd14c62a42cfd5a9d1ae4e58c14d/node_modules/electron-prebu...

journal has some logs (maybe helpful, maybe not). They are in reverse chronological order btw (journalctl -ru output)

Mar 08 00:30:35 scb1003 pdfrender[30726]: AssertionError: display is not set!
Mar 08 00:30:35 scb1003 pdfrender[30726]: File "xpra/x11/bindings/core_bindings.pyx", line 57, in xpra.x11.bindings.core_bindings.X11CoreBin
Mar 08 00:30:35 scb1003 pdfrender[30726]: X11Window = X11WindowBindings()
Mar 08 00:30:35 scb1003 pdfrender[30726]: File "/usr/lib/python2.7/dist-packages/xpra/x11/gtk_x11/prop.py", line 25, in <module>
Mar 08 00:30:35 scb1003 pdfrender[30726]: from xpra.x11.gtk_x11.prop import prop_get, prop_set
Mar 08 00:30:35 scb1003 pdfrender[30726]: File "/usr/lib/python2.7/dist-packages/xpra/client/gtk_base/gtk_client_window_base.py", line 33, i
Mar 08 00:30:35 scb1003 pdfrender[30726]: from xpra.client.gtk_base.gtk_client_window_base import GTKClientWindowBase, HAS_X11_BINDINGS
Mar 08 00:30:35 scb1003 pdfrender[30726]: File "/usr/lib/python2.7/dist-packages/xpra/client/gtk2/gtk2_window_base.py", line 15, in <module>
Mar 08 00:30:35 scb1003 pdfrender[30726]: from xpra.client.gtk2.gtk2_window_base import GTK2WindowBase
Mar 08 00:30:35 scb1003 pdfrender[30726]: File "/usr/lib/python2.7/dist-packages/xpra/client/gtk2/client_window.py", line 9, in <module>
Mar 08 00:30:35 scb1003 pdfrender[30726]: from xpra.client.gtk2.client_window import ClientWindow
Mar 08 00:30:35 scb1003 pdfrender[30726]: File "/usr/lib/python2.7/dist-packages/xpra/client/gtk2/border_client_window.py", line 10, in <mod
Mar 08 00:30:35 scb1003 pdfrender[30726]: from xpra.client.gtk2.border_client_window import BorderClientWindow
Mar 08 00:30:35 scb1003 pdfrender[30726]: File "/usr/lib/python2.7/dist-packages/xpra/client/gtk2/client.py", line 37, in <module>
Mar 08 00:30:35 scb1003 pdfrender[30726]: toolkit_module = __import__(client_module, globals(), locals(), ['XpraClient'])
Mar 08 00:30:35 scb1003 pdfrender[30726]: File "/usr/lib/python2.7/dist-packages/xpra/scripts/main.py", line 1174, in make_client
Mar 08 00:30:35 scb1003 pdfrender[30726]: app = make_client(error_cb, opts)
Mar 08 00:30:35 scb1003 pdfrender[30726]: File "/usr/lib/python2.7/dist-packages/xpra/scripts/main.py", line 1111, in run_client
Mar 08 00:30:35 scb1003 pdfrender[30726]: return run_client(error_cb, options, args, mode)
Mar 08 00:30:35 scb1003 pdfrender[30726]: File "/usr/lib/python2.7/dist-packages/xpra/scripts/main.py", line 761, in run_mode
Mar 08 00:30:35 scb1003 pdfrender[30726]: return run_mode(script_file, err, options, args, mode, defaults)
Mar 08 00:30:35 scb1003 pdfrender[30726]: File "/usr/lib/python2.7/dist-packages/xpra/scripts/main.py", line 103, in main
Mar 08 00:30:35 scb1003 pdfrender[30726]: Traceback (most recent call last):
Mar 08 00:30:35 scb1003 pdfrender[30726]: xpra main error:
Mar 08 00:30:35 scb1003 pdfrender[30726]: 2017-03-08 00:30:35,085 failed load posix keyboard bindings: display is not set!
Mar 08 00:30:34 scb1003 pdfrender[30726]: from xpra.x11.gtk_x11 import gdk_display_source
Mar 08 00:30:34 scb1003 pdfrender[30726]: /usr/lib/python2.7/dist-packages/xpra/client/gtk2/__init__.py:7: GtkWarning: IA__gdk_screen_get_ro
Mar 08 00:30:34 scb1003 pdfrender[30726]: warnings.warn(str(e), _gtk.Warning)
Mar 08 00:30:34 scb1003 pdfrender[30726]: [199B blob data]
Mar 08 00:30:34 scb1003 pdfrender[30726]: Warning: a protocol list is present, the new list "unix,inet,inet6" will not be installed
Mar 08 00:30:34 scb1003 pdfrender[30726]: Reading profile /etc/firejail/disable-passwdmgr.inc
Mar 08 00:30:34 scb1003 pdfrender[30726]: Reading profile /etc/firejail/disable-programs.inc
Mar 08 00:30:34 scb1003 pdfrender[30726]: Reading profile /etc/firejail/disable-common.inc
Mar 08 00:30:34 scb1003 pdfrender[30726]: Reading profile /etc/firejail/default.profile
Mar 08 00:30:34 scb1003 pdfrender[30726]: Reading profile /etc/firejail/pdfrender.profile
Mar 08 00:30:34 scb1003 pdfrender[30726]: 2017-03-08 00:30:34,264 xpra is ready.
Mar 08 00:30:34 scb1003 pdfrender[30726]: 2017-03-08 00:30:34,251 running with pid 30728
Mar 08 00:30:34 scb1003 pdfrender[30726]: 2017-03-08 00:30:34,251 xpra server version 0.14.10 (r7983)
Mar 08 00:30:34 scb1003 pdfrender[30726]: 2017-03-08 00:30:34,246 cannot load dbus helper: No module named dbus
Mar 08 00:30:34 scb1003 pdfrender[30726]: 2017-03-08 00:30:34,168 server uuid is 8102737322954f3d80ea573401b3777e
Mar 08 00:30:33 scb1003 pdfrender[30726]: (==) Using system config directory "/usr/share/X11/xorg.conf.d"
Mar 08 00:30:33 scb1003 pdfrender[30726]: (++) Using config file: "/etc/xpra/xorg.conf"
Mar 08 00:30:33 scb1003 pdfrender[30726]: (++) Log file: "/home/pdfrender/.xpra/Xorg.:427.log", Time: Wed Mar  8 00:30:33 2017
Mar 08 00:30:33 scb1003 pdfrender[30726]: (WW) warning, (EE) error, (NI) not implemented, (??) unknown.
Mar 08 00:30:33 scb1003 pdfrender[30726]: (++) from command line, (!!) notice, (II) informational,
Mar 08 00:30:33 scb1003 pdfrender[30726]: Markers: (--) probed, (**) from config file, (==) default setting,
Mar 08 00:30:33 scb1003 pdfrender[30726]: to make sure that you have the latest version.
Mar 08 00:30:33 scb1003 pdfrender[30726]: Before reporting problems, check http://wiki.x.org
Mar 08 00:30:33 scb1003 pdfrender[30726]: Current version of pixman: 0.32.6
Mar 08 00:30:33 scb1003 pdfrender[30726]: xorg-server 2:1.16.4-1 (http://www.debian.org/support)
Mar 08 00:30:33 scb1003 pdfrender[30726]: Build Date: 11 February 2015  12:32:02AM
Mar 08 00:30:33 scb1003 pdfrender[30726]: Kernel command line: BOOT_IMAGE=/boot/vmlinuz-4.4.0-3-amd64 root=UUID=0a595999-dbe0-4882-b1f7-2040
Mar 08 00:30:33 scb1003 pdfrender[30726]: Current Operating System: Linux scb1003 4.4.0-3-amd64 #1 SMP Debian 4.4.2-3+wmf7 (2016-11-04) x86_
Mar 08 00:30:33 scb1003 pdfrender[30726]: Build Operating System: Linux 3.16.0-4-amd64 x86_64 Debian
Mar 08 00:30:33 scb1003 pdfrender[30726]: X Protocol Version 11, Revision 0
Mar 08 00:30:33 scb1003 pdfrender[30726]: Release Date: 2014-12-20
Mar 08 00:30:33 scb1003 pdfrender[30726]: X.Org X Server 1.16.4
Mar 08 00:30:33 scb1003 pdfrender[30726]: Warning: a protocol list is present, the new list "unix,inet,inet6" will not be installed
Mar 08 00:30:33 scb1003 pdfrender[30726]: Reading profile /etc/firejail/disable-passwdmgr.inc
Mar 08 00:30:33 scb1003 pdfrender[30726]: Reading profile /etc/firejail/disable-programs.inc
Mar 08 00:30:33 scb1003 pdfrender[30726]: Reading profile /etc/firejail/disable-common.inc
Mar 08 00:30:33 scb1003 pdfrender[30726]: Reading profile /etc/firejail/default.profile
Mar 08 00:30:33 scb1003 pdfrender[30726]: Reading profile /etc/firejail/pdfrender.profile
Mar 08 00:30:33 scb1003 systemd[1]: Started "pdfrender service".
Mar 08 00:30:33 scb1003 systemd[1]: Starting "pdfrender service"...
Mar 08 00:30:32 scb1003 pdfrender[26738]: (EE) Server terminated successfully (0). Closing log file.
Mar 08 00:30:32 scb1003 pdfrender[26738]: Xpra: Fatal IO error 2 (No such file or directory) on X server :711.
Mar 08 00:30:32 scb1003 pdfrender[26738]: 2017-03-08 00:30:32,835 killing xvfb with pid 26742
Mar 08 00:30:32 scb1003 pdfrender[26738]: 2017-03-08 00:30:32,835 removing socket /home/pdfrender/.xpra/scb1003-711
Mar 08 00:30:32 scb1003 pdfrender[26738]: got deadly signal SIGTERM, exiting
Mar 08 00:30:32 scb1003 pdfrender[26738]: 2017-03-08 00:30:32,835 got signal SIGTERM, exiting
Mar 08 00:30:32 scb1003 pdfrender[26738]: 2017-03-08 00:30:32,835
Mar 08 00:30:32 scb1003 pdfrender[26738]: Parent is shutting down, bye...
Mar 08 00:30:32 scb1003 pdfrender[26738]: Child received signal 15, shutting down the sandbox...
Mar 08 00:30:32 scb1003 pdfrender[26738]: Parent received signal 15, shutting down the child process...
Mar 08 00:30:32 scb1003 pdfrender[26738]: Parent pid 26821, child pid 26822
Mar 08 00:30:32 scb1003 systemd[1]: Stopping "pdfrender service"...

It's not clear as to why the TCP port has not been opened by the process. After some debugging, @Giuseppe and me think this will probably go away upon a restart, but we 've decided to open the phab task so this is known and debugged a bit more if needed.

Restart command

To restart all instances in eqiad, something like this works:

for i in 1 2 3 4;do echo $i; ssh scb100$i.eqiad.wmnet "sudo service pdfrender restart"; done

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

The display assertion vaguely points towards xpra or Xorg. Smells like a race condition on service restart, possibly with the old Xorg or xpra still hanging around. The line "Mar 08 00:30:32 scb1003 pdfrender[26738]: Xpra: Fatal IO error 2 (No such file or directory) on X server :711." could be related.

I would also expect this to go away on restart. I think the logs already give us all the information we can get out of this instance, and restarting it is fine.

We'll need to reproduce this in order to fix it.

This is (unfortunately) a common scenario on service start-up. The current work-around is stopping the service, wait a bit and start it (just restarting it also works sometimes, but it's not as reliable).

As an immediate work-around, maybe we could add a delay in the restart process? One way to do this might be to add a sleep in https://www.freedesktop.org/software/systemd/man/systemd.service.html#ExecStopPost=.

Ottomata triaged this task as Medium priority.Mar 8 2017, 6:55 PM

Change 341833 had a related patch set uploaded (by GWicke):
[operations/puppet] Delay service shut-down to work around xpra race

https://gerrit.wikimedia.org/r/341833

The same thing has just happened when I've tried to update the service to a newer version (see T160764). Will put the patch for puppet SWAT and attempt to deploy after it's merged.

Change 341833 merged by Filippo Giunchedi:
[operations/puppet@production] PDFRender: Delay service shut-down to work around xpra race

https://gerrit.wikimedia.org/r/341833

Although the patch was merged, the situation didn't change - the exact same log is produced on server restart. This blocks deployment of the new version T160764 because, although for the old version the error was there, it didn't prevent the service from starting up. The new version just hangs after the AssertionError and doesn't accept connections. In beta cluster though this doesn't happen.

The problem (pdfrender hanging at startup) just showed up again on scb1002, and it seems there is no way to get around that race condition at the moment (no amount of waiting is ok).

For the moment, I couldn't find anything justifying what's happening besides a race condition at startup time, so that the pdf service tries to access xpra before it's ready.

I would suggest we run xpra as a separate service that pdfrender depends upon in systemd, and use some systemd primitive to ensure the latter starts only when the former is ready.

For now I'll ack the alarm in icinga and wait for someone of you to be online to further debug the issue toghether.

Happened again today afaics on scb100[12], resolved restarting pdfrender on both.

Restarted pdfrender on scb1004, scb200[2,4] today, xpra race condition traces in /srv/log/pdfrender

MoritzMuehlenhoff raised the priority of this task from Medium to High.Jun 14 2017, 7:19 AM
MoritzMuehlenhoff subscribed.

I think the only reliable way to fix this would be to add a systemd unit to xpra (the Debian package doesn't include one, but Arch Linux has a proposed one at https://wiki.archlinux.org/index.php/Xpra#Server) and then amend the unit for pdfrender with

[Unit]
After=xpra.service
Wants=xpra.service

We have a system service which coordinates service startup, let's use that.

Mentioned in SAL (#wikimedia-operations) [2017-06-17T16:51:00Z] <volans> restarted pdfrender on scb200[2,4] T159922

There is an ETA for a permanent fix? It seems to me that we've already delayed this too much given the frequency at which it's happening lately.

There is an ETA for a permanent fix? It seems to me that we've already delayed this too much given the frequency at which it's happening lately.

There is currently a debate as to whether Electron will be kept at all in T166188: Architecture of new rendering backend for Extension:Collection. I would suggest to see what the final outcome of that will be before investing efforts into finding the correct fix for Electron.

Change 359967 had a related patch set uploaded (by GWicke; owner: GWicke):
[operations/puppet@production] Restart pdfrender service once per day

https://gerrit.wikimedia.org/r/359967

Marko's new version of the patch actually checks whether the service is responsive, and only restarts it when necessary. This should eliminate for manual restarts, and should buy us enough time until the longer term browser-based renderer is determined in T166188.

I am also wondering if there were any hangs during normal operation (not after a manual restart) before the Electron upgrade on April 11th, or if this is a regression.

@GWicke, @mobrovac - currently all of the options we are seriously looking at include electron. We are leaning towards T168871#3449625 and would be ready to commit unless there's significant objections from your team or ops. Do we know what portion of renders this bug is affecting? Currently, electron is used as the default renderer on all projects.

@ovasileva, thanks for the update. Anything that uses a generic browser based print process is good news to me. Electron is just one of several possible wrappers around Chrome, and we can replace it with other options like the newer native Chrome print mode without any changes in behavior or output.

The Electron issues are relatively rare, but when they happen, they hang one instance until it is restarted. This is something we need to address in the longer term, once a decision between generic browser-based rendering (ex: Electron) vs. specialized tools like wkhtml2pdf has been made on your end.

@GWicke - about 90% sure that we'll go with electron. If there's no significant objections to the proposal in T168871, we will probably be committing to that approach by the end of this week.

Change 359967 merged by Giuseppe Lavagetto:
[operations/puppet@production] PDF Render: Check hourly if the service is running via cron

https://gerrit.wikimedia.org/r/359967

on scb1002

Current Status:	  CRITICAL  
 (for 0d 5h 51m 50s)
Status Information:	connect to address 10.64.16.21 and port 5252: Connection refused
HTTP CRITICAL - Unable to open TCP socket

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=scb1002&service=pdfrender

duration: 0d 5h 52m 52s

Mentioned in SAL (#wikimedia-operations) [2017-07-28T02:26:56Z] <mutante> scb1002 - systemctl restart pdfrender - was "connect to address 10.64.16.21 and port 5252: Connection refused" in Icinga since a couple hours (T159922) - recovered

19:27 < icinga-wm> RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time

Mentioned in SAL (#wikimedia-operations) [2017-07-31T17:12:05Z] <herron> scb1001 restarted pdfrender service - T159922

Mentioned in SAL (#wikimedia-operations) [2017-07-31T20:48:59Z] <mutante> restarting pdfrender service on sc1001 after icinga alert (T159922)

^ that was scb1002 - not sc1001 - typo

This happened today at around 6:58 on scb1001, due to oomkiller killing one of electron's children

Aug  2 06:57:58 scb1001 kernel: [3687844.422723] Memory cgroup out of memory: Kill process 13114 (electron) score 922 or sacrifice child
Aug  2 06:57:58 scb1001 kernel: [3687844.433046] Killed process 13116 (electron) total-vm:322044kB, anon-rss:928kB, file-rss:25052kB, shmem-rss:0kB

Then xpra shuts down and everything gets restarted by systemd

Aug  2 06:58:07 scb1001 pdfrender[13039]: xpra at :918 has exited.
Aug  2 06:58:08 scb1001 pdfrender[13039]: Xpra server pid 13040, xpra client pid 13073, jail 13074
Aug  2 06:58:09 scb1001 pdfrender[13039]: (EE) Server terminated successfully (0). Closing log file.
Aug  2 06:58:21 scb1001 systemd[1]: pdfrender.service holdoff time over, scheduling restart.
Aug  2 06:58:21 scb1001 systemd[1]: Stopping "pdfrender service"...
Aug  2 06:58:21 scb1001 systemd[1]: Starting "pdfrender service"...
Aug  2 06:58:21 scb1001 systemd[1]: Started "pdfrender service".

I noticed in case of a restart not working pdfrender isn't even emitting Renderer listening on http://0.0.0.0:5252 in the logs which matches what we're seeing (not listening on socket).

Looking at the code it seems either electron hasn't finished booting and/or windowpool never finishes?

// Electron finished booting
electronApp.once('ready', () => {
  electronApp.ready = true;
  app.pool = new WindowPool();
  const listener = app.listen(PORT, HOSTNAME, () => printBootMessage(listener));
});

In either case there should be some failsafe to at least retry whatever electron or windowpool were trying to do

Mentioned in SAL (#wikimedia-operations) [2017-08-07T12:39:29Z] <_joe_> restarting pdfrender on scb1001, T159922

I just reported the issue on the other ticket, interested on a fix, of course, but unsubscribing to try to keep my inbox under 100 phab daily emails :-) Readd me if you need help.

Mentioned in SAL (#wikimedia-operations) [2017-08-13T19:01:50Z] <godog> bounce pdfrender on scb1001 and scb1003 - T159922

This just happened again, any thoughts on what I wrote in T159922#3492238 ? Namely that xpra might not be necessarily the root cause

Mentioned in SAL (#wikimedia-operations) [2017-08-15T13:21:51Z] <moritzm> bounced pdfrender on scb1001 (T159922)

Mentioned in SAL (#wikimedia-operations) [2017-08-15T15:21:30Z] <mobrovac> restarting pdfrender on scb1001, added some debug messages to help us diagnose T159922

This just happened again, any thoughts on what I wrote in T159922#3492238 ? Namely that xpra might not be necessarily the root cause

I added some debugging messages to the service on scb1001, and, in fact, the initialisation process never finishes:

[INIT] Electron App back-end ready, creating the windows ...
[WindowPool] Created window with id = 1, n = 7
[WindowPool] Created window with id = 2, n = 6
[WindowPool] Created window with id = 3, n = 5
[WindowPool] Created window with id = 4, n = 4
[WindowPool] Created window with id = 5, n = 3

It hangs after a certain amount of windows have been created. For reference, when the service starts up normally, these log entries are produced:

[INIT] Electron App back-end ready, creating the windows ...
[WindowPool] Created window with id = 1, n = 7
[WindowPool] Created window with id = 2, n = 6
[WindowPool] Created window with id = 3, n = 5
[WindowPool] Created window with id = 4, n = 4
[WindowPool] Created window with id = 5, n = 3
[WindowPool] Created window with id = 6, n = 2
[WindowPool] Created window with id = 7, n = 1
[WindowPool] Created window with id = 8, n = 0
[INIT] Window creating completed
Renderer listening on http://0.0.0.0:5252
Usage: GET http://0.0.0.0:5252/[pdf|png|jpeg]?accessKey=[token]&url=http%3A%2F%2Fgoogle.com

I don't have an answer as to why the window-creation process hangs at the moment.

Mentioned in SAL (#wikimedia-operations) [2017-08-16T08:10:24Z] <moritzm> bounced pdfrender on scb1004 (T159922)

I tried setting a time-out during the initialisation process that would kill the process if ti doesn't complete in a reasonable amount of time (and hence allow SystemD to restart it), but unfortunately it seems that Electron's window-creation process blocks entirely the libuv loop, so it is never executed.

I did notice, however, that even when that happens in most of the cases at least 4 windows get created, so a temporary work-around might be to lower the concurrency to 4. That should be enough to serve the current rate of requests (~2reqs/sec).

Change 372156 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[operations/puppet@production] PDF Render: Lower the concurrency to 4

https://gerrit.wikimedia.org/r/372156

Change 372156 merged by Alexandros Kosiaris:
[operations/puppet@production] PDF Render: Lower the concurrency to 4

https://gerrit.wikimedia.org/r/372156

At least one instance on scb100* broke again about 20 hours ago:

image.png (630×1 px, 94 KB)

I just restarted the instances on scb1001-1004, which resolved the alert.

Assuming this was an instance with the reduced concurrency applied, it seems that this change did not help much either.

Assuming this was an instance with the reduced concurrency applied, it seems that this change did not help much either.

The concurrency change was meant only to help for the restart process itself, not the service process running, but hanging issue (which is a different beast). If you managed to restart them in one go, then it's actually working :)

Mentioned in SAL (#wikimedia-operations) [2017-08-24T21:59:54Z] <gwicke> restarted pdfrender service instances in eqiad / T159922

Mentioned in SAL (#wikimedia-operations) [2017-08-28T18:48:52Z] <gwicke> restarted pdfrender instances in eqiad (T159922)

This task was about pdfrender failing to start, and that problem has been "hotfixed".

It has nothing to do with the service hanging and us having no functional monitoring (because we have no swagger spec for this, for instnance).

I'd resolve this ticket at this point. @mobrovac what do you think?

I personally am not sure whether the startup issues are caused by the same underlying issue as the hangs, or not. I would imagine that an electron worker process restarting could run into similar hangs as on service startup.

In any case, the hangs are an important problem right now, requiring frequent manual intervention. If we decide to close this task, lets make sure we have another task to keep tracking the hangs.

FWIW I support closing this and moving the other parts into their own tasks. This task is already long enough and treats at least 3 different things (pdfrending hanging problem + pdfrender start up problem + discussions pertaining to a new service. i.e. T166188) so it's becoming a little bit confusing. Let's rename this one to reflect we 've treated the pdf start up problem (which is pretty much 80% of the task) and move the other parts into their own tasks.

mobrovac claimed this task.

Agreed, this task has become confusing. As the start-up issue has been worked around, I am closing this task. I have created T174916: electron/pdfrender hangs where we can track the service's hangs in production.