When a bunch of webgrid-lighttpd nodes were depooled on Jan. 6th, we discovered that they didn't inform the proxylistener process on the Toolforge proxy system that the services were rescheduled.
That caused the system to be out of sync, and we got 503 errors on the web for those tools. It turns out that a long-standing bug is that webgrid issues *should* be marked "not rerunable" so that they are instead stopped as grid jobs (which will run the epilog script against the proxy). They will then be restarted by the webservice monitor. Rescheduling them on the grid skips the epilog script, so it ends up without a port on the proxy and no webservice is apparent event though it reports that it is running to the webservice monitor (because it is running...on the grid, just not on the web).
The fix is to ensure that webservice does qsub -r n instead of -r y or whatever it does now.