So since ocg1003 was misbehaving lately, I took a look:
the fact it was not decommisioned nor reimaged (see T84723) meant it was still processing jobs and, quite surprisingly, receiving requests from the appservers in this form:
GET /?command=download_file&collection_id=75e63cbbd79765efb78d951745923ae3c2a6a5f2&writer=rdf2latex HTTP/1.0 Host: ocg1003:8000 ...
coming in directly from the appservers. Even without accounting for the error in setting the Host header this breaks quite spectacularly our whole infrastructure: we assume that all traffic to our individual hosts should be mediated by a load balancer and this clearly violates that.
I seem to understand this is not easy to fix, but it needs to be fixed as soon as possible - or this would mean that whenever an host goes down for any reason we'll fail a portion of the requests.
[CSA] We can implement a flag to tell ocg1003 to stop taking new tasks from the queue. In conjunction with one of the queue purging scripts (or just waiting a few days) that will ensure that ocg1003 becomes quiescent with no cached files, and can be taken down.