The abstract of the issue is described on this gerrit CR https://gerrit.wikimedia.org/r/c/operations/puppet/+/483999:
zotero pages currently arrive because multiple instances become
unresponsive when the nodejs maximum heap is reached and are killed by
zotero, only to be immediately restarted by kubernetes. That means
a quick recovery at the instance level and an eventual (and usually
rather quick) recovery at the service level. The service is a depending
service of citoid, but the latter is able to function even without
zotero, albeit with reduced functionality.
Unfortunately zotero pages are currently non actionable as usually the
service recovers long before a human has decided to perform any action.
Even in a pathological case where the service is reported as down long
enough, the only course of action is to delete all pods and allow
kubernetes to restart everything, effectively just rushing things a bit.
Due to all of the above, disable paging for zotero. IRC and email alerts
will continue to be sent as normal
The underlying issue is the lack of a readiness probe as stated on T213689 . One of the attempts was to adjust node heap size to either delay the problem or making the heap size smaller less massive, this was tried on T213414.
When we deployed the new image with the heap size changes we also deployed commits since merged from upstream, one of them caused a zlib issue due to incompatibilities between installed zlib version on Debian and node bundled one (details are in this CR https://gerrit.wikimedia.org/r/c/mediawiki/services/zotero/+/484205/)
This task is sort of an umbrella task for zotero latest incidents, it should be closed when we dont receive multiple pages due to zotero.