Page MenuHomePhabricator

Zotero service crashes and pages multiple times.
Closed, ResolvedPublic

Description

The abstract of the issue is described on this gerrit CR https://gerrit.wikimedia.org/r/c/operations/puppet/+/483999:

zotero pages currently arrive because multiple instances become
unresponsive when the nodejs maximum heap is reached and are killed by
zotero, only to be immediately restarted by kubernetes. That means
a quick recovery at the instance level and an eventual (and usually
rather quick) recovery at the service level. The service is a depending
service of citoid, but the latter is able to function even without
zotero, albeit with reduced functionality.

Unfortunately zotero pages are currently non actionable as usually the
service recovers long before a human has decided to perform any action.
Even in a pathological case where the service is reported as down long
enough, the only course of action is to delete all pods and allow
kubernetes to restart everything, effectively just rushing things a bit.
Due to all of the above, disable paging for zotero. IRC and email alerts
will continue to be sent as normal

The underlying issue is the lack of a readiness probe as stated on T213689 . One of the attempts was to adjust node heap size to either delay the problem or making the heap size smaller less massive, this was tried on T213414.

When we deployed the new image with the heap size changes we also deployed commits since merged from upstream, one of them caused a zlib issue due to incompatibilities between installed zlib version on Debian and node bundled one (details are in this CR https://gerrit.wikimedia.org/r/c/mediawiki/services/zotero/+/484205/)

This task is sort of an umbrella task for zotero latest incidents, it should be closed when we dont receive multiple pages due to zotero.

Event Timeline

CDanis triaged this task as Medium priority.Jan 14 2019, 2:37 PM

Mentioned in SAL (#wikimedia-operations) [2019-01-14T14:52:25Z] <akosiaris> upgrade zotero pods to 2019-01-14-115905-candidate in codfw T213693

Mentioned in SAL (#wikimedia-operations) [2019-01-14T15:04:51Z] <akosiaris> upgrade zotero pods to 2019-01-14-115905-candidate in eqiad T213693

akosiaris lowered the priority of this task from Medium to Low.Jan 14 2019, 3:09 PM
akosiaris subscribed.

We have already identified a specific url that was able to send zotero in what appear like a busy loop. The upgrade done a few mins ago in both codfw + eqiad seems to have addressed that specific incident. We should monitor the service in the next few days/weeks to see if the problems arises again. Given it's parsing arbitrary data, it's quite possible we haven't seen the last of it.

greg subscribed.

Meta: Reading "This task is sort of an umbrella task for zotero latest incidents, it should be closed when we dont receive multiple pages due to zotero." makes me think this is an "active situation" task not a follow-up.

fsero claimed this task.

After latest deployments of zotero this has been fixed