Zotero service crashes and pages multiple times.
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• fsero
	Jan 14 2019, 12:11 PM

Description

The abstract of the issue is described on this gerrit CR https://gerrit.wikimedia.org/r/c/operations/puppet/+/483999:

zotero pages currently arrive because multiple instances become
unresponsive when the nodejs maximum heap is reached and are killed by
zotero, only to be immediately restarted by kubernetes. That means
a quick recovery at the instance level and an eventual (and usually
rather quick) recovery at the service level. The service is a depending
service of citoid, but the latter is able to function even without
zotero, albeit with reduced functionality.

Unfortunately zotero pages are currently non actionable as usually the
service recovers long before a human has decided to perform any action.
Even in a pathological case where the service is reported as down long
enough, the only course of action is to delete all pods and allow
kubernetes to restart everything, effectively just rushing things a bit.
Due to all of the above, disable paging for zotero. IRC and email alerts
will continue to be sent as normal

The underlying issue is the lack of a readiness probe as stated on T213689 . One of the attempts was to adjust node heap size to either delay the problem or making the heap size smaller less massive, this was tried on T213414.

When we deployed the new image with the heap size changes we also deployed commits since merged from upstream, one of them caused a zlib issue due to incompatibilities between installed zlib version on Debian and node bundled one (details are in this CR https://gerrit.wikimedia.org/r/c/mediawiki/services/zotero/+/484205/)

This task is sort of an umbrella task for zotero latest incidents, it should be closed when we dont receive multiple pages due to zotero.

Related Objects

Mentioned Here: T213414: allow zotero container nodejs server to define the amount of heap used instead of the fixed limit of 1.7Gi
T213689: Create a readiness probe for zotero

Event Timeline

• fsero created this task.Jan 14 2019, 12:11 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 14 2019, 12:11 PM

• fsero added a project: Kubernetes.Jan 14 2019, 12:11 PM

CDanis triaged this task as Medium priority.Jan 14 2019, 2:37 PM

Mentioned in SAL (#wikimedia-operations) [2019-01-14T14:52:25Z] <akosiaris> upgrade zotero pods to 2019-01-14-115905-candidate in codfw T213693

Mentioned in SAL (#wikimedia-operations) [2019-01-14T15:04:51Z] <akosiaris> upgrade zotero pods to 2019-01-14-115905-candidate in eqiad T213693

We have already identified a specific url that was able to send zotero in what appear like a busy loop. The upgrade done a few mins ago in both codfw + eqiad seems to have addressed that specific incident. We should monitor the service in the next few days/weeks to see if the problems arises again. Given it's parsing arbitrary data, it's quite possible we haven't seen the last of it.

Meta: Reading "This task is sort of an umbrella task for zotero latest incidents, it should be closed when we dont receive multiple pages due to zotero." makes me think this is an "active situation" task not a follow-up.

Liuxinyu970226 subscribed.Jan 20 2019, 3:25 PM

• Phabricator_maintenance moved this task from Backlog to Acknowledged on the SRE board.Jan 26 2019, 11:04 PM

After latest deployments of zotero this has been fixed

Liuxinyu970226 unsubscribed.Feb 15 2019, 1:41 PM

Zotero service crashes and pages multiple times.Closed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Zotero service crashes and pages multiple times.
Closed, ResolvedPublic
Actions