Page MenuHomePhabricator

zotero paging / serving 5xxes after CPU spikes
Closed, ResolvedPublic

Description

<+icinga-wm> PROBLEM - LVS zotero eqiad port 4969/tcp - Zotero- zotero.svc.eqiad.wmnet IPv4 #page on zotero.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
<+icinga-wm> RECOVERY - LVS zotero eqiad port 4969/tcp - Zotero- zotero.svc.eqiad.wmnet IPv4 #page on zotero.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 196 bytes in 1.067 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
<_joe_> it was a higher cpu spike compared to the ones we seem to have regularly
<_joe_> sadly zotero is a dark box
<_joe_> see https://grafana.wikimedia.org/d/2oPtfvXWk/zotero?viewPanel=28&orgId=1&refresh=1m&from=now-24h&to=now
<_joe_> it doesn't seem like the usual memleak is causing this though
<rzl> yeah
<rzl> memory's a little high but not crazy
<_joe_> I think the cure might be raising the number of replicas if this keeps happening
<_joe_> or maybe someone is just trying to cite a url that causes this :/
<rzl> sounds plausible - when's the last time we re-provisioned it?
<_joe_> I have no idea :)
<+icinga-wm> PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
<_joe_> but I they seem to have all been started 30 hours ago
<_joe_> and I see several restarts for most pods
<_joe_> due to failing liveness probes
<_joe_> rzl: I would try to raise the number of pods allocated by say 10%, and /or a rolling restart of the pods
<rzl> _joe_: I was interested in whether the CPU usage is uniform between the pods, and it looks like it's not: https://grafana.wikimedia.org/goto/e5npVTN7k
<rzl> most of them are fine, a few are spiking
<rzl> so I'm not sure more pods would help
<+icinga-wm> RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
<_joe_> at least to keep the service available I'd say :P
<_joe_> but yes, possibly not :/
<_joe_> you can check the times zotero went into a cpu spike and find the urls that were requested to url-downloader in that timespan from the pod ip
<_joe_> in the squid logs
<-> tgr|away is now known as tgr
<rzl> looking -- I don't know my way around the squid logs super well so I may not get there faster than the spike ends on its own at this point :P but let's see
<rzl> man I'm not even sure if I can *find* the squid logs faster than the spike ends on its own :/
<rzl> more practice for me later
<legoktm> are they not on the url-downloader host itself?
<legoktm> increasing the # of replicas should at least let some more pods keep serving requests while others are tied up with the massive requests, right?
<rzl> zotero 5xxs and cpu both look recovered
<_joe_> yes the spike is recovered
<_joe_> and yes the squid logs should be under /var/log/squid-something on the host :)
<_joe_> they have the "client" ip, and ua iirc
<_joe_> so you should be able to find the calls to specific zotero pods from their IP address
<legoktm> git blame says the # of replicas at least the same since July 2019, when it was committed to git
<_joe_> (it's in kubectl describe pods)
<rzl> _joe_: ahh okay cool
<_joe_> legoktm: yes normally we have issues with zotero due to memleaks
<rzl> yeah I think I just haven't happened to look at this file before, I'm going to keep digging around for this even though it looks like the incident is over
<_joe_> and in the past we had such cpu problems when zotero was used to cite certain wiki pages - in which case you won't find nothing in the squid logs
<legoktm> so throwing more "hardware" at this seems reasonable
<rzl> yeah I'll buy that
<legoktm> so add 2 more replicas? 4?
<rzl> okay out of the three pods that were spiking at 17:25, none of their IP addresses appears in the squid log as far as I can tell
<rzl> (although I got to them post-restart, and I'm not sure if they get new IPs when that happens)
<rzl> legoktm: there's what, 10 now? +4 sounds reasonable to me but I don't have a great sense of what it costs resource-wise
<rzl> like I can look up the allocation, I just don't know if it's "a lot" on our infra :P
<rzl> or rather, we have 10 in eqiad and 10 in codfw, we're talking about +4 in each
<legoktm> I don't have a good sense of that either, I struggled a bit with that for Shellbox too
<rzl> something to talk about at Monday's svcops meeting but let's run with +4 in the meantime

Event Timeline

Change 723604 had a related patch set uploaded (by Legoktm; author: Legoktm):

[operations/deployment-charts@master] zotero: Add 4 more replicas (total 14)

https://gerrit.wikimedia.org/r/723604

Change 723604 merged by jenkins-bot:

[operations/deployment-charts@master] zotero: Add 4 more replicas (total 14)

https://gerrit.wikimedia.org/r/723604

Open questions:

  • Is there a better way to fix this / identify the problematic requests?
  • Is throwing more resources at zotero an acceptable long-term solution or just short-term?
  • How many replicas is appropriate? And generally, how much does a replica "cost" in terms of overall cluster size, etc.

Open questions:

  • Is there a better way to fix this / identify the problematic requests?

Zotero logs are neigh to useless (actively harmful in the past) and have been disabled in 827891af. That makes it difficult to figure out directly what has happened. Using citoid logs as a proxy can help (but not always). e.g. https://logstash.wikimedia.org/goto/b8ce6f59b6417c0a79797056a1d0b957 covers some of the first few minutes where CPU increase is noticed. And yet I can't find a silver bullet. That being said, the last few pages seem to have been related with parsing really large PDF files. On the other hand, PDFs are being often requested to be cited (or at least that's my cursory reading of the logs of a day)

  • Is throwing more resources at zotero an acceptable long-term solution or just short-term?

I think it is (kinda, we might want another 4 or 5). More on my thinking below.

  • How many replicas is appropriate?

We don't have SLOs for this service so no product owned guide for this. Also important to point out that zotero IS NOT written by wikimedia.

However, using Citoid as a proxy,

Zotero is single threaded and a nodejs app. That means it has an event queue but whenever it is doing some CPU intensive (like parsing a large PDF) it is unable to serve the queue. It also does not offer any endpoint that can be used for a readinessProbe. That effectively means that when a given replica is serving a large request it can't serve anything else AND it is not depooled from the rotation, which means that requests will still head it's way. In the majority of cases, it's ok. Citoid is usually ok with zotero requests timing out and it falls back to its internal parser. Plus, by the time zotero ends up with the large request it will serve node's event queue which USUALLY (judging from CPU graphs) will be way cheaper CPU wise.

So what happens when we get paged is one of the following scenarios:

  • All majority replicas get some repeated request (a user submits the url they want cited multiple times) and end up being CPU pegged and unable to service requests, including monitoring which times out 3 times and raises the alert.
  • A replica gets a large request and becomes unable to serve more requests for an amount of time, monitoring requests flow its' way in a real unlucky situation, timeout and raise the alert.
  • Something in the spectrum between those 2 extremes.

In this specific case, grafana points out high CPU for 2 replicas, so... we got unlucky? The main reason I say this is because https://grafana.wikimedia.org/d/NJkCVermz/citoid?viewPanel=46&orgId=1&from=1632499800000&to=1632511919000 points out that Citoid was seeing 500s before and after the page, but we got no page for those events. Zooming out to 7 days show that those 5xx are not unusual at all.

And generally, how much does a replica "cost" in terms of overall cluster size, etc.

That is answered by https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/services/zotero/values.yaml#14. TL;DR is on average 200M RAM and 0.2 of a CPU, with spikes to 1 CPU and 2G RAM. So not much. We can add more.

So, a solution?

The good one would be to add a cheap HTTP get endpoint to zotero, add a readinessProbe and allow kubernetes to depool/pool replicas organically.

Lacking the will to invest in doing that, let's fallback to queuing theory. Using the p90 (~6s) for latencies and 3 rps for traffic (and ignoring nodejs event queue which adds a bit of a buffer on its own), and a M/M/Inf (At least on server per customer) model we get that we need 18.75 servers, so 19 replicas. Round it up to 20 and it will probably be a long time before we get paged again for this.

I think some form of readinessProbe is indeed necessary, either by adding a GET endpoint or adding curl to the zotero container and a script that allows checking POST requests work.

The PDF connection might be a red herring because although that's what happened in the past, attempting to translate those was disabled here: https://github.com/zotero/translation-server/issues/68

https://github.com/zotero/translation-server/commit/5cf698e20acf8a2fe40219a464f5cbdf7154d24f

https://github.com/zotero/translation-server/commit/9f2c36664fe5f0c3e98ffeb369194f7c242120c3

Locally I've checked and this is still working as expected. So the rejections happen because they get requested a lot, not because it's necessarily related to the crash.

I will point out, also, that I have noticed recently a lot of our traffic for a long time is probably coming from some external source and not us. Some third part is requesting "zotero" format, which I know that neither Citoid extension/visual editor nor the reftoolbar uses.

https://grafana.wikimedia.org/d/NJkCVermz/citoid?orgId=1&refresh=5m&from=now-30m&to=now&var-dc=eqiad%20prometheus%2Fk8s&var-service=citoid

You can see here the majority of requests are for zotero format (including a lot of the requests around the crash). Mediawiki format requests are likely largely visual editor, and mediawiki-basefields is RefToolBar. The Proveit gadget also uses mediawiki format. It is possible there is a consumer that uses the zotero format that's legitimately using it for Wikimedia projects but I don't know of one.

If this is causing us to have to allocate additional resources we should probably figure out where it's coming from.

The PDF connection might be a red herring because although that's what happened in the past, attempting to translate those was disabled here: https://github.com/zotero/translation-server/issues/68

https://github.com/zotero/translation-server/commit/5cf698e20acf8a2fe40219a464f5cbdf7154d24f

https://github.com/zotero/translation-server/commit/9f2c36664fe5f0c3e98ffeb369194f7c242120c3

Locally I've checked and this is still working as expected. So the rejections happen because they get requested a lot, not because it's necessarily related to the crash.

Per those commits, the limit seems to be 50MB. @dstillman is probably right in his comment that it is maybe too high. Can we lower this more/is it configurable now? Like 10MB ? Do we have any use case requiring that we are able to cite objects > 10MB ?

I will point out, also, that I have noticed recently a lot of our traffic for a long time is probably coming from some external source and not us. Some third part is requesting "zotero" format, which I know that neither Citoid extension/visual editor nor the reftoolbar uses.

https://grafana.wikimedia.org/d/NJkCVermz/citoid?orgId=1&refresh=5m&from=now-30m&to=now&var-dc=eqiad%20prometheus%2Fk8s&var-service=citoid

You can see here the majority of requests are for zotero format (including a lot of the requests around the crash). Mediawiki format requests are likely largely visual editor, and mediawiki-basefields is RefToolBar. The Proveit gadget also uses mediawiki format. It is possible there is a consumer that uses the zotero format that's legitimately using it for Wikimedia projects but I don't know of one.

If this is causing us to have to allocate additional resources we should probably figure out where it's coming from.

I did have a look at that. It's a single IP in OVH (a hosting provider) in Canada that seems to be responsible for ~90% of that traffic, but there are others (albeit at volumes that are 2 orders of magnitude less). Some quick searches and calculations imply ~1rps and not a commercial service. Traffic isn't a lot, we probably can't call this a violation of https://www.mediawiki.org/wiki/API:Etiquette#Request_limit.

The PDF connection might be a red herring because although that's what happened in the past, attempting to translate those was disabled here: https://github.com/zotero/translation-server/issues/68

https://github.com/zotero/translation-server/commit/5cf698e20acf8a2fe40219a464f5cbdf7154d24f

https://github.com/zotero/translation-server/commit/9f2c36664fe5f0c3e98ffeb369194f7c242120c3

Locally I've checked and this is still working as expected. So the rejections happen because they get requested a lot, not because it's necessarily related to the crash.

Per those commits, the limit seems to be 50MB. @dstillman is probably right in his comment that it is maybe too high. Can we lower this more/is it configurable now? Like 10MB ? Do we have any use case requiring that we are able to cite objects > 10MB ?

A PR is doable. I don't really have a sense at how often it's exceeded, though, now that pdfs are excluded. We could just try it.

I will point out, also, that I have noticed recently a lot of our traffic for a long time is probably coming from some external source and not us. Some third part is requesting "zotero" format, which I know that neither Citoid extension/visual editor nor the reftoolbar uses.

https://grafana.wikimedia.org/d/NJkCVermz/citoid?orgId=1&refresh=5m&from=now-30m&to=now&var-dc=eqiad%20prometheus%2Fk8s&var-service=citoid

You can see here the majority of requests are for zotero format (including a lot of the requests around the crash). Mediawiki format requests are likely largely visual editor, and mediawiki-basefields is RefToolBar. The Proveit gadget also uses mediawiki format. It is possible there is a consumer that uses the zotero format that's legitimately using it for Wikimedia projects but I don't know of one.

If this is causing us to have to allocate additional resources we should probably figure out where it's coming from.

I did have a look at that. It's a single IP in OVH (a hosting provider) in Canada that seems to be responsible for ~90% of that traffic, but there are others (albeit at volumes that are 2 orders of magnitude less). Some quick searches and calculations imply ~1rps and not a commercial service. Traffic isn't a lot, we probably can't call this a violation of https://www.mediawiki.org/wiki/API:Etiquette#Request_limit.

Great, thanks for checking it out!

In terms of a get endpoint, would swagger docs suffice? I started to do something like that but never got around to finishing it :/ https://github.com/zotero/translation-server/issues/76

The PDF connection might be a red herring because although that's what happened in the past, attempting to translate those was disabled here: https://github.com/zotero/translation-server/issues/68

https://github.com/zotero/translation-server/commit/5cf698e20acf8a2fe40219a464f5cbdf7154d24f

https://github.com/zotero/translation-server/commit/9f2c36664fe5f0c3e98ffeb369194f7c242120c3

Locally I've checked and this is still working as expected. So the rejections happen because they get requested a lot, not because it's necessarily related to the crash.

Per those commits, the limit seems to be 50MB. @dstillman is probably right in his comment that it is maybe too high. Can we lower this more/is it configurable now? Like 10MB ? Do we have any use case requiring that we are able to cite objects > 10MB ?

A PR is doable. I don't really have a sense at how often it's exceeded, though, now that pdfs are excluded. We could just try it.

Neither do I. I guess we lack metrics for that, but on the other hand, we lack metrics for everything at the zotero level (aside from CPU/Memory/Network) so that's expected.

But on a more general note, such magic numbers as the one at: https://github.com/zotero/translation-server/commit/9f2c36664fe5f0c3e98ffeb369194f7c242120c3#diff-9097bfe5ae868eccdb921b5a043bd3bc4087e8aacd257640ab51f1513baae7b4R86

should be configurable, unless the default is so logical that noone would ever overwrite it.

I will point out, also, that I have noticed recently a lot of our traffic for a long time is probably coming from some external source and not us. Some third part is requesting "zotero" format, which I know that neither Citoid extension/visual editor nor the reftoolbar uses.

https://grafana.wikimedia.org/d/NJkCVermz/citoid?orgId=1&refresh=5m&from=now-30m&to=now&var-dc=eqiad%20prometheus%2Fk8s&var-service=citoid

You can see here the majority of requests are for zotero format (including a lot of the requests around the crash). Mediawiki format requests are likely largely visual editor, and mediawiki-basefields is RefToolBar. The Proveit gadget also uses mediawiki format. It is possible there is a consumer that uses the zotero format that's legitimately using it for Wikimedia projects but I don't know of one.

If this is causing us to have to allocate additional resources we should probably figure out where it's coming from.

I did have a look at that. It's a single IP in OVH (a hosting provider) in Canada that seems to be responsible for ~90% of that traffic, but there are others (albeit at volumes that are 2 orders of magnitude less). Some quick searches and calculations imply ~1rps and not a commercial service. Traffic isn't a lot, we probably can't call this a violation of https://www.mediawiki.org/wiki/API:Etiquette#Request_limit.

Great, thanks for checking it out!

In terms of a get endpoint, would swagger docs suffice? I started to do something like that but never got around to finishing it :/ https://github.com/zotero/translation-server/issues/76

That would work fine. Many thanks for working on that!

In terms of a get endpoint, would swagger docs suffice? I started to do something like that but never got around to finishing it :/ https://github.com/zotero/translation-server/issues/76

That would work fine. Many thanks for working on that!

In review -> https://github.com/zotero/translation-server/pull/131

In terms of a get endpoint, would swagger docs suffice? I started to do something like that but never got around to finishing it :/ https://github.com/zotero/translation-server/issues/76

That would work fine. Many thanks for working on that!

In review -> https://github.com/zotero/translation-server/pull/131

@Mvolz, care to ping upstream again on that one? It's 6 months already.

We had another slew of pages disrupting our weekend as SREs. Pages we could literally do nothing else than just restart things and hope the problem doesn't show up again. This is not sustainable and is demoralizing long term, I am afraid that if this continues SRE will have to pull support for zotero and stop receiving pages for incidents/outages/degradations the service faces.

Judging from the graphs, https://w.wiki/4zi8 we still had spare capacity (14 replicas and at most 9 were fully utilized) and could serve requests. So if we had this get endpoint to use as a readiness probe, kubernetes would have automatically depooled the non responsive instances until they were responsive again leading to no user-visible issues for end users.

In terms of a get endpoint, would swagger docs suffice? I started to do something like that but never got around to finishing it :/ https://github.com/zotero/translation-server/issues/76

That would work fine. Many thanks for working on that!

In review -> https://github.com/zotero/translation-server/pull/131

@Mvolz, care to ping upstream again on that one? It's 6 months already.

We had another slew of pages disrupting our weekend as SREs. Pages we could literally do nothing else than just restart things and hope the problem doesn't show up again. This is not sustainable and is demoralizing long term, I am afraid that if this continues SRE will have to pull support for zotero and stop receiving pages for incidents/outages/degradations the service faces.

Judging from the graphs, https://w.wiki/4zi8 we still had spare capacity (14 replicas and at most 9 were fully utilized) and could serve requests. So if we had this get endpoint to use as a readiness probe, kubernetes would have automatically depooled the non responsive instances until they were responsive again leading to no user-visible issues for end users.

Oh no, that's terrible :(

Let's just merge this locally, that's what we have a fork for. Whether it gets done later upstream or not, that's all fixable during normal working hours :).

Change 774848 had a related patch set uploaded (by Mvolz; author: Mvolz):

[mediawiki/services/zotero@master] Add OpenApi spec and Swagger documentation

https://gerrit.wikimedia.org/r/774848

Change 776284 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] zotero: Disable paging

https://gerrit.wikimedia.org/r/776284

For posterity's sake, alert histograms from icinga for the 2 instances of zotero.

codfw:

image.png (388×976 px, 19 KB)

eqiad:

image.png (416×966 px, 19 KB)

In none of these cases did we do anything more than indiscriminately restart all instances.

Change 776284 merged by Alexandros Kosiaris:

[operations/puppet@production] zotero: Disable paging

https://gerrit.wikimedia.org/r/776284

Mentioned in SAL (#wikimedia-operations) [2022-04-02T11:26:40Z] <akosiaris> disable zotero paging until T291707 is resolved.

Change 774848 merged by jenkins-bot:

[mediawiki/services/zotero@master] Add OpenApi spec and Swagger documentation

https://gerrit.wikimedia.org/r/774848

Change 777758 had a related patch set uploaded (by Mvolz; author: Mvolz):

[operations/deployment-charts@master] Update zotero to include get endpoint

https://gerrit.wikimedia.org/r/777758

Change 777758 merged by jenkins-bot:

[operations/deployment-charts@master] Update zotero to include get endpoint

https://gerrit.wikimedia.org/r/777758

akosiaris@deploy1002:~$ kubectl get events
LAST SEEN   TYPE      REASON      OBJECT                                  MESSAGE
2m49s       Warning   Unhealthy   pod/zotero-production-684574794-6n999   Readiness probe failed: Get http://10.64.64.81:1969/?spec: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
38s         Warning   Unhealthy   pod/zotero-production-684574794-g5xr6   Readiness probe failed: Get http://10.64.65.14:1969/?spec: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
6m17s       Warning   Unhealthy   pod/zotero-production-684574794-gv9nf   Readiness probe failed: Get http://10.64.68.96:1969/?spec: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
9m40s       Warning   Unhealthy   pod/zotero-production-684574794-jglk4   Readiness probe failed: Get http://10.64.64.67:1969/?spec: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
17m         Warning   Unhealthy   pod/zotero-production-684574794-jlk69   Readiness probe failed: Get http://10.64.68.239:1969/?spec: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
3m35s       Warning   Unhealthy   pod/zotero-production-684574794-nlqz2   Readiness probe failed: Get http://10.64.64.60:1969/?spec: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
17m         Warning   Unhealthy   pod/zotero-production-684574794-nr4ng   Readiness probe failed: Get http://10.64.69.222:1969/?spec: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
31m         Warning   Unhealthy   pod/zotero-production-684574794-ps595   Readiness probe failed: Get http://10.64.68.138:1969/?spec: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
27m         Warning   Unhealthy   pod/zotero-production-684574794-sk5zw   Readiness probe failed: Get http://10.64.67.19:1969/?spec: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
39m         Warning   Unhealthy   pod/zotero-production-684574794-st8qr   Readiness probe failed: Get http://10.64.67.28:1969/?spec: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2m38s       Warning   Unhealthy   pod/zotero-production-684574794-tj2qh   Readiness probe failed: Get http://10.64.67.249:1969/?spec: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
48m         Warning   Unhealthy   pod/zotero-production-684574794-wf5rq   Readiness probe failed: Get http://10.64.71.49:1969/?spec: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
5m2s        Warning   Unhealthy   pod/zotero-production-684574794-wlbwg   Readiness probe failed: Get http://10.64.66.98:1969/?spec: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
34m         Warning   Unhealthy   pod/zotero-production-684574794-xtdqm   Readiness probe failed: Get http://10.64.66.153:1969/?spec: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

Readiness probe works! @Mvolz, many thanks for that!

I 'll be monitoring this for the next few days and re-enable paging early next week.

I don't see an alert since Apr 5 in Icinga for either codfw or eqiad. I am gonna re-enabling paging.

Change 779118 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] Revert "zotero: Disable paging"

https://gerrit.wikimedia.org/r/779118

Change 779118 merged by Alexandros Kosiaris:

[operations/puppet@production] Revert "zotero: Disable paging"

https://gerrit.wikimedia.org/r/779118

akosiaris claimed this task.

Tentatively resolving this.