Page MenuHomePhabricator

Request took down both zotero and citoid (exceeding memory)
Closed, InvalidPublic

Description

A request took down both zotero and citoid, probably by taking up too much memory:

https://grafana.wikimedia.org/d/000000620/xxxx-zotero-debugging-kubernetes?orgId=1&from=1579716341745&to=1579717330571&fullscreen&panelId=41

https://logstash.wikimedia.org/app/kibana#/dashboard/default?_g=(refreshInterval%3A(display%3AOff%2Cpause%3A!f%2Cvalue%3A0)%2Ctime%3A(from%3A'2020-01-22T18%3A10%3A06.199Z'%2Cmode%3Aabsolute%2Cto%3A'2020-01-22T18%3A20%3A06.201Z')) for the time, search (type:citoid) -( the "share" function doesn't include the search term).

Sample citoid crash:

{
  "_index": "logstash-2020.01.22",
  "_type": "citoid",
  "_id": "AW_Og9d1KWrIH1QReyyj",
  "_version": 1,
  "_score": null,
  "_source": {
    "source_host": "10.192.64.28",
    "err_levelPath": "fatal/service-runner/unhandled",
    "level": "EMERGENCY",
    "line": "22",
    "pid": 16,
    "type": "citoid",
    "message": "Maximum call stack size exceeded",
    "version": "1.0",
    "normalized_message": "Maximum call stack size exceeded",
    "tags": [
      "input-gelf-12201",
      "deprecated-input",
      "es",
      "gelf",
      "normalized_message_untrimmed"
    ],
    "err_message": "",
    "file": "/srv/service/node_modules/domutils/lib/traversal.js",
    "@timestamp": "2020-01-22T18:29:02.551Z",
    "err_name": "RangeError",
    "host": "citoid-production-685469f44d-nnmn8",
    "@version": "1",
    "gelf_level": "0",
    "err_stack": "RangeError: Maximum call stack size exceeded\n    at Object.exports.getName (/srv/service/node_modules/domutils/lib/traversal.js:22:27)\n    at tag (/srv/service/node_modules/css-select/lib/general.js:21:11)\n    at findAll (/srv/service/node_modules/domutils/lib/querying.js:87:6)\n    at findAll (/srv/service/node_modules/domutils/lib/querying.js:90:27)\n    at findAll (/srv/service/node_modules/domutils/lib/querying.js:90:27)\n    at findAll (/srv/service/node_modules/domutils/lib/querying.js:90:27)\n    at findAll (/srv/service/node_modules/domutils/lib/querying.js:90:27)\n    at findAll (/srv/service/node_modules/domutils/lib/querying.js:90:27)\n    at findAll (/srv/service/node_modules/domutils/lib/querying.js:90:27)\n    at findAll (/srv/service/node_modules/domutils/lib/querying.js:90:27)\n    at findAll (/srv/service/node_modules/domutils/lib/querying.js:90:27)\n    at findAll (/srv/service/node_modules/domutils/lib/querying.js:90:27)\n    at findAll (/srv/service/node_modules/domutils/lib/querying.js:90:27)\n    at findAll (/srv/service/node_modules/domutils/lib/querying.js:90:27)\n    at findAll (/srv/service/node_modules/domutils/lib/querying.js:90:27)\n    at findAll (/srv/service/node_modules/domutils/lib/querying.js:90:27)"
  },
  "fields": {
    "@timestamp": [
      1579717742551
    ]
  },
  "sort": [
    1579717742551
  ]
}

We don't actually know what the request was yet as the kibana logs don't display it (it should be in the raw citoid logs.)

Event Timeline

Mvolz created this task.Jan 22 2020, 7:03 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 22 2020, 7:03 PM
MoritzMuehlenhoff triaged this task as High priority.Jan 23 2020, 8:23 AM
akosiaris added a comment.EditedJan 23 2020, 8:34 AM

it should be in the raw logs

Not sure what this refers to. If it is zotero logs, note that those don't exist. We had to turn them off as they were in a really bad format that caused issues to our logging pipeline. Assuming https://github.com/zotero/translation-server/issues/2 is addressed, we could reenable them. The disabling was done back in 827891af3d49beefc4a32042ff0ee2907dc77f0b

If it's the citoid logs, running kubectl logs -l app=citoid after having sourced .hfenv from the correct cluster directory works. But they are going to be way more difficult to parse of course.

But I do see in logstash a err_body_internalURI attribute in some entries that seems to indeed have what we need?

Mvolz updated the task description. (Show Details)Jan 23 2020, 11:12 AM

I just noticed that for some reason setting DEBUG_LEVEL: 0 for zotero no longer works however.

it should be in the raw logs

Not sure what this refers to. If it is zotero logs, note that those don't exist. We had to turn them off as they were in a really bad format that caused issues to our logging pipeline. Assuming https://github.com/zotero/translation-server/issues/2 is addressed, we could reenable them. The disabling was done back in 827891af3d49beefc4a32042ff0ee2907dc77f0b
If it's the citoid logs, running kubectl logs -l app=citoid after having sourced .hfenv from the correct cluster directory works. But they are going to be way more difficult to parse of course.
But I do see in logstash a err_body_internalURI attribute in some entries that seems to indeed have what we need?

The citoid logs, yes. So err_body_internalURI does look like what we needed but it wasn't in the entry I was interested in.

Part of the problem I think is that a request that breaks zotero won't necessarily bubble up to kibana from citoid because I don't think we have a high enough logging level on them (i.e. not warn). Historically Zotero misses were very common and nothing to be worried about, and we "used" citoid directly for requests a lot more. Now that we're using Zotero for the overwhelming majority of requests (as of the last 6 months - a year ish) this should be reconsidered, and warn whenever zotero can't do a request. Some of these will be for trivial reasons (there are a few select website zotero fails on and it's a known issue but not a stability issue) but for incidents like these would be helpful.

Mvolz added a comment.EditedJan 23 2020, 12:31 PM

it should be in the raw logs

Not sure what this refers to. If it is zotero logs, note that those don't exist. We had to turn them off as they were in a really bad format that caused issues to our logging pipeline. Assuming https://github.com/zotero/translation-server/issues/2 is addressed, we could reenable them. The disabling was done back in 827891af3d49beefc4a32042ff0ee2907dc77f0b
If it's the citoid logs, running kubectl logs -l app=citoid after having sourced .hfenv from the correct cluster directory works. But they are going to be way more difficult to parse of course.

I tried

source .hfenv 
kubectl logs -l app=citoid-production

In both /srv/deployment-charts/helmfile.d/services/codfw/citoid and /srv/deployment-charts/helmfile.d/services/eqiad/citoid

No results. Any idea what I'm doing wrong? (citoid wasn't listed as an option: only options were citoid-production, production-metrics-exporter, or citoid-projection-tls-proxy, and none of them listed any logfiles.)

akosiaris added a comment.EditedJan 23 2020, 1:07 PM

-l app=citoid as that's the value for the app label, not citoid-production.

You can have a look at all the labels that are attached to pods with either https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/master/charts/citoid/templates/deployment.yaml#17 or just by doing kubectl describe deployment and look at the pod template/labels stanza.

What it did ask for is the name of the container since there are 3 containers in the pod

  1. citoid-production (what you want)
  2. production-metrics-exporter (the prometheus exporter for the grafana dashboards)
  3. citoid-production-tls-proxy (the container that makes sure TLS is terminated before the app)

The exact call would be

kubectl logs -l app=citoid -c citoid-production

I just noticed that for some reason setting DEBUG_LEVEL: 0 for zotero no longer works however.

Turns out those aren't the logs we disabled. It's errors from various other libraries, e.g. failing to parse some CSS, as in the stack trace below

2020-01-21T16:11:24.714112759Z Error: Could not parse CSS stylesheet
2020-01-21T16:11:24.714170578Z     at exports.createStylesheet (/srv/app/node_modules/jsdom/lib/jsdom/living/helpers/stylesheets.js:34:21)
2020-01-21T16:11:24.714177931Z     at HTMLStyleElementImpl._updateAStyleBlock (/srv/app/node_modules/jsdom/lib/jsdom/living/nodes/HTMLStyleElement-impl.js:67:5)
2020-01-21T16:11:24.714183737Z     at HTMLStyleElementImpl._poppedOffStackOfOpenElements (/srv/app/node_modules/jsdom/lib/jsdom/living/nodes/HTMLStyleElement-impl.js:42:10)
2020-01-21T16:11:24.714188711Z     at OpenElementStack.pop (/srv/app/node_modules/jsdom/lib/jsdom/browser/htmltodom.js:17:12)
2020-01-21T16:11:24.714209803Z     at Object.endTagInText [as END_TAG_TOKEN] (/srv/app/node_modules/parse5/lib/parser/index.js:2153:20)
2020-01-21T16:11:24.714214318Z     at Parser._processToken (/srv/app/node_modules/parse5/lib/parser/index.js:657:55)
2020-01-21T16:11:24.714218228Z     at Parser._processInputToken (/srv/app/node_modules/parse5/lib/parser/index.js:684:18)
2020-01-21T16:11:24.714222279Z     at Parser._runParsingLoop (/srv/app/node_modules/parse5/lib/parser/index.js:440:18)
2020-01-21T16:11:24.714226235Z     at Parser.parse (/srv/app/node_modules/parse5/lib/parser/index.js:344:14)
2020-01-21T16:11:24.714230145Z     at Object.parse (/srv/app/node_modules/parse5/lib/index.js:10:19)

or errors directly from node (also not handled/disabled)

2020-01-22T11:01:41.436402876Z (node:1) UnhandledPromiseRejectionWarning
2020-01-22T11:01:41.436456979Z (node:1) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 2554)
2020-01-22T11:01:41.460166655Z (node:1) UnhandledPromiseRejectionWarning
2020-01-22T11:01:41.460196958Z (node:1) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 2555)
2020-01-22T11:01:41.49044047Z (node:1) UnhandledPromiseRejectionWarning
2020-01-22T11:01:41.490488134Z (node:1) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 2556)
2020-01-22T11:01:43.609115114Z (node:1) UnhandledPromiseRejectionWarning
2020-01-22T11:01:43.609174512Z (node:1) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 2557)
2020-01-22T11:01:43.687915951Z (node:1) UnhandledPromiseRejectionWarning
2020-01-22T11:01:43.687958881Z (node:1) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 2558)
2020-01-22T11:01:47.862200244Z (node:1) UnhandledPromiseRejectionWarning
2020-01-22T11:01:47.862254667Z (node:1) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 2559)
2020-01-22T11:01:49.870064446Z (node:1) UnhandledPromiseRejectionWarning
2020-01-22T11:01:49.87011025Z (node:1) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 2560)
2020-01-22T11:01:52.593730548Z (node:1) UnhandledPromiseRejectionWarning
2020-01-22T11:01:52.593782319Z (node:1) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 2561)
2020-01-22T11:01:52.616259154Z (node:1) UnhandledPromiseRejectionWarning
2020-01-22T11:01:52.616294627Z (node:1) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 2562)
2020-01-22T11:01:56.79085685Z (node:1) UnhandledPromiseRejectionWarning
2020-01-22T11:01:56.790905348Z (node:1) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 2563)
2020-01-22T11:01:56.812243323Z (node:1) UnhandledPromiseRejectionWarning
Mvolz added a comment.Jan 23 2020, 1:55 PM

-l app=citoid as that's the value for the app label, not citoid-production.
You can have a look at all the labels that are attached to pods with either https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/master/charts/citoid/templates/deployment.yaml#17 or just by doing kubectl describe deployment and look at the pod template/labels stanza.
What it did ask for is the name of the container since there are 3 containers in the pod

  1. citoid-production (what you want)
  2. production-metrics-exporter (the prometheus exporter for the grafana dashboards)
  3. citoid-production-tls-proxy (the container that makes sure TLS is terminated before the app)

The exact call would be
kubectl logs -l app=citoid -c citoid-production

Thanks. That's working now, but I've downloaded the log file and it's just what's already available on kibana, warn level or higher. There's no debug level or message (10/20) in the logs - I don't suppose we have those anywhere?

Thanks. That's working now, but I've downloaded the log file and it's just what's already available on kibana, warn level or higher. There's no debug level or message (10/20) in the logs - I don't suppose we have those anywhere?

Nope. What's in there is what citoid is instructed to log. The config is at https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/master/charts/citoid/templates/config.yaml#16

Mvolz updated the task description. (Show Details)Jan 24 2020, 11:35 AM
Mvolz added a comment.EditedJan 24 2020, 2:37 PM

Thanks. That's working now, but I've downloaded the log file and it's just what's already available on kibana, warn level or higher. There's no debug level or message (10/20) in the logs - I don't suppose we have those anywhere?

Nope. What's in there is what citoid is instructed to log. The config is at https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/master/charts/citoid/templates/config.yaml#16

Ah - well unfortunately without the trace and debug messages, and the missing err_body_internalURI in the warning log entries of interest, the logs seem largely useless in terms of figuring out which url was the culprit :/

I've put in a pr to bump up zotero misses up to warn, what do you think about logging trace and debug as well in the meantime? Too overkill?

err_body_internalURI should ideally be in every single log message too, I guess I'll try to figure out why that isn't happening.

Thanks. That's working now, but I've downloaded the log file and it's just what's already available on kibana, warn level or higher. There's no debug level or message (10/20) in the logs - I don't suppose we have those anywhere?

Nope. What's in there is what citoid is instructed to log. The config is at https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/master/charts/citoid/templates/config.yaml#16

Ah - well unfortunately without the trace and debug messages, and the missing err_body_internalURI in the warning log entries of interest, the logs seem largely useless in terms of figuring out which url was the culprit :/
I've put in a pr to bump up zotero misses up to warn, what do you think about logging trace and debug as well in the meantime? Too overkill?

info would be acceptable I guess. trace/debug might be an overkill though, especially given that citoid seems to log the body of the entire page (why does it do that again?). Also, we don't know when the next event will occur (it's months since we last got paged for zotero memory issues).

err_body_internalURI should ideally be in every single log message too, I guess I'll try to figure out why that isn't happening.

+1

Thanks. That's working now, but I've downloaded the log file and it's just what's already available on kibana, warn level or higher. There's no debug level or message (10/20) in the logs - I don't suppose we have those anywhere?

Nope. What's in there is what citoid is instructed to log. The config is at https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/master/charts/citoid/templates/config.yaml#16

Ah - well unfortunately without the trace and debug messages, and the missing err_body_internalURI in the warning log entries of interest, the logs seem largely useless in terms of figuring out which url was the culprit :/
I've put in a pr to bump up zotero misses up to warn, what do you think about logging trace and debug as well in the meantime? Too overkill?

info would be acceptable I guess. trace/debug might be an overkill though, especially given that citoid seems to log the body of the entire page (why does it do that again?). Also, we don't know when the next event will occur (it's months since we last got paged for zotero memory issues).

err_body_internalURI should ideally be in every single log message too, I guess I'll try to figure out why that isn't happening.

+1

So it turns out (I obviously didn't write this part) we have a wrapper for http errors in particular that logs in the internal uri and also the entire body. This is actually pretty useful if anyone was paying attention to logs which actually I may start doing; individual websites do funny things sometimes and the only way to know that these problems exist is to at least log that, although I agree that warning with the whole body can be overkill but it's definitely useful/ readable sometimes. For instance I saw right in the logs the body of a page was saying "you need to email so and so to get permission to scrape these pages" etc and I did so. Although they haven't emailed me back yet!

Anyway that also explains why we don't have an internal uri for any other errors, it's only for the http errors.

I guess we aren't gonna find the source of this specific event. @Mvolz do you feel we are at least better prepared logging wise for the next time this happens?

Mvolz added a comment.Thu, Mar 12, 2:27 PM

I guess we aren't gonna find the source of this specific event. @Mvolz do you feel we are at least better prepared logging wise for the next time this happens?

Yes...

But I still would like to solve https://github.com/wikimedia/service-template-node/issues/128 , because that way any warnings would always come with the request id . Right now we are logging failed zotero urls so we might be able to catch that based on the timing, but the ideal way is to have the request info directly in any error in citoid.

akosiaris closed this task as Invalid.Thu, Mar 12, 3:10 PM

OK, I 've pinged @Pchelolo at https://github.com/wikimedia/service-template-node/issues/128, maybe he can help. But otherwise, I am inclined to close this as Invalid for and open a new one/reopen at the next occurrence. Feel free to reopen.