Page MenuHomePhabricator

Graphite returning server errors (out of memory?)
Closed, ResolvedPublic

Description

For the past hour or so, Graphite (specifically graphite1004?) seems to be having trouble; first, it returned 502 Bad Gateway errors (first icinga alert 2019-03-05 16:17:01), then 503 Service Unavailable (no icinga alerts for these, first seen by me 2019-03-05 16:49:40). @herron already tried restarting uwsgi-graphite-web (SAL 1, 2), but it doesn’t seem to have fixed the problem yet (though at first it appeared to solve the 502 errors). The server might be out of memory:

2019-03-05 16:52:34 <herron> graphite1004 kernel: [10179701.956141] oom_reaper: reaped process 184953 (uwsgi), now anon-rss:0kB, file-rss:0kB, shmem-rss:124kB

Event Timeline

It seems this was actually caused by me – I was editing the wikidata-edits board, specifically the OAuth panel, which uses a lot of aliasSub calls to turn e. g. wikidata.rc.edits.oauth.1253 into wikidata.rc.edits.oauth.QuickStatements, so that aliasByNode() afterwards shows the OAuth consumer name instead of the consumer ID. (The list of consumers is hand-maintained, and I wanted to update it.)

In the process, I noticed that this would turn wikidata.rc.edits.oauth.593 into wikidata.rc.edits.oauth.5Widar13, because the 93 → Widar [1.3] replacement wasn’t anchored to replace only a full “node” in the series name. While trying to figure out how exactly to fix this (I think the correct solution is aliasSub('\.93', '.Widar13')), I inadvertently requested a bunch of renders where the aliasSub() changed the number of periods in the series name, which seems to have confused aliasByNode() (according to https://www.irccloud.com/pastebin/dVXB5kM1/). And the problem persisted until I closed the browser tab that kept making these bad requests, and then (after another restart of the service, to clear out the requests which were never going to finish or time out) it seems to have gone away.

Looking at the board is fine now; I’ll try to be more careful when editing it (i. e. edit the query in Notepad instead of using Grafana’s graphical editor, then try it out by curling the Graphite API directly once I think it’s fine, and notify ops if the request appears to be moribund).

Edit: my edited query was not fine, I’ll stop messing with this for now

Change 494620 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] add uwsgi worker timeouts + max RSS for graphite

https://gerrit.wikimedia.org/r/494620

jbond triaged this task as High priority.Mar 6 2019, 10:43 AM
jbond subscribed.

Change 494620 merged by CDanis:
[operations/puppet@production] graphite: uwsgi workers: set timeouts + max RSS

https://gerrit.wikimedia.org/r/494620

Lucas, can you verify that this is resolved?

Seems so, yes. I got an error from Varnish after 60 seconds and the graphite-eqiad board looks totally healthy. Thanks!