Page MenuHomePhabricator

Kartotherian/Maps outage followups, 2020-10-29
Closed, ResolvedPublic

Description

  • lightweight incident report for the incident
  • open task to migrate all graphite uses to prometheus, and also to make sure the metrics as exported make sense (e.g. on the existing console, there's some notes about naming inconsistencies making things difficult to interpret)
  • open task to clean up some of the common logging spam, and to add more data to error logs
  • open a design task to fix the "backlog cascading failure" issue. Somehow, service-runner should know that a request is too old to be worth working on, and throw it away before doing actual work on it. (If we fix this, then adding capacity back to the service would have been sufficient to bring us out of an outage. It would also mean that, in the general case, bursts of extra traffic while we're near capacity are much less likely to get into an outage from which we can't recover.)

Event Timeline

Change 637536 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/dns@master] temporarily failoid kartotherian in eqiad

https://gerrit.wikimedia.org/r/637536

CDanis added subscribers: Gehel, hnowlan, Joe, wiki_willy.

This was a capacity issue.

The depooling of maps1004, followed by the usual daily zenith of traffic, led to the rest of maps@eqiad being very over capacity and pegged at 100% CPU.

We repooled maps1004 at 16:26, however, its cpu usage immediately shot to 100% along with the rest of the cluster.

Unfortunately it seems that nodeJS and/or service-runner services can get into a state where they are always processing backlogged queries -- many of which must have already timed out at the Traffic layer! -- and thus, never make further forward progress without manual intervention. This should be fixed.

At 16:46 we percussively restarted kartotherian on all maps1* and that brought things back to the usual, steady-state of 60% CPU utilization -- which, again, we're underprovisioned.

The new hardware seems to have been racked in codfw T260271 (although it looks like the new servers are not yet pooled); however, in eqiad, where the situation is more critical as our loadbalancing is imperfect and eqiad consistently runs hotter due to receiving all the European traffic*, we're still waiting on the new hardware being racked: T260269

(*: It should be the case that this doesn't matter because either core datacenter should be capable of serving all traffic alone, so we are N+1. However without the new hardware, this is a very load-bearing "should".)

Change 637536 abandoned by CDanis:
[operations/dns@master] temporarily failoid kartotherian in eqiad

Reason:
resolved by kartotherian restart

https://gerrit.wikimedia.org/r/637536

CDanis renamed this task from Kartotherian/Maps issues, 2020-10-29 to Kartotherian/Maps outage followups, 2020-10-29.Oct 29 2020, 5:45 PM

Kartotherian's logging was a complicating factor that could be improved - there are many log messages that look like potentially critical errors that are actually quite benign. Additionally, the timeouts caused by backlogged queries were not obvious as timeouts in responses.

Filed T266820 to bring codfw maps to production

Kartotherian's logging was a complicating factor that could be improved - there are many log messages that look like potentially critical errors that are actually quite benign. Additionally, the timeouts caused by backlogged queries were not obvious as timeouts in responses.

Filed T266820 to bring codfw maps to production

Thanks!

And yeah, both the logging and the exported metrics could use a lot of love...

I've written some proposed followups in the task description, feel free to comment or edit :)

My earlier comment at T266807#6589341 is pretty close to a lightweight IR, fwiw.

jijiki triaged this task as Medium priority.Nov 10 2020, 4:27 PM

Removing inactive task assignee.

@Marostegui: Thank you for following up, I missed your earlier ping.

Reading T266807#6591631, it seems to me we have the basics for a report which would cover item #1 (in this task's description), assuming a report is created in wikitech.

However, the rest of the items still need to be addressed. I'm adding @WDoranWMF for guidance on maps.

This is more than 2 years old and the latest updates don't offer any new info. @lmata, should we just close this? After all this time, I doubt we can gleam anything useful from this one. The architecture of Maps has already changed substantially since the time of this incident.

lmata claimed this task.

This is more than 2 years old and the latest updates don't offer any new info. @lmata, should we just close this? After all this time, I doubt we can gleam anything useful from this one. The architecture of Maps has already changed substantially since the time of this incident.

Thanks, yeah, I don't think this will see any more progress, and I agree with your assessment. So I'm closing the task.