Tue, May 21
We saw one of these events at 14:48 today and pybal reported fetch failures for -- and wanted to depool -- basically the entire appserver fleet https://phabricator.wikimedia.org/P8551
Mon, May 20
+1. In general I think it would be a great idea to do a lot more with annotations than we presently do:
Sun, May 19
Thanks! We now believe this is resolved.
Tue, May 14
Mon, May 13
Here's my tentative plan for moving forward with this, including a rollout procedure:
Fri, May 10
Wed, May 8
Some quick notes from today's meeting:
Tue, May 7
Trying out a few things here:
Mon, May 6
cc @mark who I know is about to start looking at hardware requests for the coming FY
We now have IRC alerting based on scraping each prometheus for its process_start_time_seconds metric.
My patches are also stuck in the queue, and I'm seeing teammates manually V+2 their Puppet changes.
Fri, May 3
It does seem much faster now, thanks @elukey ! Impact of loading 30 days on Prometheus is also minimal now -- modest CPU usage and while there was some increase in RAM consumption over baseline while we were both playing with this, it's not concerning. Thank you :)
Thu, May 2
Also sorry, I don't have a lot of time left over this week; can take a deeper look next week
I think you should just be able to remove the "custom all value" in the dashboard settings and have it work. In this case Grafana will create its own 'all' value that is simply a regex OR'ing together all the known values, which it looks like it computes based on the cluster=kafka_jumbo hidden variable.
I got tied up with goal work and incident response and have only had a little time to spend on this.
I think the 'real' thing we need to notify on here is when Swift decides it wants to stop using a disk (which it did here)
Wed, May 1
Tue, Apr 30
As documented in T222112#5147131 this didn't actually fix the dashboard at fault in this particular incident, but I've heard from another large-scale Prometheus user (and Prometheus dev) that they've had similar problems and recommend 10M as a value.
I'm pretty sure it is these panels that are responsible for the most Prometheus load
They take much longer to load than the rest of the panels, and some of them errored out with the new settings.
I think https://grafana.wikimedia.org/d/000000607/cluster-overview might have been missed here? I see at least some old metrics being used there, e.g. node_memory_Cached in the "Memory per host" section.
Mon, Apr 29
Very easy to reproduce this presently. Not an OOM but close.
Fri, Apr 26
Thu, Apr 25
Wed, Apr 24
@Cwek Thank you very much for the detailed report! I've rolled back the experimental change to our DNS records, and by now, more than enough time should have passed for the TTLs to expire on the records that seemed to cause inaccessibility. Hopefully this will rectify things.
Apr 23 2019
authdns-update complete as of ~20:33:56 UTC.
Apr 21 2019
A bit out of date, but
Apr 19 2019
Found a logstash fatal that definitely implicates database on a commonswiki Special:Log pageload
Looks like database being slow? Pretty sure this is a MW API call backing the pageload of Special:Log on commonswiki.
Apr 18 2019
Apr 17 2019
Calling this resolved for now -- the mtail-based events are also being monitored by Icinga, and would have caught all the previous instances missed by the node_exporter/kernel counter stats.
Apr 16 2019
17:02:39 <+logmsgbot> !log cdanis@puppetmaster1001 conftool action : set/pooled=yes; selector: name=mw1280.eqiad.wmnet
13:26:19 <+logmsgbot> !log cdanis@puppetmaster1001 conftool action : set/pooled=no; selector: dc=eqiad,name=mw1280.eqiad.wmnet,cluster=api_appserver
Apr 15 2019
looks like logmsgbot was happily chattering away in #wikimedia-overload because of some race condition (within IRC services?) reconnecting to freenode around Apr 13 17:06.
@Cmjohnson I'm on US East time and can handle the depool. Give me a ping when you're ready
Apr 12 2019
Apr 8 2019
So it sounds like the firewall work is done (thanks Arzhel!)