Thu, Jul 18
Wed, Jul 17
Tue, Jul 16
First used to validate the change made in https://gerrit.wikimedia.org/r/c/operations/puppet/+/522992 but useful generically.
Mon, Jul 15
The backlog in Kafka should clear in just a few more minutes. Closing this; separate issues to be opened later for followup work.
I've started an incident document at https://wikitech.wikimedia.org/wiki/Incident_documentation/20190715-logstash and would appreciate more contributors.
I think we likely want to revisit this.
Sun, Jul 14
Fri, Jul 12
Thu, Jul 11
Just a note that it happened again today ;)
Wed, Jul 10
Wed, Jul 3
Wow, that was quick, thanks! Riccardo should have time to do the deploy while I'm on vacation 🙃
Tue, Jul 2
Mon, Jul 1
I think I just found another one that was missed: https://grafana.wikimedia.org/d/000000545/ganeti
Fri, Jun 28
NB that the default limit in Varnish actually allows for slightly fewer than 64 headers -- the first line of an HTTP response is parsed into several pseudo-headers internally, so AIUI there's space for ~59 headers.
Thu, Jun 27
I'm told the plan is to move these onto Ganeti in PoPs, so that seems just as good.
So, it looks like that this 500 did in fact come from the application layer... but shouldn't we still be getting more response headers from the edge?
Wed, Jun 26
I am unable to create a Discourse account because of this loop.
+1 for the relative simplicity of Thanos (from both a design and deployment
Tue, Jun 25
Mon, Jun 24
Am I alone in feeling like this probably deserves an incident report?
it's just one (not-often-used) link down, not a site down; UBN is unnecessary IMO
Telia reports a 'major outage' and is tracking status of our circuit in case 00993514
Jun 20 2019
My guess is that the beginning of this problem correlates with the beginning of the fetch failures in the first graph panel here:
Jun 19 2019
@Reedy manually ran the global renames that were never queued properly.
Jun 5 2019
Some curious stuff in the monitoring data:
Jun 4 2019
Indeed, thanks @ema ! I talked with @fgiunchedi some about this earlier and we tweaked the wording on the Logstash dashboard to remind users that "Varnish" appearing highly in the "top n Backends" panel is not necessarily reflective of a Varnish issue.
Jun 3 2019
There is a "Nagios Compatible" transport, but it is underdocumented and seems to also only write to a local filesystem path (which is presumed to be a Nagios external command FIFO).
Jun 1 2019
@Marostegui just found something we forgot: the use of Prometheus metrics in Grafana's variable definitions (e.g. by a label_values() query)
May 31 2019
SGTM @elukey, thanks!
May 30 2019
May 29 2019
Andrew, can you (or someone else) advise on rolling out this change for Analytics?
May 23 2019
May 21 2019
We saw one of these events at 14:48 today and pybal reported fetch failures for -- and wanted to depool -- basically the entire appserver fleet https://phabricator.wikimedia.org/P8551
May 20 2019
+1. In general I think it would be a great idea to do a lot more with annotations than we presently do:
May 19 2019
Thanks! We now believe this is resolved.
May 14 2019
May 13 2019
Here's my tentative plan for moving forward with this, including a rollout procedure:
May 10 2019
May 8 2019
Some quick notes from today's meeting:
May 7 2019
Trying out a few things here:
May 6 2019
cc @mark who I know is about to start looking at hardware requests for the coming FY
We now have IRC alerting based on scraping each prometheus for its process_start_time_seconds metric.
My patches are also stuck in the queue, and I'm seeing teammates manually V+2 their Puppet changes.
May 3 2019
It does seem much faster now, thanks @elukey ! Impact of loading 30 days on Prometheus is also minimal now -- modest CPU usage and while there was some increase in RAM consumption over baseline while we were both playing with this, it's not concerning. Thank you :)
May 2 2019
Also sorry, I don't have a lot of time left over this week; can take a deeper look next week
I think you should just be able to remove the "custom all value" in the dashboard settings and have it work. In this case Grafana will create its own 'all' value that is simply a regex OR'ing together all the known values, which it looks like it computes based on the cluster=kafka_jumbo hidden variable.
I got tied up with goal work and incident response and have only had a little time to spend on this.
I think the 'real' thing we need to notify on here is when Swift decides it wants to stop using a disk (which it did here)
May 1 2019
Apr 30 2019
As documented in T222112#5147131 this didn't actually fix the dashboard at fault in this particular incident, but I've heard from another large-scale Prometheus user (and Prometheus dev) that they've had similar problems and recommend 10M as a value.
I'm pretty sure it is these panels that are responsible for the most Prometheus load
They take much longer to load than the rest of the panels, and some of them errored out with the new settings.
I think https://grafana.wikimedia.org/d/000000607/cluster-overview might have been missed here? I see at least some old metrics being used there, e.g. node_memory_Cached in the "Memory per host" section.