Wed, Jan 19
This is done, graphite is back in eqiad
Tue, Jan 18
This is done, cergen 0.2.6 includes this feature
Most of the entries in service::catalog now have a probes section, these are the failures as of today:
This is complete I believe, we're backing up the daily hierarchy now
Mon, Jan 17
I suspect this has to do with Grafana 8 test instance at grafana2001 evaluating alerts and for some reason (to be investigated) reaching different conclusions about the alert's state (cfr T282863: Upgrade Grafana to 8.x)
Another data point: today while investigating T298945 I ran into this on grafana2001's logs:
Wed, Jan 5
Tue, Jan 4
I took a quick look and the timeout seems related to the amount of metrics matching MediaWiki.resourceloader_build.*.sample_rate (about ~5k). For example changing to MediaWiki.resourceloader_build.wikibase*.sample_rate I can load 2/30 days of data. The change might be due to (either or both):
Thank you @Papaul
Dec 17 2021
I've bandaided the immediate issue, leaving the task open since we haven't addressed the high volume of logs
And ran into this again :( quite annoying for preformatted sections in VE, I need to switch back to source editing
With the last set of patches we're able to add probes for the majority of internal/discovery services. Including use cases like sending JSON strings and checking responses with regular expressions.
Dec 16 2021
yes +1 to spread around rows as much as we can
Dec 15 2021
Blackholing emails is Good Enough™ for now
Reverting to 5.10.0-9 has brought back stability, resolving. We still have T297433 to update firmware, which will happen when dcops can
I have implemented part of this work for service::catalog network probes, specifically I needed to exporter per-service state field. Even easier if the "calculations" of metrics can happen locally on the Prometheus hosts, see also https://gerrit.wikimedia.org/r/c/operations/puppet/+/747140 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/747139
I believe this is now (partially?) done, and service-runner supports Prometheus natively these days. What do you think @akosiaris ?
This is possible now with Prometheus and Alertmanager, cc T294564
I just got the session expired on alerts.w.o and the application correctly detected the situation and reloaded, resolving
Dec 13 2021
I'm tentatively resolving the task since all short term mitigations are completed, feel free to reopen if sth is amiss
Dec 10 2021
@Manuel @Lydia_Pintscher going forward I suggest also investing resources to switch to Prometheus as the supported metric system. Graphite is deprecated and in "life support" mode while all producers (essentially mediawiki and related) are being ported over, thanks!
Thank you for the summary @gmodena ! Some replies inline
Yes once you have logs in elasticsearch you can turn search queries into Prometheus metrics, from there you have dashboards and alerts too (either based on Grafana, or as Prometheus alerting rules in operations/alerts.git). HTH!
Dec 9 2021
I've rolled back graphite2003 to 5.10.0-9-amd64, next steps as per IRC convo are to wait for graphite2003' stability, and consider upgrading firmware on graphite1004 since we might want that anyways
I looked at the stack trace and to me it looks like either a kernel bug (we've never run graphite with 5.10.0-8-amd64 as per thanos metrics link ) Or the hardware is faulty, the SSDs are kinda old but I believe we should be seeing different failures at least from one of the drives)
The temporary netconsole client on graphite1004 paid off, see https://phabricator.wikimedia.org/P18076 for logs from the host (journalctl -u netconsole on centrallog1001).
Thank you folks for taking care of this!
Dec 8 2021
For the record, for testing purposes I've manually enabled netconsole on graphite1004 and pointed it to centrallog1001. Once the patch series above are merged the same config will be in puppet too
Just a note to indicate that given the recent Grafana 8 vulnerability we should make sure to upgrade to the latest 8 version
Thanks @gmodena for the summary, do you have a list of metrics and labels you pushed in your local environment? It'll help review the names/practices/etc
Thank you folks for investigating this! I am taking a look too and so far have failed to find anything of note
Dec 7 2021
I chatted with @MoritzMuehlenhoff re: the rollback, apt won't let you remove a running kernel though there's a way to ask grub to reboot into another menu entry (the second entry of the first submenu in this case). Therefore the procedure can look like this:
Dec 6 2021
This is back just now FWIW
Dec 3 2021
Very strange indeed, I just sent a test email and another invitation from VO to firstname.lastname@example.org. I've also sent you an email from an external address, let's see what makes it
Dec 2 2021
Hello Jesse, thanks for reaching out. You have the VO invitation in your inbox now, please see also https://wikitech.wikimedia.org/wiki/VictorOps for general documentation and next steps. I'll resolve the task though please feel free to reopen is something is amiss!
Dec 1 2021
Nov 29 2021
Went another route, namely having proper permissions on the mount directory
Nov 26 2021
Nov 25 2021
A different but related solution by @Kormat would be to disable/blackhole emails altogether