User Details
- User Since
- Oct 3 2014, 8:06 AM (499 w, 2 d)
- Availability
- Available
- IRC Nick
- godog
- LDAP User
- Filippo Giunchedi
- MediaWiki User
- FGiunchedi (WMF) [ Global Accounts ]
Wed, Apr 24
The bandaid is in place (restart rsyslog.service every 4 hours, the 4 is a magic number, it can be tweaked). Let's see how we go with this in place while a better solution is found
This is done! we're running jaeger collector and query 1.56
Thu, Apr 18
Tue, Apr 16
This is done, I've gone for a simpler approach for now to just delete old jaeger indices via curator. If need arises we can do other tweaks like replica / shard count tweaks and so on
My take to have alert group (i.e. the alert name) at the beginning is the following:
- alerts.w.o is "keyed" to alert groups, not individual alerts
- the optional number of alerts that are firing refers to the alert group as a whole, e.g. I find it confusing to have an alert count next to the individual alert (FIRING [4x] <summary> (<alert group>))
- the alert group name already gives a broad indication of what's wrong
Mon, Apr 15
I'm optimistically resolving this since logstash.w.o (nowadays opensearch dashboards) is working as expected
I've fixed the issue with the following on centrallog hosts:
I'm optimistically resolving this since we no longer use mod auth ldap
Fri, Apr 12
I've chatted with @JMeybohm about this, and since for example a seemingly-related fix (https://github.com/rsyslog/rsyslog/pull/5012/commits/e8ac82e09f930bf99421cc323c24a9dbf215f9da) is present in the Debian testing repos (plus potentially other fixes) I've backported rsyslog to bullseye as 8.2404.0-1~bpo11+1. It is actually a straight backport (no changes needed) although there's a flaky imfile test, which I haven't verified it is equally flaky on sid.
I'm +1 on moving resolved / firing to the beginning and see what the feedback is.
Thu, Apr 11
Following up from a chat yesterday:
Wed, Apr 10
Tue, Apr 9
Indeed, I think grafana_labs.certs.yaml as a whole can be ditched
Mon, Apr 8
The configuration hasn't changed, though we did upgrade to Bookworm and together with that came a new version of Alertmanager, thus it might be a regression
Yes we'll be trimming the retention more this week @MatthewVernon
Fri, Apr 5
Thank you @Jhancock.wm @herron !
Thu, Apr 4
Wed, Apr 3
Moving off Q4 board since we have hw in capex spreadsheet and it'll be coming next FY
Resolving this since capacity is under control now and we have more coming next FY as per T357747: Capacity planning/estimation for Thanos
I'm moving this off this Q board since I believe it'll happen further down the line
I haven't observed OOMs related to WAL when doing a Prometheus rolling restart today, I'm optimistically resolving the task, though to be reopened if things change
This is done, available space for Prometheus data has been expanded
Done by @Fabfur, thank you!
Indeed, I don't see the alert at https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org&q=alertname%3DAlertLintProblem
Tue, Apr 2
Thank you folks for the quick action on this! Appreciate it
Mar 29 2024
This is done, we're using . as separator
This is done, we have more space for prometheus
I pushed forward with this to be in a stable/known state ASAP, i.e. alert1001 and alert2001 are both on puppet 7 now and catalogs compile successfully
Set batphone for today until Monday COB
Mar 28 2024
I've been working on debugging this too, here's my understanding:
- naggen2 is used to generate icinga configuration for nagios_host and nagios_service exported resources, runs as a generator on puppet master/server
- naggen2 reads /etc/puppet/puppetdb.conf to discover the puppetdb url
- on puppetmaster naggen2 works because puppetdb url points on port 8443, which has ssl cert validation as optional
- this is not the case on puppetserver, thus naggen2 can't query puppetdb
Mar 27 2024
Also cc @VRiley-WMF if you could help with this? thank you!
Mar 26 2024
This is fixed, I've undone my symlink bandaid. I've also reported the issue at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1067768
Mar 25 2024
Good news and bad news, in the sense that I can't reproduce the OOM in prometheus k8s in codfw, I suspect my fix at https://gerrit.wikimedia.org/r/1013515 to fetch less Envoy metrics significantly reduced load and thus replaying the WAL doesn't pose a memory problem anymore.
Thank you @andrea.denisse for the suggestions! Let's indeed discuss further what are the best options going forward
Promising results, samples/s in eqiad went from ~200k/s to ~110k/s after the change (and slightly increasing)
@Jclark-ctr it looks like one of the new SSDs from {T359452} isn't happy, I've located the drive so it should be blinking; could we replace it ASAP? please ping me on IRC when you can, thank you !
Mar 22 2024
Thank you for the heads up; for context I'm working on T352640: Fix Pontoon to bootstrap from Bookworm and Puppetserver which will enable us to rebuild the whole o11y stack with Bullseye/Bookworm VMs
Mar 21 2024
Thank you @andrea.denisse for taking a look!
Mar 20 2024
This is done, thank you @Papaul
Mar 19 2024
@Jclark-ctr @VRiley-WMF please ping me on irc when you get on site tomorrow and we can coordinate, I'll be around, thank you!
Thank you @Jhancock.wm ! how's tomorrow at 16 UTC for you? we'll be doing both hosts one at a time, and just to confirm: the drives are hot swap (?)
Good point re: statsd_exporter_events_conflict_total, looking at a mw-on-k8s world, I think linking the statsd-exporter lifecycle to mw seems the easiest? which also begs the question: maybe it does happen already during mw deployments as pods are cycled?
For awareness, see also https://phabricator.wikimedia.org/T359178#9640223 re: statsv in the context of varnishkafka deprecation/removal.
Mar 8 2024
Yeah having some ballpark numbers will be a great help @cmooney, unless we're talking hundreds of thousands more metrics than we have now I think we're good to go, tens of thousands we can do without much effort/resources
Ah yes indeed, thank you @JMeybohm !
Indeed the WAL grew quite fast (faster than I expected anyways) as the mw-on-k8s migration progressed (we're at ~50% now)
Mar 6 2024
Calling this done, albeit with an hack
Logs from ircecho.service
Thank you @LSobanski ! Those are known, I've silenced the alerts for now, leaving the task open as a reminder