Thank you for following up @Andrew, I'm wondering if we could locally hack sth to unblock that specific bit and see what else needs fixing?
Wed, May 12
Tue, May 11
I can't find an option to instruct icinga to stop sending ACKs notifications on a per-contact basis unfortunately. Since the issue seems benign I'll resolve, feel free to reopen though!
Mon, May 10
RAID firmware upgraded and host rebooted 2x, we're back
Message at boot up
Looks like the host is busted, I'll try a reboot
Fri, May 7
Thu, May 6
Tue, May 4
For a bit of context: to keep a good "emulation" of production there needs to be a block device (LVM, or other like a loop device or similar) for puppet (and the scripts) to mkfs/mount/etc.
While we're on the topic (ah!) of apifeatureusage, with mediawiki logs on kafka we don't strictly need logstash anymore to ingest kafka -> cirrussearch if the feature stays based on mw logs (as opposed to event platform).
Mon, May 3
I did some work on this last week, there's temporary patches on netmon1002 to get things going at least minimally and collect voltage/current/power/etc from the PDU's branches. I ran into troubles with conditional discovery and asked upstream about it: https://community.librenms.org/t/skipping-values-based-on-oids-in-another-table-with-yaml-discovery/15689
Fri, Apr 30
I had a brief look into this to check the logstash pipeline health. I can't find events in the dashboard for the last 90d, although from the sent payload I'm guessing the messages should end up in (eqiad|codfw).mediawiki.client.error topic in the kafka "logging" cluster (?).
Thu, Apr 29
Wed, Apr 28
Thank you @Papaul, today I poked a little at librenms chatsworth support and it looks like the current support is not complete (for sure not as complete as sentry3/sentry4) we'd need to add support for inbound current and environmental monitors. I can dedicate some time this quarter to this, @wiki_willy what's the timeline for the testing phase ?
Tue, Apr 27
This is complete
Hosts is decom
Thank you @Papaul, could you forward the attached mib? I'll take a look, though I think a call will be best
Mon, Apr 26
AFAICT all of these "proto incidents" are ACKs issued by icinga (not SOC ACKs) and as such don't page folks in SOC. I think the proper action here might be to instruct icinga to stop sending ACKs to SOC, or leave things as-is since there weren't mis-pages ?
FWIW +1 on lowering debug level, AFAIK mwlog1001 is indeed quite close to being replaced by mwlog1002 in T224565: Migrate mwlog/udp2log servers to Buster
I agree, we should be restricting #page to alerts that page folks, not sure of an alternative tag though (or remove the tag altogether for now) cc @ayounsi
All thanos-fe hosts reimaged, resolving
From my tests the culprit seems to be webproxy hosts closing the transfer after ~4MB, though using urldownloader works as expected, which proxy were you using for the tests @Urbanecm ?
Fri, Apr 23
FWIW this is still happening (namely when GET'ing a query with an sso session in need for refresh, the thanos UI shows Error executing query: OK, fully refreshing the page works). The UI worked fine for me during a working day, but stopped working until refresh the next day. What's the current refresh time for an SSO session before the refresh in the background kicks in?
As a data point, after the forcelogin change (thanks!) I haven't experienced faulty logins/redirects when moving from grafana.w.o to grafana-rw.w.o
Back to 90-ish percent max fs utilization
Host will be ready for decom next week and filesystems are mostly empty already, no need to replace disks. Leaving the task open until decom
Wed, Apr 21
SGTM, in practical terms the work to do involves adding the account to hieradata/common/profile/thanos/swift.yaml to puppet.git and the private bits to "public private" and the real private.git
Bizarre PCC is a NOOP indeed. The patch LGTM, but I see mailman3 didn't log anything to journald on lists1002 since this morning?
Thank you for the feedback! Replies below
Upgraded librenms today in T266987 and added alertmanager-codfw.w.o to the AM transports.
For the specific problem I think you could also use a case switch (I think preferably using hiera variable like Andrew suggested in the review, similar to is_critical). HTH!
For daemons that are logging to syslog/journald the tl;dr to get the logs in logstash is to add the "program name" to modules/profile/files/rsyslog/lookup_table_output.json with value kafka local (or only kafka if you are not interested in local logs). For daemons logging to local files, tl;dr similar setup plus the "input file" part of rsyslog (i.e. rsyslog::input::file). Hope that helps! Happy to review patches of course and/or provide more guidance
We have implemented paging for non-ops teams in VO/splunk oncall, within icinga and alertmanager has that capability as well. I'm boldly resolving the task, but feel free to reopen!