Mon, May 9
This idea arose in an irc convo while looking into stale icinga alerts on the unhandled dashboard, @Dzahn please add/adjust/edit anything I missed!
'chunking_advertise_hosts =' (disabling chunking) has been applied to both MXes and we have not seen this error recur since that change was made.
Hi @bcampbell, while SRE is investigating could ITS please open a case with the google postmasters about this issue as well?
Looking at count of log lines matching "BDAT command used when CHUNKING not advertised" on mx1001 this appears to have began on the 5th, which was also the date that exim was restarted after the host was rebooted for a kernel update.
Tue, May 3
Seeing a significant drop in CONNECT (blue) since https://gerrit.wikimedia.org/r/776878 was applied, looking better!
Mon, Apr 25
A round of kafka-logging rolling reboots was completed today using sre.kafka.reboot-workers. Resolving!
Apr 14 2022
JFTR this was discussed at the last o11y meeting and sounds good. I went ahead and made a copy of the status-page dashboard to incorporate the current home dashboard "welcome to grafana" bits with the status page panels and arranged things to try and fit as much as possible on screen at once. Edits welcome, but if that looks good let's go ahead and set it as the home.
Apr 7 2022
Apr 6 2022
Apr 5 2022
If I'm understanding correctly the idea is to have a set of generic curator rules that would automatically set retention based on patterns like "2weeks" or "2days" in the index name?
Apr 4 2022
Hi @tchin, the requested access has now been provisioned and will be fully deployed within 30 minutes (as puppet runs complete across the fleet)
Apr 1 2022
Sounds like that'd work, although I wonder if there are any alternatives that may be more obvious at a glance? Could we get away with doing something like including the unit as a keyword before the stamp? e.g.
Mar 31 2022
Hey @Legoktm thanks for the report, yes looks like these were indeed set to inactive. That's been enabled and should be working now.
Looking more closely I see all bullseye hosts have the unit enabled, while all buster hosts do not.
Looks like ipmiseld isn't enabled on a sampling of these hosts, letting puppet ensure the service is enabled and running seems like a good next step
Mar 30 2022
Thanks for the report @kostajh yes this has been addressed and an acknowledgement has been added here as well https://www.wikimediastatus.net/incidents/ft72m2rcs8tg
Mar 28 2022
Removing from the sre access request queue while the details of the request are being clarified. Please re-tag when ready for implementation and/or assistance is needed from sre clinic duty.
Hello, I'll close this as invalid for now since the task will need to specify what access/group is being requested, and an approving party, in order to move forward.
Resolving as the near-term access requested in the description has been provisioned, please reopen if any follow up is needed. Thanks!
Mar 25 2022
Mar 22 2022
Mar 21 2022
Mar 17 2022
Disks have been added and the volume group on the host has been grown. Thanks @Jclark-ctr!
Mar 16 2022
Mar 15 2022
The reasoning for checking these via the proxy is because the prometheus hosts can't reach all of the watchrat checked URLs directly, and it's simpler to have one blackbox exporter configuration that uses a proxy and works for all cases than to split the config out between proxied/non-proxied urls. Here's the current config https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/prometheus/templates/blackbox_exporter/common.yml.erb$25-34
Mar 14 2022
Mar 11 2022
Thanks for the task @BTullis!
Mar 4 2022
Hey @Jclark-ctr, could we schedule an installation window for next week?
Mar 3 2022
Neat, thanks @RhinosF1 that should be good enough to get started with this
To take a step back, the varnish slo dashboard linked in the description didn't actually originate from a template. Presumably this one was a manual fork of the original etcd slo example dashboard that's been manually adjusted.
Mar 2 2022
Added +20g to /dev/mapper/centrallog1001--vg-data
Centrallog1001 is above the icinga threshold today, I'll see what I can prune while we wait on the long term solution via T301926
00-partial_logs was a directory used during the centrallog host switchover, I've cleaned that up (removed it) just now and will keep an eye on the next run.
Mar 1 2022
Had a shower of IRC alerts today after deploying the freeipmi-ipmiseld package, which isn't a critical situation but overwhelmed the operations channel with noise and caused the bot to be kicked and ircecho temporarily disabled (to avoid a recovery shower)
Feb 28 2022
Something like https://github.com/pyrra-dev/pyrra seems worth exploring for this and possibly more
SGTM, IIRC java 8 was in use by kafka-logging which is no longer colocated on the logstash hosts.
To complicate matters, rsyslog also appears to throw errors when a module is loaded but not actively used, e.g.:
Feb 25 2022
Hey @bking! Just created your account, you should have received an email from the system to confirm.
Feb 24 2022
Scorecard has been filled in based on the info in the incident report
Feb 22 2022
Feb 17 2022
Looks much better now, resolving!
+1 for reverse proxying the prometheus web interface behind SSO, that seems straightforward to me and could be useful in other cases as well
Feb 16 2022
Thanks for looking at this! From what I can tell centrallog1001 has 2x 1TB disks installed, but you are seeing 8x in hardware?