User Details
- User Since
- Oct 3 2014, 8:06 AM (493 w, 3 d)
- Availability
- Available
- IRC Nick
- godog
- LDAP User
- Filippo Giunchedi
- MediaWiki User
- FGiunchedi (WMF) [ Global Accounts ]
Fri, Mar 8
Yeah having some ballpark numbers will be a great help @cmooney, unless we're talking hundreds of thousands more metrics than we have now I think we're good to go, tens of thousands we can do without much effort/resources
Ah yes indeed, thank you @JMeybohm !
Indeed the WAL grew quite fast (faster than I expected anyways) as the mw-on-k8s migration progressed (we're at ~50% now)
Wed, Mar 6
Calling this done, albeit with an hack
Logs from ircecho.service
Thank you @LSobanski ! Those are known, I've silenced the alerts for now, leaving the task open as a reminder
All good! Thank you @colewhite for the merge
Tue, Mar 5
Something else that didn't work well: the current version of ircecho doesn't seem to attempt reopening the files it is supposed to look for in /var/log/icinga. I have "fixed" this by creating said .log files and then restarting ircecho, which then did properly open/tail the files
I've bandaided the issue on alert2001, we'll need a more proper fix:
Thank you for the detailed write up on this @Krinkle ! See below for my take:
Stalling until thanos-compact finishes its cycle, and we can assess how much space is used too
With the new 1.6TB disk in place we have ~2.2TB of raid0, which is great. This is fine for short/medium term, not long term because it means thanos-compact is able to complete a cycle only on titan2001 now. We'll get the other hosts in line in terms of space soon though (next FY or this FY is TBD)
Brilliant, thank you very much @Jhancock.wm !
Mon, Mar 4
Thank you @Jhancock.wm ! I'd like to go for the 1x 1.6TB SSD please to be added to the existing SSDs in titan2001
Optimistically resolving since we've moved to prometheus-based alerts for puppet failures, which do aggregate and should DTRT in this case too
I'm tentatively resolving this since I believe we didn't see new occurrences
Fri, Mar 1
bot is indeed working -- thanks again @brennen
Hey @Peter, I checked the apache logs on grafana1002 and couldn't find anything relevant on the 22nd; however we (o11y) recommend turning off datasourceerror notifications for alerts, see also the full rationale and instructions at https://wikitech.wikimedia.org/wiki/Grafana#DatasourceError_notification_spam (not sure if you came across this already tho)
Thank you for the report, in general I agree we should be aggregating on the unit name itself and that would make the alert more clear; to achieve this we can change the grouping logic when routing alerts, I'll take a stab at it next week
The issue rang a bell, and indeed we've fixed the issue in https://gerrit.wikimedia.org/r/c/operations/puppet/+/981407 although on the standby host the override file with the fix is never deployed because icinga-am is set to not run (and rightfully so).
Thu, Feb 29
For reference, the full list of search-related graphite alerts:
Wed, Feb 28
Thank you @andrea.denisse for filing the task! I'm thinking of reverting the thanos debug logging in T356788: thanos-query probedown due to OOM of both eqiad titan frontends since we have a better idea of problematic queries now. blackbox-exporter logs though will need to stay at debug level since they are used for debugging alerts themselves (e.g. ProbeDown has a link to the blackbox-exporter logs in logstash)
Fri, Feb 23
This happened again today, recovery was better in the sense that titan hosts themselves remained available, the OOM kicked in and things recovered without intervention. The page still went out as probes failed though.
And we're done
Fantastic, thank you @brennen ! I'll take the task and resolve once the bot is confirmed working
I'm more confident now in resolving the task, since WidespreadPuppetFailure has been fixed now and will alert on >= 3% failure rate in a given site (either agent failed, or no resources)
Thu, Feb 22
Turns out I was too hasty here, WidespreadPuppetFailure should have fired and it didn't, reopening this and I'll investigate here
I'm optimistically calling this resolved as there won't be critical notification spam going forward
Wed, Feb 21
Calling this done since https://trace.wikimedia.org now is a thing, thank you all involved @akosiaris @CDanis @taavi !
cc @MatthewVernon and SRE-swift-storage for your input re: capacity planning and hardware needs for thanos-be, let me know what you think!
Tue, Feb 20
cheers @thcipriani ! @brennen could you help us with this request? thank you!
Mon, Feb 19
Feb 16 2024
I am digging into ingressgateway logs and found the following upon issuing the curl above:
Feb 15 2024
update: I've been poking at ingress/istio after the change above without any luck, current symptom is what looks like a timeout:
Now Thanos services run in their own slice, which should help with enforcing resource limits.
Feb 14 2024
After a little thought I think at the very least we should do the following:
Happened again albeit on titan1001 only, where query-frontend and store both using cpu and memory, and the host becoming unresponsive
Feb 13 2024
Thank you for reaching out; I generally agree with the rationale, and I'm ok to try a larger repeat_interval for SystemdUnitFailed. I'll send a patch to implement that for any SystemdUnitFailed alert regardless of team, though we can tune it as needed.
We've implemented this in other tasks
@Peachey88 thank you for your help on this, however please don't retitle @phaultfinder tasks as the title is used as a search key and a new task will be created (T357400)
Feb 12 2024
Feb 9 2024
Since we're back to Icinga semantics in terms of waiting before alerting I'm resolving the task!
The patch above does essentially that, i.e. match SystemdUnitFailed semantics to what we were expecting for Icinga (3 minute leeway)
Resolving this since we have multi-team alerts, both host-based and a pattern to get even finer-grained ownership in case of e.g. systemd units living on the same hosts and owned by different teams
I'm boldly resolving this: the progress indicator is at this tab https://docs.google.com/spreadsheets/d/19nxCXldb804TJCXGy4Z2BHG_1wRksRnKcPC6sXfjQuM/edit#gid=701141702 namely the "TODO" pivot table that lists all icinga checks we have yet to migrate
I've silenced the alert related to this for 60d
Yes this is possible, albeit a bit clunky at the moment. The way we do it for e.g. dcops tasks is group the alerts on instance and then change the title for the webhook (url-encoded) to create tasks with instance in the title. This also means you'll be getting a ticket per-host, let me know if you'd like assistance with this
Thank you for taking a look! I believe this was caused by the fact that loki started on grafana2001 before the sync and wrote its own WAL, all good
Feb 8 2024
This is completed!
Current avenues I'm exploring:
- Tighten the memory limits, thanos-query memory utilization jumps up very fast and I suspect what happens is that in certain cases there isn't enough memory left for the host to still being usable. Which is worse scenario of course than thanos-query being restarted, and takes longer to recover
- Add debug logging to thanos-query as @CDanis pointed out
Feb 7 2024
Considering the points above (crashloop detection being a new feature, buster being on the way out) I'm declining the task, though feel free to reopen as you see fit!