User Details
- User Since
- May 30 2017, 5:25 PM (362 w, 3 d)
- Availability
- Available
- IRC Nick
- herron
- LDAP User
- Herron
- MediaWiki User
- Unknown
Yesterday
This started happening because I added a grouping workaround in the parent task where essentially the grouping is done by puppet instead of pyrra itself. It results in pyrra generating more output config files, e.g. now slo-$site.yaml vs what used to be slo.yaml
Thu, May 9
Mon, May 6
Fri, May 3
Reviewed fs utilization on codfw/eqiad prom hosts and grew their k8s and ops filesystems targeting ~85% free space each
Tue, Apr 30
Still seeing two spaces after the status e.g. FIRING: although not seeing a clear cause for that
Mon, Apr 29
Overall this dashboard is meant to show graphite utilization for the whole installation, so I think the thing to do is add filters to drill down as needed.
Added two panels at the bottom of the dashboard to display count over time details using the time picker
Fri, Apr 26
FWIW we recently added disk capacity to these hosts with about ~1T free in the VG. I've made a note also to discuss/plan next week how best to allocate the additional space for the long-term with the team. In the mean time should it fire again it is safe to grow the LV again, though hopefully it wont be necessary.
Thu, Apr 25
fwiw this look quite similar to the diff from the experimental patch where the global cert name was changed to a discovery.wmnet domain:
Tue, Apr 23
Thanks! Looks good!
Prometheus1005 is down and depooled, any time works!
Fri, Apr 19
Reopening -- today we experienced a memory issue on prometheus1005 which presumably relates to this maintenance. Could we arrange to swap the faulty DIMM outlined in T362990? Thanks in advance!
Tue, Apr 16
FWIW I think the current alert text makes sense based on the premise that all alert recipients will/should know about how alerting system internals are structured.
Mon, Apr 15
Thu, Apr 11
Apr 10 2024
While considering this I'd also like to propose moving the (alert name) to the end of message at the same time. For example:
Apr 9 2024
Hey @VRiley-WMF, I'll help out with this one for the o11y side.
Apr 8 2024
With T352756 T359879 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/1017873 in mind I think It'd be worth spending some time here to work out a strategy for bringing backfilled metrics into production.
Apr 4 2024
I think we're in good shape here, please reopen if anything else is needed
SSD and RAM upgrades have been installed thanks @Jhancock.wm!
Apr 2 2024
CC from IRC chat -- We've tentatively scheduled this for this Weds afternoon (Eastern TZ, 4/3/2024)
Mar 28 2024
Mar 27 2024
Ah excellent! I thought we would have to order new. Yes in that case lets go ahead with 32Gig DDR4 2666 please. Thank you!
Mar 25 2024
Mar 22 2024
scrapping this as ram procurement request for titan has already been submitted
Mar 21 2024
FWIW I just went through a similar triage and broker restart process in T358870
Mar 20 2024
IMO it'd be worth considering a split of the collector role into independent dashboard, logstash and potentially logs-api roles to provide additional isolation between services.
Mar 19 2024
The relocation/rebalance plan outlined in T326419#9639228 has finished running, and a re-run of topicmappr with fresh metrics now shows no proposed moves. Resolving!
Mar 18 2024
Using topicmappr I've generated the below plan to rebalance kafka-logging eqiad:
Mar 16 2024
Mar 14 2024
Renewed certs have been deployed and alerts have cleared, and the auto renew period for Kafka certificates has been increased from 11 to 30 days before expiration.
Mar 11 2024
Spent some time looking into this, and I'm also wondering why these queries seem to yield different results when ran over more than ~15d
Mar 8 2024
Mar 4 2024
Mar 1 2024
A few observations after looking into this:
@fgiunchedi thanks for the task! sure, having a look into this
Feb 27 2024
Footer is now in place on (slo|pyrra).wikimedia.org
Feb 20 2024
Feb 8 2024
Feb 7 2024
Here's a capture of what was logged by apache in the minute leading up to titan1002 hanging https://phabricator.wikimedia.org/P56453
Feb 2 2024
Feb 1 2024
Jan 26 2024
Jan 19 2024
Jan 18 2024
Jan 17 2024
A custom grafana graphite datasource exporter, and a grafana dashboard using these metrics to outline current graphite datasource utilization have been deployed.
Jan 12 2024
Thanks for the context, although I am still wondering why benthos and logstash would share a consumer group? As I understand it benthos would be ingesting these topics instead of logstash, as opposed to benthos and logstash consuming the topics together.
@colewhite could you expand on the rationale for benthos joining the logstash consumer groups as opposed to using their own 'benthos' consumer groups?
Dec 22 2023
The requested access has been granted and will be live within the next 30 minutes. Transitioning to resolved, but please don't hesitate to reopen if any followup is needed. Thanks!
Great thanks, I've just uploaded the patch. Next step will be a few approvals:
Closing as invalid due to inactivity, please reopen with the requested updates when ready to proceed. Thanks!
Hi @AnnWF could you please confirm that the developer account name is correct? The account named wfan is not associated with the email address in the description. Thanks in advance!