User Details
- User Since
- May 30 2017, 5:25 PM (354 w, 6 d)
- Availability
- Available
- IRC Nick
- herron
- LDAP User
- Herron
- MediaWiki User
- Unknown
Yesterday
Using topicmappr I've generated the below plan to rebalance kafka-logging eqiad:
Sat, Mar 16
Thu, Mar 14
Renewed certs have been deployed and alerts have cleared, and the auto renew period for Kafka certificates has been increased from 11 to 30 days before expiration.
Mon, Mar 11
Spent some time looking into this, and I'm also wondering why these queries seem to yield different results when ran over more than ~15d
Fri, Mar 8
Mon, Mar 4
Fri, Mar 1
A few observations after looking into this:
@fgiunchedi thanks for the task! sure, having a look into this
Tue, Feb 27
Footer is now in place on (slo|pyrra).wikimedia.org
Tue, Feb 20
Feb 8 2024
Feb 7 2024
Here's a capture of what was logged by apache in the minute leading up to titan1002 hanging https://phabricator.wikimedia.org/P56453
Feb 2 2024
Feb 1 2024
Jan 26 2024
Jan 19 2024
Jan 18 2024
Jan 17 2024
A custom grafana graphite datasource exporter, and a grafana dashboard using these metrics to outline current graphite datasource utilization have been deployed.
Jan 12 2024
Thanks for the context, although I am still wondering why benthos and logstash would share a consumer group? As I understand it benthos would be ingesting these topics instead of logstash, as opposed to benthos and logstash consuming the topics together.
@colewhite could you expand on the rationale for benthos joining the logstash consumer groups as opposed to using their own 'benthos' consumer groups?
Dec 22 2023
The requested access has been granted and will be live within the next 30 minutes. Transitioning to resolved, but please don't hesitate to reopen if any followup is needed. Thanks!
Great thanks, I've just uploaded the patch. Next step will be a few approvals:
Closing as invalid due to inactivity, please reopen with the requested updates when ready to proceed. Thanks!
Hi @AnnWF could you please confirm that the developer account name is correct? The account named wfan is not associated with the email address in the description. Thanks in advance!
The requested access has been granted and will be live within the next 30 minutes. Transitioning to resolved, but please don't hesitate to reopen if any followup is needed. Thanks!
The requested access has been granted and will be live within the next 30 minutes. Transitioning to resolved, but please don't hesitate to reopen if any followup is needed. Thanks!
Dec 21 2023
Hello! A couple of approvals will be needed on task in order to proceed:
Dec 20 2023
Thanks for the detailed request!
Dec 19 2023
In theory pyrra should do the right thing, but we're running into a couple issues in our current deployment:
Thanks for the task, also related is https://github.com/pyrra-dev/pyrra/issues/986
Dec 18 2023
Hello! I'm removing the access request tag from this task as it doesn't appear actionable by sre clinic duty. Please re-add if/when clinic duty attention is needed. Thanks!
Hello! A few approvals are needed here before proceeding
Hi @odimitrijevic , @Milimetric -- could you please review/approve this user addition to the analytics-privatedata-users group? Thanks in advance!
Hello! Grooming the backlog today. Given that we've been in a holding pattern on this for some time I'll temporarily close as 'invalid' (since we're needing user input in order to proceed) with the understanding that it'll be reopened by the requestor when ready to proceed. Thanks!
Dec 13 2023
Dec 12 2023
Dec 7 2023
We're in stable state with option 1 outlined in the description (increase heap to 4g) completed. Transitioning to resolved
I'm reviewing the backlog today (almost exactly one year since the last update!) and I think we're ok to close this since certspotter failures were addressed, and we can re-evaluate if/when ready in a new task. Please reopen if I'm wrong about that
Having a look through our backlog and AFAIK the items in the task description have been done by now. Please revert my changes & reopen if I'm wrong about that
Dec 6 2023
In cases where outbound mail delivery is important basic inbound mail handling should be configured for the (sub)domain and any from addresses too, this way things like bounces, messages to abuse@, callbacks, dmarc, etc can be properly handled.
@elukey could the gaps possibly be attributed to changes in the queries/recording rules? I know we've been through a few iterations for these SLOs, is it possible the recording rule simply wasn't in place for that gap period?
Thank you for looking into this @brouberol. Yes I think you are right about the anonymous ACLs. I think as long as we have pretty clear steps to enable the ACLs, and more importantly steps to revert if something unexpected happens, we should give it a try. Experimenting on kafka-test first SGTM
Dec 4 2023
Dec 1 2023
Re: dropping replica labels we may be able to instruct thanos compact to deduplicate them instead (not tested)
Nov 29 2023
optimistically resolving now that the checklist in the desc is complete
Nov 28 2023
Nov 27 2023
Nov 16 2023
One option that comes to mind is relabeling with something like labelkeep to ingest only the labels we want/need on the prometheus side. That'd let us cut down without modifying the framework at the source.
Nov 15 2023
Prometheus1005 is an R440 which should have 10 total 2.5" bays, today there are (6) 2T SSDs installed. I think it'd be worth getting the ball rolling on adding another (4) SSDs.
Nov 13 2023
Nov 9 2023
I spent some time today experimenting with https://github.com/grafana/cortex-tools, specifically cortextool analyse grafana which looked promising, but unfortunately throws parse errors when it encounters a period in the metric name which makes it not suitable for graphite metrics.
Nov 8 2023
I've had success mounting non-root filesystems that were unreliable (networked fs, external arrays, these kinds of things) using autofs, which these days can be done in systemd.
Nov 7 2023
Thanks for the input everyone! Sounds like we have a consensus on option 1. I'll get started with rolling collector VM reboots into 12GB memory then upload a patch for the JVMs and go from there.