User Details
- User Since
- May 30 2017, 5:25 PM (401 w, 5 d)
- Availability
- Available
- IRC Nick
- herron
- LDAP User
- Herron
- MediaWiki User
- Unknown
Wed, Feb 5
setting environment ETCDCTL_API=2 for the backup script may be an option as well
Tue, Feb 4
Wikitech account/LDAP: | Herron |
SUL account | KHerron (WMF) |
Account linked on IDM | Y |
I have visited MediaWiki:Loginprompt | Y |
I have tried to reset my password using Special:PasswordReset | Y |
Mon, Feb 3
Wed, Jan 22
While reflecting on the options above, an option 3 comes to mind where we would modify the check_ scripts that will remain to emit metrics to a push gateway. Pros and cons are mostly cut/paste from option 2, with a pro of not needing a wrapper.
Fri, Jan 17
querying k8s eqiad for {__name__=~"istio.*"} in prometheus web I'm seeing two errors occur
Thu, Jan 16
$ curl -g 'http://prometheus.svc.eqiad.wmnet/ops/api/v1/query?query={__name__!=""}' | jq % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 128 100 128 0 0 1 0 0:02:08 0:01:54 0:00:14 28 { "status": "error", "errorType": "execution", "error": "query processing would load too many samples into memory in query execution" }
Tue, Jan 14
I'll untag o11y since the alerting portion of this looks done, please re-tag if needed thanks!
Resolving as afaik we've been in steady state here for some time
Mon, Jan 13
While looking at https://gerrit.wikimedia.org/r/1110747 (which nicely highlights gap periods) I noticed that the recent gap appears to have shifted since I looked last week at log_dead_letters_hits:increase12w
Jan 8 2025
Cleaning up old task
Cleaning up old tasks
Cleaning up old task
Declining as we're moving away from Grizzly
Declining as we're moving away from Grizzly
declining as we've moved to pyrra
I'll resolve this as we now inherit a sli/slo metric convention via pyrra
closing old task
Resolving since we have sampling working now via the otel collector
Jan 7 2025
AFAIK this was sorted out with the last work on the alert hosts by using a different failover strategy
Dec 20 2024
Dec 19 2024
Looking again today with some more time passed I think the good news is we've dropped rx bandwidth from roughly ~250MB/s to ~150MB/s sustained, in the ballpark of a 40% reduction.
Dec 18 2024
I'm beginning to see some improvements with caching bucket enabled, I think there's room for further tuning/improvement. Please see details in https://phabricator.wikimedia.org/T368953#10413075
Initial results with Thanos store caching bucket enabled look promising. I'm seeing reductions in duration, errors, network/socket utilization and slight decrease in cpu util
Dec 17 2024
A few more cache tuning patches that should have been linked on this task:
Thanks @fgiunchedi that helps explain the quite lower than I'd expect cache memory utilization in the frontend.
Dec 12 2024
Dec 10 2024
Dec 9 2024
Dec 3 2024
I am writing this email in regard to case #1234678, which is about "escalation not working as expected".
As mentioned in the previous email, it has been observed that the user escalator_sysuser has routed incident #5465 from "SRE:SRE Business Hours (Escalation)" to "SRE:SRE Batphone (Escalation)". This change in escalation policy will result in notifications being sent to users from both escalation policies simultaneously.
Additionally, please note that the timestamps may vary as the timeline payload captures the Recovery Alert Timings. The incident triggered time can be viewed in the Critical Alert for the incident. So there was no delay from the Splunk On-Call.
VMs built, tracking remaining setup in T381417: aux-k8s-codfw cluster setup
VMs built, tracking remaining setup in T381417: aux-k8s-codfw cluster setup
VMs built, tracking remaining setup in T381417: aux-k8s-codfw cluster setup
Nov 21 2024
cumin1002:~$ sudo cookbook sre.ganeti.makevm --vcpus 1 --memory 1 --disk 50 --os bookworm --cluster codfw --group A -t T378988 aux-k8s-etcd2003
Nov 20 2024
escalator_sysuser is our account for the vo-escalate service which runs from the active alert host. vo-escalate checks every 15 seconds looking for incidents that have not yet paged anyone, and routes them to the batphone, and for some reason it fired on this incident.
This has been superseded by Pyrra which generates the recording rules automatically
Nov 18 2024
Nov 15 2024
I think we're good here!
Nov 13 2024
Hello! A few next-steps to move forward with this request:
Nov 12 2024
membership to ldap group wmf has been provisioned, thanks!
uid=khantstop has been added to ldap group wmf
Nov 8 2024
Nov 7 2024
ganeti1028:~# gnt-instance console aux-k8s-worker1004.eqiad.wmnet @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY! Someone could be eavesdropping on you right now (man-in-the-middle attack)! It is also possible that a host key has just been changed. The fingerprint for the RSA key sent by the remote host is SHA256:C9d5dCz/suTrJQqv6NtW3q/R241NXTC1GL3JMLfaMY4. Please contact your system administrator. Add correct host key in /dev/null to get rid of this message. Offending DSA key in /var/lib/ganeti/known_hosts:2 remove with: ssh-keygen -f "/var/lib/ganeti/known_hosts" -R "ganeti01.svc.eqiad.wmnet" RSA host key for ganeti01.svc.eqiad.wmnet has changed and you have requested strict checking. Host key verification failed. Failure: command execution error: Connection to console of instance aux-k8s-worker1004.eqiad.wmnet failed, please check cluster configuration
Nov 4 2024
Nov 1 2024
Oct 31 2024
(Fixed in T377703)
Oct 31 18:19:40 prometheus1005 prometheus-blackbox-exporter[2309163]: ts=2024-10-31T18:19:40.825Z caller=handler.go:184 module=tcp_rsyslog_receiver_ip4 target=[10.64.16.86]:6514 level=debug msg="Beginning probe" probe=tcp timeout_seconds=3 Oct 31 18:19:40 prometheus1005 prometheus-blackbox-exporter[2309163]: ts=2024-10-31T18:19:40.825Z caller=handler.go:184 module=tcp_rsyslog_receiver_ip4 target=[10.64.16.86]:6514 level=debug msg="Resolving target address" target=10.64.16.86 ip_protocol=ip4 Oct 31 18:19:40 prometheus1005 prometheus-blackbox-exporter[2309163]: ts=2024-10-31T18:19:40.825Z caller=handler.go:184 module=tcp_rsyslog_receiver_ip4 target=[10.64.16.86]:6514 level=debug msg="Resolved target address" target=10.64.16.86 ip=10.64.16.86 Oct 31 18:19:40 prometheus1005 prometheus-blackbox-exporter[2309163]: ts=2024-10-31T18:19:40.827Z caller=handler.go:184 module=tcp_rsyslog_receiver_ip4 target=[10.64.16.86]:6514 level=debug msg="Dialing TCP with TLS" Oct 31 18:19:40 prometheus1005 prometheus-blackbox-exporter[2309163]: ts=2024-10-31T18:19:40.876Z caller=handler.go:184 module=tcp_rsyslog_receiver_ip4 target=[10.64.16.86]:6514 level=debug msg="Successfully dialed" Oct 31 18:19:40 prometheus1005 prometheus-blackbox-exporter[2309163]: ts=2024-10-31T18:19:40.876Z caller=handler.go:184 module=tcp_rsyslog_receiver_ip4 target=[10.64.16.86]:6514 level=debug msg="Probe succeeded" duration_seconds=0.050861874
Oct 30 2024
Oct 29 2024
Looked into this a bit since the silence expired, the service being probed is up but looks like the related prometheus blackbox exporter exporter is failing to load the configured certificate
Oct 25 2024
Looked into this a bit and from what I can tell the feature for OTEL sampling within Thanos was introduced in a later version (v0.32.0)
Oct 21 2024
https://trace.wikimedia.org is now running 1.62.0.