Page MenuHomePhabricator

herron (Keith Herron)
Site Reliability Engineer

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Wednesday

  • Clear sailing ahead.

User Details

User Since
May 30 2017, 5:25 PM (401 w, 5 d)
Availability
Available
IRC Nick
herron
LDAP User
Herron
MediaWiki User
Unknown

Recent Activity

Wed, Feb 5

herron added a comment to T385727: etcd: adapt etcd-backup.py for etcd 3.4.

setting environment ETCDCTL_API=2 for the backup script may be an option as well

Wed, Feb 5, 4:21 PM · SRE Observability (FY2024/2025-Q3), SRE
herron created T385727: etcd: adapt etcd-backup.py for etcd 3.4.
Wed, Feb 5, 4:10 PM · SRE Observability (FY2024/2025-Q3), SRE

Tue, Feb 4

herron awarded T376267: ☂ Wikitech account linking and SUL error reporting a Cup of Joe token.
Tue, Feb 4, 7:07 PM · wikitech.wikimedia.org
herron updated the task description for T381417: aux-k8s-codfw cluster setup.
Tue, Feb 4, 6:33 PM · SRE Observability (FY2024/2025-Q3), Infrastructure-Foundations, SRE, Kubernetes
herron added a comment to T376267: ☂ Wikitech account linking and SUL error reporting.
Wikitech account/LDAP:Herron
SUL accountKHerron (WMF)
Account linked on IDMY
I have visited MediaWiki:LoginpromptY
I have tried to reset my password using Special:PasswordResetY
Tue, Feb 4, 5:40 PM · wikitech.wikimedia.org

Mon, Feb 3

herron added a comment to T381417: aux-k8s-codfw cluster setup.

Change #1116825 merged by Herron:

[operations/puppet@production] aux_k8s: apply etcd_aux_k8s role to aux-k8s-etcd200[345] nodes

https://gerrit.wikimedia.org/r/1116825

Mon, Feb 3, 7:11 PM · SRE Observability (FY2024/2025-Q3), Infrastructure-Foundations, SRE, Kubernetes

Wed, Jan 22

herron added a comment to T350360: Evaluate "drop in" replacement for nrpe scripts.

While reflecting on the options above, an option 3 comes to mind where we would modify the check_ scripts that will remain to emit metrics to a push gateway. Pros and cons are mostly cut/paste from option 2, with a pro of not needing a wrapper.

Wed, Jan 22, 4:01 PM · Observability-Alerting

Fri, Jan 17

herron added a comment to T383923: Prometheus: queries matching on {__name__} error out on larger instances.

querying k8s eqiad for {__name__=~"istio.*"} in prometheus web I'm seeing two errors occur

Fri, Jan 17, 4:12 PM · Observability-Metrics

Thu, Jan 16

herron added a comment to T383923: Prometheus: queries matching on {__name__} error out on larger instances.
$ curl -g 'http://prometheus.svc.eqiad.wmnet/ops/api/v1/query?query={__name__!=""}' | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   128  100   128    0     0      1      0  0:02:08  0:01:54  0:00:14    28
{
  "status": "error",
  "errorType": "execution",
  "error": "query processing would load too many samples into memory in query execution"
}
Thu, Jan 16, 9:19 PM · Observability-Metrics
herron created T383923: Prometheus: queries matching on {__name__} error out on larger instances.
Thu, Jan 16, 7:18 PM · Observability-Metrics

Tue, Jan 14

herron removed a project from T380402: idp.wikimedia.org should have a paging blackbox probe: Observability-Alerting.

I'll untag o11y since the alerting portion of this looks done, please re-tag if needed thanks!

Tue, Jan 14, 4:04 PM · Infrastructure-Foundations
herron closed T335586: improve controls for on-call rotation management, a subtask of T313958: Evaluate viable candidates for incident paging, as Resolved.
Tue, Jan 14, 3:56 PM · Observability-Alerting
herron closed T335586: improve controls for on-call rotation management as Resolved.

Resolving as afaik we've been in steady state here for some time

Tue, Jan 14, 3:56 PM · Observability-Alerting
herron closed T350508: Grafana OnCall: Production service setup, a subtask of T350506: Explore Grafana OnCall for on-call schedule management and alert/page routing, as Declined.
Tue, Jan 14, 3:55 PM · Observability-Alerting
herron closed T350508: Grafana OnCall: Production service setup as Declined.
Tue, Jan 14, 3:55 PM · Observability-Alerting
herron awarded T352756: Gap in metrics rendered from Thanos Rules a Cup of Joe token.
Tue, Jan 14, 2:28 PM · SRE Observability (FY2024/2025-Q3), Observability-Metrics, Machine-Learning-Team

Mon, Jan 13

herron added a comment to T352756: Gap in metrics rendered from Thanos Rules.

While looking at https://gerrit.wikimedia.org/r/1110747 (which nicely highlights gap periods) I noticed that the recent gap appears to have shifted since I looked last week at log_dead_letters_hits:increase12w

Mon, Jan 13, 3:20 PM · SRE Observability (FY2024/2025-Q3), Observability-Metrics, Machine-Learning-Team

Jan 8 2025

herron added a subtask for T269333: Switch default Grafana datasource to Thanos: T256954: Port Prometheus dashboards to Thanos.
Jan 8 2025, 3:53 PM · SRE Observability (FY2024/2025-Q1), Observability-Metrics
herron added a parent task for T256954: Port Prometheus dashboards to Thanos: T269333: Switch default Grafana datasource to Thanos.
Jan 8 2025, 3:53 PM · Observability-Metrics, User-fgiunchedi, SRE
herron closed T254231: Icinga checks that are OK even if host is down as Resolved.

Cleaning up old task

Jan 8 2025, 3:51 PM · Observability-Alerting
herron closed T197171: Graph outbound mail volume on per-service or hostgroup level as Resolved.

Cleaning up old tasks

Jan 8 2025, 3:43 PM · Observability-Logging, SRE-Sprint-Week-Sustainability-March2023, Infrastructure-Foundations, Sustainability (Incident Followup), observability, Mail
herron closed T197171: Graph outbound mail volume on per-service or hostgroup level, a subtask of T370005: Email improvements round two (FY 2024/25), as Resolved.
Jan 8 2025, 3:43 PM · Epic, Infrastructure-Foundations, Mail
herron closed T193766: Ship host syslogs to ELK, a subtask of T205855: Investigate approaches to ingest sensitive log producers, as Resolved.
Jan 8 2025, 3:41 PM · observability, Wikimedia-Logstash, SRE
herron closed T193766: Ship host syslogs to ELK as Resolved.

Cleaning up old task

Jan 8 2025, 3:41 PM · Observability-Logging, observability, Wikimedia-Logstash, SRE
herron closed T193766: Ship host syslogs to ELK, a subtask of T213902: Implement sensitive logstash access control, as Resolved.
Jan 8 2025, 3:41 PM · Patch-Needs-Improvement, User-herron, Observability-Logging
herron closed T347499: Grafana oncall pilot environment (in prod/ganeti) as Declined.
Jan 8 2025, 3:26 PM · User-herron, Observability-Alerting
herron closed T331659: Grizzly: CI improvements as Declined.

Declining as we're moving away from Grizzly

Jan 8 2025, 3:14 PM · SRE-grizzly-sprint, Observability-Metrics
herron closed T331656: Grizzly: onboard "popular" dashboards as static json managed dashboards as Declined.

Declining as we're moving away from Grizzly

Jan 8 2025, 3:14 PM · SRE-grizzly-sprint, Observability-Metrics
herron closed T290012: Add service SLO URL to template, a subtask of T274665: Design and implement SLO Dashboard tooling, as Resolved.
Jan 8 2025, 2:43 PM · SRE Observability (FY2021/2022-Q1)
herron closed T290012: Add service SLO URL to template as Resolved.
Jan 8 2025, 2:43 PM · Observability-Metrics
herron closed T290009: Add Budget Burndown Panels to SLO Dashboard Template, a subtask of T274665: Design and implement SLO Dashboard tooling, as Declined.
Jan 8 2025, 2:43 PM · SRE Observability (FY2021/2022-Q1)
herron closed T290009: Add Budget Burndown Panels to SLO Dashboard Template as Declined.

declining as we've moved to pyrra

Jan 8 2025, 2:43 PM · Observability-Metrics
herron closed T274668: Standardize a SLI metrics naming/storage/mapping scheme, a subtask of T274665: Design and implement SLO Dashboard tooling, as Resolved.
Jan 8 2025, 2:41 PM · SRE Observability (FY2021/2022-Q1)
herron closed T274668: Standardize a SLI metrics naming/storage/mapping scheme as Resolved.

I'll resolve this as we now inherit a sli/slo metric convention via pyrra

Jan 8 2025, 2:41 PM · Observability-Metrics
herron closed T248400: elk7: fields indexed without position data; cannot run PhraseQuery, a subtask of T234854: Upgrade ELK Stack to version 7, as Resolved.
Jan 8 2025, 2:39 PM · SRE Observability (FY2021/2022-Q1), observability, Patch-For-Review, SRE, Wikimedia-Logstash
herron closed T248400: elk7: fields indexed without position data; cannot run PhraseQuery as Resolved.

closing old task

Jan 8 2025, 2:39 PM · Observability-Logging, observability, Patch-For-Review, SRE, Wikimedia-Logstash
herron closed T207296: Rationalize default logrotate "rotated" file extensions as Declined.
Jan 8 2025, 2:36 PM · Observability-Logging, observability, Wikimedia-Logstash, SRE
herron closed T378190: Thanos: set up trace sampling as Resolved.

Resolving since we have sampling working now via the otel collector

Jan 8 2025, 2:31 PM · Patch-For-Review, Observability-Tracing
herron closed T378190: Thanos: set up trace sampling, a subtask of T376179: Thanos: enable tracing, as Resolved.
Jan 8 2025, 2:31 PM · Observability-Tracing

Jan 7 2025

herron closed T333855: vopsbot needed manual restart after alerting hosts failover, a subtask of T333478: failover alert1001 to alert2001, as Resolved.
Jan 7 2025, 4:14 PM · Patch-For-Review, SRE Observability
herron closed T333855: vopsbot needed manual restart after alerting hosts failover as Resolved.

AFAIK this was sorted out with the last work on the alert hosts by using a different failover strategy

Jan 7 2025, 4:13 PM · Observability-Alerting

Dec 20 2024

herron added a comment to T368953: Thanos Cache Tuning.

let's please offboard the problematic liftwing SLOs before the holidays and get thanos/titan to a pre-raw-metric load

Dec 20 2024, 3:34 PM · Patch-For-Review, Observability-Metrics
herron added a comment to T302995: Transition to Pyrra for SLO Visualization and Management.

Mentioned in T368953 as well -- The heavy liftwing slos have been offboarded for now, they are using a lot of thanos system resources and we think it'll be safest to offboard them during the break (apologies for the cross-posts, trying to keep SLO talk here and cache tuning in T368953)

Dec 20 2024, 3:34 PM · Patch-For-Review, User-herron, Observability-Metrics

Dec 19 2024

herron added a comment to T368953: Thanos Cache Tuning.

Looking again today with some more time passed I think the good news is we've dropped rx bandwidth from roughly ~250MB/s to ~150MB/s sustained, in the ballpark of a 40% reduction.

Dec 19 2024, 5:31 PM · Patch-For-Review, Observability-Metrics

Dec 18 2024

herron added a comment to T302995: Transition to Pyrra for SLO Visualization and Management.

I'm beginning to see some improvements with caching bucket enabled, I think there's room for further tuning/improvement. Please see details in https://phabricator.wikimedia.org/T368953#10413075

Dec 18 2024, 4:52 PM · Patch-For-Review, User-herron, Observability-Metrics
herron added a comment to T368953: Thanos Cache Tuning.

Initial results with Thanos store caching bucket enabled look promising. I'm seeing reductions in duration, errors, network/socket utilization and slight decrease in cpu util

Dec 18 2024, 4:50 PM · Patch-For-Review, Observability-Metrics

Dec 17 2024

herron added a comment to T368953: Thanos Cache Tuning.

A few more cache tuning patches that should have been linked on this task:

Dec 17 2024, 6:06 PM · Patch-For-Review, Observability-Metrics
herron added a comment to T302995: Transition to Pyrra for SLO Visualization and Management.

Thanks @fgiunchedi that helps explain the quite lower than I'd expect cache memory utilization in the frontend.

Dec 17 2024, 5:32 PM · Patch-For-Review, User-herron, Observability-Metrics

Dec 12 2024

herron moved T382055: slos.wikimedia.org name not present on current certificate on cp hosts from Inbox to Backlog on the SRE Observability board.
Dec 12 2024, 2:55 PM · Observability-Metrics
herron edited projects for T382055: slos.wikimedia.org name not present on current certificate on cp hosts, added: SRE Observability; removed Observability-Metrics.
Dec 12 2024, 2:55 PM · Observability-Metrics
herron claimed T382055: slos.wikimedia.org name not present on current certificate on cp hosts.
Dec 12 2024, 2:54 PM · Observability-Metrics

Dec 10 2024

herron created T381901: MariaDB Replica SQL: s6 on db2158 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table recentchanges is corrupt: try to repair it on query. Default database: ruwiki. .
Dec 10 2024, 5:23 PM · DBA

Dec 9 2024

herron updated the task description for T302995: Transition to Pyrra for SLO Visualization and Management.
Dec 9 2024, 6:14 PM · Patch-For-Review, User-herron, Observability-Metrics

Dec 3 2024

herron added a comment to T371244: VictorOps paged batphone immediately rather than after 5m.

I am writing this email in regard to case #1234678, which is about "escalation not working as expected".

As mentioned in the previous email, it has been observed that the user escalator_sysuser has routed incident #5465 from "SRE:SRE Business Hours (Escalation)" to "SRE:SRE Batphone (Escalation)". This change in escalation policy will result in notifications being sent to users from both escalation policies simultaneously.

Additionally, please note that the timestamps may vary as the timeline payload captures the Recovery Alert Timings. The incident triggered time can be viewed in the Critical Alert for the incident. So there was no delay from the Splunk On-Call.

Dec 3 2024, 7:15 PM · SRE Observability, SRE-OnFire, SRE
herron moved T381417: aux-k8s-codfw cluster setup from Inbox to FY2024/2025-Q3 on the SRE Observability board.
Dec 3 2024, 7:07 PM · SRE Observability (FY2024/2025-Q3), Infrastructure-Foundations, SRE, Kubernetes
herron closed T378987: codfw: (4x) aux-k8s-worker nodes as Resolved.

VMs built, tracking remaining setup in T381417: aux-k8s-codfw cluster setup

Dec 3 2024, 5:01 PM · vm-requests, SRE, Kubernetes
herron closed T378988: codfw: (3x) aux-k8s-etcd nodes as Resolved.

VMs built, tracking remaining setup in T381417: aux-k8s-codfw cluster setup

Dec 3 2024, 4:59 PM · vm-requests, SRE, Kubernetes
herron closed T378987: codfw: (4x) aux-k8s-worker nodes, a subtask of T378742: aux-k8s: eqiad expansion, codfw creation, & future hopes and dreams, as Resolved.
Dec 3 2024, 4:59 PM · Infrastructure-Foundations, SRE, Kubernetes, Epic
herron closed T378986: codfw: (2x) aux-k8s-ctrl nodes as Resolved.

VMs built, tracking remaining setup in T381417: aux-k8s-codfw cluster setup

Dec 3 2024, 4:59 PM · vm-requests, SRE, Kubernetes
herron closed T378988: codfw: (3x) aux-k8s-etcd nodes, a subtask of T378742: aux-k8s: eqiad expansion, codfw creation, & future hopes and dreams, as Resolved.
Dec 3 2024, 4:58 PM · Infrastructure-Foundations, SRE, Kubernetes, Epic
herron closed T378986: codfw: (2x) aux-k8s-ctrl nodes, a subtask of T378742: aux-k8s: eqiad expansion, codfw creation, & future hopes and dreams, as Resolved.
Dec 3 2024, 4:58 PM · Infrastructure-Foundations, SRE, Kubernetes, Epic
herron updated the task description for T378742: aux-k8s: eqiad expansion, codfw creation, & future hopes and dreams.
Dec 3 2024, 4:56 PM · Infrastructure-Foundations, SRE, Kubernetes, Epic
herron created T381417: aux-k8s-codfw cluster setup.
Dec 3 2024, 4:55 PM · SRE Observability (FY2024/2025-Q3), Infrastructure-Foundations, SRE, Kubernetes

Nov 21 2024

herron updated the task description for T378988: codfw: (3x) aux-k8s-etcd nodes.
Nov 21 2024, 10:31 PM · vm-requests, SRE, Kubernetes
herron added a comment to T378988: codfw: (3x) aux-k8s-etcd nodes.
cumin1002:~$ sudo cookbook sre.ganeti.makevm --vcpus 1 --memory 1 --disk 50 --os bookworm --cluster codfw --group A -t T378988 aux-k8s-etcd2003
Nov 21 2024, 10:31 PM · vm-requests, SRE, Kubernetes

Nov 20 2024

herron added a comment to T371244: VictorOps paged batphone immediately rather than after 5m.

escalator_sysuser is our account for the vo-escalate service which runs from the active alert host. vo-escalate checks every 15 seconds looking for incidents that have not yet paged anyone, and routes them to the batphone, and for some reason it fired on this incident.

Nov 20 2024, 3:00 PM · SRE Observability, SRE-OnFire, SRE
herron closed T289615: Migrate existing SLO related metrics to recording rules, a subtask of T274668: Standardize a SLI metrics naming/storage/mapping scheme, as Invalid.
Nov 20 2024, 2:32 PM · Observability-Metrics
herron closed T289615: Migrate existing SLO related metrics to recording rules as Invalid.

This has been superseded by Pyrra which generates the recording rules automatically

Nov 20 2024, 2:32 PM · Patch-Needs-Improvement, Observability-Metrics

Nov 18 2024

CDanis awarded T378989: eqiad: (2x) aux-k8s-worker nodes a Party Time token.
Nov 18 2024, 6:57 PM · Kubernetes, vm-requests, SRE
herron added a comment to T374178: PrometheusRuleEvaluationFailures team-sre_opensearch.yaml.

What do you think about?

Nov 18 2024, 3:11 PM · Observability-Alerting

Nov 15 2024

herron updated the task description for T378742: aux-k8s: eqiad expansion, codfw creation, & future hopes and dreams.
Nov 15 2024, 6:49 PM · Infrastructure-Foundations, SRE, Kubernetes, Epic
herron closed T378989: eqiad: (2x) aux-k8s-worker nodes, a subtask of T378742: aux-k8s: eqiad expansion, codfw creation, & future hopes and dreams, as Resolved.
Nov 15 2024, 6:49 PM · Infrastructure-Foundations, SRE, Kubernetes, Epic
herron closed T378989: eqiad: (2x) aux-k8s-worker nodes as Resolved.

I think we're good here!

Nov 15 2024, 6:49 PM · Kubernetes, vm-requests, SRE

Nov 13 2024

herron updated the task description for T379678: Requesting access to deployment for dbrant.
Nov 13 2024, 6:07 PM · SRE, SRE-Access-Requests
herron moved T379678: Requesting access to deployment for dbrant from Untriaged to Manager/NDA Approval/Confirmation on the SRE-Access-Requests board.

Hello! A few next-steps to move forward with this request:

Nov 13 2024, 5:42 PM · SRE, SRE-Access-Requests
herron triaged T379678: Requesting access to deployment for dbrant as Medium priority.
Nov 13 2024, 5:38 PM · SRE, SRE-Access-Requests

Nov 12 2024

herron closed T379630: Grant Access to ldap/wmf for HArroyo-WMF as Resolved.

membership to ldap group wmf has been provisioned, thanks!

Nov 12 2024, 3:47 PM · SRE, LDAP-Access-Requests
herron added a member for WMF-NDA: hector.arroyo.
Nov 12 2024, 3:39 PM
herron closed T379409: Grant Access to ldap/wmf for khantstop as Resolved.

uid=khantstop has been added to ldap group wmf

Nov 12 2024, 2:56 PM · SRE, LDAP-Access-Requests
herron added a member for WMF-NDA: Khantstop.
Nov 12 2024, 2:47 PM

Nov 8 2024

lmata awarded T359293: Alert in need of triage: ProbeDown (instance centrallog1002:6514) a Like token.
Nov 8 2024, 8:17 PM · SRE Observability, sre-alert-triage
herron added a comment to T378989: eqiad: (2x) aux-k8s-worker nodes.
ganeti1028:~# gnt-instance console aux-k8s-worker1004.eqiad.wmnet
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the RSA key sent by the remote host is
SHA256:C9d5dCz/suTrJQqv6NtW3q/R241NXTC1GL3JMLfaMY4.
Please contact your system administrator.
Add correct host key in /dev/null to get rid of this message.
Offending DSA key in /var/lib/ganeti/known_hosts:2
  remove with:
  ssh-keygen -f "/var/lib/ganeti/known_hosts" -R "ganeti01.svc.eqiad.wmnet"
RSA host key for ganeti01.svc.eqiad.wmnet has changed and you have requested strict checking.
Host key verification failed.
Failure: command execution error:
Connection to console of instance aux-k8s-worker1004.eqiad.wmnet failed, please check cluster configuration

Looks like this VM got assigned to new ganeti host ganeti1045.eqiad.wmnet which might still be a work in progress T378921?

Nov 8 2024, 3:20 PM · Kubernetes, vm-requests, SRE

Nov 7 2024

herron added a comment to T378989: eqiad: (2x) aux-k8s-worker nodes.
ganeti1028:~# gnt-instance console aux-k8s-worker1004.eqiad.wmnet
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the RSA key sent by the remote host is
SHA256:C9d5dCz/suTrJQqv6NtW3q/R241NXTC1GL3JMLfaMY4.
Please contact your system administrator.
Add correct host key in /dev/null to get rid of this message.
Offending DSA key in /var/lib/ganeti/known_hosts:2
  remove with:
  ssh-keygen -f "/var/lib/ganeti/known_hosts" -R "ganeti01.svc.eqiad.wmnet"
RSA host key for ganeti01.svc.eqiad.wmnet has changed and you have requested strict checking.
Host key verification failed.
Failure: command execution error:
Connection to console of instance aux-k8s-worker1004.eqiad.wmnet failed, please check cluster configuration
Nov 7 2024, 7:22 PM · Kubernetes, vm-requests, SRE

Nov 4 2024

herron updated the task description for T378742: aux-k8s: eqiad expansion, codfw creation, & future hopes and dreams.
Nov 4 2024, 3:44 PM · Infrastructure-Foundations, SRE, Kubernetes, Epic
herron created T378989: eqiad: (2x) aux-k8s-worker nodes.
Nov 4 2024, 3:40 PM · Kubernetes, vm-requests, SRE
herron created T378988: codfw: (3x) aux-k8s-etcd nodes.
Nov 4 2024, 3:40 PM · vm-requests, SRE, Kubernetes
herron created T378987: codfw: (4x) aux-k8s-worker nodes.
Nov 4 2024, 3:40 PM · vm-requests, SRE, Kubernetes
herron created T378986: codfw: (2x) aux-k8s-ctrl nodes.
Nov 4 2024, 3:40 PM · vm-requests, SRE, Kubernetes

Nov 1 2024

colewhite awarded T359293: Alert in need of triage: ProbeDown (instance centrallog1002:6514) a Party Time token.
Nov 1 2024, 9:21 PM · SRE Observability, sre-alert-triage

Oct 31 2024

herron closed T377703: Alert in need of triage: ProbeDown (instance centrallog2002:6514) as Resolved.

(Fixed in T377703)

Oct 31 2024, 6:23 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08), sre-alert-triage
herron closed T359293: Alert in need of triage: ProbeDown (instance centrallog1002:6514) as Resolved.
Oct 31 18:19:40 prometheus1005 prometheus-blackbox-exporter[2309163]: ts=2024-10-31T18:19:40.825Z caller=handler.go:184 module=tcp_rsyslog_receiver_ip4 target=[10.64.16.86]:6514 level=debug msg="Beginning probe" probe=tcp timeout_seconds=3
Oct 31 18:19:40 prometheus1005 prometheus-blackbox-exporter[2309163]: ts=2024-10-31T18:19:40.825Z caller=handler.go:184 module=tcp_rsyslog_receiver_ip4 target=[10.64.16.86]:6514 level=debug msg="Resolving target address" target=10.64.16.86 ip_protocol=ip4
Oct 31 18:19:40 prometheus1005 prometheus-blackbox-exporter[2309163]: ts=2024-10-31T18:19:40.825Z caller=handler.go:184 module=tcp_rsyslog_receiver_ip4 target=[10.64.16.86]:6514 level=debug msg="Resolved target address" target=10.64.16.86 ip=10.64.16.86
Oct 31 18:19:40 prometheus1005 prometheus-blackbox-exporter[2309163]: ts=2024-10-31T18:19:40.827Z caller=handler.go:184 module=tcp_rsyslog_receiver_ip4 target=[10.64.16.86]:6514 level=debug msg="Dialing TCP with TLS"
Oct 31 18:19:40 prometheus1005 prometheus-blackbox-exporter[2309163]: ts=2024-10-31T18:19:40.876Z caller=handler.go:184 module=tcp_rsyslog_receiver_ip4 target=[10.64.16.86]:6514 level=debug msg="Successfully dialed"
Oct 31 18:19:40 prometheus1005 prometheus-blackbox-exporter[2309163]: ts=2024-10-31T18:19:40.876Z caller=handler.go:184 module=tcp_rsyslog_receiver_ip4 target=[10.64.16.86]:6514 level=debug msg="Probe succeeded" duration_seconds=0.050861874
Oct 31 2024, 6:22 PM · SRE Observability, sre-alert-triage
herron closed T377703: Alert in need of triage: ProbeDown (instance centrallog2002:6514), a subtask of T359293: Alert in need of triage: ProbeDown (instance centrallog1002:6514), as Resolved.
Oct 31 2024, 6:22 PM · SRE Observability, sre-alert-triage
herron added a comment to T359293: Alert in need of triage: ProbeDown (instance centrallog1002:6514).

Change #1084199 merged by Herron:

[operations/puppet@production] profile::syslog::centralserver: use prometheus cert for blackbox check

https://gerrit.wikimedia.org/r/1084199

Oct 31 2024, 4:55 PM · SRE Observability, sre-alert-triage

Oct 30 2024

herron added a project to T376790: Split the permission to access Logstash from the cn=wmf and cn=nda groups: SRE Observability.
Oct 30 2024, 1:09 PM · SRE Observability, Infrastructure-Foundations, SRE

Oct 29 2024

herron added a comment to T359293: Alert in need of triage: ProbeDown (instance centrallog1002:6514).

Looked into this a bit since the silence expired, the service being probed is up but looks like the related prometheus blackbox exporter exporter is failing to load the configured certificate

Oct 29 2024, 5:38 PM · SRE Observability, sre-alert-triage
herron added a subtask for T359293: Alert in need of triage: ProbeDown (instance centrallog1002:6514): T377703: Alert in need of triage: ProbeDown (instance centrallog2002:6514).
Oct 29 2024, 5:35 PM · SRE Observability, sre-alert-triage
herron added a parent task for T377703: Alert in need of triage: ProbeDown (instance centrallog2002:6514): T359293: Alert in need of triage: ProbeDown (instance centrallog1002:6514).
Oct 29 2024, 5:35 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08), sre-alert-triage

Oct 25 2024

herron updated subscribers of T378190: Thanos: set up trace sampling.

Looked into this a bit and from what I can tell the feature for OTEL sampling within Thanos was introduced in a later version (v0.32.0)

Oct 25 2024, 3:11 PM · Patch-For-Review, Observability-Tracing
herron created T378190: Thanos: set up trace sampling.
Oct 25 2024, 3:02 PM · Patch-For-Review, Observability-Tracing

Oct 21 2024

herron closed T376904: Upgrade to Jaeger v1.62.0 as Resolved.

https://trace.wikimedia.org is now running 1.62.0.

Oct 21 2024, 4:50 PM · Patch-For-Review, Observability-Tracing