Page MenuHomePhabricator

colewhite (cwhite)
User

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Saturday

  • Clear sailing ahead.

User Details

User Since
Aug 21 2018, 6:05 PM (275 w, 2 d)
Availability
Available
LDAP User
Cwhite
MediaWiki User
CWhite (WMF) [ Global Accounts ]

Recent Activity

Fri, Nov 17

colewhite closed T350786: No entries at all in beta-logs.wmcloud.org since 2023-11-06 Z 12:15:39 as Resolved.

Definitely a different problem.

Fri, Nov 17, 4:46 PM · Quality-and-Test-Engineering-Team, SRE Observability (FY2023/2024-Q2), Observability-Logging, Release-Engineering-Team, Beta-Cluster-Infrastructure

Thu, Nov 16

colewhite closed T350786: No entries at all in beta-logs.wmcloud.org since 2023-11-06 Z 12:15:39 as Resolved.

Logstash was crashlooping because it was attempting to load a template that did not exist on the host anymore. Now that it is using the right template, logs are flowing again.

Thu, Nov 16, 3:31 PM · Quality-and-Test-Engineering-Team, SRE Observability (FY2023/2024-Q2), Observability-Logging, Release-Engineering-Team, Beta-Cluster-Infrastructure

Wed, Nov 15

colewhite added a comment to T350638: Track MediaWiki stats usage.

Injecting statslib into the legacy statsd code path introduces significant complexity and risk.

Wed, Nov 15, 9:01 PM · Patch-For-Review, Observability-Metrics, MediaWiki-libs-Stats
colewhite added a comment to T322424: Commons/multimedia errors, caused by repeated swift (cascading?) failures, late 2022.

@jcrespo roll-restarted swift proxies today using sre.swift.roll-restart-reboot-swift-ms-proxies cookbook in response to high 502s and 504s from ATS.

Wed, Nov 15, 6:09 PM · SRE, SRE-swift-storage

Tue, Nov 14

colewhite added a comment to T279112: meta.domain in Logstash seems to usually not like doing term matches.

This example in the task description is still broken today. E.g. meta.domain:wikipedia.org has either no results, or results that do not contain en.wikipedia.org, en.m.wikipedia.org, zh.m.wikipedia.org, and the other 200+ subdomains.

Instead, it sometimes matches spurious records from third-party proxy sites like fr-wikipedia.org.

Tue, Nov 14, 3:56 PM · SRE Observability (FY2023/2024-Q2), Instrument-ClientError, Observability-Logging, observability, Wikimedia-Logstash

Tue, Nov 7

colewhite added a comment to T350434: Logstash collector tuning.

Trying option 1 seems like a good start to try handling memory size issues. Note that we may want to adjust logstash tuning as well afterwards.

Tue, Nov 7, 12:40 AM · SRE Observability (FY2023/2024-Q2), Observability-Logging

Mon, Nov 6

colewhite claimed T350638: Track MediaWiki stats usage.
Mon, Nov 6, 11:55 PM · Patch-For-Review, Observability-Metrics, MediaWiki-libs-Stats
colewhite created T350638: Track MediaWiki stats usage.
Mon, Nov 6, 11:47 PM · Patch-For-Review, Observability-Metrics, MediaWiki-libs-Stats
colewhite updated the task description for T350597: Audit and prioritize metrics for conversion to statslib that are used for graphite-based alerting.
Mon, Nov 6, 11:32 PM · Patch-For-Review, SRE Observability (FY2023/2024-Q2), Observability-Metrics

Fri, Nov 3

colewhite added a subtask for T213902: Implement sensitive logstash access control: T350516: Enable OpenSearch security plugin - Beta Logs.
Fri, Nov 3, 9:52 PM · Patch-Needs-Improvement, User-herron, SRE Observability (FY2023/2024-Q2), Observability-Logging
colewhite added a parent task for T350516: Enable OpenSearch security plugin - Beta Logs: T213902: Implement sensitive logstash access control.
Fri, Nov 3, 9:52 PM · Observability-Logging
colewhite created T350516: Enable OpenSearch security plugin - Beta Logs.
Fri, Nov 3, 9:52 PM · Observability-Logging
colewhite added a comment to T350366: Multiple images fail to build from sources.

The error in the loki build is due to its dependency on the fact it depends on golang-1.13 which has been dismissed years ago. @colewhite do you think we can just remove the loki image instead?

Fri, Nov 3, 9:33 PM · serviceops

Oct 27 2023

colewhite closed T343021: Deploy prometheus-statsd-exporter to Test Env, a subtask of T343020: Converting MediaWiki Metrics to StatsLib, as Invalid.
Oct 27 2023, 11:05 PM · SRE Observability (FY2023/2024-Q2), Observability-Metrics
colewhite closed T343021: Deploy prometheus-statsd-exporter to Test Env as Invalid.

There are other tasks filed doing what this task intended.

Oct 27 2023, 11:05 PM · SRE Observability (FY2023/2024-Q2)

Oct 25 2023

colewhite added a comment to T302373: Upgrade prometheus-statsd-exporter.

Fresh off the presses I think we can upgrade to statsd_exporter 0.25, which would allow us to drop our custom patch to relay statsd metrics

Oct 25 2023, 3:49 PM · User-fgiunchedi, SRE Observability (FY2023/2024-Q2), Observability-Metrics

Oct 19 2023

colewhite closed T327218: Elastic/Opensearch shard size check: Round index size to the 2nd decimal as Resolved.
Oct 19 2023, 10:59 PM · Observability-Alerting

Oct 18 2023

colewhite closed T348795: Implement StatsLib improvements as Resolved.

Thanks, @aaron!

Oct 18 2023, 6:56 PM · MW-1.42-notes (1.42.0-wmf.2; 2023-10-24), SRE Observability (FY2023/2024-Q2), Observability-Metrics, MediaWiki-libs-Stats

Oct 17 2023

colewhite edited projects for T349140: Transaction profiler logs full query which is truncated by logstash, added: Observability-Logging; removed observability.
Oct 17 2023, 9:43 PM · Observability-Logging, Release-Engineering-Team (Radar), MediaWiki-libs-Rdbms, DBA, Wikimedia-Logstash
colewhite added a comment to T348508: Curator failed to delete indices in codfw.

Summarizing highlights from my IRC conversation with @fgiunchedi:

Oct 17 2023, 5:47 PM · Patch-For-Review, SRE Observability (FY2023/2024-Q2), Observability-Logging
colewhite edited projects for T349067: Logstash ecs-* index has fields with conflicting types, added: Observability-Logging; removed SRE Observability, observability.

Thanks for the report!

Oct 17 2023, 5:35 PM · Observability-Logging, Wikimedia-Logstash
colewhite added a subtask for T342451: ECS labels field: OpenSearch attempts to detect field type by content: T349067: Logstash ecs-* index has fields with conflicting types.
Oct 17 2023, 5:34 PM · Observability-Logging
colewhite added a parent task for T349067: Logstash ecs-* index has fields with conflicting types: T342451: ECS labels field: OpenSearch attempts to detect field type by content.
Oct 17 2023, 5:34 PM · Observability-Logging, Wikimedia-Logstash

Oct 16 2023

colewhite added a comment to T344751: Decide on default histogram buckets for MediaWiki timers.

Since these 2 buckets were requested specifically I'd be inclined to either append them to the default set, or document the rationale for omitting them. [.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10, 30, 60, (+Inf)] seems fair IMO

I have no problem keeping the 30s and 60s buckets. I was under the impression from the meeting that @Krinkle preferred to omit them?

Oct 16 2023, 11:55 PM · MediaWiki-libs-Stats, serviceops, Observability-Metrics
colewhite added a comment to T344751: Decide on default histogram buckets for MediaWiki timers.

Per meeting with @Krinkle today:

Oct 16 2023, 6:13 PM · MediaWiki-libs-Stats, serviceops, Observability-Metrics

Oct 12 2023

colewhite added a comment to T348508: Curator failed to delete indices in codfw.

It doesn't look like there is an option to change the timeout parameter. We'll need to patch curator. 😕

Oct 12 2023, 11:08 PM · Patch-For-Review, SRE Observability (FY2023/2024-Q2), Observability-Logging
colewhite created T348806: Rethink how metric label values are sanitized.
Oct 12 2023, 9:10 PM · Observability-Metrics, MediaWiki-libs-Stats
colewhite added a project to T344751: Decide on default histogram buckets for MediaWiki timers: MediaWiki-libs-Stats.
Oct 12 2023, 8:15 PM · MediaWiki-libs-Stats, serviceops, Observability-Metrics
colewhite changed the status of T245464: Use php-hrtime monotonic clock instead of microtime for perf measure in MW, a subtask of T255502: Goal: Save Timing median back under 1 second, from Open to In Progress.
Oct 12 2023, 8:12 PM · MediaWiki-Platform-Team
colewhite changed the status of T245464: Use php-hrtime monotonic clock instead of microtime for perf measure in MW from Open to In Progress.
Oct 12 2023, 8:12 PM · MW-1.42-notes (1.42.0-wmf.3; 2023-10-31), Patch-For-Review, MediaWiki-Platform-Team, MediaWiki-libs-Stats, Wikimedia-Performance-recommendation, User-jijiki, MediaWiki-libs-BagOStuff
colewhite changed the status of T348795: Implement StatsLib improvements from Open to In Progress.
Oct 12 2023, 7:59 PM · MW-1.42-notes (1.42.0-wmf.2; 2023-10-24), SRE Observability (FY2023/2024-Q2), Observability-Metrics, MediaWiki-libs-Stats
colewhite updated the task description for T348796: MediaWiki: Define new metric type - Histogram.
Oct 12 2023, 6:12 PM · SRE Observability (FY2023/2024-Q3), Observability-Metrics, MediaWiki-libs-Stats
colewhite created T348796: MediaWiki: Define new metric type - Histogram.
Oct 12 2023, 6:11 PM · SRE Observability (FY2023/2024-Q3), Observability-Metrics, MediaWiki-libs-Stats
colewhite created T348795: Implement StatsLib improvements.
Oct 12 2023, 6:00 PM · MW-1.42-notes (1.42.0-wmf.2; 2023-10-24), SRE Observability (FY2023/2024-Q2), Observability-Metrics, MediaWiki-libs-Stats
colewhite updated subscribers of T344751: Decide on default histogram buckets for MediaWiki timers.

I propose we implement a mixed approach.

Oct 12 2023, 5:49 PM · MediaWiki-libs-Stats, serviceops, Observability-Metrics
colewhite added a comment to T348508: Curator failed to delete indices in codfw.

OpenSearch replies with HTTP 200 {"acknowledged":false} indicating the operation hasn't failed, but has hit the 30s "explicit operation timeout". This is different than master_timeout (which curator is providing) specifying the "timeout for connection to master".

Oct 12 2023, 12:11 AM · Patch-For-Review, SRE Observability (FY2023/2024-Q2), Observability-Logging

Oct 11 2023

colewhite triaged T348508: Curator failed to delete indices in codfw as Medium priority.
Oct 11 2023, 5:19 PM · Patch-For-Review, SRE Observability (FY2023/2024-Q2), Observability-Logging
colewhite changed the status of T348508: Curator failed to delete indices in codfw from Open to In Progress.

How long the operation took is suspiciously close to 30s, which the default timeout parameter value, maybe we should bump that in addition to (or instead of) master_timeout ? upstream docs

Oct 11 2023, 5:18 PM · Patch-For-Review, SRE Observability (FY2023/2024-Q2), Observability-Logging

Oct 10 2023

colewhite closed T348262: Remove opensearch shard size check for logstash cluster as Resolved.

Done!

Oct 10 2023, 7:49 PM · SRE Observability (FY2023/2024-Q2), Observability-Logging

Oct 4 2023

colewhite closed T335242: Decommission 'coal' and 'coal-web' services as Resolved.

Stopped and removed units for coal, uwsgi-coal, wmf_auto_restart_coal and wmf_auto_restart_uwsgi-coal.

Oct 4 2023, 11:30 PM · MediaWiki-Platform-Team (Radar), SRE Observability (FY2023/2024-Q2), observability, Projects-Cleanup
colewhite closed T347976: Curator cluster wide delete action fails on logstash hosts as Resolved.

Delete actions took a few milliseconds over 30s today. Optimistically resolving.

Oct 4 2023, 3:04 PM · Observability-Logging

Oct 3 2023

colewhite closed T345362: DatasourceError grafana alerting error message database is locked as Resolved.

Optimistically resolving now that WAL is enabled. Will watch the logs for new instances.

Oct 3 2023, 8:36 PM · Observability-Alerting
colewhite closed T345362: DatasourceError grafana alerting error message database is locked, a subtask of T317887: Upgrade to Grafana 9, as Resolved.
Oct 3 2023, 8:36 PM · SRE Observability (FY2023/2024-Q2), Observability-Metrics
colewhite changed the status of T347976: Curator cluster wide delete action fails on logstash hosts from Open to In Progress.

I concur. Let's bump the timeout.

Oct 3 2023, 3:24 PM · Observability-Logging

Oct 2 2023

colewhite added a comment to T344937: Decom dispatch infrastructure.

There's an outstanding Icinga alert that seems related: CRITICAL - degraded: The following units failed: dispatch-scheduler.service,docker-image-prune-old.service

Oct 2 2023, 8:02 PM · Incident Tooling, User-herron

Sep 28 2023

colewhite reopened T345362: DatasourceError grafana alerting error message database is locked, a subtask of T317887: Upgrade to Grafana 9, as Open.
Sep 28 2023, 3:24 PM · SRE Observability (FY2023/2024-Q2), Observability-Metrics
colewhite reopened T345362: DatasourceError grafana alerting error message database is locked as "Open".

Discovered some more evidence of this in logs this morning. There is another recommendation to enable WAL on the sqlite db (new in Grafana 9.4).

Sep 28 2023, 3:24 PM · Observability-Alerting

Sep 25 2023

colewhite closed T346893: Investigate swagger-exporter failures as Resolved.

Error logs have gone away and the prometheus view looks good.

Sep 25 2023, 11:35 PM · Observability-Alerting, serviceops
colewhite closed T346893: Investigate swagger-exporter failures, a subtask of T320620: Port openapi/swagger checks/alerts to Prometheus, as Resolved.
Sep 25 2023, 11:34 PM · Observability-Alerting, observability, serviceops
colewhite added a comment to T346796: Requesting access to analytics-privatedata-users for Aisha Khatun.

I am getting this error when I kinit
kinit: Client 'akhatun@WIKIMEDIA' not found in Kerberos database while getting initial credentials
Am I supposed to get a temporary password though email?

Sep 25 2023, 2:48 PM · SRE, SRE-Access-Requests

Sep 22 2023

colewhite reopened T335242: Decommission 'coal' and 'coal-web' services as "Open".
21:34:16 <Krinkle> cwhite: it seems 'coal' is still running on webperf1003. I guess we didn't absent it and/or intentionally removed it simply with intention to remove by hand but haven't yet?
Sep 22 2023, 11:07 PM · MediaWiki-Platform-Team (Radar), SRE Observability (FY2023/2024-Q2), observability, Projects-Cleanup
colewhite closed T342535: Requesting access to analytics_privatedata_users, deployment_members for Mabualruz, a subtask of T345186: Deployment training request for mabualruz, as Resolved.
Sep 22 2023, 3:51 PM · Release-Engineering-Team (Deployment Training Requests)
colewhite closed T342535: Requesting access to analytics_privatedata_users, deployment_members for Mabualruz as Resolved.

The group membership change has been deployed.

Sep 22 2023, 3:51 PM · SRE, SRE-Access-Requests
colewhite closed T346796: Requesting access to analytics-privatedata-users for Aisha Khatun as Resolved.

Restored the level of access held before last contract expired.

Sep 22 2023, 3:44 PM · SRE, SRE-Access-Requests

Sep 21 2023

colewhite closed T347110: Requesting access to deployment for dr0ptp4kt as Resolved.

The group membership change has been deployed.

Sep 21 2023, 11:40 PM · SRE, SRE-Access-Requests
colewhite updated the task description for T347110: Requesting access to deployment for dr0ptp4kt.
Sep 21 2023, 11:39 PM · SRE, SRE-Access-Requests
colewhite closed T346921: Migrate Bawolff from wmf ldap group to nda ldap group, a subtask of T345447: Re-evaluate WMF staff developer creation process, as Resolved.
Sep 21 2023, 11:35 PM · SecTeam-Processed, Code-Health, Security, production-risk-assessment
colewhite closed T346921: Migrate Bawolff from wmf ldap group to nda ldap group as Resolved.

Migrated to nda ldap group.

Sep 21 2023, 11:35 PM · SRE, LDAP-Access-Requests
colewhite updated subscribers of T342535: Requesting access to analytics_privatedata_users, deployment_members for Mabualruz.
Sep 21 2023, 11:29 PM · SRE, SRE-Access-Requests
colewhite updated the task description for T342535: Requesting access to analytics_privatedata_users, deployment_members for Mabualruz.
Sep 21 2023, 11:29 PM · SRE, SRE-Access-Requests
colewhite updated subscribers of T342535: Requesting access to analytics_privatedata_users, deployment_members for Mabualruz.
Sep 21 2023, 11:26 PM · SRE, SRE-Access-Requests
colewhite updated the task description for T342535: Requesting access to analytics_privatedata_users, deployment_members for Mabualruz.
Sep 21 2023, 11:22 PM · SRE, SRE-Access-Requests
colewhite updated subscribers of T347110: Requesting access to deployment for dr0ptp4kt.

ping: @thcipriani as approver for deployment group membership

Sep 21 2023, 11:21 PM · SRE, SRE-Access-Requests
colewhite moved T347110: Requesting access to deployment for dr0ptp4kt from Untriaged to Manager/NDA Approval/Confirmation on the SRE-Access-Requests board.
Sep 21 2023, 11:20 PM · SRE, SRE-Access-Requests
colewhite updated the task description for T347110: Requesting access to deployment for dr0ptp4kt.
Sep 21 2023, 11:19 PM · SRE, SRE-Access-Requests
colewhite moved T346796: Requesting access to analytics-privatedata-users for Aisha Khatun from Untriaged to Manager/NDA Approval/Confirmation on the SRE-Access-Requests board.
Sep 21 2023, 11:18 PM · SRE, SRE-Access-Requests
colewhite added a comment to T346796: Requesting access to analytics-privatedata-users for Aisha Khatun.

@MGerlach is there an expiry date for this contract renewal?

Sep 21 2023, 11:12 PM · SRE, SRE-Access-Requests
colewhite updated the task description for T346796: Requesting access to analytics-privatedata-users for Aisha Khatun.
Sep 21 2023, 11:10 PM · SRE, SRE-Access-Requests
colewhite closed T346694: Requesting access to analytics and search resources for dr0ptp4kt as Resolved.

The group membership change has been deployed.

Sep 21 2023, 10:55 PM · SRE, SRE-Access-Requests
colewhite updated the task description for T346694: Requesting access to analytics and search resources for dr0ptp4kt.
Sep 21 2023, 10:52 PM · SRE, SRE-Access-Requests
colewhite updated the task description for T346694: Requesting access to analytics and search resources for dr0ptp4kt.
Sep 21 2023, 10:51 PM · SRE, SRE-Access-Requests
colewhite added a comment to T344428: refreshUserImpactJob logs mysterious fatal errors.

mw2381:

$ ulimit -Hn
1048576
$ ulimit -Sn
1024
Sep 21 2023, 9:16 PM · Growth-Team (Sprint 1 (Growth Team)), serviceops, SRE, Performance Issue, GrowthExperiments-Homepage, GrowthExperiments-ImpactModule
colewhite closed T339137: Ingest php syslog from Excimer UI (webperf host) into Logstash as Resolved.

I see logs in logstash! \o/

Sep 21 2023, 7:18 PM · Observability-Logging, Patch-For-Review, WikimediaDebug, Performance-Team

Sep 20 2023

colewhite added a comment to T288624: PHP Warning: curl_multi_remove_handle(): supplied resource is not a valid cURL Multi Handle resource.

Extra problem:
I cannot find a way to get visibility into these log messages in https://logstash.wikimedia.org/. Assistance would be appreciated. @colewhite can you provide advice?

Sep 20 2023, 8:41 PM · MW-1.40-notes, MW-1.39-notes, MW-1.41-notes (1.41.0-wmf.27; 2023-09-19), Patch-For-Review, Data Engineering and Event Platform Team, Data-Engineering, Event-Platform, Wikimedia-production-error, MediaWiki-libs-HTTP, Beta-Cluster-reproducible
colewhite closed T345362: DatasourceError grafana alerting error message database is locked, a subtask of T317887: Upgrade to Grafana 9, as Resolved.
Sep 20 2023, 6:06 PM · SRE Observability (FY2023/2024-Q2), Observability-Metrics
colewhite closed T345362: DatasourceError grafana alerting error message database is locked as Resolved.

It's been more than a week and I can see no more instances of this in the logs.

Sep 20 2023, 6:06 PM · Observability-Alerting
colewhite claimed T346893: Investigate swagger-exporter failures.
Sep 20 2023, 5:38 PM · Observability-Alerting, serviceops

Sep 19 2023

colewhite added a comment to T341792: Provision Zookeeper Cluster for storing Flink HA data.

Change 958991 merged by Btullis:

[operations/puppet@production] Add the analytics and search-platform teams to flink zk contacts

https://gerrit.wikimedia.org/r/958991

Sep 19 2023, 11:25 PM · Discovery-Search (Current work), Data-Platform-SRE
colewhite closed T290156: Config.dev.yaml and config.prod.yaml need updating to remove references to logstash as Resolved.

Considering how much time has passed, it's probably safe to say this is complete. If not, please reach out :)

Sep 19 2023, 9:58 PM · service-template-node
colewhite placed T288619: Improve the process to consume and use API.LOG to filter out bad performing queries either by extra tooling within o11y or analytics up for grabs.
Sep 19 2023, 9:56 PM · Observability-Logging

Sep 18 2023

colewhite added a comment to T344751: Decide on default histogram buckets for MediaWiki timers.

Interestingly, if StatsLib creates executeTiming_seconds_bucket as a counter and executeTiming_seconds as a timer and sends them to the exporter, this renders statsd-exporter inoperable until a restart is commanded. It appears we have to be careful not to step on the metric names as statsd-exporter would generate them for summaries and histograms.

Sep 18 2023, 8:26 PM · MediaWiki-libs-Stats, serviceops, Observability-Metrics
colewhite added a comment to T344751: Decide on default histogram buckets for MediaWiki timers.

The handful of timing metrics we have (and actively make use of) vary a lot in their range. I'm not sure a single set can be of much use. The size and range of timing measures vary a lot throughout the platform, from job measures in the range between whole seconds and hours, to WANCache callbacks that are measures between 0.1 and 100 milliseconds.

Sep 18 2023, 8:25 PM · MediaWiki-libs-Stats, serviceops, Observability-Metrics

Sep 14 2023

colewhite renamed T346402: Many kafka errors in beta/deployment-prep from Beta logstash filled with kafka errors to Many kafka errors in beta/deployment-prep.
Sep 14 2023, 10:17 PM · Data-Engineering, Beta-Cluster-Infrastructure
colewhite edited projects for T346402: Many kafka errors in beta/deployment-prep, added: Data-Engineering; removed Observability-Logging.

It appears there is some problem with the kafka-jumbo nodes in deployment prep.

Sep 14 2023, 10:16 PM · Data-Engineering, Beta-Cluster-Infrastructure

Sep 12 2023

colewhite moved T345358: DatasourceError alerts emitted by jinxer are unhelpful from Inbox to Done on the SRE Observability (FY2023/2024-Q1) board.
Sep 12 2023, 9:14 PM · SRE Observability (FY2023/2024-Q1), Observability-Alerting
colewhite added a project to T345358: DatasourceError alerts emitted by jinxer are unhelpful: SRE Observability (FY2023/2024-Q1).
Sep 12 2023, 9:14 PM · SRE Observability (FY2023/2024-Q1), Observability-Alerting
colewhite closed T345358: DatasourceError alerts emitted by jinxer are unhelpful, a subtask of T317887: Upgrade to Grafana 9, as Resolved.
Sep 12 2023, 9:13 PM · SRE Observability (FY2023/2024-Q2), Observability-Metrics
colewhite closed T345358: DatasourceError alerts emitted by jinxer are unhelpful as Resolved.
Sep 12 2023, 9:13 PM · SRE Observability (FY2023/2024-Q1), Observability-Alerting
colewhite moved T345362: DatasourceError grafana alerting error message database is locked from Inbox to Radar on the Observability-Alerting board.
Sep 12 2023, 8:10 PM · Observability-Alerting
colewhite added a comment to T345362: DatasourceError grafana alerting error message database is locked.

@colewhite I'm getting DatasourceError rweb email alerts. Is that covered by this task or T344961 ?

Sep 12 2023, 8:00 PM · Observability-Alerting

Sep 11 2023

colewhite claimed T345362: DatasourceError grafana alerting error message database is locked.
Sep 11 2023, 9:35 PM · Observability-Alerting
colewhite added a comment to T345362: DatasourceError grafana alerting error message database is locked.

Grafana is updated and silence is removed.

Sep 11 2023, 9:35 PM · Observability-Alerting
colewhite added a comment to T345900: Add a dependency on the opensearch-py client.

Linking my comment here for visibility: T345337#9150551

Sep 11 2023, 2:11 PM · Infrastructure-Foundations, SRE-tools, Spicerack

Sep 7 2023

colewhite edited projects for T345884: mw2444 down, added: serviceops-radar; removed serviceops.
Sep 7 2023, 9:02 PM · serviceops, SRE, ops-codfw
colewhite updated the task description for T345884: mw2444 down.
Sep 7 2023, 8:56 PM · serviceops, SRE, ops-codfw
colewhite created T345884: mw2444 down.
Sep 7 2023, 8:52 PM · serviceops, SRE, ops-codfw
colewhite added a comment to T345362: DatasourceError grafana alerting error message database is locked.

9.4.14 is live on grafana-next. Will do some testing there before rolling to production early next week. Reinstalled the silence until we can complete the upgrade.

Sep 7 2023, 7:19 PM · Observability-Alerting
colewhite added a comment to T345337: spicerack: tox fails to install PyYAML using python 3.11 on bookworm.

@colewhite from your comment on on the elastic search restart cookbook can i assume that moving to the opensearch python packages would also allow us to drop this dependency from spicerack. i.e. we could use that package for the current search cookbooks and this new ones?

Sep 7 2023, 4:51 PM · Patch-For-Review, Infrastructure-Foundations, SRE-tools, Spicerack
colewhite added a comment to T345337: spicerack: tox fails to install PyYAML using python 3.11 on bookworm.

Ran into this today trying to pip install wikimedia-spicerack (Python 3.11).

Sep 7 2023, 4:00 PM · Patch-For-Review, Infrastructure-Foundations, SRE-tools, Spicerack
colewhite added a comment to T344798: Write a cookbook for rolling reboot/restart of datahubsearch servers.

Related: T255864: Use/adopt search cluster ES management cookbooks for logging ES too

Sep 7 2023, 12:01 AM · Data-Platform-SRE