Page MenuHomePhabricator

colewhite (cwhite)
User

Projects (17)

Today

  • No visible events.

Tomorrow

  • No visible events.

Friday

  • No visible events.

User Details

User Since
Aug 21 2018, 6:05 PM (399 w, 14 h)
Availability
Available
LDAP User
Cwhite
MediaWiki User
CWhite (WMF) [ Global Accounts ]

Recent Activity

Yesterday

colewhite closed T267664: Enhance smart_data_dump to support gathering metrics from both raid and standalone disks, a subtask of T267135: smart-data-dump should fail loudly when it can't gather metrics, as Resolved.
Tue, Apr 14, 11:29 PM · Observability-Alerting, SRE
colewhite closed T267664: Enhance smart_data_dump to support gathering metrics from both raid and standalone disks as Resolved.

Change is deployed and seems to pass the smoke test. Optimistically closing.

Tue, Apr 14, 11:29 PM · SRE Observability

Wed, Apr 8

colewhite added a comment to T422048: SLOBudgetBurn.

I overwrote that sector in the hope that the disk will reallocate it and OpenSearch correct any data discrepancies when it gets around to it.

Wed, Apr 8, 5:08 PM · observability
colewhite added a comment to T422048: SLOBudgetBurn.

I'm thinking disk read errors:

2026-04-07T04:56:16.083470+00:00 logging-hd2001 kernel: [1672676.995139] mpt3sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
2026-04-07T04:56:16.083496+00:00 logging-hd2001 kernel: [1672676.995151] mpt3sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
2026-04-07T04:56:16.083498+00:00 logging-hd2001 kernel: [1672676.995155] mpt3sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
2026-04-07T04:56:16.083499+00:00 logging-hd2001 kernel: [1672676.995164] mpt3sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
2026-04-07T04:56:16.083500+00:00 logging-hd2001 kernel: [1672676.995169] mpt3sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
2026-04-07T04:56:16.083502+00:00 logging-hd2001 kernel: [1672676.995175] mpt3sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
2026-04-07T04:56:16.083503+00:00 logging-hd2001 kernel: [1672676.995195] sd 0:0:1:0: [sdb] tag#6784 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=2s
2026-04-07T04:56:16.083504+00:00 logging-hd2001 kernel: [1672676.995204] sd 0:0:1:0: [sdb] tag#6784 Sense Key : Medium Error [current] [descriptor] 
2026-04-07T04:56:16.083506+00:00 logging-hd2001 kernel: [1672676.995209] sd 0:0:1:0: [sdb] tag#6784 Add. Sense: Read retries exhausted
2026-04-07T04:56:16.083507+00:00 logging-hd2001 kernel: [1672676.995214] sd 0:0:1:0: [sdb] tag#6784 CDB: Read(16) 88 00 00 00 00 01 1e e7 30 00 00 00 04 00 00 00
2026-04-07T04:56:16.083508+00:00 logging-hd2001 kernel: [1672676.995218] critical medium error, dev sdb, sector 4813435780 op 0x0:(READ) flags 0x80700 phys_seg 5 prio class 2
2026-04-07T04:56:19.210392+00:00 logging-hd2001 kernel: [1672680.153496] mpt3sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
2026-04-07T04:56:19.210409+00:00 logging-hd2001 kernel: [1672680.153501] sd 0:0:1:0: [sdb] tag#6815 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=2s
2026-04-07T04:56:19.210410+00:00 logging-hd2001 kernel: [1672680.153504] mpt3sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
2026-04-07T04:56:19.210411+00:00 logging-hd2001 kernel: [1672680.153513] sd 0:0:1:0: [sdb] tag#6815 Sense Key : Medium Error [current] [descriptor] 
2026-04-07T04:56:19.210413+00:00 logging-hd2001 kernel: [1672680.153518] mpt3sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
2026-04-07T04:56:19.210414+00:00 logging-hd2001 kernel: [1672680.153521] sd 0:0:1:0: [sdb] tag#6815 Add. Sense: Read retries exhausted
2026-04-07T04:56:19.210437+00:00 logging-hd2001 kernel: [1672680.153525] mpt3sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
2026-04-07T04:56:19.210438+00:00 logging-hd2001 kernel: [1672680.153528] sd 0:0:1:0: [sdb] tag#6815 CDB: Read(16) 88 00 00 00 00 01 1e e7 33 80 00 00 00 08 00 00
2026-04-07T04:56:19.210440+00:00 logging-hd2001 kernel: [1672680.153533] critical medium error, dev sdb, sector 4813435780 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
2026-04-07T04:56:22.318883+00:00 logging-hd2001 kernel: [1672683.261973] mpt3sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
2026-04-07T04:56:22.318906+00:00 logging-hd2001 kernel: [1672683.261977] sd 0:0:1:0: [sdb] tag#6817 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=3s
2026-04-07T04:56:22.318909+00:00 logging-hd2001 kernel: [1672683.261981] mpt3sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
2026-04-07T04:56:22.318911+00:00 logging-hd2001 kernel: [1672683.261991] sd 0:0:1:0: [sdb] tag#6817 Sense Key : Medium Error [current] [descriptor] 
2026-04-07T04:56:22.318912+00:00 logging-hd2001 kernel: [1672683.261997] sd 0:0:1:0: [sdb] tag#6817 Add. Sense: Read retries exhausted
2026-04-07T04:56:22.318914+00:00 logging-hd2001 kernel: [1672683.262003] sd 0:0:1:0: [sdb] tag#6817 CDB: Read(16) 88 00 00 00 00 01 1e e7 33 80 00 00 00 08 00 00
2026-04-07T04:56:22.318915+00:00 logging-hd2001 kernel: [1672683.262006] critical medium error, dev sdb, sector 4813435780 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
2026-04-07T04:56:25.443753+00:00 logging-hd2001 kernel: [1672686.386824] mpt3sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
2026-04-07T04:56:25.443778+00:00 logging-hd2001 kernel: [1672686.386848] sd 0:0:1:0: [sdb] tag#6826 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=3s
2026-04-07T04:56:25.443781+00:00 logging-hd2001 kernel: [1672686.386861] sd 0:0:1:0: [sdb] tag#6826 Sense Key : Medium Error [current] [descriptor] 
2026-04-07T04:56:25.443782+00:00 logging-hd2001 kernel: [1672686.386867] sd 0:0:1:0: [sdb] tag#6826 Add. Sense: Read retries exhausted
2026-04-07T04:56:25.443784+00:00 logging-hd2001 kernel: [1672686.386873] sd 0:0:1:0: [sdb] tag#6826 CDB: Read(16) 88 00 00 00 00 01 1e e7 33 80 00 00 00 08 00 00
2026-04-07T04:56:25.443785+00:00 logging-hd2001 kernel: [1672686.386877] critical medium error, dev sdb, sector 4813435780 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
2026-04-07T04:56:28.518821+00:00 logging-hd2001 kernel: [1672689.461882] mpt3sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
2026-04-07T04:56:28.518837+00:00 logging-hd2001 kernel: [1672689.461908] sd 0:0:1:0: [sdb] tag#6846 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=3s
2026-04-07T04:56:28.518839+00:00 logging-hd2001 kernel: [1672689.461920] sd 0:0:1:0: [sdb] tag#6846 Sense Key : Medium Error [current] [descriptor] 
2026-04-07T04:56:28.518841+00:00 logging-hd2001 kernel: [1672689.461926] sd 0:0:1:0: [sdb] tag#6846 Add. Sense: Read retries exhausted
2026-04-07T04:56:28.518842+00:00 logging-hd2001 kernel: [1672689.461931] sd 0:0:1:0: [sdb] tag#6846 CDB: Read(16) 88 00 00 00 00 01 1e e7 33 80 00 00 00 08 00 00
2026-04-07T04:56:28.518844+00:00 logging-hd2001 kernel: [1672689.461935] critical medium error, dev sdb, sector 4813435780 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
2026-04-07T04:56:31.610758+00:00 logging-hd2001 kernel: [1672692.553802] mpt3sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
2026-04-07T04:56:31.610769+00:00 logging-hd2001 kernel: [1672692.553806] sd 0:0:1:0: [sdb] tag#6788 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=3s
2026-04-07T04:56:31.610772+00:00 logging-hd2001 kernel: [1672692.553816] sd 0:0:1:0: [sdb] tag#6788 Sense Key : Medium Error [current] [descriptor] 
2026-04-07T04:56:31.610773+00:00 logging-hd2001 kernel: [1672692.553823] sd 0:0:1:0: [sdb] tag#6788 Add. Sense: Read retries exhausted
2026-04-07T04:56:31.610774+00:00 logging-hd2001 kernel: [1672692.553829] sd 0:0:1:0: [sdb] tag#6788 CDB: Read(16) 88 00 00 00 00 01 1e e7 33 80 00 00 00 08 00 00
2026-04-07T04:56:31.610776+00:00 logging-hd2001 kernel: [1672692.553832] critical medium error, dev sdb, sector 4813435780 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
2026-04-07T04:56:34.819055+00:00 logging-hd2001 kernel: [1672695.762075] mpt3sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
2026-04-07T04:56:34.819080+00:00 logging-hd2001 kernel: [1672695.762081] sd 0:0:1:0: [sdb] tag#6789 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=3s
2026-04-07T04:56:34.819084+00:00 logging-hd2001 kernel: [1672695.762095] sd 0:0:1:0: [sdb] tag#6789 Sense Key : Medium Error [current] [descriptor] 
2026-04-07T04:56:34.819086+00:00 logging-hd2001 kernel: [1672695.762100] sd 0:0:1:0: [sdb] tag#6789 Add. Sense: Read retries exhausted
2026-04-07T04:56:34.819087+00:00 logging-hd2001 kernel: [1672695.762106] sd 0:0:1:0: [sdb] tag#6789 CDB: Read(16) 88 00 00 00 00 01 1e e7 33 80 00 00 00 08 00 00
2026-04-07T04:56:34.819088+00:00 logging-hd2001 kernel: [1672695.762110] critical medium error, dev sdb, sector 4813435780 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
2026-04-07T04:56:37.894085+00:00 logging-hd2001 kernel: [1672698.837089] mpt3sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
2026-04-07T04:56:37.894109+00:00 logging-hd2001 kernel: [1672698.837113] sd 0:0:1:0: [sdb] tag#6818 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=3s
2026-04-07T04:56:37.894112+00:00 logging-hd2001 kernel: [1672698.837126] sd 0:0:1:0: [sdb] tag#6818 Sense Key : Medium Error [current] [descriptor] 
2026-04-07T04:56:37.894113+00:00 logging-hd2001 kernel: [1672698.837132] sd 0:0:1:0: [sdb] tag#6818 Add. Sense: Read retries exhausted
2026-04-07T04:56:37.894115+00:00 logging-hd2001 kernel: [1672698.837138] sd 0:0:1:0: [sdb] tag#6818 CDB: Read(16) 88 00 00 00 00 01 1e e7 33 80 00 00 00 08 00 00
2026-04-07T04:56:37.894116+00:00 logging-hd2001 kernel: [1672698.837141] critical medium error, dev sdb, sector 4813435780 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
2026-04-07T04:56:41.027614+00:00 logging-hd2001 kernel: [1672701.970600] mpt3sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
2026-04-07T04:56:41.027638+00:00 logging-hd2001 kernel: [1672701.970624] sd 0:0:1:0: [sdb] tag#6829 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=3s
2026-04-07T04:56:41.027640+00:00 logging-hd2001 kernel: [1672701.970636] sd 0:0:1:0: [sdb] tag#6829 Sense Key : Medium Error [current] [descriptor] 
2026-04-07T04:56:41.027642+00:00 logging-hd2001 kernel: [1672701.970642] sd 0:0:1:0: [sdb] tag#6829 Add. Sense: Read retries exhausted
2026-04-07T04:56:41.027643+00:00 logging-hd2001 kernel: [1672701.970647] sd 0:0:1:0: [sdb] tag#6829 CDB: Read(16) 88 00 00 00 00 01 1e e7 33 80 00 00 00 08 00 00
2026-04-07T04:56:41.027645+00:00 logging-hd2001 kernel: [1672701.970651] critical medium error, dev sdb, sector 4813435780 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
2026-04-07T04:56:42.866662+00:00 logging-hd2001 opensearch[378008]: fatal error in thread [opensearch[logging-hd2001-production-elk7-codfw][generic][T#22]], exiting
2026-04-07T04:56:42.867561+00:00 logging-hd2001 opensearch[378008]: java.lang.InternalError: a fault occurred in a recent unsafe memory access operation in compiled Java code
2026-04-07T04:56:42.867655+00:00 logging-hd2001 opensearch[378008]: #011at org.apache.lucene.store.BufferedChecksumIndexInput.readBytes(BufferedChecksumIndexInput.java:46)
2026-04-07T04:56:42.902773+00:00 logging-hd2001 opensearch[378008]: #011at org.apache.lucene.store.DataInput.readBytes(DataInput.java:72)
2026-04-07T04:56:42.902903+00:00 logging-hd2001 opensearch[378008]: #011at org.apache.lucene.store.ChecksumIndexInput.skipByReading(ChecksumIndexInput.java:79)
2026-04-07T04:56:42.937754+00:00 logging-hd2001 opensearch[378008]: #011at org.apache.lucene.store.ChecksumIndexInput.seek(ChecksumIndexInput.java:64)
2026-04-07T04:56:42.937876+00:00 logging-hd2001 opensearch[378008]: #011at org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecUtil.java:618)
2026-04-07T04:56:42.937946+00:00 logging-hd2001 opensearch[378008]: #011at org.apache.lucene.codecs.lucene90.Lucene90PostingsReader.checkIntegrity(Lucene90PostingsReader.java:2049)
2026-04-07T04:56:42.938023+00:00 logging-hd2001 opensearch[378008]: #011at org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsReader.checkIntegrity(Lucene90BlockTreeTermsReader.java:330)
2026-04-07T04:56:42.938114+00:00 logging-hd2001 opensearch[378008]: #011at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.checkIntegrity(PerFieldPostingsFormat.java:370)
2026-04-07T04:56:42.938261+00:00 logging-hd2001 opensearch[378008]: #011at org.apache.lucene.codecs.perfield.PerFieldMergeState$FilterFieldsProducer.checkIntegrity(PerFieldMergeState.java:296)
2026-04-07T04:56:42.938384+00:00 logging-hd2001 opensearch[378008]: #011at org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:83)
2026-04-07T04:56:42.938459+00:00 logging-hd2001 opensearch[378008]: #011at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.merge(PerFieldPostingsFormat.java:205)
2026-04-07T04:56:42.938531+00:00 logging-hd2001 opensearch[378008]: #011at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:209)
2026-04-07T04:56:42.938623+00:00 logging-hd2001 opensearch[378008]: #011at org.apache.lucene.index.SegmentMerger.mergeWithLogging(SegmentMerger.java:298)
2026-04-07T04:56:42.938727+00:00 logging-hd2001 opensearch[378008]: #011at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:137)
2026-04-07T04:56:42.938841+00:00 logging-hd2001 opensearch[378008]: #011at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:5140)
2026-04-07T04:56:42.946807+00:00 logging-hd2001 opensearch[378008]: #011at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4680)
2026-04-07T04:56:42.946917+00:00 logging-hd2001 opensearch[378008]: #011at org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:6432)
2026-04-07T04:56:42.947003+00:00 logging-hd2001 opensearch[378008]: #011at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:639)
2026-04-07T04:56:42.947078+00:00 logging-hd2001 opensearch[378008]: #011at org.opensearch.index.engine.OpenSearchConcurrentMergeScheduler.doMerge(OpenSearchConcurrentMergeScheduler.java:120)
2026-04-07T04:56:43.004903+00:00 logging-hd2001 opensearch[378008]: #011at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:700)
2026-04-07T04:56:44.157299+00:00 logging-hd2001 systemd[1]: opensearch_2@production-elk7-codfw.service: Main process exited, code=exited, status=128/n/a
2026-04-07T04:56:44.174262+00:00 logging-hd2001 systemd[1]: opensearch_2@production-elk7-codfw.service: Failed with result 'exit-code'.
2026-04-07T04:56:44.186310+00:00 logging-hd2001 systemd[1]: opensearch_2@production-elk7-codfw.service: Consumed 6h 53min 12.273s CPU time.
Wed, Apr 8, 2:23 PM · observability

Fri, Apr 3

colewhite added a comment to T390215: Logstash is overwhelmed.

The istio-system namespace is logging ~980 events/sec. Many are just istio-ingressgateway for authority:page-analytics.discovery.wmnet (~833 events/sec).

Fri, Apr 3, 3:23 AM · SRE Observability (FY2025/2026-Q1), Patch-For-Review, Observability-Logging

Wed, Mar 25

colewhite added a comment to T418929: Q4:rack/setup/install kafka-logging100[6-8].

I'd like to propose a rename to logging-kafka so that these hosts follow the other logging-* hosts indicating its role in the larger cluster.

Wed, Mar 25, 3:41 PM · Patch-For-Review, observability, SRE, ops-eqiad, DC-Ops
colewhite added a comment to T418931: Q3:rack/setup/install kafka-logging200[6-8].

I'd like to propose a rename to logging-kafka so that these hosts follow the other logging-* hosts indicating its role in the larger cluster.

Wed, Mar 25, 3:41 PM · observability, SRE, ops-codfw, DC-Ops

Mon, Mar 16

colewhite updated the task description for T420158: Eqiad: lsw1-c2-eqiad BGP maintenance/ Tuesday 17th at 9:30 CDT.
Mon, Mar 16, 10:45 PM · Data-Platform-SRE (2026-03-06 - 2026-03-27), ServiceOps new, netops, Infrastructure-Foundations, SRE
colewhite updated subscribers of T413127: Directory Listing and Download from Object Storage.

2M files can be too much for swift for one container and generally the backends. I have some suggestions:

  • Maybe drop all hourly graphs and logs after three months?
  • We could also drop all non "all" graphs and logs after a year too?

I think dropping those would save a lot of space without dropping too much useful stuff.

Mon, Mar 16, 8:51 PM · Patch-For-Review, MediaWiki-Platform-Team (Radar), Data-Persistence, Arc-Lamp
colewhite added a comment to T420034: deployment-kafka-logging01 is down for maintenance because Trixie is not yet well supported.

@colewhite I see logstash consumer groups connecting, when you have a moment could you verify if everything works for beta logs?

Mon, Mar 16, 4:56 PM · Beta-Cluster-Infrastructure

Mar 13 2026

colewhite edited projects for T417001: Upgrade Observability Kafka hosts to trixie, added: Observability-Logging; removed SRE Observability.
Mar 13 2026, 5:17 PM · Observability-Logging
colewhite added a comment to T416384: Reduce logstash logs from machine learning infra.

I think we are out of the woods, we have around ~20k/minute logs now mostly coming from the inference-service pods (kserve-container). I think that we could definitely improve things there:

  1. The logs are not emitted using json or ECS so in case of errors, like Python stacktraces, we get one log for each line. It is a waste on the logstash side, but also not really great for human readers that need to investigate an outage the day afterwards. If the logs are not on the pods because of rotation, getting a complete stacktrace from logstash is really really tedious.
  1. We have a mixture of kserve traces, unicorn access logs, latency timings etc.. Do we need all of them?
Mar 13 2026, 3:42 PM · Patch-For-Review, Machine-Learning-Team (Q4 FY2025-26)

Mar 12 2026

colewhite closed T414098: Move https://status.wikimedia.org/ away from rackspace, a subtask of T376400: Redesign wikitech-static, as Resolved.
Mar 12 2026, 5:37 PM · Patch-For-Review, serviceops-radar, SRE-Unowned, SRE, wikitech.wikimedia.org
colewhite closed T414098: Move https://status.wikimedia.org/ away from rackspace as Resolved.

Last thing the old rackspace host handles is [301] http[s]://wikimediastatus.net -> https://www.wikimediastatus.net

Mar 12 2026, 5:37 PM · Patch-For-Review, collaboration-services, SRE Observability, cloud-services-team
colewhite closed T414098: Move https://status.wikimedia.org/ away from rackspace, a subtask of T408704: offline rackspace wikitech-static, online aws wikitech-static, as Resolved.
Mar 12 2026, 5:37 PM · Infrastructure-Foundations
colewhite added a parent task for T419887: Move wikimediastatus.net 301 to ncredir: T408704: offline rackspace wikitech-static, online aws wikitech-static.
Mar 12 2026, 5:36 PM · Patch-For-Review, Traffic, SRE Observability
colewhite added a subtask for T408704: offline rackspace wikitech-static, online aws wikitech-static: T419887: Move wikimediastatus.net 301 to ncredir.
Mar 12 2026, 5:36 PM · Infrastructure-Foundations
colewhite created T419887: Move wikimediastatus.net 301 to ncredir.
Mar 12 2026, 5:36 PM · Patch-For-Review, Traffic, SRE Observability
colewhite reopened T414098: Move https://status.wikimedia.org/ away from rackspace, a subtask of T376400: Redesign wikitech-static, as Open.
Mar 12 2026, 5:32 PM · Patch-For-Review, serviceops-radar, SRE-Unowned, SRE, wikitech.wikimedia.org
colewhite reopened T414098: Move https://status.wikimedia.org/ away from rackspace as "Open".

Last thing the old rackspace host handles is [301] http[s]://wikimediastatus.net -> https://www.wikimediastatus.net

Mar 12 2026, 5:32 PM · Patch-For-Review, collaboration-services, SRE Observability, cloud-services-team
colewhite reopened T414098: Move https://status.wikimedia.org/ away from rackspace, a subtask of T408704: offline rackspace wikitech-static, online aws wikitech-static, as Open.
Mar 12 2026, 5:32 PM · Infrastructure-Foundations

Mar 11 2026

colewhite added a comment to T413127: Directory Listing and Download from Object Storage.

Can you give me an idea of number & size of objects, and what sort of bandwidth you expect this to need, please?

The files I'd like to serve can be found in swift at https://ms-fe.svc.eqiad.wmnet/v1/AUTH_performance/arclamp-(logs|svgs)-(hourly|daily)
Some napkin math and a 3-year retention period (currently configured) yields 2,023,560 files. and ~634GB of data. Files range from a few hundred bytes to a little over a hundred megabytes.

Mar 11 2026, 8:36 PM · Patch-For-Review, MediaWiki-Platform-Team (Radar), Data-Persistence, Arc-Lamp

Mar 6 2026

colewhite added a project to T418612: Audit mwlog storage and retention: Observability-Logging.

Poolcounter logs are greatly reduced post-deploy.

$ zcat poolcounter.log-20260304*.gz | wc -l
844298029
$ zcat poolcounter.log-20260305*.gz | wc -l
73026
Mar 6 2026, 12:57 AM · Observability-Logging, SRE Observability (FY2025/2026-Q3)

Mar 2 2026

colewhite updated the task description for T418772: Eqiad: lsw1-d7-eqiad BGP maintenance.
Mar 2 2026, 9:17 PM · Prod-Kubernetes, ServiceOps new, netops, Infrastructure-Foundations, SRE

Feb 24 2026

colewhite added a project to T249663: write some recording rules for queries used in the appserver RED k8s dashboard: Observability-Metrics.
Feb 24 2026, 10:24 PM · Patch-For-Review, Observability-Metrics, SRE Observability (FY2025/2026-Q3), Prod-Kubernetes, ServiceOps new, SRE
colewhite added a comment to T343020: Converting MediaWiki Metrics to StatsLib.

However it has now received code review and is just waiting on response from the submitter, so it probably no longer needs to be in that column.

Feb 24 2026, 10:19 PM · SRE Observability (FY2025/2026-Q1), Essential-Work, MW-1.44-notes (1.44.0-wmf.28; 2025-05-06), Patch-For-Review, Observability-Metrics
colewhite added a project to T416863: csp-report-only topic has exploded in size: Observability-Logging.
Feb 24 2026, 10:53 AM · Observability-Logging, SRE Observability (FY2025/2026-Q3), ContentSecurityPolicy

Feb 23 2026

colewhite closed T418063: wikimediastatus.net has expired certificate as Resolved.

As part of T376400: Redesign wikitech-static, the backup wikitech moved away from the legacy wikitech-static host. Part of that process was pointing wikitech-static.wikimedia.org away from the legacy host to its new home. This made it impossible for certbot to complete its certificate renewal for a host that it no longer served. This renewal discontinuity interrupted wikimediastatus.net and status.wikimedia.org renewals.

Feb 23 2026, 8:18 PM · Incident Tooling, Traffic

Jan 28 2026

colewhite created T415784: GitLab Explore projects "Maximum Page Reached".
Jan 28 2026, 1:31 PM · GitLab (Upstream pit of despair 🕳️)

Jan 26 2026

colewhite closed T409363: Setup service name for Beta Cluster access to logstash service in logging project as Resolved.
Jan 26 2026, 10:02 AM · Scap, Observability-Logging, Beta-Cluster-Infrastructure

Jan 22 2026

colewhite added a subtask for T390215: Logstash is overwhelmed: T415289: logging-hd nodes evict-rejoin troubles.
Jan 22 2026, 4:59 PM · SRE Observability (FY2025/2026-Q1), Patch-For-Review, Observability-Logging
colewhite added a parent task for T415289: logging-hd nodes evict-rejoin troubles: T390215: Logstash is overwhelmed.
Jan 22 2026, 4:59 PM · Observability-Logging
colewhite created T415289: logging-hd nodes evict-rejoin troubles.
Jan 22 2026, 4:59 PM · Observability-Logging
colewhite closed T409339: Beta Cluster MediaWiki updates require logging-logstash-02.logging.eqiad1.wikimedia.cloud to allow access to port 9200 by `scap` as Resolved.

I've added monitoring and alerting to the new cname record. Considering this done!

Jan 22 2026, 4:58 PM · Observability-Logging, Beta-Cluster-Infrastructure, Scap
colewhite updated subscribers of T415270: librenms.syslog table is 800GB.
Jan 22 2026, 3:46 PM · observability, netops, DBA, Infrastructure-Foundations

Jan 21 2026

colewhite closed T414670: Change units for "network utilization" on "host overview" dashboard to bits/sec as Resolved.

Was bold and made this change. Will direct here for further discussion if there are other opinions.

Jan 21 2026, 9:36 PM · Observability-Metrics, SRE Observability (FY2025/2026-Q3)

Jan 15 2026

colewhite added a comment to T414670: Change units for "network utilization" on "host overview" dashboard to bits/sec.

I'm +1 for this change. Usually, I'm looking at the network panels to see where utilization is relative to the link speed. It'd be nice to not have to manually calculate it.

Jan 15 2026, 10:42 PM · Observability-Metrics, SRE Observability (FY2025/2026-Q3)
colewhite added a comment to T414648: Request to increase quotas for logging project.

Thank you!!

Jan 15 2026, 4:32 PM · Observability-Logging, Cloud-VPS (Quota-requests)

Jan 14 2026

colewhite added a project to T414648: Request to increase quotas for logging project: Observability-Logging.
Jan 14 2026, 11:45 PM · Observability-Logging, Cloud-VPS (Quota-requests)
colewhite created T414648: Request to increase quotas for logging project.
Jan 14 2026, 11:45 PM · Observability-Logging, Cloud-VPS (Quota-requests)
colewhite added a comment to T414501: Improve tooling for long-running Thanos queries.

For context, the outage was caused by saturated nics on the titan hosts.

Jan 14 2026, 8:36 PM · SRE Observability
colewhite added a comment to T414607: Improve Grafana scalability.

For visibility, the outage today was a "grafana consumed all the memory" condition. https://grafana-next.wikimedia.org which links to the read-only backup in the standby DC remained available and was used in to diagnose the primary grafana host.

Jan 14 2026, 8:31 PM · SRE Observability

Jan 12 2026

colewhite added a comment to T414098: Move https://status.wikimedia.org/ away from rackspace.

We (observability) asked about this in the all-SRE meeting and most preferred to 301 status.wm.o -> wikimediastatus.net on the grounds that there are many Wikimedia properties using third-party tools and hosting with different privacy policies. Another option presented was to host a small static page on miscweb.

Jan 12 2026, 9:29 PM · Patch-For-Review, collaboration-services, SRE Observability, cloud-services-team

Jan 6 2026

colewhite added a comment to T413842: Icinga config can break on ensure=>absent in nagios_common::check_command::config.

Puppet is running on icinga again after that last patch.

Jan 6 2026, 2:09 AM · Observability-Alerting
colewhite renamed T413842: Icinga config can break on ensure=>absent in nagios_common::check_command::config from Icinga config broken due to ensure absent to Icinga config can break on ensure=>absent in nagios_common::check_command::config.
Jan 6 2026, 2:08 AM · Observability-Alerting
colewhite created T413842: Icinga config can break on ensure=>absent in nagios_common::check_command::config.
Jan 6 2026, 12:43 AM · Observability-Alerting

Dec 23 2025

colewhite closed T413414: Put logging-sd[12]00[567] into service as Resolved.
Dec 23 2025, 7:20 PM · Observability-Logging

Dec 22 2025

colewhite changed the status of T413414: Put logging-sd[12]00[567] into service from Open to In Progress.
Dec 22 2025, 11:26 PM · Observability-Logging
colewhite created T413414: Put logging-sd[12]00[567] into service.
Dec 22 2025, 11:15 PM · Observability-Logging
colewhite updated the task description for T413127: Directory Listing and Download from Object Storage.
Dec 22 2025, 5:11 PM · Patch-For-Review, MediaWiki-Platform-Team (Radar), Data-Persistence, Arc-Lamp

Dec 18 2025

colewhite added a member for Arc-Lamp: colewhite.
Dec 18 2025, 8:49 PM
colewhite created T413127: Directory Listing and Download from Object Storage.
Dec 18 2025, 8:49 PM · Patch-For-Review, MediaWiki-Platform-Team (Radar), Data-Persistence, Arc-Lamp
colewhite added a comment to T412842: arclamp hosts ran out of space.

I learned some things.

Dec 18 2025, 2:37 PM · Observability-Logging, Arc-Lamp

Dec 17 2025

colewhite closed T412980: Logstash: add index on new x-wmf-* fields in Evnoy access log. as Resolved.

Done! Let us know if something is amiss!

Dec 17 2025, 5:31 PM · Observability-Logging

Dec 16 2025

colewhite closed T412806: Logstash timezone defaulting to PST instead of UTC as Resolved.

I've reset the timezone back to UTC in the settings.

Dec 16 2025, 2:48 PM · Observability-Logging

Dec 11 2025

colewhite claimed T409363: Setup service name for Beta Cluster access to logstash service in logging project.

Record is in place and config change to scap is deployed.

Dec 11 2025, 5:19 PM · Scap, Observability-Logging, Beta-Cluster-Infrastructure
colewhite claimed T335610: Loki stores too many chunks.
Dec 11 2025, 3:17 PM · Observability-Logging
colewhite moved T335610: Loki stores too many chunks from Prioritized to Watching on the Observability-Logging board.
Dec 11 2025, 3:16 PM · Observability-Logging
colewhite added a comment to T335610: Loki stores too many chunks.

Will watch memory usage and filesystem usage in the coming weeks.

Dec 11 2025, 3:15 PM · Observability-Logging

Dec 9 2025

colewhite added a comment to T412143: ~5k/logs/sec from netdev.

List of affected hosts:

asw1-b3-magru
asw1-b4-magru
asw1-bw27-esams
asw1-by27-esams
cloudsw1-b1-codfw
lsw1-a2-codfw
lsw1-a3-codfw
lsw1-a4-codfw
lsw1-a5-codfw
lsw1-a6-codfw
lsw1-a7-codfw
lsw1-a8-codfw
lsw1-b2-codfw
lsw1-b3-codfw
lsw1-b4-codfw
lsw1-b5-codfw
lsw1-b6-codfw
lsw1-b7-codfw
lsw1-b8-codfw
lsw1-c1-codfw
lsw1-c2-codfw
lsw1-c3-codfw
lsw1-c4-codfw
lsw1-c5-codfw
lsw1-c6-codfw
lsw1-c7-codfw
lsw1-d1-codfw
lsw1-d2-codfw
lsw1-d3-codfw
lsw1-d4-codfw
lsw1-d5-codfw
lsw1-d6-codfw
lsw1-d7-codfw
lsw1-d8-codfw
lsw1-e1-eqiad
lsw1-e2-eqiad
lsw1-e3-eqiad
lsw1-e5-eqiad
lsw1-e6-eqiad
lsw1-e7-eqiad
lsw1-e8-eqiad
lsw1-f1-eqiad
lsw1-f2-eqiad
lsw1-f3-eqiad
lsw1-f5-eqiad
lsw1-f6-eqiad
lsw1-f7-eqiad
lsw1-f8-eqiad
ssw1-a1-codfw
ssw1-a8-codfw
ssw1-d1-codfw
ssw1-d8-codfw
ssw1-e1-eqiad
ssw1-f1-eqiad
Dec 9 2025, 6:39 PM · Observability-Logging, netops, Infrastructure-Foundations
colewhite added a subtask for T390215: Logstash is overwhelmed: T412143: ~5k/logs/sec from netdev.
Dec 9 2025, 5:36 PM · SRE Observability (FY2025/2026-Q1), Patch-For-Review, Observability-Logging
colewhite added a parent task for T412143: ~5k/logs/sec from netdev: T390215: Logstash is overwhelmed.
Dec 9 2025, 5:36 PM · Observability-Logging, netops, Infrastructure-Foundations
colewhite created T412143: ~5k/logs/sec from netdev.
Dec 9 2025, 5:35 PM · Observability-Logging, netops, Infrastructure-Foundations

Dec 5 2025

colewhite added a comment to T411474: Grafana "MW deploy" "Train deployments" annotations broken on some dashboards.

@colewhite is there a way to search for all of the dashboards that have graphite annotations? I fixed the 3 that @hashar explicitly listed, but I'm not excited enough about this to want to manually check every dashboard.

Dec 5 2025, 6:21 PM · Scap, Release-Engineering-Team (Radar), Grafana
colewhite closed T406880: hCaptcha: Implement alerts as Resolved.

@colewhite it seems like an alert should have fired given these logs https://logstash.wikimedia.org/goto/6aee202d1a5f1feb5ac0504d4e2dbcd6 , but I do not see one in #psi-alerts.

Dec 5 2025, 5:35 PM · ConfirmEdit (CAPTCHA extension), Product Safety and Integrity (Sprint Mince Pie Dec 1 - Dec 12), MW-1.45-notes (1.45.0-wmf.25; 2025-10-28), Observability-Alerting, WE4.2 Bot detection (WE4.2 hCaptcha account creation trial)
colewhite closed T406880: hCaptcha: Implement alerts, a subtask of T404204: Investigate options for automatic fallback to FancyCAPTCHA, as Resolved.
Dec 5 2025, 5:35 PM · MW-1.45-notes (1.45.0-wmf.22; 2025-10-07), Product Safety and Integrity (Sprint Apfel Strudel (Sep 29 - Oct 17)), SRE, WE4.2 Bot detection (WE4.2 hCaptcha account creation trial)

Dec 4 2025

colewhite added a comment to T411474: Grafana "MW deploy" "Train deployments" annotations broken on some dashboards.

For awareness, there's also a regression affecting annotations in Grafana: https://github.com/grafana/grafana/issues/110265

Dec 4 2025, 10:23 PM · Scap, Release-Engineering-Team (Radar), Grafana
colewhite added a comment to T383563: mw.track: support for histogram metrics.

However, I lack the knowledge about what happens afterward. dogstatsd seems to support metrics (|h), but I'm not sure what it does with that and how it is passed on to the next part of our pipeline. And we also need to somehow define the buckets to use.

Dec 4 2025, 10:00 PM · MW-1.46-notes (1.46.0-wmf.14; 2026-02-03), Essential-Work, Growth-Team (FY2025-26 Q3 Sprint 2), User-Michael, patch-welcome, MediaWiki-Engineering, Observability-Metrics, Data-Engineering-Radar, MediaWiki-Platform-Team (Radar), Data-Engineering, MediaWiki-extensions-WikimediaEvents, Grafana, GrowthExperiments
colewhite added a comment to T410764: MediaWiki periodic job startupregistrystats-mediawikiwiki failed.

If you rename them, the next time the alert fires, it will create a new task instead of appending to the existing open one.

IMO that's not ideal; these tasks will clutter up search results and will be hard to navigate. Couldn't it use a custom Maniphest field rather than the title for identifying which task is related to a script?

I don't know, we'd have to ask SRE Observability about that

Dec 4 2025, 6:52 PM · MediaWiki-Platform-Team, SRE Observability, serviceops-deprecated

Dec 3 2025

colewhite closed T411623: Logstash: add index on new fields in Evnoy access log. as Resolved.

Done! Let us know if something is amiss!

Dec 3 2025, 3:15 PM · Observability-Logging
colewhite added a comment to T377273: Migrate scap metrics to Prometheus.

For batch-job oriented applications, we run a Prometheus PushGateway instance which may be an option for this. A Prometheus metrics endpoint in a stateful service is the ideal, though. Is there an opportunity here with spiderpig being a daemon service and could serve as the source of this data? (Guessing myself: probably not, but it can't hurt to ask the experts. 🙂)

Dec 3 2025, 12:42 AM · Observability-Metrics, Scap

Dec 2 2025

colewhite removed projects from T411474: Grafana "MW deploy" "Train deployments" annotations broken on some dashboards: Observability-Metrics, observability.

The replacement for this annotation tool is to use the Public Logs datasource in Grafana which is backed by Loki. Please let us know if the Observability team can be of further assistance.

Dec 2 2025, 6:49 PM · Scap, Release-Engineering-Team (Radar), Grafana
colewhite renamed T411474: Grafana "MW deploy" "Train deployments" annotations broken on some dashboards from Graphana no more shows "MW deploy" "Train deployments" annotations to Grafana "MW deploy" "Train deployments" annotations broken on some dashboards.
Dec 2 2025, 4:05 PM · Scap, Release-Engineering-Team (Radar), Grafana

Nov 25 2025

colewhite closed T410795: Logstash: No longer able to install OpenSearch 2.7.0 from internal apt repo, a subtask of T407123: Mirror OpenSearch repos from upstream, as Resolved.
Nov 25 2025, 12:04 AM · Data-Platform-SRE (2026.01.05 - 2026.01.23), OKR-Work
colewhite closed T410795: Logstash: No longer able to install OpenSearch 2.7.0 from internal apt repo as Resolved.

component/opensearch27 provisioned and beta-logs is downgraded again.

Nov 25 2025, 12:04 AM · Observability-Logging

Nov 21 2025

colewhite updated the task description for T410795: Logstash: No longer able to install OpenSearch 2.7.0 from internal apt repo.
Nov 21 2025, 10:23 PM · Observability-Logging
colewhite added a subtask for T407123: Mirror OpenSearch repos from upstream: T410795: Logstash: No longer able to install OpenSearch 2.7.0 from internal apt repo.
Nov 21 2025, 10:21 PM · Data-Platform-SRE (2026.01.05 - 2026.01.23), OKR-Work
colewhite added a parent task for T410795: Logstash: No longer able to install OpenSearch 2.7.0 from internal apt repo: T407123: Mirror OpenSearch repos from upstream.
Nov 21 2025, 10:21 PM · Observability-Logging
colewhite triaged T410795: Logstash: No longer able to install OpenSearch 2.7.0 from internal apt repo as High priority.
Nov 21 2025, 10:21 PM · Observability-Logging
colewhite created T410795: Logstash: No longer able to install OpenSearch 2.7.0 from internal apt repo.
Nov 21 2025, 10:21 PM · Observability-Logging

Nov 20 2025

colewhite closed T410619: Alert in need of triage: SystemdUnitCrashLoop (instance grafana2001:9100) as Resolved.

I've removed the rsync job that I suspect was causing Loki to panic regularly.

Nov 20 2025, 4:32 PM · SRE Observability (FY2025/2026-Q2), Observability-Logging, sre-alert-triage

Nov 14 2025

colewhite added a comment to T409361: `logging` project missing normal DNS zone delegation.

Thank you!

Nov 14 2025, 5:17 PM · cloud-services-team, Cloud-VPS

Nov 6 2025

colewhite added a comment to T408378: Nokia OSPF alerts not working.

In today's case, the alert criteria wasn't met because the metrics went missing.

Nov 6 2025, 9:13 PM · Observability-Alerting, netops, Infrastructure-Foundations, SRE

Nov 5 2025

colewhite added a comment to T409339: Beta Cluster MediaWiki updates require logging-logstash-02.logging.eqiad1.wikimedia.cloud to allow access to port 9200 by `scap`.

The network ACLs and the scap config were updated to use the newer host: logging-logstash-04.

Nov 5 2025, 9:52 PM · Observability-Logging, Beta-Cluster-Infrastructure, Scap

Nov 4 2025

colewhite added a project to T305223: Clean up stale Prometheus target and rules files: Observability-Metrics.
Nov 4 2025, 7:49 PM · Patch-For-Review, Observability-Metrics, SRE Observability (FY2025/2026-Q1)

Oct 28 2025

colewhite added a comment to T407219: Enrich MediaWiki logs with IP reputation data in Logstash.

Yeah. IPoid serves up a database created from the daily IP data dumps Spur gives us, so theoretically that dump could be provided directly to the Logstash hosts in some different format that can be queried efficiently locally (e.g. sqlite) but I imagine that would result in an infeasible amount of maintenance overhead.

Possibly not an infeasible amount of maintenance overhead. Depends on how stable the data source is and how performant it can be made. Performance testing is needed.

That seems problematic even with IPoid. I guess this is a no-go then?

I wouldn't say it's no-go because of that. Isolating the stream is possible and so is a separate data enrichment step. Isolating the stream would give us a more concrete picture of the load this feature will induce on the Spur data provider (whatever form it takes).

Is there maybe a way to do such post-processing inside the OpenSearch index?

Post-processing indexed events is less preferable as it introduces tombstoning overhead. We'd like to avoid that. It is far better to inject the data into the event before it reaches OpenSearch for storage. Knowing the volume of the isolated stream would help us asses the impact.

I see @kostajh wrote about how to import the Spur data to OpenSearch, so maybe there's an intent to do that anyway and we'd only have to connect the two indexes somehow?

OpenSearch has no JOIN clause. Depending on the size of the data Spur gives us, I do see the possibility of the separate data enrichment pipeline pulling data backed by an OpenSearch index, though. Performance testing is needed.

Oct 28 2025, 11:49 PM · MediaWiki-Debug-Logger, MediaWiki-extensions-IPReputation, Observability-Logging, Wikimedia-Logstash, Product Safety and Integrity

Oct 27 2025

colewhite added a project to T406170: Metric anomaly in resourceloader_build_seconds after 26 June 2025: MediaWiki-libs-Stats.
Oct 27 2025, 5:45 PM · MW-1.46-notes (1.46.0-wmf.1; 2025-11-05), MW-1.45-notes (1.45.0-wmf.25; 2025-10-28), MediaWiki-libs-Stats, MediaWiki-Platform-Team, Regression, MediaWiki-ResourceLoader

Oct 22 2025

colewhite added a comment to T406880: hCaptcha: Implement alerts.

@colewhite would you be able to also set up an alert for the Logstash entry for hCaptcha is unavailable, falling back to FancyCaptcha?

Oct 22 2025, 8:09 PM · ConfirmEdit (CAPTCHA extension), Product Safety and Integrity (Sprint Mince Pie Dec 1 - Dec 12), MW-1.45-notes (1.45.0-wmf.25; 2025-10-28), Observability-Alerting, WE4.2 Bot detection (WE4.2 hCaptcha account creation trial)

Oct 16 2025

colewhite added a comment to T404888: Parse DMARC reports and create a dashboard from data.

Based on my understanding of what DMARC reports contain, I think this data is fine to store in Logstash. How I interpret the recommendation to "keep email logs out of logstash", is about preventing leaks of data and metadata about private communications.

Oct 16 2025, 9:57 PM · Patch-For-Review, SRE Observability, Epic, Infrastructure-Foundations, Mail

Oct 15 2025

colewhite removed projects from T340187: Enable pl.wikisource clienterrors monitoring in Logstash: Observability-Logging, Wikimedia-Logstash.

We (observability) provide the infrastructure, but this request looks like a mediawiki config setting. I believe the linked patch would fulfill the ask, but someone else should make the decision on whether it is the right change.

Oct 15 2025, 4:14 PM · Patch-For-Review, Test Kitchen, Wikimedia-Site-requests
colewhite edited projects for T340187: Enable pl.wikisource clienterrors monitoring in Logstash, added: Observability-Logging; removed observability.
Oct 15 2025, 2:21 PM · Patch-For-Review, Test Kitchen, Wikimedia-Site-requests
colewhite edited projects for T406795: Q2:rack/setup/install logging-sd200[567], added: Observability-Logging; removed observability.
Oct 15 2025, 2:19 PM · Observability-Logging, SRE, ops-codfw, DC-Ops
colewhite edited projects for T406796: Q2:rack/setup/install logging-sd100[567], added: Observability-Logging; removed observability.
Oct 15 2025, 2:18 PM · Observability-Logging, SRE, ops-eqiad, DC-Ops
colewhite removed a project from T407185: Fix Kafka replicas skew: observability.
Oct 15 2025, 2:18 PM · Data-Platform-SRE (2025.11.07 - 2025.11.28), Essential-Work, Data-Engineering-Radar, serviceops-deprecated, Observability-Logging, Data-Engineering
colewhite assigned T407185: Fix Kafka replicas skew to herron.
Oct 15 2025, 2:17 PM · Data-Platform-SRE (2025.11.07 - 2025.11.28), Essential-Work, Data-Engineering-Radar, serviceops-deprecated, Observability-Logging, Data-Engineering
colewhite removed a project from T406308: Link to view requestctl rule in superset no longer working: SRE Observability.

Please correct me if I'm wrong, but is this link generated and displayed by requestctl?

Oct 15 2025, 2:14 PM · Sustainability (Incident Followup), requestctl
colewhite edited projects for T406689: Email alerts from Grafana stopped working?, added: Observability-Alerting; removed SRE Observability.
Oct 15 2025, 2:08 PM · Observability-Alerting, Grafana
colewhite edited projects for T407130: Remove check_systemd_unit_status-based Icinga alerts, added: Observability-Alerting; removed SRE Observability.
Oct 15 2025, 2:07 PM · Observability-Alerting
colewhite edited projects for T407320: Package benthos/redpanda for trixie, added: Observability-Logging; removed SRE Observability.

Do you happen to have a trixie host available that we can try the existing package on?

Oct 15 2025, 2:07 PM · Observability-Logging, Traffic

Oct 14 2025

colewhite added a comment to T407219: Enrich MediaWiki logs with IP reputation data in Logstash.

A few things come to mind immediately.

  1. We don't want to introduce an external (to us) dependency. Logstash has no internet access and it shouldn't have it. (I am guessing the workaround is to use IPoid?)
  2. We don't want to introduce additional latency to the overall pipeline. The desired stream ought to be isolated so that delays induced by the external data provider do not add latency for other tenants.
  3. Logstash handles thousands of events per second which translates into requests against the data source. Whatever handles the data should be able to absorb this extra load.
  4. The Logstash SLO may need to be tuned to handle the introduction of an external dependency.
Oct 14 2025, 10:02 PM · MediaWiki-Debug-Logger, MediaWiki-extensions-IPReputation, Observability-Logging, Product Safety and Integrity, Wikimedia-Logstash