Page MenuHomePhabricator

BCornwall (Brett Cornwall)
SRE/Traffic

Projects (7)

Today

  • No visible events.

Tomorrow

  • No visible events.

Wednesday

  • No visible events.

User Details

User Since
May 4 2022, 6:41 PM (206 w, 4 d)
Availability
Available
IRC Nick
brett
LDAP User
BCornwall
MediaWiki User
BCornwall-WMF [ Global Accounts ]

Recent Activity

Wed, Apr 15

BCornwall closed T421007: hardware troubleshooting: NVMe errors on cp1115.eqiad.wmnet as Resolved.

I can confirm it's behaving properly now! Reimage worked just fine and I don't have any kernel errors any more. It's been put into service. Thanks, @VRiley-WMF !

Wed, Apr 15, 6:11 PM · Traffic, SRE, ops-eqiad, DC-Ops

Tue, Apr 14

BCornwall changed the status of T408617: Containerize ncmonitor from In Progress to Stalled.
Tue, Apr 14, 7:23 PM · Traffic
BCornwall renamed T423310: IPv6 non-functional in GitLab CI environments from IPv6 non-functional in CI environments to IPv6 non-functional in GitLab CI environments.
Tue, Apr 14, 4:21 PM · collaboration-services
BCornwall created T423310: IPv6 non-functional in GitLab CI environments.
Tue, Apr 14, 4:17 PM · collaboration-services
BCornwall added a comment to T421421: Revert lvs1017 Mellanox NIC to Broadcom.

We've decided to move forward with this task. Would dcops be willing to handle the NIC revert in lvs1017?

Tue, Apr 14, 3:46 PM · SRE, Traffic, ops-eqiad, DC-Ops

Mon, Apr 13

BCornwall closed T423199: [Update DNS Record Request] - wikimedia.org - Add TXT verification for Miro as Resolved.
Mon, Apr 13, 10:06 PM · SRE, DNS, Traffic
BCornwall added a comment to T423199: [Update DNS Record Request] - wikimedia.org - Add TXT verification for Miro.

Hi, @JKelsoteel-WMF ! This has been deployed - I'm going to go ahead and close this; Please do re-open if something is not as expected. Thanks!

Mon, Apr 13, 10:06 PM · SRE, DNS, Traffic
BCornwall changed the status of T423199: [Update DNS Record Request] - wikimedia.org - Add TXT verification for Miro from Open to In Progress.
Mon, Apr 13, 8:53 PM · SRE, DNS, Traffic

Thu, Apr 9

BCornwall updated subscribers of P90343 T422860 haproxy packages.

cc @Fabfur - I told @bking that it's safe to delete those as 2.8 in bullseye isn't listed as to be deleted.

Thu, Apr 9, 5:32 PM · Data-Platform-SRE, Traffic
BCornwall added a comment to T422261: sre.hosts.reboot-single cookbook removes any and all downtimes after reboot.

Thanks for the response, @elukey! Indeed, Icinga would ideally not even be used any more. Since the service in question is planning to be replaced in the upcoming months, it's not worth the porting effort. However, that is a good response: "Why are you using this dead alerting system in the first place? Migrate over to Prometheus/AM".

Thu, Apr 9, 4:09 PM · Infrastructure-Foundations, Traffic

Wed, Apr 8

BCornwall added a comment to T422261: sre.hosts.reboot-single cookbook removes any and all downtimes after reboot.

This highlights the larger problem of the opacity of cookbooks, particularly those that purport to be generalized. Removing downtimes not related to its own operation is overreaching and IMHO the solution is for it to remove only its own downtime.

Wed, Apr 8, 4:58 PM · Infrastructure-Foundations, Traffic

Tue, Apr 7

BCornwall updated the task description for T401832: Upgrade Traffic hosts to trixie.
Tue, Apr 7, 6:09 PM · Traffic

Fri, Apr 3

BCornwall added a comment to T421421: Revert lvs1017 Mellanox NIC to Broadcom.

We're going to be discussing whether we want to pursue this still, sorry for the premature bug report. We'll probably discuss next tuesday in our sync-up.

Fri, Apr 3, 8:50 PM · SRE, Traffic, ops-eqiad, DC-Ops
BCornwall added a comment to T421007: hardware troubleshooting: NVMe errors on cp1115.eqiad.wmnet.

No problem. I just re-ran it and can confirm that the issues are still present.

Fri, Apr 3, 4:11 PM · Traffic, SRE, ops-eqiad, DC-Ops
BCornwall moved T422261: sre.hosts.reboot-single cookbook removes any and all downtimes after reboot from Backlog to Radar/Not for Service on the Traffic board.
Fri, Apr 3, 4:03 PM · Infrastructure-Foundations, Traffic
BCornwall created T422261: sre.hosts.reboot-single cookbook removes any and all downtimes after reboot.
Fri, Apr 3, 4:02 PM · Infrastructure-Foundations, Traffic

Wed, Apr 1

BCornwall updated the task description for T401832: Upgrade Traffic hosts to trixie.
Wed, Apr 1, 9:24 PM · Traffic
BCornwall updated the task description for T401832: Upgrade Traffic hosts to trixie.
Wed, Apr 1, 9:23 PM · Traffic
BCornwall closed T411097: Deprecate low-traffic proxoid service and O:hcaptcha_proxy for the older hcaptcha proxy setup as Resolved.
Wed, Apr 1, 5:21 PM · User-Raine, SRE, Traffic

Tue, Mar 31

BCornwall added a comment to T411097: Deprecate low-traffic proxoid service and O:hcaptcha_proxy for the older hcaptcha proxy setup.

The LVS service has been remove, the hosts, decommissioned, and the hcaptcha_proxy module removed from puppet. I'm not sure that anything needs to happen with the Grafana dashboard. I see that there's an exclusion of the old hcaptcha hosts in the dashboard variables but otherwise don't see any remnants on first glance.

Tue, Mar 31, 8:15 PM · User-Raine, SRE, Traffic
BCornwall updated the task description for T401832: Upgrade Traffic hosts to trixie.
Tue, Mar 31, 7:26 PM · Traffic
BCornwall updated the task description for T401832: Upgrade Traffic hosts to trixie.
Tue, Mar 31, 5:53 PM · Traffic

Mon, Mar 30

BCornwall updated the task description for T401832: Upgrade Traffic hosts to trixie.
Mon, Mar 30, 11:03 PM · Traffic
BCornwall removed a subtask for T411097: Deprecate low-traffic proxoid service and O:hcaptcha_proxy for the older hcaptcha proxy setup: T420468: Retire mw-parsoid LVS service.
Mon, Mar 30, 11:02 PM · User-Raine, SRE, Traffic
BCornwall removed a parent task for T420468: Retire mw-parsoid LVS service: T411097: Deprecate low-traffic proxoid service and O:hcaptcha_proxy for the older hcaptcha proxy setup.
Mon, Mar 30, 11:02 PM · ServiceOps-Services-Oids, ServiceOps new
BCornwall updated the task description for T401832: Upgrade Traffic hosts to trixie.
Mon, Mar 30, 9:17 PM · Traffic
BCornwall added a comment to T421007: hardware troubleshooting: NVMe errors on cp1115.eqiad.wmnet.

@VRiley-WMF Do I need to check things again?

Mon, Mar 30, 9:03 PM · Traffic, SRE, ops-eqiad, DC-Ops
BCornwall changed the status of T411097: Deprecate low-traffic proxoid service and O:hcaptcha_proxy for the older hcaptcha proxy setup from Open to In Progress.
Mon, Mar 30, 9:01 PM · User-Raine, SRE, Traffic
BCornwall added a parent task for T420468: Retire mw-parsoid LVS service: T411097: Deprecate low-traffic proxoid service and O:hcaptcha_proxy for the older hcaptcha proxy setup.
Mon, Mar 30, 9:00 PM · ServiceOps-Services-Oids, ServiceOps new
BCornwall added a subtask for T411097: Deprecate low-traffic proxoid service and O:hcaptcha_proxy for the older hcaptcha proxy setup: T420468: Retire mw-parsoid LVS service.
Mon, Mar 30, 9:00 PM · User-Raine, SRE, Traffic
BCornwall updated the task description for T401832: Upgrade Traffic hosts to trixie.
Mon, Mar 30, 8:51 PM · Traffic
BCornwall updated the task description for T401832: Upgrade Traffic hosts to trixie.
Mon, Mar 30, 7:47 PM · Traffic
BCornwall updated the task description for T401832: Upgrade Traffic hosts to trixie.
Mon, Mar 30, 5:57 PM · Traffic
BCornwall updated the task description for T401832: Upgrade Traffic hosts to trixie.
Mon, Mar 30, 4:11 PM · Traffic

Thu, Mar 26

BCornwall updated the task description for T421421: Revert lvs1017 Mellanox NIC to Broadcom.
Thu, Mar 26, 6:53 PM · SRE, Traffic, ops-eqiad, DC-Ops
BCornwall updated the task description for T421421: Revert lvs1017 Mellanox NIC to Broadcom.
Thu, Mar 26, 6:53 PM · SRE, Traffic, ops-eqiad, DC-Ops
BCornwall added a subtask for T387145: Q3:test NIC for lvs1017: T421421: Revert lvs1017 Mellanox NIC to Broadcom.
Thu, Mar 26, 6:43 PM · SRE, ops-eqiad, Traffic, DC-Ops
BCornwall added a parent task for T421421: Revert lvs1017 Mellanox NIC to Broadcom: T387145: Q3:test NIC for lvs1017.
Thu, Mar 26, 6:43 PM · SRE, Traffic, ops-eqiad, DC-Ops
BCornwall added a project to T421421: Revert lvs1017 Mellanox NIC to Broadcom: Traffic.
Thu, Mar 26, 6:42 PM · SRE, Traffic, ops-eqiad, DC-Ops
BCornwall created T421421: Revert lvs1017 Mellanox NIC to Broadcom.
Thu, Mar 26, 6:42 PM · SRE, Traffic, ops-eqiad, DC-Ops
BCornwall added a comment to T196336: Icinga passive checks go awol and downtime stops working.

It's been consistent behavior for some weeks now - both downtimes are removed at once after the reboot occurs. Not sure if this is a cookbook issue or an icinga issue, given the context of this Icinga bug.

Thu, Mar 26, 4:33 PM · Observability-Alerting, SRE, Icinga, observability

Wed, Mar 25

BCornwall added a comment to T196336: Icinga passive checks go awol and downtime stops working.

Unsure if this is related to this particular issue but running sre.hosts.downtime and then sre.hosts.reboot-single causes the downtime to be removed, triggering alerts still.

Wed, Mar 25, 7:14 PM · Observability-Alerting, SRE, Icinga, observability
BCornwall closed T421207: Dead links on https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Process_count, a subtask of T421203: Bad ATS config led to large volume of 5xx from RESTBase, as Resolved.
Wed, Mar 25, 4:10 PM · Incident Severity 3, Traffic, Wikimedia-Incident
BCornwall closed T421207: Dead links on https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Process_count as Resolved.

The information on the page was not accurate and has been removed.

Wed, Mar 25, 4:10 PM · Documentation, Sustainability (Incident Followup), SRE, Traffic
BCornwall triaged T421207: Dead links on https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Process_count as Low priority.
Wed, Mar 25, 4:03 PM · Documentation, Sustainability (Incident Followup), SRE, Traffic
BCornwall changed the status of T421207: Dead links on https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Process_count from Open to In Progress.
Wed, Mar 25, 4:03 PM · Documentation, Sustainability (Incident Followup), SRE, Traffic
BCornwall changed the status of T421207: Dead links on https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Process_count, a subtask of T421203: Bad ATS config led to large volume of 5xx from RESTBase, from Open to In Progress.
Wed, Mar 25, 4:03 PM · Incident Severity 3, Traffic, Wikimedia-Incident

Tue, Mar 24

BCornwall created T421183: Update Pint package.
Tue, Mar 24, 10:16 PM · SRE Observability (FY2025/2026-Q4)
BCornwall closed T420789: lvs2013 NIC fails to set queue length / fails to initialize with ipip-multiqueue-optimizer as Resolved.
Tue, Mar 24, 8:24 PM · Traffic
BCornwall added a comment to T421007: hardware troubleshooting: NVMe errors on cp1115.eqiad.wmnet.

Unfortunately, it's still throwing the errors. :(

Tue, Mar 24, 4:52 PM · Traffic, SRE, ops-eqiad, DC-Ops
BCornwall added a comment to T421007: hardware troubleshooting: NVMe errors on cp1115.eqiad.wmnet.

@VRiley-WMF Yes, please do!

Tue, Mar 24, 3:10 PM · Traffic, SRE, ops-eqiad, DC-Ops
BCornwall updated the task description for T401832: Upgrade Traffic hosts to trixie.
Tue, Mar 24, 1:42 AM · Traffic
BCornwall updated the task description for T421007: hardware troubleshooting: NVMe errors on cp1115.eqiad.wmnet.
Tue, Mar 24, 12:59 AM · Traffic, SRE, ops-eqiad, DC-Ops
BCornwall updated the task description for T421007: hardware troubleshooting: NVMe errors on cp1115.eqiad.wmnet.
Tue, Mar 24, 12:57 AM · Traffic, SRE, ops-eqiad, DC-Ops
BCornwall renamed T421007: hardware troubleshooting: NVMe errors on cp1115.eqiad.wmnet from hardware troubleshooting: Unable to PXE boot cp1115.eqiad.wmnet to hardware troubleshooting: NVMe errors on cp1115.eqiad.wmnet.
Tue, Mar 24, 12:55 AM · Traffic, SRE, ops-eqiad, DC-Ops
BCornwall added a comment to T421007: hardware troubleshooting: NVMe errors on cp1115.eqiad.wmnet.

Marked cp1115 as "failed" in netbox

Tue, Mar 24, 12:43 AM · Traffic, SRE, ops-eqiad, DC-Ops
BCornwall renamed T421007: hardware troubleshooting: NVMe errors on cp1115.eqiad.wmnet from firmware troubleshooting: Unable to PXE boot cp1115.eqiad.wmnet to hardware troubleshooting: Unable to PXE boot cp1115.eqiad.wmnet.
Tue, Mar 24, 12:41 AM · Traffic, SRE, ops-eqiad, DC-Ops
BCornwall added a comment to T421007: hardware troubleshooting: NVMe errors on cp1115.eqiad.wmnet.

Hm. The specific output:

Tue, Mar 24, 12:39 AM · Traffic, SRE, ops-eqiad, DC-Ops

Mon, Mar 23

BCornwall moved T421007: hardware troubleshooting: NVMe errors on cp1115.eqiad.wmnet from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.
Mon, Mar 23, 10:08 PM · Traffic, SRE, ops-eqiad, DC-Ops
BCornwall created T421007: hardware troubleshooting: NVMe errors on cp1115.eqiad.wmnet.
Mon, Mar 23, 10:08 PM · Traffic, SRE, ops-eqiad, DC-Ops

Mar 20 2026

BCornwall changed the status of T420789: lvs2013 NIC fails to set queue length / fails to initialize with ipip-multiqueue-optimizer from Open to In Progress.
Mar 20 2026, 9:02 PM · Traffic
BCornwall updated the task description for T420789: lvs2013 NIC fails to set queue length / fails to initialize with ipip-multiqueue-optimizer.
Mar 20 2026, 8:51 PM · Traffic
BCornwall created T420789: lvs2013 NIC fails to set queue length / fails to initialize with ipip-multiqueue-optimizer.
Mar 20 2026, 8:50 PM · Traffic

Mar 19 2026

BCornwall updated the task description for T401832: Upgrade Traffic hosts to trixie.
Mar 19 2026, 5:38 PM · Traffic
BCornwall updated the task description for T401832: Upgrade Traffic hosts to trixie.
Mar 19 2026, 5:07 PM · Traffic
BCornwall assigned T420586: Varnish sends invalid Server-Timing header when device is enrolled in an experiment to BBlack.
Mar 19 2026, 3:05 PM · Traffic
BCornwall changed the status of T420586: Varnish sends invalid Server-Timing header when device is enrolled in an experiment from Open to In Progress.
Mar 19 2026, 3:05 PM · Traffic

Mar 18 2026

BCornwall updated the task description for T401832: Upgrade Traffic hosts to trixie.
Mar 18 2026, 11:05 PM · Traffic
BCornwall updated the task description for T401832: Upgrade Traffic hosts to trixie.
Mar 18 2026, 11:05 PM · Traffic
BCornwall updated the task description for T401832: Upgrade Traffic hosts to trixie.
Mar 18 2026, 8:23 PM · Traffic
BCornwall updated the task description for T401832: Upgrade Traffic hosts to trixie.
Mar 18 2026, 7:09 PM · Traffic
BCornwall updated the task description for T420498: Factor in pooled status for SLO measurements.
Mar 18 2026, 5:30 PM · SRE-SLO, observability, Traffic
BCornwall created T420498: Factor in pooled status for SLO measurements.
Mar 18 2026, 5:30 PM · SRE-SLO, observability, Traffic
BCornwall updated the task description for T401832: Upgrade Traffic hosts to trixie.
Mar 18 2026, 5:10 PM · Traffic
BCornwall updated the task description for T401832: Upgrade Traffic hosts to trixie.
Mar 18 2026, 5:08 PM · Traffic
BCornwall added a project to T419753: Decommission codfw cp hosts cp2027-cp2040: ops-codfw.
Mar 18 2026, 3:39 PM · SRE, DC-Ops, ops-codfw, decommission-hardware, Traffic

Mar 17 2026

BCornwall updated the task description for T401832: Upgrade Traffic hosts to trixie.
Mar 17 2026, 10:56 PM · Traffic
BCornwall updated the task description for T401832: Upgrade Traffic hosts to trixie.
Mar 17 2026, 7:18 PM · Traffic
BCornwall updated the task description for T401832: Upgrade Traffic hosts to trixie.
Mar 17 2026, 5:15 PM · Traffic
BCornwall updated the task description for T401832: Upgrade Traffic hosts to trixie.
Mar 17 2026, 12:26 AM · Traffic

Mar 16 2026

BCornwall updated the task description for T401832: Upgrade Traffic hosts to trixie.
Mar 16 2026, 11:35 PM · Traffic
BCornwall updated the task description for T401832: Upgrade Traffic hosts to trixie.
Mar 16 2026, 9:40 PM · Traffic
BCornwall updated the task description for T401832: Upgrade Traffic hosts to trixie.
Mar 16 2026, 8:34 PM · Traffic
BCornwall updated the task description for T401832: Upgrade Traffic hosts to trixie.
Mar 16 2026, 7:20 PM · Traffic
BCornwall updated the task description for T401832: Upgrade Traffic hosts to trixie.
Mar 16 2026, 5:39 PM · Traffic

Mar 13 2026

BCornwall updated the task description for T401832: Upgrade Traffic hosts to trixie.
Mar 13 2026, 9:03 PM · Traffic
BCornwall updated the task description for T401832: Upgrade Traffic hosts to trixie.
Mar 13 2026, 8:56 PM · Traffic
BCornwall updated the task description for T401832: Upgrade Traffic hosts to trixie.
Mar 13 2026, 6:20 PM · Traffic
BCornwall updated the task description for T401832: Upgrade Traffic hosts to trixie.
Mar 13 2026, 5:11 PM · Traffic
BCornwall edited P76274 lsdepooled - For using as a custom Waybar module..
Mar 13 2026, 4:28 PM
BCornwall edited P76274 lsdepooled - For using as a custom Waybar module..
Mar 13 2026, 4:24 PM
BCornwall updated the task description for T401832: Upgrade Traffic hosts to trixie.
Mar 13 2026, 3:50 PM · Traffic
BCornwall updated the task description for T401832: Upgrade Traffic hosts to trixie.
Mar 13 2026, 1:22 AM · Traffic

Mar 12 2026

BCornwall updated the task description for T401832: Upgrade Traffic hosts to trixie.
Mar 12 2026, 10:22 PM · Traffic
BCornwall updated the task description for T401832: Upgrade Traffic hosts to trixie.
Mar 12 2026, 8:36 PM · Traffic
BCornwall renamed T419753: Decommission codfw cp hosts cp2027-cp2040 from Decommission codfw cp hosts cp2027-cp2042 to Decommission codfw cp hosts cp2027-cp2040.
Mar 12 2026, 5:41 PM · SRE, DC-Ops, ops-codfw, decommission-hardware, Traffic
BCornwall renamed T419753: Decommission codfw cp hosts cp2027-cp2040 from Decommission codfw cp hosts to Decommission codfw cp hosts cp2027-cp2042.
Mar 12 2026, 5:41 PM · SRE, DC-Ops, ops-codfw, decommission-hardware, Traffic
BCornwall added a project to T419753: Decommission codfw cp hosts cp2027-cp2040: decommission-hardware.
Mar 12 2026, 5:02 PM · SRE, DC-Ops, ops-codfw, decommission-hardware, Traffic
BCornwall closed T419611: hw troubleshooting: Comm Error: Backplane 0 for cp7012 as Resolved.

Thank you for all your work, rob. I was able to reimage and all seems well now. I'll re-open this is anything changes.

Mar 12 2026, 1:25 AM · Traffic, ops-magru, DC-Ops
BCornwall updated the task description for T401832: Upgrade Traffic hosts to trixie.
Mar 12 2026, 1:24 AM · Traffic