Page MenuHomePhabricator

wiki_willy
User

Projects (9)

Today

  • No visible events.

Tomorrow

  • No visible events.

Tuesday

  • No visible events.

User Details

User Since
Apr 16 2019, 9:00 PM (365 w, 4 d)
Availability
Available
LDAP User
Wpao
MediaWiki User
WPao (WMF) [ Global Accounts ]

Recent Activity

Fri, Apr 17

wiki_willy updated subscribers of T423719: Repurpose tools-k8s-ctrl[1001-1002],tools-k8s-worker[1001-1008] to wikikube-worker13[73-82].

Thanks @Clement_Goubert! @VRiley-WMF & @Jclark-ctr - this is the hardware being repurposed that I mentioned about during our Dc-Ops meeting yesterday. Thanks, Willy

Fri, Apr 17, 4:39 PM · ServiceOps new, ServiceOps-Upgrades-Hardware, SRE, ops-eqiad, DC-Ops

Tue, Apr 7

wiki_willy updated subscribers of T422382: Degraded RAID on ml-serve1001.

Thanks @Jclark-ctr. Hi @isarantopoulos - since we're looking to refresh this soon, do you still need us to purchase a replacement drive? Thanks, Willy

Tue, Apr 7, 3:38 PM · Machine-Learning-Team, DC-Ops, SRE, ops-eqiad

Mon, Mar 30

wiki_willy updated subscribers of T380299: Revisit use of the wmf-deployment Gerrit group for deployment-charts rights.

Thanks @RLazarus, we'll get it added to agenda for Infra Foundations meeting on Monday, March 30.

Mon, Mar 30, 5:54 AM · Infrastructure-Foundations, ServiceOps new, Kubernetes

Thu, Mar 26

wiki_willy created T421442: Discrepancy for wikikube-worker[1360-1372].
Thu, Mar 26, 8:42 PM · SRE, ops-eqiad, DC-Ops

Mon, Mar 23

wiki_willy added a comment to T420041: db1253 depooled following host crash.

Adding the ops-eqiad tag and removing ops-eqdfw. @Jclark-ctr will take a look at it a bit later today.

Mon, Mar 23, 1:54 PM · DBA
wiki_willy edited projects for T420041: db1253 depooled following host crash, added: ops-eqiad; removed ops-eqdfw.
Mon, Mar 23, 1:52 PM · DBA

Mar 10 2026

wiki_willy added a comment to T418411: Data Required for Energy Efficiency Directive: Due March 13 for DRMRS & May 15 for ESAMS.

Hi @ssingh - I was thinking along the lines of seeing if we would be able to calculate SERT ourselves instead of installing the tool. When I dig around a bit, it looks like SERT takes the weighted geometric mean of 65% for CPU, 30% for Memory, and 5% for Storage workloads. Since we only have 25 servers, I was thinking an estimate might be good enough; and we could just let them know that's how we came up with the metric since we can't install SERT on our hosts. Another alternative, we could also consolidate all our questions and concerns together, and Rob could email it out to customercare@ to see if the support team can provide any other guidance.

Mar 10 2026, 11:14 PM · SRE, ops-esams, ops-magru, DC-Ops
wiki_willy added a comment to T418411: Data Required for Energy Efficiency Directive: Due March 13 for DRMRS & May 15 for ESAMS.

Hey @ssingh - I guess technically, the way it's worded refers to just "new "servers. However, it'd also be a little weird if they're asking for just "new" because I don't think we provided this info last year either. Since our footprint at caching sites are super small, do you think providing an estimate would be possible? When we provide them the info, we could explain the reasoning behind it as well.

Mar 10 2026, 9:30 PM · SRE, ops-esams, ops-magru, DC-Ops
wiki_willy added a comment to T418411: Data Required for Energy Efficiency Directive: Due March 13 for DRMRS & May 15 for ESAMS.

I was reading the notification for DRMRS a bit more closely, and it looks like March 31 is the due date for Digital Realty to report the data to the EU, but the due date for us to provide the info to Digital Realty is this Friday, March 13. Updating the subject line to reflect the date.

Mar 10 2026, 7:38 AM · SRE, ops-esams, ops-magru, DC-Ops
wiki_willy renamed T418411: Data Required for Energy Efficiency Directive: Due March 13 for DRMRS & May 15 for ESAMS from Data Required for Energy Efficiency Directive: Due March 31 for DRMRS & May 15 for ESAMS to Data Required for Energy Efficiency Directive: Due March 13 for DRMRS & May 15 for ESAMS.
Mar 10 2026, 5:58 AM · SRE, ops-esams, ops-magru, DC-Ops

Feb 25 2026

wiki_willy created T418411: Data Required for Energy Efficiency Directive: Due March 13 for DRMRS & May 15 for ESAMS.
Feb 25 2026, 8:00 PM · SRE, ops-esams, ops-magru, DC-Ops

Feb 24 2026

wiki_willy added a comment to T415002: Unusually high disk errors on the an-worker nodes since upgrading the disks.

Hi @BTullis - sure, that sounds like a good test plan. One thing to keep in mind though is the data center switchover (from codfw to eqiad) will be happening on March 24 and 25. We'll likely see an overall power increase of 15-20kW at eqiad during that time, so hopefully that's enough time to observe any potentially changes or findings by then. Here's what I typically use for monitoring power by racks (at the bottom of the page):

Feb 24 2026, 6:11 PM · Data-Platform-SRE (2026-03-06 - 2026-03-27), SRE, DC-Ops, ops-eqiad

Feb 17 2026

wiki_willy assigned T417054: move the link from lvs1020 from ssw1-f1-eqiad to ssw1-e1-eqiad to VRiley-WMF.
Feb 17 2026, 9:02 PM · SRE, DC-Ops, ops-eqiad, Sustainability (Incident Followup)
wiki_willy updated subscribers of T366193: Anycast ns[01].wikimedia.org for IPv4.
Feb 17 2026, 8:00 PM · SRE, Traffic
wiki_willy added a comment to T417220: Missing PDU Info in Grafana.

Yeah, it could be because we're using new PDU models for cabinets E9-E14. Let's see what the Observability team recommends.

Feb 17 2026, 4:55 PM · Observability-Metrics
wiki_willy added a comment to T414948: Decommission an-worker11[17-41] but reuse an-worker11[17,18,31,33,34] as dse-k8s-workers.

Sounds good, thanks @brouberol !

Feb 17 2026, 2:14 AM · SRE, DC-Ops, ops-eqiad, Data-Platform-SRE (2026-02-13 - 2026-03-06)

Feb 13 2026

wiki_willy added a comment to T414948: Decommission an-worker11[17-41] but reuse an-worker11[17,18,31,33,34] as dse-k8s-workers.

Sure @BTullis - no problem at all. Can you do me a favor though and let us know which hostnames on the EOL document, referencing this ticket number? That way, we'll leave it off of our radar. Thanks, Willy

Feb 13 2026, 6:00 PM · SRE, DC-Ops, ops-eqiad, Data-Platform-SRE (2026-02-13 - 2026-03-06)

Feb 11 2026

wiki_willy created T417220: Missing PDU Info in Grafana.
Feb 11 2026, 7:18 PM · Observability-Metrics

Feb 6 2026

wiki_willy assigned T403035: Eqiad: Fr-tech expansion to VRiley-WMF.
Feb 6 2026, 9:35 PM · fundraising-tech-ops, DC-Ops, Infrastructure-Foundations, ops-eqiad, SRE

Feb 4 2026

wiki_willy added a comment to T414411: cp5022 is unreachable.

Sounds good @RobH, that plan works for me as well. Do you know if Jin has access to any of these parts by any chance? If he is able to get a hold of them, he could just add the cost onto our invoice.

Feb 4 2026, 6:41 PM · SRE, DC-Ops, ops-eqsin, Traffic
wiki_willy added a comment to T414411: cp5022 is unreachable.

Hey @RobH - did Jin say what kind of initial troubleshooting he did? Like did he do a power drain, reseat certain parts, etc? I think we can go ahead and purchase parts to see if it'll help fix this, though it'll be helpful knowing what was attempted so far. Thanks, Willy

Feb 4 2026, 6:30 PM · SRE, DC-Ops, ops-eqsin, Traffic

Dec 22 2025

wiki_willy assigned T412458: root user not on newest batches of supermicro servers. to VRiley-WMF.
Dec 22 2025, 6:16 PM · Infrastructure-Foundations, SRE, DC-Ops, ops-codfw, ops-eqiad

Dec 2 2025

wiki_willy added a comment to T411533: Reclaim components from decommed servers.

Swap out R430 spare drives with newer drives (1 for 1 swap), along with memory

Dec 2 2025, 9:20 PM · SRE, DC-Ops, ops-eqiad

Nov 6 2025

wiki_willy updated subscribers of T409374: db1262 is down.

@Jclark-ctr - can you help out @Marostegui with getting a RMA for the DIMM?

Nov 6 2025, 7:43 PM · SRE, DC-Ops, ops-eqiad, Sustainability (Incident Followup), DBA

Oct 28 2025

wiki_willy assigned T408600: decommission es1031.eqiad.wmnet to VRiley-WMF.
Oct 28 2025, 9:17 PM · SRE, DC-Ops, ops-eqiad, DBA, decommission-hardware
wiki_willy assigned T408585: Unresponsive management for ms-be1090.mgmt:22 to VRiley-WMF.
Oct 28 2025, 9:16 PM · SRE, ops-eqiad, DC-Ops

Oct 7 2025

wiki_willy reassigned T406554: cr2-eqiad: fan failure on left tray [Oct 2025] from cmooney to VRiley-WMF.
Oct 7 2025, 8:50 PM · DC-Ops, ops-eqiad, netops, Infrastructure-Foundations, SRE
wiki_willy reassigned T404959: Move lvs1020 link from ssw1-f1-eqiad to ssw1-e1-eqiad from cmooney to VRiley-WMF.
Oct 7 2025, 8:50 PM · DC-Ops, ops-eqiad, Traffic, Infrastructure-Foundations, netops, SRE

Oct 2 2025

wiki_willy added a comment to T401886: asw2-a4-eqiad:PEM 1 is not powered.

Hi @VRiley-WMF - the access to create RMA cases should be resolved now per Juniper, so hopefully it unblocks you on this one. Thanks, Willy

Oct 2 2025, 10:46 PM · SRE, DC-Ops, ops-eqiad

Sep 16 2025

wiki_willy added a comment to T404413: decommission kafka-jumbo100[7-9].eqiad.wmnet.

Hi @brouberol - thanks for opening this task. Is this one ready to be handed over to DC-Ops? Thanks, Willy

Sep 16 2025, 8:37 PM · SRE, DC-Ops, ops-eqiad, decommission-hardware

Sep 5 2025

wiki_willy added a comment to T403855: decommission mwmaint2002.codfw.wmnet.

Thanks @jasmine_ !

Sep 5 2025, 7:05 PM · SRE, DC-Ops, serviceops-deprecated, decommission-hardware, ops-codfw
wiki_willy added a comment to T400442: decommission mwmaint1002.eqiad.wmnet.

Awesome, thanks so much @jasmine_ !

Sep 5 2025, 6:59 PM · SRE, DC-Ops, ops-eqiad, serviceops-deprecated, decommission-hardware
wiki_willy added a comment to T400442: decommission mwmaint1002.eqiad.wmnet.

Hi @Clement_Goubert & @jasmine_ - to follow up on this one, I think we're still waiting on this task to be passed over to Dc-Ops. Can you split this into two different tasks (one for ops-eqiad and one for ops-codfw), for us to unrack the servers? Much appreciated in advance. Thanks, Willy

Sep 5 2025, 6:54 PM · SRE, DC-Ops, ops-eqiad, serviceops-deprecated, decommission-hardware
wiki_willy added a comment to T383227: decommission mw135[8-9], mw136[4-6], mw137[2-3], mw140[0-4], mw1406, mw14[11-13].

Hi @jasmine_ - just checking if you had an ETA on wrapping up wikikube-ctrl1001 for decommissioning? We're hoping to have this Phabricator task passed over to Dc-Ops, to help free up some rack space in eqiad. Much appreciated in advance. Thanks, Willy

Sep 5 2025, 6:48 PM · SRE, ops-eqiad, DC-Ops, Patch-For-Review, serviceops-deprecated, decommission-hardware
wiki_willy added a comment to T397447: Take kafka-jumbo100[7-9] out of service, ready for decom.

Hi @brouberol & @BTullis - I don't think we've seen the Phabricator task for Data Center ops to decommission these servers from the racks. Can you submit that over to us via the Decom workflow below so we can unrack these to free up some rackspace:

Sep 5 2025, 6:45 PM · SRE, DC-Ops, ops-eqiad, Data-Platform-SRE (2025.07.26 - 2025.08.15)

Sep 2 2025

wiki_willy assigned T403031: Eqiad: Replacement top-of-rack switch for rack C1 to VRiley-WMF.
Sep 2 2025, 8:34 PM · DC-Ops, ops-eqiad, Infrastructure-Foundations, SRE

Aug 29 2025

wiki_willy added a comment to T392851: Q4:rack/setup/install cp20[43-58] codfw.

Thanks @RobH. Our account team has changed quite a bit, but you can follow up with Hossam and Dawn after creating the support ticket

Aug 29 2025, 5:22 PM · User-Elukey, SRE, Patch-For-Review, Traffic, ops-codfw, DC-Ops

Aug 27 2025

wiki_willy updated subscribers of T402938: KernelErrors Server cloudcephosd1052 logged kernel errors.

++ @RobH - can you work with John on getting a 25g Broadcom NIC for this one?

Aug 27 2025, 2:58 PM · SRE, DC-Ops, ops-eqiad, cloud-services-team

Aug 26 2025

wiki_willy assigned T401678: decommission an-worker109[6-9].eqiad.wmnet to VRiley-WMF.
Aug 26 2025, 8:47 PM · Essential-Work, Data-Platform-SRE (2025.08.16 - 2025.09.05), SRE, DC-Ops, ops-eqiad, decommission-hardware

Aug 14 2025

wiki_willy updated subscribers of T401504: Degraded RAID on an-worker1128.

Adding @BTullis and @Stevemunene for feedback on an appropriate window for an-worker1128

Aug 14 2025, 7:55 PM · Essential-Work, Data-Platform-SRE (2025.08.16 - 2025.09.05), SRE, ops-eqiad, DC-Ops

Aug 12 2025

wiki_willy added a comment to T400638: Q1:rack/setup/install maps101[1-4].

Awesome, thank you!

Aug 12 2025, 4:38 PM · SRE, ops-eqiad, serviceops-deprecated, DC-Ops
wiki_willy added a comment to T400637: Q1:rack/setup/install maps201[1-4].

Thanks @MoritzMuehlenhoff!

Aug 12 2025, 4:38 PM · SRE, ops-codfw, serviceops-deprecated, DC-Ops

Aug 11 2025

wiki_willy reassigned T400637: Q1:rack/setup/install maps201[1-4] from joanna_borun to MoritzMuehlenhoff.

Hi @MoritzMuehlenhoff - are you able to help confirm the racking details and update site.pp on this one? Thanks, Willy

Aug 11 2025, 6:22 PM · SRE, ops-codfw, serviceops-deprecated, DC-Ops
wiki_willy reassigned T400638: Q1:rack/setup/install maps101[1-4] from joanna_borun to MoritzMuehlenhoff.

Hi @MoritzMuehlenhoff - are you able to confirm the racking details and update the site.pp info on this one? Thanks, Willy

Aug 11 2025, 6:21 PM · SRE, ops-eqiad, serviceops-deprecated, DC-Ops

Aug 5 2025

wiki_willy assigned T400778: Q4: eqiad: (12) PDUs for ML expansion to VRiley-WMF.
Aug 5 2025, 8:53 PM · SRE, ops-eqiad, DC-Ops
wiki_willy assigned T401210: Unresponsive management for cloudcephosd1036.mgmt:22 to VRiley-WMF.
Aug 5 2025, 8:46 PM · SRE, DC-Ops, ops-eqiad
wiki_willy assigned T400877: Install new disk controllers to SM swift backends (eqiad) to VRiley-WMF.
Aug 5 2025, 8:44 PM · SRE, SRE-swift-storage, ops-eqiad, DC-Ops

Jul 31 2025

wiki_willy assigned T400876: Install new disk controllers to SM swift backends (codfw) to Jhancock.wm.
Jul 31 2025, 9:55 PM · ops-codfw, DC-Ops, SRE, SRE-swift-storage
wiki_willy updated subscribers of T400876: Install new disk controllers to SM swift backends (codfw).
Jul 31 2025, 9:54 PM · ops-codfw, DC-Ops, SRE, SRE-swift-storage
wiki_willy updated subscribers of T400877: Install new disk controllers to SM swift backends (eqiad).

Hi @Jclark-ctr - can you provide info on where the controllers from T393941 are, so that you and @VRiley-WMF can work with Matthew on the controller swap? Thanks, Willy

Jul 31 2025, 9:43 PM · SRE, SRE-swift-storage, ops-eqiad, DC-Ops
wiki_willy updated subscribers of T400876: Install new disk controllers to SM swift backends (codfw).
Jul 31 2025, 8:52 PM · ops-codfw, DC-Ops, SRE, SRE-swift-storage

Jul 29 2025

wiki_willy removed projects from T386860: Enable CPU performance governor on Relforge, Cloudelastic, and Elasticsearch hosts: ops-codfw, ops-eqiad.
Jul 29 2025, 9:18 PM · Data-Platform-SRE, SRE, DC-Ops
wiki_willy closed T392006: eqiad: second frack parent tracking task as Resolved.

Resolving task, we will be installing two new Fundraising cabinets as a solution instead.

Jul 29 2025, 9:16 PM · SRE, Infrastructure-Foundations, fundraising-tech-ops, netops, DC-Ops, ops-eqiad
wiki_willy removed a project from T394498: SSD firmware update for an-mariadb100[1-2]: ops-eqiad.
Jul 29 2025, 9:15 PM · Essential-Work, Data-Platform-SRE (2025.08.16 - 2025.09.05), DC-Ops
wiki_willy removed a project from T395910: cloudcephosd10[48-52] service implementation: ops-eqiad.
Jul 29 2025, 9:13 PM · cloud-services-team (FY2025/2026-Q1-Q2), Cloud-VPS, SRE, DC-Ops
wiki_willy assigned T398006: Outbound errors on interface cr1-eqiad:et-1/1/2 (Transport: cr1-codfw:et-1/0/2 (Arelion, IC-374549) {#20231106}) to Jclark-ctr.
Jul 29 2025, 9:08 PM · SRE, DC-Ops, ops-eqiad
wiki_willy assigned T396717: Fix PXE miss-configurations to VRiley-WMF.
Jul 29 2025, 9:05 PM · SRE, ops-eqiad, DC-Ops, ops-codfw
wiki_willy assigned T391489: Decom eqiad row B <-> cloudsw links to Jclark-ctr.
Jul 29 2025, 9:02 PM · SRE, DC-Ops, ops-eqiad
wiki_willy assigned T400161: msw1-eqiad: cable me0 dedicated mgmt port directly to the switch itself to VRiley-WMF.
Jul 29 2025, 9:02 PM · ops-eqiad, netops, Infrastructure-Foundations, DC-Ops, SRE
wiki_willy assigned T400159: Discrepencies with cableid & ports on some msw in c/d <-> msw1-eqiad to Jclark-ctr.
Jul 29 2025, 9:00 PM · Infrastructure-Foundations, netops, SRE, DC-Ops, ops-eqiad

Jul 23 2025

wiki_willy reassigned T400211: Install serial port breakout card on sretest2001 from Papaul to Jhancock.wm.

Hi @Jhancock.wm - since @Papaul is out on sabbatical, can you take a look at this one? It's related to debugging some of the Supermicro issues.. Thanks, Willy

Jul 23 2025, 4:07 PM · SRE, DC-Ops, ops-codfw, Infrastructure-Foundations

Jul 22 2025

wiki_willy reopened T393042: Q4:rack/setup/install Dell Config H 1P Test Host as "Open".

Re-opening. @Jhancock.wm - per @Marostegui's previous comment:

Jul 22 2025, 10:58 PM · SRE, ops-codfw, DC-Ops

Jul 11 2025

wiki_willy added a comment to T392851: Q4:rack/setup/install cp20[43-58] codfw.

Hi @elukey - can you or @Volans send me an email summarizing everything you need from Dell? I'll add the Technical Account Rep to the email thread to loop you in with him.

If we don't find the issue we'd probably need to contact Dell to verify if we need to do something extra or not. @wiki_willy Hi! This is the task about IDRAC 10 that we were discussing the other day, we'd probably need to get in touch with DELL to figure out what we have to do :(

Jul 11 2025, 4:06 PM · User-Elukey, SRE, Patch-For-Review, Traffic, ops-codfw, DC-Ops

Jun 12 2025

wiki_willy added a comment to T244315: decommission cookbook: add support for decom spreadsheet.

Hey @Volans - I think we've come up with a couple solutions since this task was created. One is providing a monthly Netbox dump to the Accounting team, so that they can see which hosts have been set to "offline" since the previous month. And the second one is creating an ongoing EOL Server list, to track down SRE teams that haven't decommissioned their hardware after the hardware refresh. I think we can resolve this task, but maybe we can brainstorm some other ways of improving the EOL Server list on the side.

Jun 12 2025, 8:05 PM · Infrastructure-Foundations, SRE-tools

Jun 4 2025

wiki_willy added a comment to T393107: Q#:rack/setup/install es104[78].

Thanks @Marostegui!

Jun 4 2025, 5:48 PM · SRE, Data-Persistence, ops-eqiad, DC-Ops

Jun 3 2025

wiki_willy added a comment to T393107: Q#:rack/setup/install es104[78].

Hey @Marostegui - we currently have limited availability on 10g switches, until the 10g switch refresh is completed (likely in Q1). Can these go on 1g switches, until the 10g refresh happens?

Jun 3 2025, 8:22 PM · SRE, Data-Persistence, ops-eqiad, DC-Ops

May 28 2025

wiki_willy added a comment to T393296: db1246 crashed yet again.

I just filled out the registration for the seed server today, so it should be arriving in the next 1-2 weeks. @VRiley-WMF - just a heads up that it won't include the hard drives, so you'll have to move the disks over to the replacement chassis. It also probably won't have the normal packing slip that you see on new procurement requests.

May 28 2025, 7:41 PM · SRE, DC-Ops, ops-eqiad, DBA

May 23 2025

wiki_willy added a comment to T393104: Q4:rack/setup/install ms-be109[2-5].

Hi @MatthewVernon - I just replied back to your email with a more in-depth explanation. The short answer though is that we need more SREs to decommission their previously refreshed hardware, particularly the ones on 10g switches. And for the longer term solution, once we refresh all our existing 1g network switches to 10g via T368959, it will free up a lot more options for Valerie and John to install new servers that require 10g.

May 23 2025, 11:23 PM · SRE, Data-Persistence, SRE-swift-storage, ops-eqiad, DC-Ops

May 22 2025

wiki_willy added a comment to T387231: missing pdu infos for magru.

++ @Papaul & @RobH - are one of you guys able to review the patch for Tiziano?

May 22 2025, 5:04 PM · Patch-For-Review, SRE Observability (FY2024/2025-Q3), ops-magru, DC-Ops, Observability-Metrics

May 19 2025

wiki_willy added a comment to T393296: db1246 crashed yet again.

Just a quick update: our Dell Account team is working on a resolution. There's a new open case for requesting a RMA and a server replacement.

May 19 2025, 7:10 PM · SRE, DC-Ops, ops-eqiad, DBA

May 16 2025

wiki_willy added a comment to T393296: db1246 crashed yet again.

Perfect, thanks @VRiley-WMF! I just sent an email out to our Dell Account team and cc'd you and John on it.

May 16 2025, 7:06 PM · SRE, DC-Ops, ops-eqiad, DBA
wiki_willy added a comment to T393296: db1246 crashed yet again.

Awesome, thanks @VRiley-WMF! Can you do me one more favor and summarize what was replaced next to each ticket for each Tech Support request?

May 16 2025, 5:58 PM · SRE, DC-Ops, ops-eqiad, DBA
wiki_willy added a comment to T393296: db1246 crashed yet again.

Hey @VRiley-WMF & @Jclark-ctr - I remember you two were working on tracking down and consolidating all the Dell Support tickets that we've opened for this server. Can you send me the full list of Dell Tech Support ticket numbers that we're created? I'll use that data to try and push for out account team to get us a replacement host. Thanks, Willy

May 16 2025, 5:49 PM · SRE, DC-Ops, ops-eqiad, DBA
wiki_willy added a comment to T394348: Dell SSD Critical Firmware Update.

Hi @BTullis - apologies for the mixup. For some reason, I had mixed up the dates with an-coord100[1,2], which are both offline. I've fixed the notes and removed the (decommissioned) part. Thanks for catching that!

May 16 2025, 5:44 PM · SRE, ops-codfw, ops-eqiad, DC-Ops
wiki_willy updated the task description for T394348: Dell SSD Critical Firmware Update.
May 16 2025, 5:42 PM · SRE, ops-codfw, ops-eqiad, DC-Ops

May 6 2025

wiki_willy updated subscribers of T393296: db1246 crashed yet again.

Hi @Papaul - do you have any other recommendations for this one?

May 6 2025, 12:43 AM · SRE, DC-Ops, ops-eqiad, DBA

May 5 2025

wiki_willy added a comment to T391854: Swap RAID controller on ms-be1091.eqiad.wmnet.

It's about $250 for the RAID controllers, so we can definitely order those to replace the existing ones for Config J. To keep things consistent though, should we should order this RAID controller to replace the Config E and backup hosts also?

May 5 2025, 4:21 PM · SRE, DC-Ops, Infrastructure-Foundations, Data-Persistence, ops-eqiad

Apr 30 2025

wiki_willy added a comment to T387231: missing pdu infos for magru.

Thanks @tappof, that sounds good!

Apr 30 2025, 7:00 PM · Patch-For-Review, SRE Observability (FY2024/2025-Q3), ops-magru, DC-Ops, Observability-Metrics
wiki_willy added a comment to T392796: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment.

Hi @MatthewVernon - I still have some CapEx underrun, so we could bump up the refresh to this quarter instead. @RobH - can you create a Phabricator task and quote for Matthew to review?

@wiki_willy this node is currently slated for replacement in Q2 as part of "Refresh of ms-be10[60-63]"; depending on costs/timelines of getting a replacement card in, could we pull that forward to Q1?

Apr 30 2025, 5:50 PM · SRE-swift-storage, SRE, ops-eqiad, DC-Ops
wiki_willy added a comment to T392751: Degraded RAID on db1171.

@VRiley-WMF & @Jclark-ctr - can you grab a spare from one of the decom'd servers for this?

Apr 30 2025, 6:37 AM · DBA, Data-Persistence, Data-Persistence-Backup, SRE, DC-Ops, ops-eqiad

Apr 29 2025

wiki_willy added a comment to T392796: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment.

Sorry, nevermind....it looks like they're HPs

Apr 29 2025, 4:58 PM · SRE-swift-storage, SRE, ops-eqiad, DC-Ops
wiki_willy added a comment to T392796: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment.

@Jclark-ctr - it looks like we refreshed ms-be105[1-9] towards the end of last year via T371389. Can you check with @MatthewVernon to see if any of those are close to being decommissioned, and see if we can pull the RAID card from one of those machines?

Apr 29 2025, 4:56 PM · SRE-swift-storage, SRE, ops-eqiad, DC-Ops

Apr 28 2025

wiki_willy updated subscribers of T392424: Degraded RAID on cloudcephmon1004.
Apr 28 2025, 6:00 PM · DC-Ops, SRE, ops-eqiad
wiki_willy updated subscribers of T392751: Degraded RAID on db1171.
Apr 28 2025, 5:58 PM · DBA, Data-Persistence, Data-Persistence-Backup, SRE, ops-eqiad, DC-Ops
wiki_willy added a comment to T387231: missing pdu infos for magru.

Thanks @tappof, that looks perfect. Thanks for splitting it up by rack! I went through and checked the other pop sites, and they all look good as well...except for drmrs. When you get a chance, can you get drmrs split across the two racks also? Thanks so much for your help!

Apr 28 2025, 4:04 PM · Patch-For-Review, SRE Observability (FY2024/2025-Q3), ops-magru, DC-Ops, Observability-Metrics

Apr 18 2025

wiki_willy added a comment to T387231: missing pdu infos for magru.

Hi @tappof - great job and thank you so much for working on this! It looks like I'm able to see all the information we need for magru in Grafana now.

Apr 18 2025, 5:35 PM · Patch-For-Review, SRE Observability (FY2024/2025-Q3), ops-magru, DC-Ops, Observability-Metrics
wiki_willy added a comment to T392007: eqiad: determine second frack.

Hey @ayounsi - after some feedback from my staff meeting earlier today, I reached out to Equinix to see if there's any way we'd be able to add circuits to build out a new rack for Fundraising. If everything works out with the feasibility study, we would be able to build a new rack from the ground up in the Machine Learning cage (without taking away anything dedicated to ML or in our current racks). It'll probably take 1-2 weeks though before I know for sure, so we can pause on migrating anything for a bit. Thanks, Willy

Apr 18 2025, 6:52 AM · SRE, Infrastructure-Foundations, fundraising-tech-ops, netops, DC-Ops, ops-eqiad

Mar 27 2025

wiki_willy renamed T387145: Q3:test NIC for lvs1017 from Q3:test NIC for lvs1019 to Q3:test NIC for lvs1017 or lvs1018.
Mar 27 2025, 7:48 PM · SRE, ops-eqiad, Traffic, DC-Ops
wiki_willy updated subscribers of T387145: Q3:test NIC for lvs1017.
Mar 27 2025, 7:47 PM · SRE, ops-eqiad, Traffic, DC-Ops

Mar 12 2025

wiki_willy updated subscribers of T381576: Q2:rack/setup/install ganeti105[34].eqiad.wmnet.

Hi @MoritzMuehlenhoff - the normal hardware specs for Config C is actually 2x 960gb hard drives (not 4x 960gb). I think maybe you were looking at the column for the number of DIMMs (which is 4x DIMMs for Config C) instead of the hard drives below:

Mar 12 2025, 10:31 PM · SRE, ops-eqiad, Infrastructure-Foundations, DC-Ops

Mar 7 2025

wiki_willy updated subscribers of T388221: Recommission testhost2001.codfw.wmnet as ms-be2089.codfw.wmnet.

++ @Jhancock.wm & @Papaul - per our conversation the other day, this will be the R760xd2 seed server that we received from Dell, which we'll repurpose for Matthew to test and put into production. Thanks, Willy

Mar 7 2025, 5:22 PM · Patch-For-Review, SRE-swift-storage, SRE, ops-codfw, DC-Ops
wiki_willy reassigned T387673: db1246 crashed & rebooted twice from Marostegui to VRiley-WMF.

Reassigning to Valerie to create a new Dell Support task

Mar 7 2025, 2:19 AM · DC-Ops, ops-eqiad, SRE, DBA

Mar 5 2025

wiki_willy updated subscribers of T387673: db1246 crashed & rebooted twice.

Hi @Marostegui - thanks for checking. When I look back at previous email from Dell Support sent in November, MarcoAntonio says "we can temporarily archive the case, and if the issue reappears, you can open this case within 10days by contacting me via email or we can open a new case making reference to this case if any additional support is needed after 10 days, the record of the server is saved in the TAG history." So I have a feeling your email reply on Sunday didn't reopen the case because it was past 10 days.

Mar 5 2025, 8:17 AM · DC-Ops, ops-eqiad, SRE, DBA

Feb 26 2025

wiki_willy updated subscribers of T387231: missing pdu infos for magru.

Hi @tappof - thanks for looking into this. It looks like the PDUs are in Netbox though; they were added about a year ago in May 2024:

Feb 26 2025, 5:09 PM · Patch-For-Review, SRE Observability (FY2024/2025-Q3), ops-magru, DC-Ops, Observability-Metrics

Feb 20 2025

wiki_willy added a comment to T386959: Solicit Dell to investigate magru cp temperatures.

Hey @RobH - Sukhbir and I were talking at the offsite after the fix was implemented. While increasing the fan speed helped specifically in this scenario, the other sites are able to get by with just the default fan speed. So we still wanted to get a Dell technician to compare one magru server with the default fan speed to another magru server with the adjusted higher fan speed, to see if they could isolate any other root causes - whether it was something else internal within the servers contributing to the high temps or some type of external environment cause with airflow.

Feb 20 2025, 7:23 PM · ops-magru
wiki_willy assigned T386959: Solicit Dell to investigate magru cp temperatures to RobH.

Thanks for creating this task @ssingh.

Feb 20 2025, 6:47 PM · ops-magru

Dec 11 2024

wiki_willy added a comment to T380673: Kernel error Server cloudvirt1061 may have kernel errors.

@Jclark-ctr - there's nothing that I'm aware of. If there's no additional info in the original procurement task or any historical Phabricator tickets, maybe you can check with WMCS and see if you can rebalance them?

Dec 11 2024, 10:19 PM · SRE, cloud-services-team (Hardware), ops-eqiad, DC-Ops

Nov 13 2024

wiki_willy added a comment to T375842: decommission mw[1349-1413].

Ah that makes sense, thanks for the info. We'll go ahead and move the server, after the Phabricator task is created. FWIW, all servers being ordered this fiscal year and moving forward will have 10g cards...and the refresh/upgrade to 10g switches in eqiad for rows C and D is supposed to happen probably later in Q4.

The new server is already in service. The main reason brought this up is the process we had to go through to get a 10G card in wikikube-ctrl1001 cause we need the extra bandwidth. I think that to do so, we 'll need to chose a server in a rack that has free 10G ports and re-cable. I 'll file a separate task

Nov 13 2024, 9:42 PM · SRE, DC-Ops, ops-eqiad, serviceops-deprecated, decommission-hardware

Nov 12 2024

wiki_willy added a comment to T375842: decommission mw[1349-1413].

Hi @akosiaris - thanks for confirming. I think we already ordered the replacement host though via T368933. You're welcome to continue using wikikube-ctrl1001 for a longer period of time though, and dedicate the new server for something else in the meantime if you want?

Nov 12 2024, 9:34 PM · SRE, DC-Ops, ops-eqiad, serviceops-deprecated, decommission-hardware

Nov 6 2024

wiki_willy updated subscribers of T371984: Q1:rack/setup/install backup2012.

Hi @Jhancock.wm and @Papaul - just a heads up, it looks like the test controller kit arrived yesterday:

Nov 6 2024, 7:25 PM · SRE, Data-Persistence, Data-Persistence-Backup, ops-codfw, DC-Ops