Page MenuHomePhabricator

wiki_willy
User

Projects (9)

Today

  • No visible events.

Tomorrow

  • No visible events.

Tuesday

  • No visible events.

User Details

User Since
Apr 16 2019, 9:00 PM (347 w, 4 d)
Availability
Available
LDAP User
Wpao
MediaWiki User
WPao (WMF) [ Global Accounts ]

Recent Activity

Tue, Dec 2

wiki_willy added a comment to T411533: Reclaim components from decommed servers.

Swap out R430 spare drives with newer drives (1 for 1 swap), along with memory

Tue, Dec 2, 9:20 PM · SRE, DC-Ops, ops-eqiad

Nov 6 2025

wiki_willy updated subscribers of T409374: db1262 is down.

@Jclark-ctr - can you help out @Marostegui with getting a RMA for the DIMM?

Nov 6 2025, 7:43 PM · SRE, DC-Ops, ops-eqiad, Sustainability (Incident Followup), DBA

Oct 28 2025

wiki_willy assigned T408600: decommission es1031.eqiad.wmnet to VRiley-WMF.
Oct 28 2025, 9:17 PM · SRE, DC-Ops, ops-eqiad, DBA, decommission-hardware
wiki_willy assigned T408585: Unresponsive management for ms-be1090.mgmt:22 to VRiley-WMF.
Oct 28 2025, 9:16 PM · SRE, DC-Ops, ops-eqiad

Oct 7 2025

wiki_willy reassigned T406554: cr2-eqiad: fan failure on left tray [Oct 2025] from cmooney to VRiley-WMF.
Oct 7 2025, 8:50 PM · DC-Ops, ops-eqiad, netops, Infrastructure-Foundations, SRE
wiki_willy reassigned T404959: Move lvs1020 link from ssw1-f1-eqiad to ssw1-e1-eqiad from cmooney to VRiley-WMF.
Oct 7 2025, 8:50 PM · DC-Ops, ops-eqiad, Traffic, Infrastructure-Foundations, netops, SRE

Oct 2 2025

wiki_willy added a comment to T401886: asw2-a4-eqiad:PEM 1 is not powered.

Hi @VRiley-WMF - the access to create RMA cases should be resolved now per Juniper, so hopefully it unblocks you on this one. Thanks, Willy

Oct 2 2025, 10:46 PM · SRE, DC-Ops, ops-eqiad

Sep 16 2025

wiki_willy added a comment to T404413: decommission kafka-jumbo100[7-9].eqiad.wmnet.

Hi @brouberol - thanks for opening this task. Is this one ready to be handed over to DC-Ops? Thanks, Willy

Sep 16 2025, 8:37 PM · SRE, DC-Ops, ops-eqiad, decommission-hardware

Sep 5 2025

wiki_willy added a comment to T403855: decommission mwmaint2002.codfw.wmnet.

Thanks @jasmine_ !

Sep 5 2025, 7:05 PM · SRE, DC-Ops, serviceops, decommission-hardware, ops-codfw
wiki_willy added a comment to T400442: decommission mwmaint1002.eqiad.wmnet.

Awesome, thanks so much @jasmine_ !

Sep 5 2025, 6:59 PM · SRE, DC-Ops, ops-eqiad, serviceops, decommission-hardware
wiki_willy added a comment to T400442: decommission mwmaint1002.eqiad.wmnet.

Hi @Clement_Goubert & @jasmine_ - to follow up on this one, I think we're still waiting on this task to be passed over to Dc-Ops. Can you split this into two different tasks (one for ops-eqiad and one for ops-codfw), for us to unrack the servers? Much appreciated in advance. Thanks, Willy

Sep 5 2025, 6:54 PM · SRE, DC-Ops, ops-eqiad, serviceops, decommission-hardware
wiki_willy added a comment to T383227: decommission mw135[8-9], mw136[4-6], mw137[2-3], mw140[0-4], mw1406, mw14[11-13].

Hi @jasmine_ - just checking if you had an ETA on wrapping up wikikube-ctrl1001 for decommissioning? We're hoping to have this Phabricator task passed over to Dc-Ops, to help free up some rack space in eqiad. Much appreciated in advance. Thanks, Willy

Sep 5 2025, 6:48 PM · SRE, DC-Ops, ops-eqiad, Patch-For-Review, serviceops, decommission-hardware
wiki_willy added a comment to T397447: Take kafka-jumbo100[7-9] out of service, ready for decom.

Hi @brouberol & @BTullis - I don't think we've seen the Phabricator task for Data Center ops to decommission these servers from the racks. Can you submit that over to us via the Decom workflow below so we can unrack these to free up some rackspace:

Sep 5 2025, 6:45 PM · SRE, DC-Ops, ops-eqiad, Data-Platform-SRE (2025.07.26 - 2025.08.15)

Sep 2 2025

wiki_willy assigned T403031: Eqiad: Replacement top-of-rack switch for rack C1 to VRiley-WMF.
Sep 2 2025, 8:34 PM · DC-Ops, ops-eqiad, Infrastructure-Foundations, SRE

Aug 29 2025

wiki_willy added a comment to T392851: Q4:rack/setup/install cp20[43-58] codfw.

Thanks @RobH. Our account team has changed quite a bit, but you can follow up with Hossam and Dawn after creating the support ticket

Aug 29 2025, 5:22 PM · User-Elukey, SRE, Patch-For-Review, Traffic, ops-codfw, DC-Ops

Aug 27 2025

wiki_willy updated subscribers of T402938: KernelErrors Server cloudcephosd1052 logged kernel errors.

++ @RobH - can you work with John on getting a 25g Broadcom NIC for this one?

Aug 27 2025, 2:58 PM · SRE, DC-Ops, ops-eqiad, cloud-services-team

Aug 26 2025

wiki_willy assigned T401678: decommission an-worker109[6-9].eqiad.wmnet to VRiley-WMF.
Aug 26 2025, 8:47 PM · Essential-Work, Data-Platform-SRE (2025.08.16 - 2025.09.05), SRE, DC-Ops, ops-eqiad, decommission-hardware

Aug 14 2025

wiki_willy updated subscribers of T401504: Degraded RAID on an-worker1128.

Adding @BTullis and @Stevemunene for feedback on an appropriate window for an-worker1128

Aug 14 2025, 7:55 PM · Essential-Work, Data-Platform-SRE (2025.08.16 - 2025.09.05), SRE, ops-eqiad, DC-Ops

Aug 12 2025

wiki_willy added a comment to T400638: Q1:rack/setup/install maps101[1-4].

Awesome, thank you!

Aug 12 2025, 4:38 PM · SRE, ops-eqiad, serviceops, DC-Ops
wiki_willy added a comment to T400637: Q1:rack/setup/install maps201[1-4].

Thanks @MoritzMuehlenhoff!

Aug 12 2025, 4:38 PM · SRE, ops-codfw, serviceops, DC-Ops

Aug 11 2025

wiki_willy reassigned T400637: Q1:rack/setup/install maps201[1-4] from joanna_borun to MoritzMuehlenhoff.

Hi @MoritzMuehlenhoff - are you able to help confirm the racking details and update site.pp on this one? Thanks, Willy

Aug 11 2025, 6:22 PM · SRE, ops-codfw, serviceops, DC-Ops
wiki_willy reassigned T400638: Q1:rack/setup/install maps101[1-4] from joanna_borun to MoritzMuehlenhoff.

Hi @MoritzMuehlenhoff - are you able to confirm the racking details and update the site.pp info on this one? Thanks, Willy

Aug 11 2025, 6:21 PM · SRE, ops-eqiad, serviceops, DC-Ops

Aug 5 2025

wiki_willy assigned T400778: Q4: eqiad: (12) PDUs for ML expansion to VRiley-WMF.
Aug 5 2025, 8:53 PM · SRE, ops-eqiad, DC-Ops
wiki_willy assigned T401210: Unresponsive management for cloudcephosd1036.mgmt:22 to VRiley-WMF.
Aug 5 2025, 8:46 PM · SRE, DC-Ops, ops-eqiad
wiki_willy assigned T400877: Install new disk controllers to SM swift backends (eqiad) to VRiley-WMF.
Aug 5 2025, 8:44 PM · SRE, SRE-swift-storage, ops-eqiad, DC-Ops

Jul 31 2025

wiki_willy assigned T400876: Install new disk controllers to SM swift backends (codfw) to Jhancock.wm.
Jul 31 2025, 9:55 PM · ops-codfw, DC-Ops, SRE, SRE-swift-storage
wiki_willy updated subscribers of T400876: Install new disk controllers to SM swift backends (codfw).
Jul 31 2025, 9:54 PM · ops-codfw, DC-Ops, SRE, SRE-swift-storage
wiki_willy updated subscribers of T400877: Install new disk controllers to SM swift backends (eqiad).

Hi @Jclark-ctr - can you provide info on where the controllers from T393941 are, so that you and @VRiley-WMF can work with Matthew on the controller swap? Thanks, Willy

Jul 31 2025, 9:43 PM · SRE, SRE-swift-storage, ops-eqiad, DC-Ops
wiki_willy updated subscribers of T400876: Install new disk controllers to SM swift backends (codfw).
Jul 31 2025, 8:52 PM · ops-codfw, DC-Ops, SRE, SRE-swift-storage

Jul 29 2025

wiki_willy removed projects from T386860: Enable CPU performance governor on Relforge, Cloudelastic, and Elasticsearch hosts: ops-codfw, ops-eqiad.
Jul 29 2025, 9:18 PM · Data-Platform-SRE, SRE, DC-Ops
wiki_willy closed T392006: eqiad: second frack parent tracking task as Resolved.

Resolving task, we will be installing two new Fundraising cabinets as a solution instead.

Jul 29 2025, 9:16 PM · SRE, Infrastructure-Foundations, fundraising-tech-ops, netops, DC-Ops, ops-eqiad
wiki_willy removed a project from T394498: SSD firmware update for an-mariadb100[1-2]: ops-eqiad.
Jul 29 2025, 9:15 PM · Essential-Work, Data-Platform-SRE (2025.08.16 - 2025.09.05), DC-Ops
wiki_willy removed a project from T395910: cloudcephosd10[48-52] service implementation: ops-eqiad.
Jul 29 2025, 9:13 PM · cloud-services-team (FY2025/26-Q1-Q2), Cloud-VPS, SRE, DC-Ops
wiki_willy assigned T398006: Outbound errors on interface cr1-eqiad:et-1/1/2 (Transport: cr1-codfw:et-1/0/2 (Arelion, IC-374549) {#20231106}) to Jclark-ctr.
Jul 29 2025, 9:08 PM · SRE, DC-Ops, ops-eqiad
wiki_willy assigned T396717: Fix PXE miss-configurations to VRiley-WMF.
Jul 29 2025, 9:05 PM · SRE, ops-eqiad, DC-Ops, ops-codfw
wiki_willy assigned T391489: Decom eqiad row B <-> cloudsw links to Jclark-ctr.
Jul 29 2025, 9:02 PM · SRE, DC-Ops, ops-eqiad
wiki_willy assigned T400161: msw1-eqiad: cable me0 dedicated mgmt port directly to the switch itself to VRiley-WMF.
Jul 29 2025, 9:02 PM · ops-eqiad, netops, Infrastructure-Foundations, SRE, DC-Ops
wiki_willy assigned T400159: Discrepencies with cableid & ports on some msw in c/d <-> msw1-eqiad to Jclark-ctr.
Jul 29 2025, 9:00 PM · Infrastructure-Foundations, netops, SRE, DC-Ops, ops-eqiad

Jul 23 2025

wiki_willy reassigned T400211: Install serial port breakout card on sretest2001 from Papaul to Jhancock.wm.

Hi @Jhancock.wm - since @Papaul is out on sabbatical, can you take a look at this one? It's related to debugging some of the Supermicro issues.. Thanks, Willy

Jul 23 2025, 4:07 PM · SRE, DC-Ops, ops-codfw, Infrastructure-Foundations

Jul 22 2025

wiki_willy reopened T393042: Q4:rack/setup/install Dell Config H 1P Test Host as "Open".

Re-opening. @Jhancock.wm - per @Marostegui's previous comment:

Jul 22 2025, 10:58 PM · SRE, ops-codfw, DC-Ops

Jul 11 2025

wiki_willy added a comment to T392851: Q4:rack/setup/install cp20[43-58] codfw.

Hi @elukey - can you or @Volans send me an email summarizing everything you need from Dell? I'll add the Technical Account Rep to the email thread to loop you in with him.

If we don't find the issue we'd probably need to contact Dell to verify if we need to do something extra or not. @wiki_willy Hi! This is the task about IDRAC 10 that we were discussing the other day, we'd probably need to get in touch with DELL to figure out what we have to do :(

Jul 11 2025, 4:06 PM · User-Elukey, SRE, Patch-For-Review, Traffic, ops-codfw, DC-Ops

Jun 12 2025

wiki_willy added a comment to T244315: decommission cookbook: add support for decom spreadsheet.

Hey @Volans - I think we've come up with a couple solutions since this task was created. One is providing a monthly Netbox dump to the Accounting team, so that they can see which hosts have been set to "offline" since the previous month. And the second one is creating an ongoing EOL Server list, to track down SRE teams that haven't decommissioned their hardware after the hardware refresh. I think we can resolve this task, but maybe we can brainstorm some other ways of improving the EOL Server list on the side.

Jun 12 2025, 8:05 PM · Infrastructure-Foundations, SRE-tools

Jun 4 2025

wiki_willy added a comment to T393107: Q#:rack/setup/install es104[78].

Thanks @Marostegui!

Jun 4 2025, 5:48 PM · SRE, Data-Persistence, ops-eqiad, DC-Ops

Jun 3 2025

wiki_willy added a comment to T393107: Q#:rack/setup/install es104[78].

Hey @Marostegui - we currently have limited availability on 10g switches, until the 10g switch refresh is completed (likely in Q1). Can these go on 1g switches, until the 10g refresh happens?

Jun 3 2025, 8:22 PM · SRE, Data-Persistence, ops-eqiad, DC-Ops

May 28 2025

wiki_willy added a comment to T393296: db1246 crashed yet again.

I just filled out the registration for the seed server today, so it should be arriving in the next 1-2 weeks. @VRiley-WMF - just a heads up that it won't include the hard drives, so you'll have to move the disks over to the replacement chassis. It also probably won't have the normal packing slip that you see on new procurement requests.

May 28 2025, 7:41 PM · SRE, DC-Ops, ops-eqiad, DBA

May 23 2025

wiki_willy added a comment to T393104: Q4:rack/setup/install ms-be109[2-5].

Hi @MatthewVernon - I just replied back to your email with a more in-depth explanation. The short answer though is that we need more SREs to decommission their previously refreshed hardware, particularly the ones on 10g switches. And for the longer term solution, once we refresh all our existing 1g network switches to 10g via T368959, it will free up a lot more options for Valerie and John to install new servers that require 10g.

May 23 2025, 11:23 PM · SRE, Data-Persistence, SRE-swift-storage, ops-eqiad, DC-Ops

May 22 2025

wiki_willy added a comment to T387231: missing pdu infos for magru.

++ @Papaul & @RobH - are one of you guys able to review the patch for Tiziano?

May 22 2025, 5:04 PM · Patch-For-Review, SRE Observability (FY2024/2025-Q3), ops-magru, DC-Ops, Observability-Metrics

May 19 2025

wiki_willy added a comment to T393296: db1246 crashed yet again.

Just a quick update: our Dell Account team is working on a resolution. There's a new open case for requesting a RMA and a server replacement.

May 19 2025, 7:10 PM · SRE, DC-Ops, ops-eqiad, DBA

May 16 2025

wiki_willy added a comment to T393296: db1246 crashed yet again.

Perfect, thanks @VRiley-WMF! I just sent an email out to our Dell Account team and cc'd you and John on it.

May 16 2025, 7:06 PM · SRE, DC-Ops, ops-eqiad, DBA
wiki_willy added a comment to T393296: db1246 crashed yet again.

Awesome, thanks @VRiley-WMF! Can you do me one more favor and summarize what was replaced next to each ticket for each Tech Support request?

May 16 2025, 5:58 PM · SRE, DC-Ops, ops-eqiad, DBA
wiki_willy added a comment to T393296: db1246 crashed yet again.

Hey @VRiley-WMF & @Jclark-ctr - I remember you two were working on tracking down and consolidating all the Dell Support tickets that we've opened for this server. Can you send me the full list of Dell Tech Support ticket numbers that we're created? I'll use that data to try and push for out account team to get us a replacement host. Thanks, Willy

May 16 2025, 5:49 PM · SRE, DC-Ops, ops-eqiad, DBA
wiki_willy added a comment to T394348: Dell SSD Critical Firmware Update.

Hi @BTullis - apologies for the mixup. For some reason, I had mixed up the dates with an-coord100[1,2], which are both offline. I've fixed the notes and removed the (decommissioned) part. Thanks for catching that!

May 16 2025, 5:44 PM · SRE, ops-codfw, ops-eqiad, DC-Ops
wiki_willy updated the task description for T394348: Dell SSD Critical Firmware Update.
May 16 2025, 5:42 PM · SRE, ops-codfw, ops-eqiad, DC-Ops

May 6 2025

wiki_willy updated subscribers of T393296: db1246 crashed yet again.

Hi @Papaul - do you have any other recommendations for this one?

May 6 2025, 12:43 AM · SRE, DC-Ops, ops-eqiad, DBA

May 5 2025

wiki_willy added a comment to T391854: Swap RAID controller on ms-be1091.eqiad.wmnet.

It's about $250 for the RAID controllers, so we can definitely order those to replace the existing ones for Config J. To keep things consistent though, should we should order this RAID controller to replace the Config E and backup hosts also?

May 5 2025, 4:21 PM · SRE, DC-Ops, Infrastructure-Foundations, Data-Persistence, ops-eqiad

Apr 30 2025

wiki_willy added a comment to T387231: missing pdu infos for magru.

Thanks @tappof, that sounds good!

Apr 30 2025, 7:00 PM · Patch-For-Review, SRE Observability (FY2024/2025-Q3), ops-magru, DC-Ops, Observability-Metrics
wiki_willy added a comment to T392796: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment.

Hi @MatthewVernon - I still have some CapEx underrun, so we could bump up the refresh to this quarter instead. @RobH - can you create a Phabricator task and quote for Matthew to review?

@wiki_willy this node is currently slated for replacement in Q2 as part of "Refresh of ms-be10[60-63]"; depending on costs/timelines of getting a replacement card in, could we pull that forward to Q1?

Apr 30 2025, 5:50 PM · SRE-swift-storage, SRE, ops-eqiad, DC-Ops
wiki_willy added a comment to T392751: Degraded RAID on db1171.

@VRiley-WMF & @Jclark-ctr - can you grab a spare from one of the decom'd servers for this?

Apr 30 2025, 6:37 AM · DBA, Data-Persistence, Data-Persistence-Backup, SRE, DC-Ops, ops-eqiad

Apr 29 2025

wiki_willy added a comment to T392796: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment.

Sorry, nevermind....it looks like they're HPs

Apr 29 2025, 4:58 PM · SRE-swift-storage, SRE, ops-eqiad, DC-Ops
wiki_willy added a comment to T392796: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment.

@Jclark-ctr - it looks like we refreshed ms-be105[1-9] towards the end of last year via T371389. Can you check with @MatthewVernon to see if any of those are close to being decommissioned, and see if we can pull the RAID card from one of those machines?

Apr 29 2025, 4:56 PM · SRE-swift-storage, SRE, ops-eqiad, DC-Ops

Apr 28 2025

wiki_willy updated subscribers of T392424: Degraded RAID on cloudcephmon1004.
Apr 28 2025, 6:00 PM · DC-Ops, SRE, ops-eqiad
wiki_willy updated subscribers of T392751: Degraded RAID on db1171.
Apr 28 2025, 5:58 PM · DBA, Data-Persistence, Data-Persistence-Backup, SRE, ops-eqiad, DC-Ops
wiki_willy added a comment to T387231: missing pdu infos for magru.

Thanks @tappof, that looks perfect. Thanks for splitting it up by rack! I went through and checked the other pop sites, and they all look good as well...except for drmrs. When you get a chance, can you get drmrs split across the two racks also? Thanks so much for your help!

Apr 28 2025, 4:04 PM · Patch-For-Review, SRE Observability (FY2024/2025-Q3), ops-magru, DC-Ops, Observability-Metrics

Apr 18 2025

wiki_willy added a comment to T387231: missing pdu infos for magru.

Hi @tappof - great job and thank you so much for working on this! It looks like I'm able to see all the information we need for magru in Grafana now.

Apr 18 2025, 5:35 PM · Patch-For-Review, SRE Observability (FY2024/2025-Q3), ops-magru, DC-Ops, Observability-Metrics
wiki_willy added a comment to T392007: eqiad: determine second frack.

Hey @ayounsi - after some feedback from my staff meeting earlier today, I reached out to Equinix to see if there's any way we'd be able to add circuits to build out a new rack for Fundraising. If everything works out with the feasibility study, we would be able to build a new rack from the ground up in the Machine Learning cage (without taking away anything dedicated to ML or in our current racks). It'll probably take 1-2 weeks though before I know for sure, so we can pause on migrating anything for a bit. Thanks, Willy

Apr 18 2025, 6:52 AM · SRE, Infrastructure-Foundations, fundraising-tech-ops, netops, DC-Ops, ops-eqiad

Mar 27 2025

wiki_willy renamed T387145: Q3:test NIC for lvs1017 from Q3:test NIC for lvs1019 to Q3:test NIC for lvs1017 or lvs1018.
Mar 27 2025, 7:48 PM · SRE, ops-eqiad, Traffic, DC-Ops
wiki_willy updated subscribers of T387145: Q3:test NIC for lvs1017.
Mar 27 2025, 7:47 PM · SRE, ops-eqiad, Traffic, DC-Ops

Mar 12 2025

wiki_willy updated subscribers of T381576: Q2:rack/setup/install ganeti105[34].eqiad.wmnet.

Hi @MoritzMuehlenhoff - the normal hardware specs for Config C is actually 2x 960gb hard drives (not 4x 960gb). I think maybe you were looking at the column for the number of DIMMs (which is 4x DIMMs for Config C) instead of the hard drives below:

Mar 12 2025, 10:31 PM · SRE, ops-eqiad, Infrastructure-Foundations, DC-Ops

Mar 7 2025

wiki_willy updated subscribers of T388221: Recommission testhost2001.codfw.wmnet as ms-be2089.codfw.wmnet.

++ @Jhancock.wm & @Papaul - per our conversation the other day, this will be the R760xd2 seed server that we received from Dell, which we'll repurpose for Matthew to test and put into production. Thanks, Willy

Mar 7 2025, 5:22 PM · Patch-For-Review, SRE-swift-storage, SRE, ops-codfw, DC-Ops
wiki_willy reassigned T387673: db1246 crashed & rebooted twice from Marostegui to VRiley-WMF.

Reassigning to Valerie to create a new Dell Support task

Mar 7 2025, 2:19 AM · DC-Ops, ops-eqiad, SRE, DBA

Mar 5 2025

wiki_willy updated subscribers of T387673: db1246 crashed & rebooted twice.

Hi @Marostegui - thanks for checking. When I look back at previous email from Dell Support sent in November, MarcoAntonio says "we can temporarily archive the case, and if the issue reappears, you can open this case within 10days by contacting me via email or we can open a new case making reference to this case if any additional support is needed after 10 days, the record of the server is saved in the TAG history." So I have a feeling your email reply on Sunday didn't reopen the case because it was past 10 days.

Mar 5 2025, 8:17 AM · DC-Ops, ops-eqiad, SRE, DBA

Feb 26 2025

wiki_willy updated subscribers of T387231: missing pdu infos for magru.

Hi @tappof - thanks for looking into this. It looks like the PDUs are in Netbox though; they were added about a year ago in May 2024:

Feb 26 2025, 5:09 PM · Patch-For-Review, SRE Observability (FY2024/2025-Q3), ops-magru, DC-Ops, Observability-Metrics

Feb 20 2025

wiki_willy added a comment to T386959: Solicit Dell to investigate magru cp temperatures.

Hey @RobH - Sukhbir and I were talking at the offsite after the fix was implemented. While increasing the fan speed helped specifically in this scenario, the other sites are able to get by with just the default fan speed. So we still wanted to get a Dell technician to compare one magru server with the default fan speed to another magru server with the adjusted higher fan speed, to see if they could isolate any other root causes - whether it was something else internal within the servers contributing to the high temps or some type of external environment cause with airflow.

Feb 20 2025, 7:23 PM · ops-magru
wiki_willy assigned T386959: Solicit Dell to investigate magru cp temperatures to RobH.

Thanks for creating this task @ssingh.

Feb 20 2025, 6:47 PM · ops-magru

Dec 11 2024

wiki_willy added a comment to T380673: Kernel error Server cloudvirt1061 may have kernel errors.

@Jclark-ctr - there's nothing that I'm aware of. If there's no additional info in the original procurement task or any historical Phabricator tickets, maybe you can check with WMCS and see if you can rebalance them?

Dec 11 2024, 10:19 PM · SRE, cloud-services-team (Hardware), ops-eqiad, DC-Ops

Nov 13 2024

wiki_willy added a comment to T375842: decommission mw[1349-1413].

Ah that makes sense, thanks for the info. We'll go ahead and move the server, after the Phabricator task is created. FWIW, all servers being ordered this fiscal year and moving forward will have 10g cards...and the refresh/upgrade to 10g switches in eqiad for rows C and D is supposed to happen probably later in Q4.

The new server is already in service. The main reason brought this up is the process we had to go through to get a 10G card in wikikube-ctrl1001 cause we need the extra bandwidth. I think that to do so, we 'll need to chose a server in a rack that has free 10G ports and re-cable. I 'll file a separate task

Nov 13 2024, 9:42 PM · SRE, DC-Ops, ops-eqiad, serviceops, decommission-hardware

Nov 12 2024

wiki_willy added a comment to T375842: decommission mw[1349-1413].

Hi @akosiaris - thanks for confirming. I think we already ordered the replacement host though via T368933. You're welcome to continue using wikikube-ctrl1001 for a longer period of time though, and dedicate the new server for something else in the meantime if you want?

Nov 12 2024, 9:34 PM · SRE, DC-Ops, ops-eqiad, serviceops, decommission-hardware

Nov 6 2024

wiki_willy updated subscribers of T371984: Q1:rack/setup/install backup2012.

Hi @Jhancock.wm and @Papaul - just a heads up, it looks like the test controller kit arrived yesterday:

Nov 6 2024, 7:25 PM · SRE, Data-Persistence, Data-Persistence-Backup, ops-codfw, DC-Ops
wiki_willy updated subscribers of T371416: Q1:rack/setup/install backup1012.

Just a heads up @Jclark-ctr & @VRiley-WMF - the test controller kit should've arrived yesterday:

Nov 6 2024, 7:23 PM · SRE, Data-Persistence-Backup, Data-Persistence, ops-eqiad, DC-Ops

Nov 4 2024

wiki_willy renamed T378828: Q2:rack/setup/install cloudcephosd10[42-47] from Q2:eqiad:(12) Ceph cluster expansion - custom config 10g to Q2:eqiad:(6) Ceph cluster expansion - custom config 10g.
Nov 4 2024, 8:23 PM · SRE, ops-eqiad, DC-Ops

Oct 31 2024

wiki_willy added a comment to T378584: Evaluate hw-raid controllers for Supermicro's Config J.

Met with the Supermicro team today, who believes the RAID kit should be approved either today or tomorrow, and shipped out after that. For reference, here are some details they sent us below:

Oct 31 2024, 6:32 PM · SRE-swift-storage, Infrastructure-Foundations, Data-Persistence, DC-Ops

Oct 30 2024

wiki_willy added a comment to T378584: Evaluate hw-raid controllers for Supermicro's Config J.

Meeting set with Supermicro team on October 31 at 3pm UTC, to discuss the proposed RAID controller option and address any outstanding questions that we have. @Volans, @elukey, @RobH, @Papaul, and myself are all on the invite titled "SMC/Wiki RAID Controller Discussion," but please let Richard from Supermicro know, if you need to propose a different meeting time. Thanks, Willy

Oct 30 2024, 9:15 PM · SRE-swift-storage, Infrastructure-Foundations, Data-Persistence, DC-Ops
wiki_willy added a comment to T371416: Q1:rack/setup/install backup1012.

Thanks so much @jcrespo, I appreciate your flexibility and patience on this.

Oct 30 2024, 8:12 PM · SRE, Data-Persistence-Backup, Data-Persistence, ops-eqiad, DC-Ops

Oct 29 2024

wiki_willy added a comment to T371416: Q1:rack/setup/install backup1012.

Thanks for the context, Jaime. Based on your current needs and with the time constraints, it sounds like it'll be better having you continue working on the host in its current state. While we're escalating everything with Supermicro, it's been a bit difficult getting some solid ETAs in place. There's also the possibility that unexpected issues could pop up, and I don't want to potentially delay things any further.

Oct 29 2024, 11:23 PM · SRE, Data-Persistence-Backup, Data-Persistence, ops-eqiad, DC-Ops
wiki_willy updated subscribers of T371416: Q1:rack/setup/install backup1012.

Hi @jcrespo - thanks for your feedback on this. My apologies that these Config J servers have been causing a lot of headaches. Unfortunately, we still have to figure out how to best resolve the performance issues from the RAID controller. In your opinion, what would work best? For example, would it work better if we set up a Config J server with the upgraded RAID controller first, and then migrated the data after? Let me know your preference, and we'll do our best to workaround and accommodate that.

Oct 29 2024, 4:52 PM · SRE, Data-Persistence-Backup, Data-Persistence, ops-eqiad, DC-Ops

Oct 28 2024

wiki_willy reopened T371984: Q1:rack/setup/install backup2012 as "Open".

Re-opening this task, since we have the incorrect RAID controller on the server. @RobH is currently working with Supermicro on getting an upgraded RAID controller onsite to hopefully resolve the performance issues being seen. @RobH - please continue following up with Supermicro with ETAs and statuses, and post them here for visibility. Thanks, Willy

Oct 28 2024, 8:18 PM · SRE, Data-Persistence, Data-Persistence-Backup, ops-codfw, DC-Ops
wiki_willy reopened T371984: Q1:rack/setup/install backup2012, a subtask of T376892: Expand media backup storage available space to 960 TB per datacenter, as Open.
Oct 28 2024, 8:15 PM · media-backups, Data-Persistence-Backup, SRE
wiki_willy reopened T371416: Q1:rack/setup/install backup1012 as "Open".

Re-opening this task, as the server has the incorrect RAID controller. We're working with Supermicro to get an upgraded RAID controller sent onsite, to replace and hopefully resolve the performance issues being seen. @RobH - can you provide frequent updates in this task and work closely with Supermicro on getting the part, until we have this issue resolved? Thanks, Willy

Oct 28 2024, 8:13 PM · SRE, Data-Persistence-Backup, Data-Persistence, ops-eqiad, DC-Ops
wiki_willy reopened T371416: Q1:rack/setup/install backup1012, a subtask of T376892: Expand media backup storage available space to 960 TB per datacenter, as Open.
Oct 28 2024, 8:12 PM · media-backups, Data-Persistence-Backup, SRE

Oct 23 2024

wiki_willy added a project to T309598: hosts have Mutiple PTR records : Infrastructure-Foundations.
Oct 23 2024, 8:21 PM · Infrastructure-Foundations, DC-Ops
wiki_willy added a comment to T377568: wmcs codfw hardware changes proposal.

Yup, agreed. If the servers can be reallocated for something else that is currently needed, I think it makes more sense to just repurpose them vs keeping them as spares or decommissioning them.

Oct 23 2024, 6:17 PM · Cloud-VPS, User-aborrero, cloud-services-team (Hardware)

Sep 28 2024

wiki_willy added a comment to T375842: decommission mw[1349-1413].

Sure, no problem @akosiaris. I'm having trouble finding the line item though for wikikube-ctrl1001 on the procurement doc. Is it part of the "Refresh of mw[1349-1413]"?

Sep 28 2024, 1:02 AM · SRE, DC-Ops, ops-eqiad, serviceops, decommission-hardware

Sep 26 2024

wiki_willy added a comment to T373993: CPU temperature issues in cp hosts.

Thanks for providing all the details on this, @ssingh. @RobH - as we chatted about earlier today, we could ask Ascenty to double-check that there are enough perf tiles in the cold aisle, confirm that the blanket panels are in place (and if not, add them), and possibly get a temperature and humidity reading in that area. Thanks, Willy

Sep 26 2024, 4:03 AM · ops-esams, ops-magru, DC-Ops, Traffic

Sep 23 2024

wiki_willy updated subscribers of T375257: Degraded RAID on es1022.

Hi @ABran-WMF - can you check with the onsite engineers @VRiley-WMF and @Jclark-ctr? Please also keep in mind this server is due to be refreshed in Q2, so a new system will be on its way in another month or so.

Sep 23 2024, 5:04 PM · SRE, DBA, ops-eqiad, DC-Ops
wiki_willy updated subscribers of T375382: Post pc1013 crash.

++ @Jclark-ctr & @VRiley-WMF, who can see if there are any parts available from decommissioned servers

Sep 23 2024, 4:58 PM · Wikimedia-production-error, Sustainability (Incident Followup), SRE, DBA

Sep 17 2024

wiki_willy created T375000: Repurposing 2x Decommissioned Servers for Phasing Out Puppet 5.
Sep 17 2024, 6:51 PM · SRE, ops-eqiad, DC-Ops

Sep 12 2024

wiki_willy added a comment to T362922: Audit/consider enabling CPU performance governor on DPE SRE-owned hosts.

++ @Jclark-ctr and @VRiley-WMF - can you confirm if we're ok with the Data Platform team increasing power on the hosts listed above? Thanks, Willy

Sep 12 2024, 9:47 PM · Data-Platform-SRE
wiki_willy assigned T373993: CPU temperature issues in cp hosts to RobH.
Sep 12 2024, 5:28 PM · ops-esams, ops-magru, DC-Ops, Traffic

Aug 12 2024

wiki_willy assigned T372208: Degraded RAID on es1029 to VRiley-WMF.

++ @VRiley-WMF - fyi, this one looks like it's high priority

Aug 12 2024, 2:51 PM · DBA, DC-Ops, SRE, ops-eqiad

Jul 18 2024

wiki_willy added a comment to T360356: Request access to servers Dcops group.

Thanks @elukey, that sounds good!

Jul 18 2024, 12:02 AM · User-Elukey, SRE, Infrastructure-Foundations