User Details
- User Since
- Apr 16 2019, 9:00 PM (347 w, 4 d)
- Availability
- Available
- LDAP User
- Wpao
- MediaWiki User
- WPao (WMF) [ Global Accounts ]
Tue, Dec 2
Swap out R430 spare drives with newer drives (1 for 1 swap), along with memory
Nov 6 2025
@Jclark-ctr - can you help out @Marostegui with getting a RMA for the DIMM?
Oct 28 2025
Oct 7 2025
Oct 2 2025
Hi @VRiley-WMF - the access to create RMA cases should be resolved now per Juniper, so hopefully it unblocks you on this one. Thanks, Willy
Sep 16 2025
Hi @brouberol - thanks for opening this task. Is this one ready to be handed over to DC-Ops? Thanks, Willy
Sep 5 2025
Thanks @jasmine_ !
Awesome, thanks so much @jasmine_ !
Hi @Clement_Goubert & @jasmine_ - to follow up on this one, I think we're still waiting on this task to be passed over to Dc-Ops. Can you split this into two different tasks (one for ops-eqiad and one for ops-codfw), for us to unrack the servers? Much appreciated in advance. Thanks, Willy
Hi @jasmine_ - just checking if you had an ETA on wrapping up wikikube-ctrl1001 for decommissioning? We're hoping to have this Phabricator task passed over to Dc-Ops, to help free up some rack space in eqiad. Much appreciated in advance. Thanks, Willy
Hi @brouberol & @BTullis - I don't think we've seen the Phabricator task for Data Center ops to decommission these servers from the racks. Can you submit that over to us via the Decom workflow below so we can unrack these to free up some rackspace:
Sep 2 2025
Aug 29 2025
Thanks @RobH. Our account team has changed quite a bit, but you can follow up with Hossam and Dawn after creating the support ticket
Aug 27 2025
++ @RobH - can you work with John on getting a 25g Broadcom NIC for this one?
Aug 26 2025
Aug 14 2025
Adding @BTullis and @Stevemunene for feedback on an appropriate window for an-worker1128
Aug 12 2025
Awesome, thank you!
Thanks @MoritzMuehlenhoff!
Aug 11 2025
Hi @MoritzMuehlenhoff - are you able to help confirm the racking details and update site.pp on this one? Thanks, Willy
Hi @MoritzMuehlenhoff - are you able to confirm the racking details and update the site.pp info on this one? Thanks, Willy
Aug 5 2025
Jul 31 2025
Hi @Jclark-ctr - can you provide info on where the controllers from T393941 are, so that you and @VRiley-WMF can work with Matthew on the controller swap? Thanks, Willy
Jul 29 2025
Resolving task, we will be installing two new Fundraising cabinets as a solution instead.
Jul 23 2025
Hi @Jhancock.wm - since @Papaul is out on sabbatical, can you take a look at this one? It's related to debugging some of the Supermicro issues.. Thanks, Willy
Jul 22 2025
Re-opening. @Jhancock.wm - per @Marostegui's previous comment:
Jul 11 2025
Hi @elukey - can you or @Volans send me an email summarizing everything you need from Dell? I'll add the Technical Account Rep to the email thread to loop you in with him.
If we don't find the issue we'd probably need to contact Dell to verify if we need to do something extra or not. @wiki_willy Hi! This is the task about IDRAC 10 that we were discussing the other day, we'd probably need to get in touch with DELL to figure out what we have to do :(
Jun 12 2025
Hey @Volans - I think we've come up with a couple solutions since this task was created. One is providing a monthly Netbox dump to the Accounting team, so that they can see which hosts have been set to "offline" since the previous month. And the second one is creating an ongoing EOL Server list, to track down SRE teams that haven't decommissioned their hardware after the hardware refresh. I think we can resolve this task, but maybe we can brainstorm some other ways of improving the EOL Server list on the side.
Jun 4 2025
Thanks @Marostegui!
Jun 3 2025
Hey @Marostegui - we currently have limited availability on 10g switches, until the 10g switch refresh is completed (likely in Q1). Can these go on 1g switches, until the 10g refresh happens?
May 28 2025
I just filled out the registration for the seed server today, so it should be arriving in the next 1-2 weeks. @VRiley-WMF - just a heads up that it won't include the hard drives, so you'll have to move the disks over to the replacement chassis. It also probably won't have the normal packing slip that you see on new procurement requests.
May 23 2025
Hi @MatthewVernon - I just replied back to your email with a more in-depth explanation. The short answer though is that we need more SREs to decommission their previously refreshed hardware, particularly the ones on 10g switches. And for the longer term solution, once we refresh all our existing 1g network switches to 10g via T368959, it will free up a lot more options for Valerie and John to install new servers that require 10g.
May 22 2025
May 19 2025
Just a quick update: our Dell Account team is working on a resolution. There's a new open case for requesting a RMA and a server replacement.
May 16 2025
Perfect, thanks @VRiley-WMF! I just sent an email out to our Dell Account team and cc'd you and John on it.
Awesome, thanks @VRiley-WMF! Can you do me one more favor and summarize what was replaced next to each ticket for each Tech Support request?
Hey @VRiley-WMF & @Jclark-ctr - I remember you two were working on tracking down and consolidating all the Dell Support tickets that we've opened for this server. Can you send me the full list of Dell Tech Support ticket numbers that we're created? I'll use that data to try and push for out account team to get us a replacement host. Thanks, Willy
Hi @BTullis - apologies for the mixup. For some reason, I had mixed up the dates with an-coord100[1,2], which are both offline. I've fixed the notes and removed the (decommissioned) part. Thanks for catching that!
May 6 2025
Hi @Papaul - do you have any other recommendations for this one?
May 5 2025
It's about $250 for the RAID controllers, so we can definitely order those to replace the existing ones for Config J. To keep things consistent though, should we should order this RAID controller to replace the Config E and backup hosts also?
Apr 30 2025
Thanks @tappof, that sounds good!
Hi @MatthewVernon - I still have some CapEx underrun, so we could bump up the refresh to this quarter instead. @RobH - can you create a Phabricator task and quote for Matthew to review?
@wiki_willy this node is currently slated for replacement in Q2 as part of "Refresh of ms-be10[60-63]"; depending on costs/timelines of getting a replacement card in, could we pull that forward to Q1?
@VRiley-WMF & @Jclark-ctr - can you grab a spare from one of the decom'd servers for this?
Apr 29 2025
Sorry, nevermind....it looks like they're HPs
@Jclark-ctr - it looks like we refreshed ms-be105[1-9] towards the end of last year via T371389. Can you check with @MatthewVernon to see if any of those are close to being decommissioned, and see if we can pull the RAID card from one of those machines?
Apr 28 2025
Thanks @tappof, that looks perfect. Thanks for splitting it up by rack! I went through and checked the other pop sites, and they all look good as well...except for drmrs. When you get a chance, can you get drmrs split across the two racks also? Thanks so much for your help!
Apr 18 2025
Hi @tappof - great job and thank you so much for working on this! It looks like I'm able to see all the information we need for magru in Grafana now.
Hey @ayounsi - after some feedback from my staff meeting earlier today, I reached out to Equinix to see if there's any way we'd be able to add circuits to build out a new rack for Fundraising. If everything works out with the feasibility study, we would be able to build a new rack from the ground up in the Machine Learning cage (without taking away anything dedicated to ML or in our current racks). It'll probably take 1-2 weeks though before I know for sure, so we can pause on migrating anything for a bit. Thanks, Willy
Mar 27 2025
Mar 12 2025
Hi @MoritzMuehlenhoff - the normal hardware specs for Config C is actually 2x 960gb hard drives (not 4x 960gb). I think maybe you were looking at the column for the number of DIMMs (which is 4x DIMMs for Config C) instead of the hard drives below:
Mar 7 2025
++ @Jhancock.wm & @Papaul - per our conversation the other day, this will be the R760xd2 seed server that we received from Dell, which we'll repurpose for Matthew to test and put into production. Thanks, Willy
Reassigning to Valerie to create a new Dell Support task
Mar 5 2025
Hi @Marostegui - thanks for checking. When I look back at previous email from Dell Support sent in November, MarcoAntonio says "we can temporarily archive the case, and if the issue reappears, you can open this case within 10days by contacting me via email or we can open a new case making reference to this case if any additional support is needed after 10 days, the record of the server is saved in the TAG history." So I have a feeling your email reply on Sunday didn't reopen the case because it was past 10 days.
Feb 26 2025
Hi @tappof - thanks for looking into this. It looks like the PDUs are in Netbox though; they were added about a year ago in May 2024:
Feb 20 2025
Hey @RobH - Sukhbir and I were talking at the offsite after the fix was implemented. While increasing the fan speed helped specifically in this scenario, the other sites are able to get by with just the default fan speed. So we still wanted to get a Dell technician to compare one magru server with the default fan speed to another magru server with the adjusted higher fan speed, to see if they could isolate any other root causes - whether it was something else internal within the servers contributing to the high temps or some type of external environment cause with airflow.
Thanks for creating this task @ssingh.
Dec 11 2024
@Jclark-ctr - there's nothing that I'm aware of. If there's no additional info in the original procurement task or any historical Phabricator tickets, maybe you can check with WMCS and see if you can rebalance them?
Nov 13 2024
Ah that makes sense, thanks for the info. We'll go ahead and move the server, after the Phabricator task is created. FWIW, all servers being ordered this fiscal year and moving forward will have 10g cards...and the refresh/upgrade to 10g switches in eqiad for rows C and D is supposed to happen probably later in Q4.
The new server is already in service. The main reason brought this up is the process we had to go through to get a 10G card in wikikube-ctrl1001 cause we need the extra bandwidth. I think that to do so, we 'll need to chose a server in a rack that has free 10G ports and re-cable. I 'll file a separate task
Nov 12 2024
Hi @akosiaris - thanks for confirming. I think we already ordered the replacement host though via T368933. You're welcome to continue using wikikube-ctrl1001 for a longer period of time though, and dedicate the new server for something else in the meantime if you want?
Nov 6 2024
Hi @Jhancock.wm and @Papaul - just a heads up, it looks like the test controller kit arrived yesterday:
Just a heads up @Jclark-ctr & @VRiley-WMF - the test controller kit should've arrived yesterday:
Nov 4 2024
Oct 31 2024
Met with the Supermicro team today, who believes the RAID kit should be approved either today or tomorrow, and shipped out after that. For reference, here are some details they sent us below:
Oct 30 2024
Meeting set with Supermicro team on October 31 at 3pm UTC, to discuss the proposed RAID controller option and address any outstanding questions that we have. @Volans, @elukey, @RobH, @Papaul, and myself are all on the invite titled "SMC/Wiki RAID Controller Discussion," but please let Richard from Supermicro know, if you need to propose a different meeting time. Thanks, Willy
Thanks so much @jcrespo, I appreciate your flexibility and patience on this.
Oct 29 2024
Thanks for the context, Jaime. Based on your current needs and with the time constraints, it sounds like it'll be better having you continue working on the host in its current state. While we're escalating everything with Supermicro, it's been a bit difficult getting some solid ETAs in place. There's also the possibility that unexpected issues could pop up, and I don't want to potentially delay things any further.
Hi @jcrespo - thanks for your feedback on this. My apologies that these Config J servers have been causing a lot of headaches. Unfortunately, we still have to figure out how to best resolve the performance issues from the RAID controller. In your opinion, what would work best? For example, would it work better if we set up a Config J server with the upgraded RAID controller first, and then migrated the data after? Let me know your preference, and we'll do our best to workaround and accommodate that.
Oct 28 2024
Re-opening this task, since we have the incorrect RAID controller on the server. @RobH is currently working with Supermicro on getting an upgraded RAID controller onsite to hopefully resolve the performance issues being seen. @RobH - please continue following up with Supermicro with ETAs and statuses, and post them here for visibility. Thanks, Willy
Re-opening this task, as the server has the incorrect RAID controller. We're working with Supermicro to get an upgraded RAID controller sent onsite, to replace and hopefully resolve the performance issues being seen. @RobH - can you provide frequent updates in this task and work closely with Supermicro on getting the part, until we have this issue resolved? Thanks, Willy
Oct 23 2024
Yup, agreed. If the servers can be reallocated for something else that is currently needed, I think it makes more sense to just repurpose them vs keeping them as spares or decommissioning them.
Sep 28 2024
Sure, no problem @akosiaris. I'm having trouble finding the line item though for wikikube-ctrl1001 on the procurement doc. Is it part of the "Refresh of mw[1349-1413]"?
Sep 26 2024
Thanks for providing all the details on this, @ssingh. @RobH - as we chatted about earlier today, we could ask Ascenty to double-check that there are enough perf tiles in the cold aisle, confirm that the blanket panels are in place (and if not, add them), and possibly get a temperature and humidity reading in that area. Thanks, Willy
Sep 23 2024
Hi @ABran-WMF - can you check with the onsite engineers @VRiley-WMF and @Jclark-ctr? Please also keep in mind this server is due to be refreshed in Q2, so a new system will be on its way in another month or so.
++ @Jclark-ctr & @VRiley-WMF, who can see if there are any parts available from decommissioned servers
Sep 17 2024
Sep 12 2024
++ @Jclark-ctr and @VRiley-WMF - can you confirm if we're ok with the Data Platform team increasing power on the hosts listed above? Thanks, Willy
Aug 12 2024
++ @VRiley-WMF - fyi, this one looks like it's high priority
Jul 18 2024
Thanks @elukey, that sounds good!
