User Details
- User Since
- Apr 16 2019, 9:00 PM (365 w, 4 d)
- Availability
- Available
- LDAP User
- Wpao
- MediaWiki User
- WPao (WMF) [ Global Accounts ]
Fri, Apr 17
Thanks @Clement_Goubert! @VRiley-WMF & @Jclark-ctr - this is the hardware being repurposed that I mentioned about during our Dc-Ops meeting yesterday. Thanks, Willy
Tue, Apr 7
Thanks @Jclark-ctr. Hi @isarantopoulos - since we're looking to refresh this soon, do you still need us to purchase a replacement drive? Thanks, Willy
Mon, Mar 30
Thanks @RLazarus, we'll get it added to agenda for Infra Foundations meeting on Monday, March 30.
Thu, Mar 26
Mon, Mar 23
Adding the ops-eqiad tag and removing ops-eqdfw. @Jclark-ctr will take a look at it a bit later today.
Mar 10 2026
Hi @ssingh - I was thinking along the lines of seeing if we would be able to calculate SERT ourselves instead of installing the tool. When I dig around a bit, it looks like SERT takes the weighted geometric mean of 65% for CPU, 30% for Memory, and 5% for Storage workloads. Since we only have 25 servers, I was thinking an estimate might be good enough; and we could just let them know that's how we came up with the metric since we can't install SERT on our hosts. Another alternative, we could also consolidate all our questions and concerns together, and Rob could email it out to customercare@ to see if the support team can provide any other guidance.
Hey @ssingh - I guess technically, the way it's worded refers to just "new "servers. However, it'd also be a little weird if they're asking for just "new" because I don't think we provided this info last year either. Since our footprint at caching sites are super small, do you think providing an estimate would be possible? When we provide them the info, we could explain the reasoning behind it as well.
I was reading the notification for DRMRS a bit more closely, and it looks like March 31 is the due date for Digital Realty to report the data to the EU, but the due date for us to provide the info to Digital Realty is this Friday, March 13. Updating the subject line to reflect the date.
Feb 25 2026
Feb 24 2026
Hi @BTullis - sure, that sounds like a good test plan. One thing to keep in mind though is the data center switchover (from codfw to eqiad) will be happening on March 24 and 25. We'll likely see an overall power increase of 15-20kW at eqiad during that time, so hopefully that's enough time to observe any potentially changes or findings by then. Here's what I typically use for monitoring power by racks (at the bottom of the page):
Feb 17 2026
Yeah, it could be because we're using new PDU models for cabinets E9-E14. Let's see what the Observability team recommends.
Sounds good, thanks @brouberol !
Feb 13 2026
Feb 11 2026
Feb 6 2026
Feb 4 2026
Sounds good @RobH, that plan works for me as well. Do you know if Jin has access to any of these parts by any chance? If he is able to get a hold of them, he could just add the cost onto our invoice.
Hey @RobH - did Jin say what kind of initial troubleshooting he did? Like did he do a power drain, reseat certain parts, etc? I think we can go ahead and purchase parts to see if it'll help fix this, though it'll be helpful knowing what was attempted so far. Thanks, Willy
Dec 22 2025
Dec 2 2025
Swap out R430 spare drives with newer drives (1 for 1 swap), along with memory
Nov 6 2025
@Jclark-ctr - can you help out @Marostegui with getting a RMA for the DIMM?
Oct 28 2025
Oct 7 2025
Oct 2 2025
Hi @VRiley-WMF - the access to create RMA cases should be resolved now per Juniper, so hopefully it unblocks you on this one. Thanks, Willy
Sep 16 2025
Hi @brouberol - thanks for opening this task. Is this one ready to be handed over to DC-Ops? Thanks, Willy
Sep 5 2025
Thanks @jasmine_ !
Awesome, thanks so much @jasmine_ !
Hi @Clement_Goubert & @jasmine_ - to follow up on this one, I think we're still waiting on this task to be passed over to Dc-Ops. Can you split this into two different tasks (one for ops-eqiad and one for ops-codfw), for us to unrack the servers? Much appreciated in advance. Thanks, Willy
Hi @jasmine_ - just checking if you had an ETA on wrapping up wikikube-ctrl1001 for decommissioning? We're hoping to have this Phabricator task passed over to Dc-Ops, to help free up some rack space in eqiad. Much appreciated in advance. Thanks, Willy
Hi @brouberol & @BTullis - I don't think we've seen the Phabricator task for Data Center ops to decommission these servers from the racks. Can you submit that over to us via the Decom workflow below so we can unrack these to free up some rackspace:
Sep 2 2025
Aug 29 2025
Thanks @RobH. Our account team has changed quite a bit, but you can follow up with Hossam and Dawn after creating the support ticket
Aug 27 2025
++ @RobH - can you work with John on getting a 25g Broadcom NIC for this one?
Aug 26 2025
Aug 14 2025
Adding @BTullis and @Stevemunene for feedback on an appropriate window for an-worker1128
Aug 12 2025
Awesome, thank you!
Thanks @MoritzMuehlenhoff!
Aug 11 2025
Hi @MoritzMuehlenhoff - are you able to help confirm the racking details and update site.pp on this one? Thanks, Willy
Hi @MoritzMuehlenhoff - are you able to confirm the racking details and update the site.pp info on this one? Thanks, Willy
Aug 5 2025
Jul 31 2025
Hi @Jclark-ctr - can you provide info on where the controllers from T393941 are, so that you and @VRiley-WMF can work with Matthew on the controller swap? Thanks, Willy
Jul 29 2025
Resolving task, we will be installing two new Fundraising cabinets as a solution instead.
Jul 23 2025
Hi @Jhancock.wm - since @Papaul is out on sabbatical, can you take a look at this one? It's related to debugging some of the Supermicro issues.. Thanks, Willy
Jul 22 2025
Re-opening. @Jhancock.wm - per @Marostegui's previous comment:
Jul 11 2025
Hi @elukey - can you or @Volans send me an email summarizing everything you need from Dell? I'll add the Technical Account Rep to the email thread to loop you in with him.
If we don't find the issue we'd probably need to contact Dell to verify if we need to do something extra or not. @wiki_willy Hi! This is the task about IDRAC 10 that we were discussing the other day, we'd probably need to get in touch with DELL to figure out what we have to do :(
Jun 12 2025
Hey @Volans - I think we've come up with a couple solutions since this task was created. One is providing a monthly Netbox dump to the Accounting team, so that they can see which hosts have been set to "offline" since the previous month. And the second one is creating an ongoing EOL Server list, to track down SRE teams that haven't decommissioned their hardware after the hardware refresh. I think we can resolve this task, but maybe we can brainstorm some other ways of improving the EOL Server list on the side.
Jun 4 2025
Thanks @Marostegui!
Jun 3 2025
Hey @Marostegui - we currently have limited availability on 10g switches, until the 10g switch refresh is completed (likely in Q1). Can these go on 1g switches, until the 10g refresh happens?
May 28 2025
I just filled out the registration for the seed server today, so it should be arriving in the next 1-2 weeks. @VRiley-WMF - just a heads up that it won't include the hard drives, so you'll have to move the disks over to the replacement chassis. It also probably won't have the normal packing slip that you see on new procurement requests.
May 23 2025
Hi @MatthewVernon - I just replied back to your email with a more in-depth explanation. The short answer though is that we need more SREs to decommission their previously refreshed hardware, particularly the ones on 10g switches. And for the longer term solution, once we refresh all our existing 1g network switches to 10g via T368959, it will free up a lot more options for Valerie and John to install new servers that require 10g.
May 22 2025
May 19 2025
Just a quick update: our Dell Account team is working on a resolution. There's a new open case for requesting a RMA and a server replacement.
May 16 2025
Perfect, thanks @VRiley-WMF! I just sent an email out to our Dell Account team and cc'd you and John on it.
Awesome, thanks @VRiley-WMF! Can you do me one more favor and summarize what was replaced next to each ticket for each Tech Support request?
Hey @VRiley-WMF & @Jclark-ctr - I remember you two were working on tracking down and consolidating all the Dell Support tickets that we've opened for this server. Can you send me the full list of Dell Tech Support ticket numbers that we're created? I'll use that data to try and push for out account team to get us a replacement host. Thanks, Willy
Hi @BTullis - apologies for the mixup. For some reason, I had mixed up the dates with an-coord100[1,2], which are both offline. I've fixed the notes and removed the (decommissioned) part. Thanks for catching that!
May 6 2025
Hi @Papaul - do you have any other recommendations for this one?
May 5 2025
It's about $250 for the RAID controllers, so we can definitely order those to replace the existing ones for Config J. To keep things consistent though, should we should order this RAID controller to replace the Config E and backup hosts also?
Apr 30 2025
Thanks @tappof, that sounds good!
Hi @MatthewVernon - I still have some CapEx underrun, so we could bump up the refresh to this quarter instead. @RobH - can you create a Phabricator task and quote for Matthew to review?
@wiki_willy this node is currently slated for replacement in Q2 as part of "Refresh of ms-be10[60-63]"; depending on costs/timelines of getting a replacement card in, could we pull that forward to Q1?
@VRiley-WMF & @Jclark-ctr - can you grab a spare from one of the decom'd servers for this?
Apr 29 2025
Sorry, nevermind....it looks like they're HPs
@Jclark-ctr - it looks like we refreshed ms-be105[1-9] towards the end of last year via T371389. Can you check with @MatthewVernon to see if any of those are close to being decommissioned, and see if we can pull the RAID card from one of those machines?
Apr 28 2025
Thanks @tappof, that looks perfect. Thanks for splitting it up by rack! I went through and checked the other pop sites, and they all look good as well...except for drmrs. When you get a chance, can you get drmrs split across the two racks also? Thanks so much for your help!
Apr 18 2025
Hi @tappof - great job and thank you so much for working on this! It looks like I'm able to see all the information we need for magru in Grafana now.
Hey @ayounsi - after some feedback from my staff meeting earlier today, I reached out to Equinix to see if there's any way we'd be able to add circuits to build out a new rack for Fundraising. If everything works out with the feasibility study, we would be able to build a new rack from the ground up in the Machine Learning cage (without taking away anything dedicated to ML or in our current racks). It'll probably take 1-2 weeks though before I know for sure, so we can pause on migrating anything for a bit. Thanks, Willy
Mar 27 2025
Mar 12 2025
Hi @MoritzMuehlenhoff - the normal hardware specs for Config C is actually 2x 960gb hard drives (not 4x 960gb). I think maybe you were looking at the column for the number of DIMMs (which is 4x DIMMs for Config C) instead of the hard drives below:
Mar 7 2025
++ @Jhancock.wm & @Papaul - per our conversation the other day, this will be the R760xd2 seed server that we received from Dell, which we'll repurpose for Matthew to test and put into production. Thanks, Willy
Reassigning to Valerie to create a new Dell Support task
Mar 5 2025
Hi @Marostegui - thanks for checking. When I look back at previous email from Dell Support sent in November, MarcoAntonio says "we can temporarily archive the case, and if the issue reappears, you can open this case within 10days by contacting me via email or we can open a new case making reference to this case if any additional support is needed after 10 days, the record of the server is saved in the TAG history." So I have a feeling your email reply on Sunday didn't reopen the case because it was past 10 days.
Feb 26 2025
Hi @tappof - thanks for looking into this. It looks like the PDUs are in Netbox though; they were added about a year ago in May 2024:
Feb 20 2025
Hey @RobH - Sukhbir and I were talking at the offsite after the fix was implemented. While increasing the fan speed helped specifically in this scenario, the other sites are able to get by with just the default fan speed. So we still wanted to get a Dell technician to compare one magru server with the default fan speed to another magru server with the adjusted higher fan speed, to see if they could isolate any other root causes - whether it was something else internal within the servers contributing to the high temps or some type of external environment cause with airflow.
Thanks for creating this task @ssingh.
Dec 11 2024
@Jclark-ctr - there's nothing that I'm aware of. If there's no additional info in the original procurement task or any historical Phabricator tickets, maybe you can check with WMCS and see if you can rebalance them?
Nov 13 2024
Ah that makes sense, thanks for the info. We'll go ahead and move the server, after the Phabricator task is created. FWIW, all servers being ordered this fiscal year and moving forward will have 10g cards...and the refresh/upgrade to 10g switches in eqiad for rows C and D is supposed to happen probably later in Q4.
The new server is already in service. The main reason brought this up is the process we had to go through to get a 10G card in wikikube-ctrl1001 cause we need the extra bandwidth. I think that to do so, we 'll need to chose a server in a rack that has free 10G ports and re-cable. I 'll file a separate task
Nov 12 2024
Hi @akosiaris - thanks for confirming. I think we already ordered the replacement host though via T368933. You're welcome to continue using wikikube-ctrl1001 for a longer period of time though, and dedicate the new server for something else in the meantime if you want?
Nov 6 2024
Hi @Jhancock.wm and @Papaul - just a heads up, it looks like the test controller kit arrived yesterday:
