Page MenuHomePhabricator

wiki_willy
User

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Sunday

  • Clear sailing ahead.

User Details

User Since
Apr 16 2019, 9:00 PM (64 w, 2 d)
Availability
Available
LDAP User
Wpao
MediaWiki User
Unknown

Recent Activity

Wed, Jul 8

wiki_willy added a comment to T251632: (Need By: 2020-06-12) rack/setup/install WMCS 10G switches.

@Cmjohnson - I chatted with Arzhel a bit earlier today, and he's going to get these dedicated 10g switches for WMCS in C8 and D5 config'd either later this week or mid-late next week. The stuff happening during the data center failover will be for the switches in row D. Thanks, Willy

Wed, Jul 8, 4:16 PM · cloud-services-team (Hardware), Operations, netops, ops-eqiad, DC-Ops

Tue, Jul 7

wiki_willy assigned T257253: Degraded RAID on db1131 to Jclark-ctr.

@Jclark-ctr - can you send in the RMA for this one, when you get in later today? Thanks, Willy

Tue, Jul 7, 3:04 PM · DBA, ops-eqiad, Operations

Wed, Jul 1

wiki_willy added a comment to T241791: (Need by: 2020-04-02) rack/setup/install relforge100[34].

Hi @Gehel - when I look at the packing slip, it looks like it separates the quantity for internal components per server. Since there were qty=2 servers in the order, it's really a total of 8x SSDs, 16x RAM, 4x power supplies, etc. Thanks, Willy

Wed, Jul 1, 5:21 PM · Patch-For-Review, ops-eqiad, Discovery-Search (Current work), Operations

Mon, Jun 29

wiki_willy added a comment to T251632: (Need By: 2020-06-12) rack/setup/install WMCS 10G switches.

Cool, thanks @ayounsi. I went ahead and fixed it on the accounting spreadsheet. Thanks, Willy

Mon, Jun 29, 9:39 PM · cloud-services-team (Hardware), Operations, netops, ops-eqiad, DC-Ops
wiki_willy added a comment to T251632: (Need By: 2020-06-12) rack/setup/install WMCS 10G switches.

@Jclark-ctr or @Cmjohnson - can one of you doublecheck the s/n's in Netbox? The accounting report says they start with "STA" and Netbox says "TA" so we'll need to confirm which one is accurate.

Mon, Jun 29, 9:17 PM · cloud-services-team (Hardware), Operations, netops, ops-eqiad, DC-Ops

Fri, Jun 26

wiki_willy renamed T255518: (Due By: 2020-07-02) rack/setup/install 3 lightweight hadoop nodes from (Need By: TBD) rack/setup/install 3 lightweight hadoop nodes to (Due By: 2020-07-02) rack/setup/install 3 lightweight hadoop nodes.
Fri, Jun 26, 7:01 PM · ops-eqiad, Operations, DC-Ops
wiki_willy renamed T254892: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] from (Need By: TBD) rack/setup/install an-worker[1096-1101] to (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101].
Fri, Jun 26, 7:00 PM · ops-eqiad, Operations, DC-Ops
wiki_willy renamed T255520: (Due By: 2020-07-17) rack/setup/install <an-test-worker1001-1003> from (Need By: TBD) rack/setup/install <hadoop testing nodes> to (Due By: 2020-07-17) rack/setup/install <hadoop testing nodes>.
Fri, Jun 26, 7:00 PM · ops-eqiad, Operations, DC-Ops
wiki_willy renamed T255072: (Due By: 2020-07-25) rack/setup/install alert1001 from (Need By: TBD) rack/setup/install alert1001 to (Due By: 2020-07-25) rack/setup/install alert1001.
Fri, Jun 26, 6:59 PM · Operations, ops-eqiad, DC-Ops
wiki_willy assigned T256397: Renamed notebook1003 and notebook1004 to Jclark-ctr.
Fri, Jun 26, 6:38 PM · Analytics-Radar, Operations, ops-eqiad, Analytics-Clusters
wiki_willy assigned T170474: Decommisson and store old row D network gear. to Cmjohnson.
Fri, Jun 26, 6:38 PM · Operations, ops-eqiad
wiki_willy reassigned T255406: decommission dbproxy1008.eqiad.wmnet from wiki_willy to Jclark-ctr.
Fri, Jun 26, 6:37 PM · DC-Ops, ops-eqiad, Operations, decommission-hardware

Tue, Jun 23

wiki_willy closed T256101: scs-a8-eqiad CPU usage over 85% as Resolved.
Tue, Jun 23, 6:23 PM · Operations, ops-eqiad, DC-Ops
wiki_willy added a comment to T256101: scs-a8-eqiad CPU usage over 85%.

Yup, it's pending on the install via T228919, so we can reference that going forward. Thanks, Willy

Tue, Jun 23, 6:22 PM · Operations, ops-eqiad, DC-Ops

Mon, Jun 22

wiki_willy assigned T186625: apply hostname labels to bast1002/WMF4749 to Jclark-ctr.
Mon, Jun 22, 9:51 PM · ops-eqiad, DC-Ops, Operations
wiki_willy assigned T254892: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] to Jclark-ctr.
Mon, Jun 22, 9:40 PM · ops-eqiad, Operations, DC-Ops
wiki_willy assigned T255072: (Due By: 2020-07-25) rack/setup/install alert1001 to Jclark-ctr.
Mon, Jun 22, 9:40 PM · Operations, ops-eqiad, DC-Ops
wiki_willy assigned T255927: db1088 crashed to Jclark-ctr.

@Jclark-ctr - I think there are some bbu's leftover from the last time you requested some spares to be ordered, but let me know if not. Thanks, Willy

Mon, Jun 22, 3:27 PM · DBA, Operations

Thu, Jun 18

wiki_willy added a comment to T220853: VMs on cloudvirt1015 crashing - bad mainboard/memory.

@Andrew and @MoritzMuehlenhoff - based on your feedback, I talked to our Dell rep again today, to figure out a different option. There's no alternative of getting credit back, but one thing they can do is to postpone that "seed server" for later on. So for example, if we're procuring some new hardware for WMCS next fiscal, we can take 1 or 2 of the servers in that order (depending on the cost), and Dell will get them to us as "seed severs" at no cost. Does that work for you? Thanks, Willy

Thu, Jun 18, 4:58 PM · cloud-services-team (Kanban), Operations, ops-eqiad, DC-Ops, User-Zppix
wiki_willy added a comment to T253154: (Need by: End of July-2020 ) codfw:rack/setup/new management switches .

@Papaul - if we're short by one msw from upgrading everything, then I would say to not upgrade the most recent msw that you have at codfw. And we can purchase another one next quarter, to replace it. Thanks, Willy

Thu, Jun 18, 4:26 PM · netops, ops-codfw, Operations

Wed, Jun 17

wiki_willy added a comment to T220853: VMs on cloudvirt1015 crashing - bad mainboard/memory.

@Andrew - just wanted to keep you posted with the latest update on this from my bi-weekly meeting with Dell today. They're going to try and replace cloudvirt1015 with a seed server. They no longer carry the same R630 servers, so they're looking at getting us either a seed server with equivalent specs. The two options they're looking at is a) an Intel seed server (which would have an Intel card) or b) an AMD seed server, which would have a Broadcom card. I'll get you the exact specs once they're provided to us, which should be around 3-4 business days. Thanks, Willy

Wed, Jun 17, 6:51 PM · cloud-services-team (Kanban), Operations, ops-eqiad, DC-Ops, User-Zppix

Tue, Jun 16

wiki_willy closed T128821: reclaim and return all cisco servers as Resolved.

Remaining Cisco servers picked up by Cisco today. Thanks @Jclark-ctr !

Tue, Jun 16, 11:42 PM · DC-Ops, decommission-hardware, Goal, Operations

Mon, Jun 15

wiki_willy added a comment to T251570: codfw: Next Gen test rack.

Thanks @Papaul - I'm going to paste this link below for future reference when purchasing:

Mon, Jun 15, 4:48 PM · ops-codfw, Operations, procurement
wiki_willy added a comment to T250053: Netbox report accounting icinga alert.

@ayounsi - I think the alert is being triggered from the Finance spreadsheet:

Mon, Jun 15, 4:45 PM · ops-eqiad, DC-Ops, Operations

Fri, Jun 12

wiki_willy added a comment to T253607: Degraded RAID on restbase-dev1004.

Thanks @Cmjohnson - T255293 created for ordering the new disk. Thanks, Willy

Fri, Jun 12, 5:23 PM · ops-eqiad, Operations
wiki_willy added a subtask for T253607: Degraded RAID on restbase-dev1004: Unknown Object (Task).
Fri, Jun 12, 5:23 PM · ops-eqiad, Operations

Thu, Jun 11

Restricted Application edited projects for T220853: VMs on cloudvirt1015 crashing - bad mainboard/memory, added: cloud-services-team (Kanban); removed cloud-services-team (Hardware).

@Andrew and @Jclark-ctr - I met with our Dell account rep today, to try and push for a new replacement server...or at minimum, allow us to ship the server back to Dell for stress testing and fixing themselves, without having to rely solely on TSR reports. There's a couple hoops that we'll still have to get by, but he's going to dig around and see what he can do internally to get around it. @Andrew - are you ok if I forward them the kernel dump from P10788? Thanks, Willy

Thu, Jun 11, 7:43 PM · cloud-services-team (Kanban), Operations, ops-eqiad, DC-Ops, User-Zppix

Jun 9 2020

wiki_willy added a comment to T220853: VMs on cloudvirt1015 crashing - bad mainboard/memory.

Hi @Andrew - I'll sync up with @Jclark-ctr tomorrow to get a summary of the interactions that have taken place with Dell, along with a list of components that have been replaced up to this point....then make one last attempt with our Dell account rep to see if we can get the entire server swapped. If unsuccessful, then yeah...if decommissioning the server is a doable option for your team, that's probably last case scenario. Will update you in a few days, with the outcome talking to Dell. Thanks, Willy

Jun 9 2020, 7:51 PM · cloud-services-team (Kanban), Operations, ops-eqiad, DC-Ops, User-Zppix

Jun 8 2020

wiki_willy assigned T254240: Decomission oresrdb2002.codfw.wmnet to Papaul.
Jun 8 2020, 10:23 PM · ops-codfw, Operations, DC-Ops
wiki_willy renamed T251618: (Need By: 2020-06-20) rack/setup/install thanos-be100[1234] from (Need By: ASAP) rack/setup/install thanos-be100[1234] to (Need By: 2020-06-20) rack/setup/install thanos-be100[1234].
Jun 8 2020, 8:26 PM · ops-eqiad, Operations, DC-Ops
wiki_willy renamed T251620: (NEED BY: 2020-06-11) rack/setup/install thanos-fe100[123].eqiad.wmnet from (NEED BY: ASAP) rack/setup/install thanos-fe100[123].eqiad.wmnet to (NEED BY: 2020-06-11) rack/setup/install thanos-fe100[123].eqiad.wmnet.
Jun 8 2020, 8:25 PM · ops-eqiad, Operations, DC-Ops
wiki_willy renamed T251627: (Need By: 2020-06-20) rack/setup/install cloudvirt10[31-39]eqiad.wmnet from (Need By: TDB) rack/setup/install cloudvirt10[31-39]eqiad.wmnet to (Need By: 2020-06-20) rack/setup/install cloudvirt10[31-39]eqiad.wmnet.
Jun 8 2020, 8:24 PM · cloud-services-team (Hardware), ops-eqiad, DC-Ops, Operations
wiki_willy renamed T251619: (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org from (Need By: TBD) rack/setup/install cloudcephosd10[04-15].wikimedia.org to (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org.
Jun 8 2020, 8:23 PM · cloud-services-team (Hardware), ops-eqiad, Operations, DC-Ops
wiki_willy renamed T251632: (Need By: 2020-06-12) rack/setup/install WMCS 10G switches from (Need By: TBD) rack/setup/install WMCS 10G switches to (Need By: 2020-06-12) rack/setup/install WMCS 10G switches.
Jun 8 2020, 8:23 PM · cloud-services-team (Hardware), Operations, netops, ops-eqiad, DC-Ops
wiki_willy renamed T241791: (Need by: 2020-04-02) rack/setup/install relforge100[34] from (Need by: TBD) rack/setup/install relforge100[34] to (Need by: 2020-04-02) rack/setup/install relforge100[34].
Jun 8 2020, 8:22 PM · Patch-For-Review, ops-eqiad, Discovery-Search (Current work), Operations

Jun 5 2020

wiki_willy updated subscribers of T128821: reclaim and return all cisco servers.

Pickup date for remaining Cisco servers at eqiad has been set for June 16. @Jclark-ctr to work with Equinix in prepping for the pickup date. Thanks, Willy

Jun 5 2020, 8:13 PM · DC-Ops, decommission-hardware, Goal, Operations
wiki_willy assigned T254392: Degraded RAID on ms-be2018 to Papaul.
Jun 5 2020, 8:08 PM · Operations, ops-codfw

Jun 3 2020

wiki_willy added a comment to T141128: determine/process/document bios firmware tracking/updating policies.

Sure, no problem @RobH . I just asked Paul to add you to the invite.

Please include me in this, as last time we evaluated this it didn't meet open source OS requirements (It used to require a windows server at some point in the network). =]

Jun 3 2020, 7:34 AM · DC-Ops, Operations

Jun 2 2020

wiki_willy added a comment to T141128: determine/process/document bios firmware tracking/updating policies.

Demo for Dell's System Management Tool set up for next Monday on June 8, to evaluate if it's something we want to use going forward or if it's something the Infra Foundations can use as a blueprint internally for firmware/bios upgrades.

Jun 2 2020, 10:48 PM · DC-Ops, Operations
wiki_willy added a comment to T128821: reclaim and return all cisco servers.

codfw Cisco servers were returned last quarter via Cisco's takeback program:

Jun 2 2020, 10:43 PM · DC-Ops, decommission-hardware, Goal, Operations
wiki_willy added a project to T254272: Update Documentation for dl360 Motherboard Swap: ops-eqiad.
Jun 2 2020, 10:37 PM · ops-eqiad, Operations, DC-Ops
wiki_willy created T254272: Update Documentation for dl360 Motherboard Swap.
Jun 2 2020, 6:31 PM · ops-eqiad, Operations, DC-Ops
wiki_willy assigned T254258: wtp1032 bootlooping on CPU error to Cmjohnson.

@Cmjohnson - looks like the warranty on this one just ended a few months ago, so just let me know whatever you find during troubleshooting, and we can order the part. Thanks, Willy

Jun 2 2020, 6:23 PM · serviceops-radar, Operations, ops-eqiad
wiki_willy added a comment to T250602: db1140 (backup source) crashed .

Thanks @jcrespo , our documentation looks to be a bit outdated, so we'll get this added in

Jun 2 2020, 6:13 PM · DC-Ops, ops-eqiad, Operations, DBA

May 28 2020

wiki_willy added a project to T253856: decom 36 old appservers in eqiad (onsite, dcops): ops-eqiad.
May 28 2020, 3:56 PM · ops-eqiad, DC-Ops, serviceops, Operations

May 27 2020

wiki_willy assigned T253607: Degraded RAID on restbase-dev1004 to Cmjohnson.

@Cmjohnson - looks like we're right on the border with the warranty for this one. Netbox shows May 12, 2017 as the install date. Can you see if the HP site allows us to RMA it? Thanks, Willy

May 27 2020, 11:36 PM · ops-eqiad, Operations
wiki_willy added a comment to T253808: db1138 (s4 master) crashed due to memory issues.

@Marostegui - will do, Papaul and John are working on pulling the TSR right now for the RMA. Thanks, Willy

May 27 2020, 9:15 PM · Wikimedia-Incident, ops-eqiad, Operations, DBA
wiki_willy assigned T253808: db1138 (s4 master) crashed due to memory issues to Jclark-ctr.
May 27 2020, 9:12 PM · Wikimedia-Incident, ops-eqiad, Operations, DBA
wiki_willy added a comment to T250602: db1140 (backup source) crashed .

Thanks @jcrespo . I don't think @Jclark-ctr has been onsite at the data center since the last update, but I'll follow up with him on this when he's out there this week. Thanks, Willy

May 27 2020, 2:56 PM · DC-Ops, ops-eqiad, Operations, DBA
wiki_willy added a comment to T251626: (Need By: TDB) rack/setup/install rdb200[78].

@Papaul - thanks for the heads up. Let me know what the cause ends up being (loose connection, bad part, etc) and I'll relay the information along to the vendor. Thanks, Willy

May 27 2020, 2:52 PM · Operations, ops-codfw, DC-Ops

May 26 2020

wiki_willy assigned T253438: an-presto1004 down to Cmjohnson.
May 26 2020, 10:30 PM · Analytics-Radar, Operations, ops-eqiad

May 20 2020

wiki_willy added a comment to T225121: (Need by: 2019-09-30) upgrade msw1-eqiad from EX4200 to EX4300.

Hi @faidon - one of the goals we have this quarter is to resolve all backlogged install tasks from q3 and earlier by end of June. With the limited number of onsite hours and reduced frequency of visits the past couple months, Chris and John have been focused more on other priority items lately. However, @Cmjohnson and I chatted a bit earlier today, and we can get this completed in the next 2-3 weeks.

May 20 2020, 11:49 PM · netops, Operations, ops-eqiad
wiki_willy updated subscribers of T251570: codfw: Next Gen test rack.

Chatted with @wkandek today on the proposed B3 or C3 racks, along with the June 9 or 11th dates/times for the mw servers. He'll check with his team on Monday, and confirm with us afterwards. Thanks, Willy

May 20 2020, 4:44 PM · ops-codfw, Operations, procurement

May 19 2020

wiki_willy added a comment to T251621: (Need By: ASAP) install additional SSDs into prometheus100[34].

Looks like it arrived on Friday, but could be that Equinix hasn't moved them over to the storage room yet:

May 19 2020, 6:27 PM · ops-eqiad, DC-Ops, Operations

May 18 2020

wiki_willy assigned T251618: (Need By: 2020-06-20) rack/setup/install thanos-be100[1234] to Jclark-ctr.
May 18 2020, 9:28 PM · ops-eqiad, Operations, DC-Ops
wiki_willy assigned T251627: (Need By: 2020-06-20) rack/setup/install cloudvirt10[31-39]eqiad.wmnet to Jclark-ctr.
May 18 2020, 9:27 PM · cloud-services-team (Hardware), ops-eqiad, Operations, DC-Ops
wiki_willy assigned T251619: (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org to Jclark-ctr.
May 18 2020, 9:27 PM · cloud-services-team (Hardware), ops-eqiad, Operations, DC-Ops
wiki_willy assigned T251620: (NEED BY: 2020-06-11) rack/setup/install thanos-fe100[123].eqiad.wmnet to Jclark-ctr.
May 18 2020, 9:27 PM · ops-eqiad, DC-Ops, Operations
wiki_willy assigned T252797: asw2-d1-eqiad:VCP failure to Jclark-ctr.
May 18 2020, 7:45 PM · Operations, netops, ops-eqiad

May 13 2020

wiki_willy added a project to T128821: reclaim and return all cisco servers: DC-Ops.
May 13 2020, 7:01 PM · DC-Ops, decommission-hardware, Goal, Operations
wiki_willy added a comment to T245161: Track down and replace very old HW.

Checked with Alex last week on the remaining devices missing decom tasks, who said he'd try to get to them when/if possible. @akosiaris - feel free to update this task when things free up a bit for you. Thanks, Willy

May 13 2020, 7:35 AM · DC-Ops

May 12 2020

wiki_willy moved T251077: mw1280 correctable memory errors logged in getsel from Hardware Failure / Troubleshoot to Decommission on the ops-eqiad board.

Thanks @Dzahn . @Jclark-ctr - I'll move this task over to the "decommission" column on the workboard.

May 12 2020, 4:10 PM · serviceops, Operations, ops-eqiad

May 11 2020

wiki_willy added a comment to T251077: mw1280 correctable memory errors logged in getsel.

Hi @elukey or @Dzahn - just wanted to follow up on this, to see if it's worth buying parts to keep this server online, especially with all the previous issues it's had. The 5yr server life cycle ends in April 2021, so hoping decommissioning it might be an option. Thanks, Willy

May 11 2020, 8:35 PM · serviceops, Operations, ops-eqiad
wiki_willy added a member for acl*security_sre: wiki_willy.
May 11 2020, 6:34 PM

May 9 2020

wiki_willy assigned T220144: Decommission labsdb1006.eqiad.wmnet and labsdb1007.eqiad.wmnet to Jclark-ctr.
May 9 2020, 12:17 AM · ops-eqiad, Operations, decommission-hardware, Data-Services
wiki_willy moved T251620: (NEED BY: 2020-06-11) rack/setup/install thanos-fe100[123].eqiad.wmnet from Procurement to Racking Tasks on the ops-eqiad board.
May 9 2020, 12:15 AM · ops-eqiad, Operations, DC-Ops

May 8 2020

wiki_willy closed T233578: hw troubleshooting: Memory correctable errors -EDAC- for elastic1029.eqiad.wmnet as Resolved.

@wiki_willy / @Cmjohnson : this server has been decommed as part of T239821, so yep, nothing to do here except get rid of it. Thanks! And sorry for the delay...

May 8 2020, 8:07 PM · Operations, ops-eqiad, DC-Ops
wiki_willy closed T233578: hw troubleshooting: Memory correctable errors -EDAC- for elastic1029.eqiad.wmnet, a subtask of T241784: (Need by: TBD) rack/setup/install restbase1028, restbase1029, restbase1030, as Resolved.
May 8 2020, 8:07 PM · Core Platform Team Workboards (Clinic Duty Team), ops-eqiad, Operations
wiki_willy added a comment to T233578: hw troubleshooting: Memory correctable errors -EDAC- for elastic1029.eqiad.wmnet.

Hi @Gehel - just wanted to follow up on this one, to hopefully wrap up the task. I couldn't find too much on the current status of elastic1029 - does your team have this slated to be decommissioned? It's been in production for about 6yrs, so hoping there's a refresh in the works. Thanks, Willy

May 8 2020, 7:57 PM · Operations, ops-eqiad, DC-Ops
wiki_willy added a comment to T252070: Degraded RAID on analytics1055.

Thanks @elukey

May 8 2020, 6:08 PM · Patch-For-Review, Operations, Analytics, DC-Ops, ops-eqiad
wiki_willy claimed T252070: Degraded RAID on analytics1055.
May 8 2020, 5:53 PM · Patch-For-Review, Operations, Analytics, DC-Ops, ops-eqiad
wiki_willy updated subscribers of T252070: Degraded RAID on analytics1055.

It looks like the 5yr server lifecycle will be ending next month. @elukey - would it be possible to decom this server instead? Thanks, Willy

May 8 2020, 5:53 PM · Patch-For-Review, Operations, Analytics, DC-Ops, ops-eqiad
wiki_willy reassigned T251219: cp5012 memory errors from Cmjohnson to RobH.

@Vgutierrez - my apologies, I initially mistook this as a host in eqiad, instead of eqsin, so had assigned it to the wrong person last week. Re-assigning now to @RobH, who might be able to work with our 3rd party vendor on troubleshooting/fixing this. Thanks, Willy

May 8 2020, 5:48 PM · Operations, ops-eqsin, Traffic
wiki_willy closed T250053: Netbox report accounting icinga alert as Resolved.

Looks good now @Cmjohnson. Resolving task

May 8 2020, 5:17 PM · ops-eqiad, DC-Ops, Operations
wiki_willy assigned T251632: (Need By: 2020-06-12) rack/setup/install WMCS 10G switches to Jclark-ctr.
May 8 2020, 7:16 AM · cloud-services-team (Hardware), Operations, netops, ops-eqiad, DC-Ops

May 6 2020

wiki_willy updated subscribers of T141128: determine/process/document bios firmware tracking/updating policies.

Chatted with @Volans yesterday for a little bit on best way we should approach doing firmware upgrades going forward. My preference is that service owners have ownership and the ability to do it remotely, since it would eliminate a lot of the back and forth coordination that would have to happen, if dc-ops owned it. Because reboots are not required for Dells (they are for HPs), Riccardo had a good suggestion that we could potentially combine firmware upgrades along with the kernel upgrades. Dell has a tool that we could potentially use, which may have improved in recent years, so I'll set something up with them to provide us a demo in the mean time. Thanks, Willy

May 6 2020, 8:44 PM · DC-Ops, Operations

May 4 2020

wiki_willy assigned T251725: Netbox report PuppetDB PhysicalHosts critical to Cmjohnson.
May 4 2020, 8:05 PM · Operations, ops-eqiad
wiki_willy added a subtask for T241784: (Need by: TBD) rack/setup/install restbase1028, restbase1029, restbase1030: T233578: hw troubleshooting: Memory correctable errors -EDAC- for elastic1029.eqiad.wmnet.
May 4 2020, 8:04 PM · Core Platform Team Workboards (Clinic Duty Team), ops-eqiad, Operations
wiki_willy added a parent task for T233578: hw troubleshooting: Memory correctable errors -EDAC- for elastic1029.eqiad.wmnet: T241784: (Need by: TBD) rack/setup/install restbase1028, restbase1029, restbase1030.
May 4 2020, 8:04 PM · Operations, ops-eqiad, DC-Ops
wiki_willy added a comment to T251725: Netbox report PuppetDB PhysicalHosts critical.

Error for elastic1029.eqiad.wmnet tied into T233578 and error for restbase1029 error tied in with T241784

May 4 2020, 8:03 PM · Operations, ops-eqiad
wiki_willy assigned T251586: Degraded RAID on kafka-jumbo1001 to Cmjohnson.
May 4 2020, 7:55 PM · ops-eqiad, Operations
wiki_willy updated subscribers of T251586: Degraded RAID on kafka-jumbo1001.

Looks like the warranty on kafka-jumbo1001 is going to end in a few weeks. @Jclark-ctr or @Cmjohnson - can one of you guys troubleshoot and submit the RMA for this part before then? Thanks, Willy

May 4 2020, 6:34 PM · ops-eqiad, Operations

May 1 2020

wiki_willy added a comment to T251077: mw1280 correctable memory errors logged in getsel.

@elukey - looks like we have another year left before the end of the 5yr life cycle mark. Let us know if have enough in production to able to decom this host, or if we should order the replacement part. Thanks, Willy

May 1 2020, 6:36 PM · serviceops, Operations, ops-eqiad

Apr 30 2020

wiki_willy added a comment to T241884: Degraded RAID on cloudvirt1024.

I'd also like to point out that we have another system purchased in the same batch T192119, and 6 more with the same configuration T201352 that are running the same workloads without any problems.

@wiki_willy This server and/or RAID card has been giving us problems since February 2019 [1]. Do we have any options here? We seem to be stuck in an infinite loop of firmware upgrades and not getting anywhere.

[1] https://wikitech.wikimedia.org/wiki/Incident_documentation/20190213-cloudvps

Apr 30 2020, 10:25 PM · cloud-services-team (Hardware), Patch-For-Review, ops-eqiad, Operations
wiki_willy added a comment to T251481: Recreate RAID on labsdb1011.

@Jclark-ctr mentioned he was going to be onsite on Thursday, so assigning this over to him, to look into tomorrow. Thanks, Willy

Apr 30 2020, 6:36 AM · Operations, ops-eqiad, DC-Ops
wiki_willy assigned T251481: Recreate RAID on labsdb1011 to Jclark-ctr.
Apr 30 2020, 6:34 AM · Operations, ops-eqiad, DC-Ops

Apr 28 2020

wiki_willy closed T250054: Netbox report coherence_rack Icinga alert as Resolved.

The remaining 10x Netbox errors (across all reports) will be handled via the following tasks per site:

Apr 28 2020, 7:04 PM · DC-Ops, ops-ulsfo, ops-eqiad, Operations
wiki_willy added a comment to T250053: Netbox report accounting icinga alert.

Fixed error in Netbox for flerovium-array2. @Jclark-ctr - once you have msw-a2-eqiad added into Julianne's spreadsheet (at the top in line 8) and fix the duplicate cable labels on https://netbox.wikimedia.org/dcim/cables/1585/ and https://netbox.wikimedia.org/dcim/cables/1587/ , then you can close out this request. Thanks, Willy

Apr 28 2020, 6:21 PM · ops-eqiad, DC-Ops, Operations
wiki_willy assigned T251219: cp5012 memory errors to Cmjohnson.

Checked Netbox and the server looks like it's still under warranty until October of this year.

Apr 28 2020, 5:48 PM · Operations, ops-eqsin, Traffic

Apr 27 2020

wiki_willy assigned T250027: restbase1025 reported DIMM issues in getsel to Cmjohnson.
Apr 27 2020, 7:29 PM · ops-eqiad, Operations
wiki_willy assigned T251077: mw1280 correctable memory errors logged in getsel to Jclark-ctr.
Apr 27 2020, 5:49 PM · serviceops, Operations, ops-eqiad

Apr 20 2020

wiki_willy added a comment to T250652: msw1-a6-eqiad flopping up and down mgmt connections on A6.

@Cmjohnson - we have a refresh for the eqiad management switches scheduled to be ordered this quarter, so I'll check with Rob to see when those are coming in. If it's going to be a while, we'll just order a couple more spares beforehand. Thanks, Willy

Apr 20 2020, 3:55 PM · Operations, ops-eqiad
wiki_willy assigned T250652: msw1-a6-eqiad flopping up and down mgmt connections on A6 to Cmjohnson.
Apr 20 2020, 3:54 PM · Operations, ops-eqiad

Apr 17 2020

wiki_willy assigned T250482: scb1001: Memory correctable errors -EDAC- to Cmjohnson.

@Dzahn - when I look at the purchase date in Netbox, it shows this server was first installed 7yrs ago in January 2013. If that's accurate, would it be possible to just decommission this server instead? Thanks, Willy

Apr 17 2020, 4:50 PM · DC-Ops, serviceops, Operations, ops-eqiad

Apr 15 2020

wiki_willy assigned T250257: Interface errors on asw2-c-eqiad - ge-3/0/9 (pc1009) to Cmjohnson.
Apr 15 2020, 5:08 PM · DC-Ops, Operations, ops-eqiad
wiki_willy added a project to T250257: Interface errors on asw2-c-eqiad - ge-3/0/9 (pc1009): DC-Ops.

i

Apr 15 2020, 5:08 PM · DC-Ops, Operations, ops-eqiad

Apr 14 2020

wiki_willy added a comment to T141128: determine/process/document bios firmware tracking/updating policies.

Update - per Dell, there's up to a 30-day delay with the factory approved bios/firmware upgrades from the time that they're posted on the web. So some of the recent bios upgrades we had to perform (like on the cp servers) most likely fell into this bucket.

Apr 14 2020, 6:59 PM · DC-Ops, Operations

Apr 13 2020

wiki_willy added a comment to T141128: determine/process/document bios firmware tracking/updating policies.

I'll take this as action item to discuss during our next staff meeting. I gave our Dell account rep a call today inquiring about when the latest firmware/bios upgrades get flashed before new hardware is shipped out to us. He'll follow up with one of their sys admins and get back to me this week.

Apr 13 2020, 6:22 PM · DC-Ops, Operations
wiki_willy added a comment to T166368: Wipe of spare/replacement disks.

Hi @faidon - from our last conversation around this topic during the all-hands, if the onsite shredding was successful on March 20, then we could proceed with onsite shredding over drive wiping for just eqiad and codfw going forward. You are correct around spare systems though, and we definitely need to continue wiping those, along with drive wiping any decommissioned servers at the caching sites.

Apr 13 2020, 5:52 PM · DC-Ops, Operations
wiki_willy claimed T141128: determine/process/document bios firmware tracking/updating policies.
Apr 13 2020, 5:26 PM · DC-Ops, Operations