Page MenuHomePhabricator

wiki_willy
User

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Friday

  • Clear sailing ahead.

User Details

User Since
Apr 16 2019, 9:00 PM (22 w, 6 h)
Availability
Available
LDAP User
Wpao
MediaWiki User
Unknown

Recent Activity

Yesterday

wiki_willy updated subscribers of T227025: (Need By: August 31) rack/setup/install (3) new zookeeper nodes.

@Jclark-ctr - since Chris had to use a sick day, can one of you guys take a look at this for Luca? Thanks, Willy

Tue, Sep 17, 7:36 PM · User-Elukey, Operations, ops-eqiad

Mon, Sep 16

wiki_willy added a comment to T227541: b6-eqiad pdu refresh (Tuesday 9/10 @11am UTC).

Checked with @Cmjohnson , who says he'll follow up to check the connections.

Mon, Sep 16, 8:38 PM · DC-Ops, Operations, ops-eqiad
wiki_willy updated subscribers of T232069: analytics1045 - RAID failure and /var/lib/hadoop/data/j can't be mounted.

Hi @Dzahn @jbond - looks like this host is out of warranty, and about 3/4 of a year away from a hardware refresh....so just wanted to double-check if you're considering to retire this system soon or if you'd like us to purchase the hardware part for replacement? Thanks, Willy

Mon, Sep 16, 7:28 PM · ops-eqiad, DC-Ops, Analytics, Operations, Analytics-Cluster
wiki_willy assigned T232069: analytics1045 - RAID failure and /var/lib/hadoop/data/j can't be mounted to Cmjohnson.
Mon, Sep 16, 7:24 PM · ops-eqiad, DC-Ops, Analytics, Operations, Analytics-Cluster
wiki_willy assigned T227133: a8-eqiad pdu refresh (Date TBA) to Cmjohnson.

Originally scheduled for Thursday 9/19, but will reschedule for a later date, since this is a network rack.

Mon, Sep 16, 5:04 PM · DC-Ops, Operations, ops-eqiad
wiki_willy renamed T227133: a8-eqiad pdu refresh (Date TBA) from a8-eqiad pdu refresh (Thursday 9/19 @11am UTC) to a8-eqiad pdu refresh (Date TBA).
Mon, Sep 16, 5:03 PM · DC-Ops, Operations, ops-eqiad
wiki_willy assigned T227539: b3-eqiad pdu refresh (Tuesday 9/17 @11am UTC) to Cmjohnson.

@Cmjohnson - good to go for tomorrow's PDU upgrade, but please confirm with @Marostegui before you start that DBs have been depooled. Thanks, Willy

Mon, Sep 16, 5:01 PM · DC-Ops, Operations, ops-eqiad
wiki_willy added a comment to T232882: backup1001 failed disk (degraded RAID).

Thanks @Jclark-ctr , can you have the drive replaced this week? Also, you might need to coordinate with @jcrespo via IRC to get a couple other things completed to get backup1001 up and running. Thanks, Willy

Mon, Sep 16, 4:28 PM · ops-eqiad, Operations

Fri, Sep 13

wiki_willy assigned T232367: (2019-09-15) rack/setup/install ms-be105[1-6].eqiad.wmnet to Jclark-ctr.
Fri, Sep 13, 10:02 PM · Operations, ops-eqiad
wiki_willy updated subscribers of T229452: db1114 crashed due to memory issues (server under warranty).

@Cmjohnson or @Jclark-ctr - can one of you guys check this out early next week? Thanks, Willy

Fri, Sep 13, 9:18 PM · ops-eqiad, DBA, Operations
wiki_willy added a comment to T229612: asw2-c-eqiad:xe-2/0/45 inbound interface errors.

@Cmjohnson - can you provide an update on this one next week? Thanks, Willy

Fri, Sep 13, 9:05 PM · netops, Operations, ops-eqiad
wiki_willy added a comment to T231525: cp1085 - IPMI not working.

Hi @Dzahn - just following up on this one, to see when the server can be taken down. Thanks, Willy

Fri, Sep 13, 9:04 PM · ops-eqiad, Traffic, Operations
wiki_willy assigned T232882: backup1001 failed disk (degraded RAID) to Jclark-ctr.
Fri, Sep 13, 6:24 PM · ops-eqiad, Operations

Wed, Sep 11

wiki_willy closed T232591: helium array has slot 3 disk failed as Resolved.
Wed, Sep 11, 4:31 PM · ops-eqiad, Operations

Tue, Sep 10

wiki_willy added a comment to T224794: Degraded RAID on helium.

Talked to @akosiaris, who will open up a new task to replace the newly failed drive. We ordered a few of them last time, so hopefully we'll have more spares lying around.

Tue, Sep 10, 7:58 PM · ops-eqiad, Operations
wiki_willy added a comment to T232505: Degraded RAID on db2060.

Thanks @Marostegui

Tue, Sep 10, 6:58 PM · Operations, ops-codfw
wiki_willy added a comment to T222950: (OoW) cloudvirt1006 - RAID battery failed.

@Cmjohnson - just following up to see if we have the correct part

Tue, Sep 10, 6:33 PM · cloud-services-team, ops-eqiad, Operations
wiki_willy added a comment to T228606: Degraded RAID on elastic1046.

@Cmjohnson - could be the drive is seated securely or possibly a loose cable /connection

Tue, Sep 10, 6:29 PM · Discovery-Search (Current work), ops-eqiad, Operations
wiki_willy added a comment to T232505: Degraded RAID on db2060.

Looks like the warranty expired on Jan. 14, 2018. @Papaul - let me know if you have any spares lying around or if we need to purchase a new disk. Thanks, Willy

Tue, Sep 10, 6:22 PM · Operations, ops-codfw
wiki_willy reassigned T232505: Degraded RAID on db2060 from Cmjohnson to Papaul.
Tue, Sep 10, 6:21 PM · Operations, ops-codfw
wiki_willy assigned T232505: Degraded RAID on db2060 to Cmjohnson.
Tue, Sep 10, 6:16 PM · Operations, ops-codfw

Mon, Sep 9

wiki_willy reassigned T220853: VMs on cloudvirt1015 crashing - bad mainboard/memory from wiki_willy to Cmjohnson.

Here's the response I got from Dell (pasted below). @Cmjohnson or @Jclark-ctr : can one of you guys call Dell at 1-800-456-3355, explain to them the numerous parts we've already replaced (and that it continues to crash on load) and get them to analyze the logs for the system? Let me know how it goes.

Mon, Sep 9, 5:20 PM · Operations, ops-eqiad, DC-Ops, User-Zppix, cloud-services-team (Kanban)
wiki_willy moved T232367: (2019-09-15) rack/setup/install ms-be105[1-6].eqiad.wmnet from Backlog to Racking Tasks on the ops-eqiad board.
Mon, Sep 9, 5:05 PM · Operations, ops-eqiad
wiki_willy renamed T226782: a1-eqiad pdu refresh (Date TBD) from a1-eqiad pdu refresh (Thursday 9/12 @11am UTC) to a1-eqiad pdu refresh (Date TBD).
Mon, Sep 9, 4:44 PM · DC-Ops, Operations, ops-eqiad
wiki_willy added a comment to T226782: a1-eqiad pdu refresh (Date TBD).

Per SRE meeting, we'll be rescheduling the PDU upgrades for this rack to a later date TBA due to a lot of the ongoing work related to the recent outages.

Mon, Sep 9, 4:44 PM · DC-Ops, Operations, ops-eqiad
wiki_willy assigned T226782: a1-eqiad pdu refresh (Date TBD) to Cmjohnson.
Mon, Sep 9, 3:57 PM · DC-Ops, Operations, ops-eqiad
wiki_willy assigned T227541: b6-eqiad pdu refresh (Tuesday 9/10 @11am UTC) to Cmjohnson.
Mon, Sep 9, 3:57 PM · DC-Ops, Operations, ops-eqiad

Wed, Sep 4

wiki_willy added a comment to T220853: VMs on cloudvirt1015 crashing - bad mainboard/memory.

Emailed our Dell account rep, who responded that they will look into what our options are and get back to us. Thanks, Willy

Wed, Sep 4, 10:33 PM · Operations, ops-eqiad, DC-Ops, User-Zppix, cloud-services-team (Kanban)
wiki_willy reassigned T230575: Degraded RAID on cloudvirt1018 from wiki_willy to Bstorm.

Assigning to @Bstorm to follow up on the previous comment.

Wed, Sep 4, 9:17 PM · ops-eqiad, Operations
wiki_willy added a comment to T220853: VMs on cloudvirt1015 crashing - bad mainboard/memory.

Thanks @Andrew - I'll reach out to our Account Rep, to see if something else can be done.

Wed, Sep 4, 9:15 PM · Operations, ops-eqiad, DC-Ops, User-Zppix, cloud-services-team (Kanban)
wiki_willy added a comment to T220853: VMs on cloudvirt1015 crashing - bad mainboard/memory.

Hi @Andrew - I mentioned the ongoing issues with this machine to our Dell account rep last week, since we've basically replaced every CPU/DIMM/MB on this box. They mentioned we could install Live Optics to evaluate load, but I'm not sure this is something we want to run on our hardware. Do you have another cloudvirt machine up and running right now on the same hardware specs? Essentially running at the same CPU usage...mainly so we can compare and try to isolate any other type of config differences between them.

Wed, Sep 4, 8:54 PM · Operations, ops-eqiad, DC-Ops, User-Zppix, cloud-services-team (Kanban)
wiki_willy added a comment to T231066: Host decommission improvements.

Hi @Volans - I was wondering in the mean time, would it be possible to give all the FTE dc-ops engineers the necessary permissions to install and decom hosts from beginning to end? Maybe either by adding these rights to a dc-ops group or granting root access for Papaul? He's definitely going to need the ability to do all this in the next 1.5 months, since he'll be in Amsterdam refreshing the entire site. Thanks, Willy

Wed, Sep 4, 2:05 AM · Patch-For-Review, DC-Ops, SRE-tools

Fri, Aug 30

wiki_willy reassigned T225121: (Need By: Sept 30) upgrade msw1-eqiad from EX4200 to EX4300 from Papaul to Cmjohnson.
Fri, Aug 30, 6:19 PM · netops, ops-eqiad, Operations
wiki_willy reassigned T227025: (Need By: August 31) rack/setup/install (3) new zookeeper nodes from elukey to Cmjohnson.

Assigning over to @Cmjohnson for @elukey 's question.

Fri, Aug 30, 6:17 PM · User-Elukey, Operations, ops-eqiad
wiki_willy assigned T231525: cp1085 - IPMI not working to Cmjohnson.
Fri, Aug 30, 6:15 PM · ops-eqiad, Traffic, Operations
wiki_willy added a comment to T231638: db1074 crashed: Broken BBU.

Thanks for confirming @Cmjohnson , subtask T231670 created for Rob to order the part. Thanks, Willy

Fri, Aug 30, 5:31 PM · ops-eqiad, Operations, DBA
wiki_willy assigned T231638: db1074 crashed: Broken BBU to Cmjohnson.

@Cmjohnson @Jclark-ctr - do you guys know offhand if we have a spare BBU lying around from a decom'd server by any chance? If not, let me know and we'll order the part.

Fri, Aug 30, 5:24 PM · ops-eqiad, Operations, DBA

Tue, Aug 27

wiki_willy added a comment to T230575: Degraded RAID on cloudvirt1018.

@Bstorm - I was able to confirm we originally ordered this machine to include 1.6tb drives via https://phabricator.wikimedia.org/T155075 , but wasn't able to find any other tasks that showed when/how they were replaced with 1.9tb drives (which Dell won't support). Do you have any details from previous records on where these 1.9tb disks came from? (ie swapped from another server, ordered separately, etc) Thanks, Willy

Tue, Aug 27, 9:53 PM · ops-eqiad, Operations
wiki_willy closed T229134: Degraded RAID on sulfur as Resolved.
Tue, Aug 27, 8:48 PM · ops-eqiad, Operations
wiki_willy added a comment to T229134: Degraded RAID on sulfur.

@Volans - ah that makes. Thanks, let's just resolve out this task then.

Tue, Aug 27, 8:28 PM · ops-eqiad, Operations
wiki_willy added a comment to T224794: Degraded RAID on helium.

@Jclark-ctr - can we resolve this task? Thanks, Willy

Tue, Aug 27, 8:16 PM · ops-eqiad, Operations
wiki_willy updated subscribers of T229134: Degraded RAID on sulfur.

@Volans - hey Riccardo, not sure if you're the right person for this, but thought I'd try asking you. Is there a different output we can get for this alert, to help us isolate the disk issue a bit more?

Tue, Aug 27, 8:13 PM · ops-eqiad, Operations

Fri, Aug 23

wiki_willy updated subscribers of T200209: Decom graphite2001/WMF6160 .

@RobH - I'll leave it up to @Papaul, since he has a better idea on the chances of reusing the parts on this system. Thanks, Willy

Fri, Aug 23, 7:22 PM · Patch-For-Review, decommission, ops-codfw, Operations, observability
wiki_willy assigned T231056: Degraded RAID on db2056 to Papaul.
Fri, Aug 23, 12:07 AM · Operations, ops-codfw

Tue, Aug 20

wiki_willy added a comment to T228606: Degraded RAID on elastic1046.

Confirmed by Chris that the drive arrived on August 8

Tue, Aug 20, 6:47 PM · Discovery-Search (Current work), ops-eqiad, Operations
wiki_willy added a comment to T227539: b3-eqiad pdu refresh (Tuesday 9/17 @11am UTC).

Thanks @Marostegui , I appreciate it.

Tue, Aug 20, 7:47 AM · DC-Ops, Operations, ops-eqiad

Mon, Aug 19

wiki_willy added a comment to T227142: a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC).

@Marostegui - I would say just go for it and fail out in advance, if it's not too much trouble. Master DBs are very critical, so my opinion is to just take the extra precautionary measures. Thanks, Willy

Mon, Aug 19, 6:44 PM · DC-Ops, Operations, ops-eqiad
wiki_willy added a comment to T227539: b3-eqiad pdu refresh (Tuesday 9/17 @11am UTC).

@Marostegui - I'll defer to Faidon or Mark for their opinion, but my suggestion is to go ahead and fail out in advance if it's not too much of a hassle. The success rate of us upgrading PDUs without any issues is pretty good, but unexpected accidents can occur, and master DBs are very critical to the infrastructure.

Mon, Aug 19, 6:41 PM · DC-Ops, Operations, ops-eqiad

Aug 16 2019

wiki_willy added a comment to T220853: VMs on cloudvirt1015 crashing - bad mainboard/memory.

Thanks Chris, hopefully this will solve things.

Aug 16 2019, 4:36 PM · Operations, ops-eqiad, DC-Ops, User-Zppix, cloud-services-team (Kanban)

Aug 15 2019

wiki_willy assigned T230575: Degraded RAID on cloudvirt1018 to Cmjohnson.
Aug 15 2019, 9:08 PM · ops-eqiad, Operations
wiki_willy renamed T227543: b8-eqiad pdu refresh (Thursday 10/31 @11am UTC) from b8-eqiad pdu refresh to b8-eqiad pdu refresh (Thursday 10/31 @11am UTC).
Aug 15 2019, 5:39 PM · DC-Ops, Operations, ops-eqiad
wiki_willy renamed T227542: b7-eqiad pdu refresh (Tuesday 11/5 @11am UTC) from b7-eqiad pdu refresh to b7-eqiad pdu refresh (Tuesday 11/5 @11am UTC).
Aug 15 2019, 5:38 PM · DC-Ops, Operations, ops-eqiad
wiki_willy renamed T227541: b6-eqiad pdu refresh (Tuesday 9/10 @11am UTC) from b6-eqiad pdu refresh to b6-eqiad pdu refresh (Tuesday 9/10 @11am UTC).
Aug 15 2019, 5:37 PM · DC-Ops, Operations, ops-eqiad
wiki_willy renamed T227540: b4-eqiad pdu refresh (Thursday 10/24 @11am UTC) from b4-eqiad pdu refresh to b4-eqiad pdu refresh (Thursday 10/24 @11am UTC).
Aug 15 2019, 5:36 PM · DC-Ops, Operations, ops-eqiad
wiki_willy renamed T227539: b3-eqiad pdu refresh (Tuesday 9/17 @11am UTC) from b3-eqiad pdu refresh to b3-eqiad pdu refresh (Tuesday 9/17 @11am UTC).
Aug 15 2019, 5:35 PM · DC-Ops, Operations, ops-eqiad
wiki_willy renamed T227538: b2-eqiad pdu refresh (Tuesday 10/29 @11am UTC) from b2-eqiad pdu refresh to b2-eqiad pdu refresh (Tuesday 10/29 @11am UTC).
Aug 15 2019, 5:34 PM · DC-Ops, Operations, ops-eqiad
wiki_willy renamed T227536: b1-eqiad pdu refresh (Thursday 10/10 @11am UTC) from b1-eqiad pdu refresh to b1-eqiad pdu refresh (Thursday 10/10 @11am UTC).
Aug 15 2019, 5:33 PM · DC-Ops, Operations, ops-eqiad
wiki_willy renamed T227133: a8-eqiad pdu refresh (Date TBA) from a8-eqiad pdu refresh to a8-eqiad pdu refresh (Thursday 9/19 @11am UTC).
Aug 15 2019, 5:32 PM · DC-Ops, Operations, ops-eqiad
wiki_willy renamed T227142: a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC) from a6-eqiad pdu refresh to a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC).
Aug 15 2019, 5:32 PM · DC-Ops, Operations, ops-eqiad
wiki_willy renamed T227138: a2-eqiad pdu refresh (Tuesday 10/8 @11am UTC) from a2-eqiad pdu refresh to a2-eqiad pdu refresh (Tuesday 10/8 @11am UTC).
Aug 15 2019, 5:31 PM · DC-Ops, Operations, ops-eqiad
wiki_willy renamed T226782: a1-eqiad pdu refresh (Date TBD) from a1-eqiad pdu refresh to a1-eqiad pdu refresh (Thursday 9/12 @11am UTC).
Aug 15 2019, 5:30 PM · DC-Ops, Operations, ops-eqiad
wiki_willy updated the task description for T226778: Install new PDUs in rows A/B (Top level tracking task).
Aug 15 2019, 5:28 PM · DC-Ops, Operations, ops-eqiad

Aug 14 2019

wiki_willy assigned T230518: elastic1017 lost network after reboot to Cmjohnson.
Aug 14 2019, 11:44 PM · ops-eqiad, DC-Ops, Operations, Discovery-Search (Current work)

Aug 13 2019

wiki_willy assigned T230442: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only to Cmjohnson.
Aug 13 2019, 9:14 PM · ops-eqiad, Operations

Aug 12 2019

wiki_willy assigned T230289: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only to Cmjohnson.

Just a heads up Chris, the system is under warranty thru June 2021. Thanks, Willy

Aug 12 2019, 9:22 AM · cloud-services-team, ops-eqiad, Operations

Aug 9 2019

wiki_willy updated the task description for T223450: Triage and resolve all outstanding Netbox report errors.
Aug 9 2019, 6:44 PM · ops-codfw, ops-eqiad, Operations, SRE-tools, netbox, DC-Ops
wiki_willy closed T229680: Missing Netbox Info for New PDUs, a subtask of T223450: Triage and resolve all outstanding Netbox report errors, as Resolved.
Aug 9 2019, 6:42 PM · ops-codfw, ops-eqiad, Operations, SRE-tools, netbox, DC-Ops
wiki_willy closed T229680: Missing Netbox Info for New PDUs as Resolved.

Info entered into Netbox by @RobH Resolving task

Aug 9 2019, 6:42 PM · Operations, ops-eqiad, netbox, DC-Ops

Aug 8 2019

wiki_willy assigned T230088: cloudelastic1002: SMART/disk error to Cmjohnson.
Aug 8 2019, 9:59 PM · ops-eqiad, DC-Ops, Operations, cloud-services-team (Kanban)
wiki_willy added a comment to T224794: Degraded RAID on helium.

Drives received last Wed, July 31 by @Jclark-ctr

Aug 8 2019, 1:23 AM · ops-eqiad, Operations
wiki_willy added a comment to T229452: db1114 crashed due to memory issues (server under warranty).

@Cmjohnson - just following up on this one, since you were out on vacation last week when the task came in.

Aug 8 2019, 1:09 AM · ops-eqiad, Operations, DBA

Aug 7 2019

wiki_willy reassigned T220853: VMs on cloudvirt1015 crashing - bad mainboard/memory from RobH to Cmjohnson.

Moving back to @Cmjohnson - can you try getting Dell to RMA you a motherboard? If they give you push back, let me know and I can try escalating with our account manager.

Aug 7 2019, 10:17 PM · Operations, ops-eqiad, DC-Ops, User-Zppix, cloud-services-team (Kanban)

Aug 6 2019

wiki_willy closed T229948: hw troubleshooting: <type of hardware failre> for <fqhn of server> as Invalid.

created task as a test. resolving.

Aug 6 2019, 5:45 PM · DC-Ops
wiki_willy created T229948: hw troubleshooting: <type of hardware failre> for <fqhn of server>.
Aug 6 2019, 5:43 PM · DC-Ops
wiki_willy added a comment to T227940: (OoW) Degraded RAID on analytics1032.

@elukey , thank you

Aug 6 2019, 3:06 PM · ops-eqiad, Operations
wiki_willy added a comment to T226599: (OoW) Degraded RAID on analytics1039.

Thanks @elukey

Aug 6 2019, 3:06 PM · ops-eqiad, Operations

Aug 5 2019

wiki_willy assigned T229880: ms-be1040 - disk issues to Cmjohnson.

Confirmed server is under warranty thru March 2021.

Aug 5 2019, 8:54 PM · DC-Ops, ops-eqiad, Operations, media-storage
wiki_willy added a comment to T229778: mr1-eqsin down since ~01:50 UTC.

@Marostegui - Ha, we tied. =)

Aug 5 2019, 5:40 AM · ops-eqsin, Operations, netops
wiki_willy closed T229778: mr1-eqsin down since ~01:50 UTC as Resolved.

Cable between mr1-eqsin p4 <---> asw-0603-eqsin p23 looks like it accidentally got bumped by the contractor during the server install. Called him back and he was able to resolve the issue by reseating the cables. Link has been stable for the past 15min now. Resolving task.

Aug 5 2019, 5:37 AM · ops-eqsin, Operations, netops
wiki_willy claimed T229778: mr1-eqsin down since ~01:50 UTC.
Aug 5 2019, 5:34 AM · ops-eqsin, Operations, netops
wiki_willy added a comment to T229778: mr1-eqsin down since ~01:50 UTC.

Alright, I'm asking him to go back to the datacenter to check all the connections on mr1-eqsin.

Aug 5 2019, 4:22 AM · ops-eqsin, Operations, netops
wiki_willy added a comment to T211368: update PDUs for eqsin (asset tag and other info).

Asset tags applied by Jin from DreamICC today as follows (also emailed out via a spreadsheet):

Aug 5 2019, 4:09 AM · Operations, ops-eqsin
wiki_willy added a comment to T229778: mr1-eqsin down since ~01:50 UTC.

@CDanis - I just checked with our 3rd party contractor and he says it shouldn't have been affected from the work he was doing. Although, he was working in the racks from 1:45-4:00 UTC, and If it only alerted for a few minutes, it could've been possible that something might've accidentally been bumped while he was installing the 3 servers. It's no longer alerting, right?

Aug 5 2019, 4:08 AM · ops-eqsin, Operations, netops
wiki_willy added a comment to T227911: msw1-eqsin/msw2-eqsin missing serial number.

Info gathered by Jin from DreamICC today. Here's the info below (also sent out via email):

Aug 5 2019, 4:00 AM · ops-eqsin, Operations
wiki_willy added a comment to T229243: remote hands setups for ganeti500[123].

Completed by Jin from DreamICC today. The missing IPV4 IP addresses used are the following, with the gateway set to 10.132.129.1 accordingly (instead of 10.132.128.1):

Aug 5 2019, 3:59 AM · Operations, ops-eqsin

Aug 2 2019

wiki_willy assigned T229706: helium.mgmt down to Cmjohnson.
Aug 2 2019, 11:02 PM · ops-eqiad, Operations
wiki_willy updated the task description for T223450: Triage and resolve all outstanding Netbox report errors.
Aug 2 2019, 5:15 PM · ops-codfw, ops-eqiad, Operations, SRE-tools, netbox, DC-Ops
wiki_willy created T229680: Missing Netbox Info for New PDUs.
Aug 2 2019, 5:11 PM · Operations, ops-eqiad, netbox, DC-Ops
wiki_willy added a comment to T223450: Triage and resolve all outstanding Netbox report errors.

@faidon - The majority of the influx in Netbox errors looks like it's from the new PDUs. Some of the info was updated into Netbox to fix the discrepancies being reported by Accounting earlier this week, but also created new Netbox errors, like missing purchase date and procurement ticket. I'll follow up with up with @RobH or @Cmjohnson next week - it'll be good training exercise/task for John to work on. Thanks, Willy

Aug 2 2019, 4:50 PM · ops-codfw, ops-eqiad, Operations, SRE-tools, netbox, DC-Ops
wiki_willy reassigned T227408: (OoW) restbase2009 lockup from wiki_willy to Papaul.
Aug 2 2019, 1:04 AM · serviceops, ops-codfw, Operations

Aug 1 2019

wiki_willy added a comment to T227408: (OoW) restbase2009 lockup.

@Papaul - if you can't find a spare from any of those decom servers, we can order it, since it's still a while before the 5yr mark.

Aug 1 2019, 4:04 PM · serviceops, ops-codfw, Operations

Jul 31 2019

wiki_willy assigned T229452: db1114 crashed due to memory issues (server under warranty) to Cmjohnson.
Jul 31 2019, 6:25 PM · ops-eqiad, Operations, DBA
wiki_willy assigned T229453: elastic1031 - PSU status critical to Jclark-ctr.

@Jclark-ctr - whenever you have a few min free, can you see if this is just a loose cable that maybe got accidentally pulled from the PDU swap last week? If it's actually a bad PSU, I think we can leave it, since it's due to be refreshed via T221636.

Jul 31 2019, 6:24 PM · ops-eqiad, Discovery-Search (Current work), Operations
wiki_willy edited projects for T229251: (2019-08-31)rack/setup/install db2131.codfw.wmnet, added: ops-codfw; removed ops-eqiad.
Jul 31 2019, 7:53 AM · ops-codfw, Operations, DBA

Jul 30 2019

wiki_willy reassigned T220853: VMs on cloudvirt1015 crashing - bad mainboard/memory from wiki_willy to RobH.

Assigning to @RobH for results from ePSA pre-boot system assessment, before determining the next steps.

Jul 30 2019, 10:51 PM · Operations, ops-eqiad, DC-Ops, User-Zppix, cloud-services-team (Kanban)

Jul 29 2019

wiki_willy assigned T229283: Degraded RAID on ms-be2021 to Papaul.
Jul 29 2019, 10:25 PM · Operations, ops-codfw
wiki_willy assigned T229134: Degraded RAID on sulfur to Cmjohnson.
Jul 29 2019, 10:24 PM · ops-eqiad, Operations
wiki_willy assigned T229156: Degraded RAID on cloudvirt1018 to Cmjohnson.

System is in-warranty (doesn't expire until May 2020)

Jul 29 2019, 10:22 PM · cloud-services-team (Kanban), ops-eqiad, Operations

Jul 26 2019

wiki_willy reassigned T229124: add jclark to datacenter-ops group from wiki_willy to RobH.

Approved for the following:

Jul 26 2019, 4:18 PM · Operations, SRE-Access-Requests
wiki_willy added a comment to T229124: add jclark to datacenter-ops group.

Approved for the following:

Jul 26 2019, 4:17 PM · Operations, SRE-Access-Requests