Page MenuHomePhabricator
Feed Advanced Search

Aug 15 2019

wiki_willy renamed T227138: a2-eqiad pdu refresh (Tuesday 10/8 @11am UTC) from a2-eqiad pdu refresh to a2-eqiad pdu refresh (Tuesday 10/8 @11am UTC).
Aug 15 2019, 5:31 PM · DC-Ops, SRE, ops-eqiad
wiki_willy renamed T226782: a1-eqiad pdu refresh (Tuesday 10/15 @11am UTC) from a1-eqiad pdu refresh to a1-eqiad pdu refresh (Thursday 9/12 @11am UTC).
Aug 15 2019, 5:30 PM · DC-Ops, SRE, ops-eqiad
wiki_willy updated the task description for T226778: Install new PDUs in rows A/B (Top level tracking task).
Aug 15 2019, 5:28 PM · DC-Ops, SRE, ops-eqiad

Aug 14 2019

wiki_willy assigned T230518: elastic1017 lost network after reboot to Cmjohnson.
Aug 14 2019, 11:44 PM · decommission-hardware, ops-eqiad, DC-Ops, SRE, Discovery-Search (Current work)

Aug 13 2019

wiki_willy assigned T230442: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only to Cmjohnson.
Aug 13 2019, 9:14 PM · ops-eqiad, SRE

Aug 12 2019

wiki_willy assigned T230289: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only to Cmjohnson.

Just a heads up Chris, the system is under warranty thru June 2021. Thanks, Willy

Aug 12 2019, 9:22 AM · Patch-For-Review, cloud-services-team, ops-eqiad, SRE

Aug 9 2019

wiki_willy closed T229680: Missing Netbox Info for New PDUs as Resolved.

Info entered into Netbox by @RobH Resolving task

Aug 9 2019, 6:42 PM · SRE, ops-eqiad, netbox, DC-Ops

Aug 8 2019

wiki_willy assigned T230088: cloudelastic1002: SMART/disk error to Cmjohnson.
Aug 8 2019, 9:59 PM · ops-eqiad, DC-Ops, SRE, cloud-services-team (Kanban)
wiki_willy added a comment to T224794: Degraded RAID on helium.

Drives received last Wed, July 31 by @Jclark-ctr

Aug 8 2019, 1:23 AM · ops-eqiad, SRE
wiki_willy added a comment to T229452: db1114 crashed due to memory issues (server under warranty).

@Cmjohnson - just following up on this one, since you were out on vacation last week when the task came in.

Aug 8 2019, 1:09 AM · ops-eqiad, DBA, SRE

Aug 7 2019

wiki_willy reassigned T220853: VMs on cloudvirt1015 crashing - bad mainboard/memory from RobH to Cmjohnson.

Moving back to @Cmjohnson - can you try getting Dell to RMA you a motherboard? If they give you push back, let me know and I can try escalating with our account manager.

Aug 7 2019, 10:17 PM · cloud-services-team (Kanban), SRE, ops-eqiad, DC-Ops, User-Zppix

Aug 6 2019

wiki_willy closed T229948: hw troubleshooting: <type of hardware failre> for <fqhn of server> as Invalid.

created task as a test. resolving.

Aug 6 2019, 5:45 PM · DC-Ops
wiki_willy created T229948: hw troubleshooting: <type of hardware failre> for <fqhn of server>.
Aug 6 2019, 5:43 PM · DC-Ops
wiki_willy added a comment to T227940: (OoW) Degraded RAID on analytics1032.

@elukey , thank you

Aug 6 2019, 3:06 PM · ops-eqiad, SRE
wiki_willy added a comment to T226599: (OoW) Degraded RAID on analytics1039.

Thanks @elukey

Aug 6 2019, 3:06 PM · ops-eqiad, SRE

Aug 5 2019

wiki_willy assigned T229880: ms-be1040 - disk issues to Cmjohnson.

Confirmed server is under warranty thru March 2021.

Aug 5 2019, 8:54 PM · DC-Ops, ops-eqiad, SRE-swift-storage, SRE
wiki_willy added a comment to T229778: mr1-eqsin down since ~01:50 UTC.

@Marostegui - Ha, we tied. =)

Aug 5 2019, 5:40 AM · ops-eqsin, SRE, netops
wiki_willy closed T229778: mr1-eqsin down since ~01:50 UTC as Resolved.

Cable between mr1-eqsin p4 <---> asw-0603-eqsin p23 looks like it accidentally got bumped by the contractor during the server install. Called him back and he was able to resolve the issue by reseating the cables. Link has been stable for the past 15min now. Resolving task.

Aug 5 2019, 5:37 AM · ops-eqsin, SRE, netops
wiki_willy claimed T229778: mr1-eqsin down since ~01:50 UTC.
Aug 5 2019, 5:34 AM · ops-eqsin, SRE, netops
wiki_willy added a comment to T229778: mr1-eqsin down since ~01:50 UTC.

Alright, I'm asking him to go back to the datacenter to check all the connections on mr1-eqsin.

Aug 5 2019, 4:22 AM · ops-eqsin, SRE, netops
wiki_willy added a comment to T211368: update PDUs for eqsin (asset tag and other info).

Asset tags applied by Jin from DreamICC today as follows (also emailed out via a spreadsheet):

Aug 5 2019, 4:09 AM · SRE, ops-eqsin
wiki_willy added a comment to T229778: mr1-eqsin down since ~01:50 UTC.

@CDanis - I just checked with our 3rd party contractor and he says it shouldn't have been affected from the work he was doing. Although, he was working in the racks from 1:45-4:00 UTC, and If it only alerted for a few minutes, it could've been possible that something might've accidentally been bumped while he was installing the 3 servers. It's no longer alerting, right?

Aug 5 2019, 4:08 AM · ops-eqsin, SRE, netops
wiki_willy added a comment to T227911: msw1-eqsin/msw2-eqsin missing serial number.

Info gathered by Jin from DreamICC today. Here's the info below (also sent out via email):

Aug 5 2019, 4:00 AM · SRE, ops-eqsin
wiki_willy added a comment to T229243: remote hands setups for ganeti500[123].

Completed by Jin from DreamICC today. The missing IPV4 IP addresses used are the following, with the gateway set to 10.132.129.1 accordingly (instead of 10.132.128.1):

Aug 5 2019, 3:59 AM · SRE, ops-eqsin

Aug 2 2019

wiki_willy assigned T229706: helium.mgmt down to Cmjohnson.
Aug 2 2019, 11:02 PM · ops-eqiad, SRE
wiki_willy created T229680: Missing Netbox Info for New PDUs.
Aug 2 2019, 5:11 PM · SRE, ops-eqiad, netbox, DC-Ops
wiki_willy reassigned T227408: (OoW) restbase2009 lockup from wiki_willy to Papaul.
Aug 2 2019, 1:04 AM · serviceops, ops-codfw, SRE

Aug 1 2019

wiki_willy added a comment to T227408: (OoW) restbase2009 lockup.

@Papaul - if you can't find a spare from any of those decom servers, we can order it, since it's still a while before the 5yr mark.

Aug 1 2019, 4:04 PM · serviceops, ops-codfw, SRE

Jul 31 2019

wiki_willy assigned T229452: db1114 crashed due to memory issues (server under warranty) to Cmjohnson.
Jul 31 2019, 6:25 PM · ops-eqiad, DBA, SRE
wiki_willy assigned T229453: elastic1031 - PSU status critical to Jclark-ctr.

@Jclark-ctr - whenever you have a few min free, can you see if this is just a loose cable that maybe got accidentally pulled from the PDU swap last week? If it's actually a bad PSU, I think we can leave it, since it's due to be refreshed via T221636.

Jul 31 2019, 6:24 PM · ops-eqiad, Discovery-Search (Current work), SRE
wiki_willy edited projects for T229251: (2019-08-31)rack/setup/install db2131.codfw.wmnet, added: ops-codfw; removed ops-eqiad.
Jul 31 2019, 7:53 AM · ops-codfw, SRE, DBA

Jul 30 2019

wiki_willy reassigned T220853: VMs on cloudvirt1015 crashing - bad mainboard/memory from wiki_willy to RobH.

Assigning to @RobH for results from ePSA pre-boot system assessment, before determining the next steps.

Jul 30 2019, 10:51 PM · cloud-services-team (Kanban), SRE, ops-eqiad, DC-Ops, User-Zppix

Jul 29 2019

wiki_willy assigned T229283: Degraded RAID on ms-be2021 to Papaul.
Jul 29 2019, 10:25 PM · SRE, ops-codfw
wiki_willy assigned T229134: Degraded RAID on sulfur to Cmjohnson.
Jul 29 2019, 10:24 PM · ops-eqiad, SRE
wiki_willy assigned T229156: Degraded RAID on cloudvirt1018 to Cmjohnson.

System is in-warranty (doesn't expire until May 2020)

Jul 29 2019, 10:22 PM · cloud-services-team (Kanban), ops-eqiad, SRE

Jul 26 2019

wiki_willy reassigned T229124: add jclark to datacenter-ops group from wiki_willy to RobH.

Approved for the following:

Jul 26 2019, 4:18 PM · SRE, SRE-Access-Requests
wiki_willy added a comment to T229124: add jclark to datacenter-ops group.

Approved for the following:

Jul 26 2019, 4:17 PM · SRE, SRE-Access-Requests

Jul 25 2019

wiki_willy reassigned T228606: Degraded RAID on elastic1046 from wiki_willy to Cmjohnson.

Thanks @elukey, subtask #T229017 has been opened to order the replacement drive with procurement. Assigning this task back to @Cmjohnson, for when the disk arrives onsite.

Jul 25 2019, 4:10 PM · Patch-For-Review, Discovery-Search (Current work), ops-eqiad, SRE

Jul 24 2019

wiki_willy assigned T228732: Upgrade db1100 firmware and BIOS to Cmjohnson.
Jul 24 2019, 8:25 PM · DBA, ops-eqiad, SRE
wiki_willy assigned T228853: Degraded RAID on cloudvirt1024 to Cmjohnson.
Jul 24 2019, 8:23 PM · ops-eqiad, SRE
wiki_willy added a comment to T228606: Degraded RAID on elastic1046.

@elukey - since elastic1046 is just barely out of warranty (only by a few months), we'll still have to purchase a new disk for this server. Just double-checking that's the route you want to go, before we place the order.

Jul 24 2019, 6:21 PM · Patch-For-Review, Discovery-Search (Current work), ops-eqiad, SRE

Jul 23 2019

wiki_willy added a comment to T220853: VMs on cloudvirt1015 crashing - bad mainboard/memory.

@Cmjohnson - are those errors for DIMM A3 enough info to get Dell to RMA a part to us? If not, let me know, and I'll bring it up during my next sync up meeting with them.

Jul 23 2019, 9:49 PM · cloud-services-team (Kanban), SRE, ops-eqiad, DC-Ops, User-Zppix

Jul 22 2019

wiki_willy assigned T228606: Degraded RAID on elastic1046 to Cmjohnson.
Jul 22 2019, 11:05 PM · Patch-For-Review, Discovery-Search (Current work), ops-eqiad, SRE
wiki_willy assigned T228618: Reallocate dbproxy1020 and dbproxy1021 from row D to row C to Cmjohnson.
Jul 22 2019, 11:04 PM · DC-Ops, SRE, ops-eqiad
wiki_willy added a comment to T228692: relocate/reimage cloudvirt1016 with 10G interfaces.

Per @Andrew - cloudvirt1016 is now ready for re-racking.

Jul 22 2019, 8:16 PM · Patch-For-Review, SRE, cloud-services-team (Kanban)
wiki_willy added a comment to T228691: relocate/reimage cloudvirt1017 with 10G interfaces.

Per @Andrew via IRC - cloudvirt1017 is now ready for re-racking.

Jul 22 2019, 8:15 PM · Patch-For-Review, SRE, cloud-services-team (Kanban)
wiki_willy assigned T228691: relocate/reimage cloudvirt1017 with 10G interfaces to Cmjohnson.
Jul 22 2019, 8:13 PM · Patch-For-Review, SRE, cloud-services-team (Kanban)
wiki_willy assigned T228692: relocate/reimage cloudvirt1016 with 10G interfaces to Cmjohnson.
Jul 22 2019, 8:13 PM · Patch-For-Review, SRE, cloud-services-team (Kanban)

Jul 19 2019

wiki_willy assigned T227632: Document PDU models to RobH.
Jul 19 2019, 11:02 PM · netbox, SRE, ops-codfw, ops-eqiad
wiki_willy added a comment to T221632: Storage capacity upgrade for WDQS.

Talked to @Gehel, @faidon, and @RobH - and we're going to proceed with RAID0 for this install. Although RAID0 is not ideal, we can make an exception here, as long as a drive failure that causes the system to go down, won't result in any dc-ops onsite repair emergencies.

Jul 19 2019, 4:08 PM · Wikidata, Wikidata-Query-Service

Jul 17 2019

wiki_willy reassigned T224794: Degraded RAID on helium from wiki_willy to Cmjohnson.

Thanks for back history @akosiaris , we'll get the replacement drives ordered for you via procurement #T228302. ~Willy

Jul 17 2019, 4:49 PM · ops-eqiad, SRE
wiki_willy added a comment to T227335: backup1001 can't address the disk shelf's drives.

@Cmjohnson - not sure if there's a loose connection somewhere on backup1001, but can you check it out when you have a few cycles? This one needs to be up and running, before data can be migrated over from helium (which is slated to be decom'd) . Thanks, Willy

Jul 17 2019, 4:21 PM · ops-eqiad, SRE, DC-Ops
wiki_willy moved T227335: backup1001 can't address the disk shelf's drives from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.
Jul 17 2019, 4:19 PM · ops-eqiad, SRE, DC-Ops
wiki_willy edited projects for T227335: backup1001 can't address the disk shelf's drives, added: ops-eqiad; removed ops-eqdfw.
Jul 17 2019, 4:19 PM · ops-eqiad, SRE, DC-Ops
wiki_willy assigned T227335: backup1001 can't address the disk shelf's drives to Cmjohnson.
Jul 17 2019, 4:18 PM · ops-eqiad, SRE, DC-Ops
wiki_willy added a comment to T226599: (OoW) Degraded RAID on analytics1039.

Thanks @elukey , much appreciated! ~Willy

Jul 17 2019, 6:44 AM · ops-eqiad, SRE

Jul 16 2019

wiki_willy updated subscribers of T226599: (OoW) Degraded RAID on analytics1039.

@Cmjohnson - if you have a bunch of spare 4tb SATA drives lying around onsite that match up with the disks on analytics1039, feel free to use them for this task. Thanks, Willy

Jul 16 2019, 10:55 PM · ops-eqiad, SRE
wiki_willy added a comment to T224794: Degraded RAID on helium.

@akosiaris or @Volans - we can order drive replacements for this, since it's out of warranty, but I was trying to figure out how this correlates with the new replacement of backup1001. Do you need replacement drives on helium, to be able to complete the migration of data over to backup1001? I'll follow up on IRC with you later tonight as well. Thanks, Willy

Jul 16 2019, 10:36 PM · ops-eqiad, SRE
wiki_willy added a comment to T227288: eqiad: 1 misc node for the Kerberos KDC service.

@elukey @RobH - I've marked it as accelerate on the procurement doc. Rob, can you work on getting these two servers included on this procurement cycle? Much appreciated.

Jul 16 2019, 2:57 PM · Analytics-Radar, hardware-requests, SRE, User-Elukey
wiki_willy added a comment to T226274: (Need By: June 30) rack/setup/install kafka-main100[1-5].

@RobH - looks like Chris finished the racking part of this install. Can you finish up the rest of the install for these 5 Kafka hosts?

Jul 16 2019, 2:29 PM · User-herron, SRE
wiki_willy reassigned T226274: (Need By: June 30) rack/setup/install kafka-main100[1-5] from Cmjohnson to RobH.
Jul 16 2019, 2:28 PM · User-herron, SRE

Jul 15 2019

wiki_willy added a comment to T222950: (OoW) cloudvirt1006 - RAID battery failed.

subtask opened up with procurement to order raid battery. ~willy

Jul 15 2019, 11:48 PM · cloud-services-team (Hardware), User-jbond, ops-eqiad, SRE
wiki_willy assigned T190086: Decommission old server wmf4077 to Cmjohnson.
Jul 15 2019, 9:02 PM · decommission-hardware, SRE
wiki_willy assigned T221572: (OoW) wtp2019 shows error messages in the racadm getsel's output to Papaul.
Jul 15 2019, 8:58 PM · ops-codfw, SRE
wiki_willy renamed T221572: (OoW) wtp2019 shows error messages in the racadm getsel's output from wtp2019 shows error messages in the racadm getsel's output to (OoW) wtp2019 shows error messages in the racadm getsel's output.
Jul 15 2019, 8:57 PM · ops-codfw, SRE
wiki_willy assigned T200678: (OoW) wtp2011 memory correctable errors to Papaul.
Jul 15 2019, 8:56 PM · SRE, ops-codfw
wiki_willy renamed T200678: (OoW) wtp2011 memory correctable errors from wtp2011 memory correctable errors to (OoW) wtp2011 memory correctable errors.
Jul 15 2019, 8:55 PM · SRE, ops-codfw
wiki_willy assigned T205240: (OoW) MCE errors on mw2181 / temperature warnings to Papaul.
Jul 15 2019, 8:55 PM · serviceops, SRE, ops-codfw
wiki_willy renamed T205240: (OoW) MCE errors on mw2181 / temperature warnings from MCE errors on mw2181 / temperature warnings to (OoW) MCE errors on mw2181 / temperature warnings.
Jul 15 2019, 8:54 PM · serviceops, SRE, ops-codfw
wiki_willy assigned T194174: (OoW) wtp2013 memory correctable errors to Papaul.
Jul 15 2019, 8:54 PM · SRE, ops-codfw
wiki_willy renamed T194174: (OoW) wtp2013 memory correctable errors from wtp2013 memory correctable errors to (OoW) wtp2013 memory correctable errors.
Jul 15 2019, 8:53 PM · SRE, ops-codfw
wiki_willy assigned T194171: (OoW) rdb2002 correctable memory errors to Papaul.
Jul 15 2019, 8:52 PM · SRE, ops-codfw
wiki_willy renamed T194171: (OoW) rdb2002 correctable memory errors from rdb2002 correctable memory errors to (OoW) rdb2002 correctable memory errors.
Jul 15 2019, 8:52 PM · SRE, ops-codfw
wiki_willy closed T219854: Broken disk on ms-be2026 as Resolved.

Looks like things are resolved here, so I'm going to resolve the task, but feel free to reopen if there's still something that needs to be completed.

Jul 15 2019, 8:51 PM · Patch-For-Review, SRE, ops-codfw
wiki_willy assigned T205712: (OoW) wtp2020: correctable memory errors to Papaul.
Jul 15 2019, 8:47 PM · SRE, ops-codfw
wiki_willy renamed T205712: (OoW) wtp2020: correctable memory errors from wtp2020: correctable memory errors to (OoW) wtp2020: correctable memory errors.
Jul 15 2019, 8:47 PM · SRE, ops-codfw
wiki_willy assigned T209337: (OoW) lvs2006 crashed into (what it seems) an unrecoverable state to Papaul.
Jul 15 2019, 8:46 PM · ops-codfw, SRE, Traffic
wiki_willy renamed T209337: (OoW) lvs2006 crashed into (what it seems) an unrecoverable state from lvs2006 crashed into (what it seems) an unrecoverable state to (OoW) lvs2006 crashed into (what it seems) an unrecoverable state.
Jul 15 2019, 8:46 PM · ops-codfw, SRE, Traffic
wiki_willy assigned T170152: mc2023 / mc2025 fail to mount root partition within 90 seconds using Linux 4.9 to Papaul.
Jul 15 2019, 8:44 PM · SRE, ops-codfw
wiki_willy assigned T192082: (OoW) lvs2006 Embedded Flash/SD-CARD iLO errors to Papaul.
Jul 15 2019, 8:42 PM · Traffic, DC-Ops, SRE, ops-codfw
wiki_willy renamed T192082: (OoW) lvs2006 Embedded Flash/SD-CARD iLO errors from lvs2006 Embedded Flash/SD-CARD iLO errors to (OoW) lvs2006 Embedded Flash/SD-CARD iLO errors.
Jul 15 2019, 8:42 PM · Traffic, DC-Ops, SRE, ops-codfw
wiki_willy assigned T148017: (OoW) lvs2002 repeated usb connect/disconnect message to Papaul.
Jul 15 2019, 8:41 PM · ops-codfw, SRE
wiki_willy renamed T148017: (OoW) lvs2002 repeated usb connect/disconnect message from lvs2002 repeated usb connect/disconnect message to (OoW) lvs2002 repeated usb connect/disconnect message.
Jul 15 2019, 7:45 PM · ops-codfw, SRE
wiki_willy renamed T225131: (OoW) Degraded RAID on es2003 from Degraded RAID on es2003 to (OoW) Degraded RAID on es2003.
Jul 15 2019, 7:43 PM · SRE, ops-codfw
wiki_willy assigned T222464: PDUs with Infeed < 0.5Amps to Papaul.
Jul 15 2019, 7:41 PM · SRE, ops-codfw
wiki_willy assigned T227408: (OoW) restbase2009 lockup to Papaul.
Jul 15 2019, 7:39 PM · serviceops, ops-codfw, SRE
wiki_willy renamed T227408: (OoW) restbase2009 lockup from restbase2009 lockup to (OoW) restbase2009 lockup.
Jul 15 2019, 7:39 PM · serviceops, ops-codfw, SRE
wiki_willy renamed T227862: (OoW) db2045 failed battery from db2045 failed battery to (OoW) db2045 failed battery.
Jul 15 2019, 7:36 PM · ops-codfw, SRE, DBA
wiki_willy added a comment to T225035: cp3035 PS Redundancy Lost.

Server will be refreshed in late Q1 / early Q2, along with a hardware refresh of the entire site.

Jul 15 2019, 7:33 PM · Traffic, SRE, ops-esams
wiki_willy assigned T203520: decommission thulium.frack.eqiad.wmnet to Cmjohnson.
Jul 15 2019, 7:31 PM · decommission-hardware, ops-eqiad, SRE
wiki_willy assigned T222109: decommission frav1001.frack.eqiad.wmnet to RobH.
Jul 15 2019, 7:28 PM · decommission-hardware, SRE, fundraising-tech-ops, ops-eqiad
wiki_willy assigned T220590: Decom ms-be101[345] to Cmjohnson.
Jul 15 2019, 7:27 PM · Patch-For-Review, ops-eqiad, decommission-hardware, User-fgiunchedi, SRE-swift-storage, SRE
wiki_willy assigned T220700: Upgrade kafka-jumbo100[1-6] to 10G NICs (if possible) to Cmjohnson.
Jul 15 2019, 7:26 PM · Analytics-Radar, ops-eqiad, hardware-requests, SRE, User-Elukey
wiki_willy assigned T226599: (OoW) Degraded RAID on analytics1039 to elukey.
Jul 15 2019, 7:25 PM · ops-eqiad, SRE
wiki_willy renamed T227940: (OoW) Degraded RAID on analytics1032 from Degraded RAID on analytics1032 to (OoW) Degraded RAID on analytics1032.
Jul 15 2019, 7:22 PM · ops-eqiad, SRE
wiki_willy assigned T218751: Audit down ports to Cmjohnson.
Jul 15 2019, 7:21 PM · DC-Ops, SRE, ops-eqiad
wiki_willy claimed T227940: (OoW) Degraded RAID on analytics1032.

@Cmjohnson - looks like this server is out of warranty and just past the 5yr mark, but is also tied to a refresh order last Q2 in FY19-20 under T204177. Also, seems like it's being used as a test server now per the following:

Jul 15 2019, 7:19 PM · ops-eqiad, SRE
wiki_willy assigned T227867: mw1239 memory errors to jijiki.

Assigning to @jijiki for now. Hi Effie - let us know when it would be ok to take this server down to reseat the DIMM, and then assign the task back to @Cmjohnson when ready.

Jul 15 2019, 6:51 PM · ops-eqiad, DC-Ops, SRE, serviceops

Jul 12 2019

wiki_willy assigned T227911: msw1-eqsin/msw2-eqsin missing serial number to RobH.
Jul 12 2019, 7:04 PM · ops-eqsin, SRE
wiki_willy assigned T224794: Degraded RAID on helium to Cmjohnson.
Jul 12 2019, 8:25 AM · ops-eqiad, SRE