RobH (Rob Halsell)Administrator
Operations Engineer

Projects (23)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Wednesday

  • Clear sailing ahead.

User Details

User Since
Nov 24 2014, 1:43 PM (220 w, 6 d)
Roles
Administrator
Availability
Available
IRC Nick
RobH
LDAP User
RobH
MediaWiki User
RobH [ Global Accounts ]

My GPG Key fingerprint = CB1F C7E7 0FF8 5DB2 6820 9C7E 75ED 14C7 0245 D22A

I am an Operations Engineer on Wikimedia's Datacenter Operations Team.

I also am the primary triage engineer for the hardware-requests project, as well as the private S4 procurement space and procurement project.

All questions involving allocation of hardware can be initially addressed on https://wikitech.wikimedia.org/wiki/Operations_requests.

Please note that private message via phabricator is not my preferred contact means. Please feel free to contact me (robh) directly via irc/freenode, or email my @wikimedia.org email address.

Recent Activity

Fri, Feb 15

RobH added a subtask for T214024: Two test hosts for SREs: Unknown Object (Task).
Fri, Feb 15, 6:43 PM · Operations, hardware-requests
RobH added a comment to T214024: Two test hosts for SREs.

Please note T216269 tracks the order of new single cpu spare pool systems. Once we have those ordered, a second system can be allocated via this task.

Fri, Feb 15, 6:42 PM · Operations, hardware-requests
RobH added a comment to T214024: Two test hosts for SREs.

Once this is approved, assign back to me and I'll get it allocated and spun up, then stall this task until a second single cpu misc system arrives for approval for the second of the two systems.

Fri, Feb 15, 6:39 PM · Operations, hardware-requests
RobH assigned T214024: Two test hosts for SREs to faidon.

So we are down to just one single cpu spare misc host. I'm creating a task to order more spare servers, but for now I can only allocate 1 system for this.

Fri, Feb 15, 6:39 PM · Operations, hardware-requests
RobH moved T215301: codfw spare pool system for partman testing from Backlog to In Discussion / Review on the hardware-requests board.
Fri, Feb 15, 6:11 PM · Patch-For-Review, Operations, hardware-requests
RobH moved T216226: GPU upgrade for stat1005 from Backlog to In Discussion / Review on the hardware-requests board.
Fri, Feb 15, 6:10 PM · Analytics, hardware-requests, Operations
RobH added a comment to T216226: GPU upgrade for stat1005.

Thanks for the input @Shilad, its much appreciated! That info is EXACTLY the kind of info we need (and why this task exists!)

Fri, Feb 15, 5:57 PM · Analytics, hardware-requests, Operations

Thu, Feb 14

RobH added a comment to T216175: HP Gen9 onboard controller review.

netbox list of ALL DL360 Gen9 systems: https://netbox.wikimedia.org/dcim/devices/?device_type_id=54&per_page=250

Thu, Feb 14, 8:00 PM · Operations
RobH triaged T216175: HP Gen9 onboard controller review as High priority.
Thu, Feb 14, 7:48 PM · Operations
RobH added a comment to T216172: Set up basic email infra for w.wiki domain.

Copy of the email:

Thu, Feb 14, 7:45 PM · Operations, Mail
RobH moved T216062: decom ruthenium from Backlog to Ready for Decommission on the decommission board.
Thu, Feb 14, 7:02 PM · Patch-For-Review, DC-Ops, decommission, Parsoid, Operations
RobH moved T205507: Decommission analytics100[1,2] from Blocked on Service Owners to Ready for Decommission on the decommission board.
Thu, Feb 14, 7:02 PM · Patch-For-Review, Operations, ops-eqiad, decommission, User-Elukey, Analytics
RobH moved T211826: decommission oxygen.eqiad.wmnet from Blocked on Service Owners to Ready for Decommission on the decommission board.
Thu, Feb 14, 7:02 PM · DC-Ops, decommission
RobH moved T209357: Return graphite100[13] to spares pool (or decom) from Backlog to Decommission on the ops-eqiad board.
Thu, Feb 14, 6:50 PM · ops-eqiad, decommission, User-fgiunchedi, Operations
RobH moved T209357: Return graphite100[13] to spares pool (or decom) from Ready for Decommission to pending onsite steps (eqiad) on the decommission board.
Thu, Feb 14, 6:50 PM · ops-eqiad, decommission, User-fgiunchedi, Operations
RobH reassigned T209357: Return graphite100[13] to spares pool (or decom) from RobH to Cmjohnson.
Thu, Feb 14, 6:50 PM · ops-eqiad, decommission, User-fgiunchedi, Operations
RobH updated the task description for T209357: Return graphite100[13] to spares pool (or decom).
Thu, Feb 14, 6:42 PM · ops-eqiad, decommission, User-fgiunchedi, Operations
RobH updated the task description for T209357: Return graphite100[13] to spares pool (or decom).
Thu, Feb 14, 6:35 PM · ops-eqiad, decommission, User-fgiunchedi, Operations
RobH updated the task description for T209357: Return graphite100[13] to spares pool (or decom).
Thu, Feb 14, 6:30 PM · ops-eqiad, decommission, User-fgiunchedi, Operations
RobH removed a project from T209357: Return graphite100[13] to spares pool (or decom): Patch-For-Review.
Thu, Feb 14, 6:27 PM · ops-eqiad, decommission, User-fgiunchedi, Operations
RobH updated the task description for T191362: decom promethium/WMF3571.
Thu, Feb 14, 6:15 PM · decommission, Operations, DC-Ops, ops-eqiad
RobH reassigned T191362: decom promethium/WMF3571 from RobH to ayounsi.

So, trying to disable the switch port:

Thu, Feb 14, 6:14 PM · decommission, Operations, DC-Ops, ops-eqiad
RobH updated the task description for T191362: decom promethium/WMF3571.
Thu, Feb 14, 6:06 PM · decommission, Operations, DC-Ops, ops-eqiad
RobH moved T206524: Decommission analytics1003 from Backlog to Decommission on the ops-eqiad board.

ready for wipe and unracking steps

Thu, Feb 14, 5:59 PM · Operations, ops-eqiad, decommission, DC-Ops, User-Elukey, Analytics
RobH reassigned T206524: Decommission analytics1003 from RobH to Cmjohnson.
Thu, Feb 14, 5:59 PM · Operations, ops-eqiad, decommission, DC-Ops, User-Elukey, Analytics
RobH moved T206524: Decommission analytics1003 from Ready for Decommission to pending onsite steps (eqiad) on the decommission board.
Thu, Feb 14, 5:58 PM · Operations, ops-eqiad, decommission, DC-Ops, User-Elukey, Analytics
RobH removed a project from T206524: Decommission analytics1003: Patch-For-Review.
Thu, Feb 14, 5:58 PM · Operations, ops-eqiad, decommission, DC-Ops, User-Elukey, Analytics
RobH updated the task description for T206524: Decommission analytics1003.
Thu, Feb 14, 5:51 PM · Operations, ops-eqiad, decommission, DC-Ops, User-Elukey, Analytics
RobH updated the task description for T206524: Decommission analytics1003.
Thu, Feb 14, 5:24 PM · Operations, ops-eqiad, decommission, DC-Ops, User-Elukey, Analytics
RobH updated the task description for T206524: Decommission analytics1003.
Thu, Feb 14, 5:21 PM · Operations, ops-eqiad, decommission, DC-Ops, User-Elukey, Analytics

Wed, Feb 13

RobH updated subscribers of T216004: Degraded RAID on cloudvirt1018.

update from irc chat:

Wed, Feb 13, 8:56 PM · cloud-services-team (Kanban), ops-eqiad, Operations
RobH added a comment to T214760: icinga1001 crashed.

10:12 < cmjohnson1> : robh Dell approved everything....the disks for cloudvirts and the cpu for icinga1001

Wed, Feb 13, 7:58 PM · Patch-For-Review, ops-eqiad, monitoring, Operations
RobH added a comment to T214760: icinga1001 crashed.

I requested a new CPU but w/out Dell's idrac log stating it's a CPU there is a good chance they will kick it back.

You have successfully submitted request SR986384843.

Wed, Feb 13, 6:07 PM · Patch-For-Review, ops-eqiad, monitoring, Operations
RobH reassigned T215569: mw1299 is down (jobrunner-canary, now up but depooled) from RobH to jijiki.

I've synced with @jijiki who is returning this to service and will comment on here.

Wed, Feb 13, 5:20 PM · ops-eqiad, Operations

Tue, Feb 12

RobH reassigned T214760: icinga1001 crashed from Volans to Cmjohnson.

Can you open a support request with Dell and insist on a replacement CPU due to the output of T214760#4941652 please?

Tue, Feb 12, 10:43 PM · Patch-For-Review, ops-eqiad, monitoring, Operations
RobH added a comment to T214760: icinga1001 crashed.

So with the comments from @Volans on T214760#4941652, it seems this may be an issue with CPU#27, which is the second CPU. It may be enough to get another CPU sent by Dell, since returning it to service in the other slot and hoping for failure seems problematic.

Tue, Feb 12, 10:25 PM · Patch-For-Review, ops-eqiad, monitoring, Operations
RobH added a comment to T214760: icinga1001 crashed.

Ok, so I'm going to address some of the error messages and log messages here:

Tue, Feb 12, 10:21 PM · Patch-For-Review, ops-eqiad, monitoring, Operations
RobH reassigned T214760: icinga1001 crashed from RobH to Volans.

Ok, I've run the hardware tests and nothing reports as broken.

Tue, Feb 12, 9:57 PM · Patch-For-Review, ops-eqiad, monitoring, Operations
RobH added a comment to T214760: icinga1001 crashed.

Ok, rebooted the system and watched it POST, no errors. A quick grep of SEL shows no additional entries from T214760#4945789.

Tue, Feb 12, 9:18 PM · Patch-For-Review, ops-eqiad, monitoring, Operations
RobH added a comment to T213121: Deploy cr2-eqsin.

Chris shipped this, and I just put in an inbound shipemnt ticket for EQ Singapore SG#: 1-185487164544
UPS tracking 1Z291X71DG27842078

Tue, Feb 12, 7:24 PM · Patch-For-Review, ops-eqiad, ops-eqsin, netops, Operations
RobH reassigned T214608: rack/setup/install logstash101[012].eqiad.wmnet from RobH to herron.

Ok, these are calling into puppet with role spare. You can apply new roles and push into service.

Tue, Feb 12, 6:28 PM · Patch-For-Review, Operations
RobH updated the task description for T214608: rack/setup/install logstash101[012].eqiad.wmnet.
Tue, Feb 12, 6:23 PM · Patch-For-Review, Operations
RobH removed a project from T214608: rack/setup/install logstash101[012].eqiad.wmnet: Patch-For-Review.
Tue, Feb 12, 6:18 PM · Patch-For-Review, Operations
RobH added a comment to T214608: rack/setup/install logstash101[012].eqiad.wmnet.

Firmware is being updated on the bios and idrac before OS installation on all three hosts:

Tue, Feb 12, 5:25 PM · Patch-For-Review, Operations
RobH removed a project from T214608: rack/setup/install logstash101[012].eqiad.wmnet: Patch-For-Review.
Tue, Feb 12, 5:19 PM · Patch-For-Review, Operations
RobH removed a project from T214608: rack/setup/install logstash101[012].eqiad.wmnet: Patch-For-Review.
Tue, Feb 12, 4:59 PM · Patch-For-Review, Operations

Mon, Feb 11

RobH assigned T214760: icinga1001 crashed to CDanis.

Icinga was failovered to icinga2001, @Cmjohnson, @RobH we can proceed either to check if the CPU is properly mounted and/or try to get some replacement parts based on current evidence.

Mon, Feb 11, 11:48 PM · Patch-For-Review, ops-eqiad, monitoring, Operations
RobH added a comment to T214760: icinga1001 crashed.

Pulled from racadm getsel

Mon, Feb 11, 11:37 PM · Patch-For-Review, ops-eqiad, monitoring, Operations
RobH assigned T214274: Degraded RAID on cp5010 to ayounsi.

Ok, support case opened with Dell and a replacement SSD has been dispatched. details below:

Mon, Feb 11, 9:43 PM · Traffic, ops-eqsin, Operations
RobH moved T215837: eqiad: requesting dual cpu misc host for icinga1001 replacement from Backlog to Pending Approval on the hardware-requests board.

We'll need management approval on which task to assign WMF7426.

Mon, Feb 11, 8:22 PM · Operations, hardware-requests
RobH triaged T215837: eqiad: requesting dual cpu misc host for icinga1001 replacement as Normal priority.
Mon, Feb 11, 8:21 PM · Operations, hardware-requests
RobH added a comment to T215411: thumbor1004 memory errors.

@jijiki pinged you in irc as well, can you return this system to service?

Mon, Feb 11, 6:59 PM · Thumbor, ops-eqiad, serviceops, Operations
RobH closed T215411: thumbor1004 memory errors as Resolved.

Ok, updated firmware to System BIOS Version = 2.6.0 revision date of 28 Jun 2018

Mon, Feb 11, 6:58 PM · Thumbor, ops-eqiad, serviceops, Operations
RobH reassigned T215411: thumbor1004 memory errors from RobH to jijiki.

Ok, so the dimm B1 is reporting bad:

Mon, Feb 11, 5:11 PM · Thumbor, ops-eqiad, serviceops, Operations
RobH moved T215411: thumbor1004 memory errors from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.
Mon, Feb 11, 5:05 PM · Thumbor, ops-eqiad, serviceops, Operations
RobH claimed T215411: thumbor1004 memory errors.
Mon, Feb 11, 5:02 PM · Thumbor, ops-eqiad, serviceops, Operations
RobH added a comment to T213121: Deploy cr2-eqsin.

So deleting a ticket rquires us to open a 'delete request' ticket, seems easier to just keep both open and they'll receive the shipment in on one or the other.

Mon, Feb 11, 4:47 PM · Patch-For-Review, ops-eqiad, ops-eqsin, netops, Operations
RobH added a comment to T213121: Deploy cr2-eqsin.
Mon, Feb 11, 4:40 PM · Patch-For-Review, ops-eqiad, ops-eqsin, netops, Operations
RobH updated the task description for T213121: Deploy cr2-eqsin.
Mon, Feb 11, 4:36 PM · Patch-For-Review, ops-eqiad, ops-eqsin, netops, Operations
RobH added a comment to T213121: Deploy cr2-eqsin.

Chris shipped this, and I just put in an inbound shipemnt ticket for EQ Singapore SG#: 1-185487164544
UPS tracking 1Z291X71DG27842078

Mon, Feb 11, 4:36 PM · Patch-For-Review, ops-eqiad, ops-eqsin, netops, Operations

Thu, Feb 7

RobH renamed T209101: ulsfo: setup ulsfo PDUs from ulsfo: install new PDUs in racks / phase out APC loaner PDU use to ulsfo: setup ulsfo PDUs.
Thu, Feb 7, 11:14 PM · Patch-For-Review, Operations, ops-ulsfo
RobH updated the task description for T209101: ulsfo: setup ulsfo PDUs.
Thu, Feb 7, 11:13 PM · Patch-For-Review, Operations, ops-ulsfo
RobH removed a project from T209101: ulsfo: setup ulsfo PDUs: Patch-For-Review.
Thu, Feb 7, 11:12 PM · Patch-For-Review, Operations, ops-ulsfo
RobH added a comment to T209101: ulsfo: setup ulsfo PDUs.

Ok, firmware updated and all power balanced.

Thu, Feb 7, 10:59 PM · Patch-For-Review, Operations, ops-ulsfo
RobH added a comment to T209101: ulsfo: setup ulsfo PDUs.

Ok, while updating these, I've noticed that the power feeds in ulsfo are not balanced. Tower A is around 7 amps and tower B is around 2 amps for both racks.

Thu, Feb 7, 9:38 PM · Patch-For-Review, Operations, ops-ulsfo
RobH added a comment to T209101: ulsfo: setup ulsfo PDUs.

all PDUs in ulsfo are now properly mounted. The temp/humidity leads are plugged in, but not run anywhere until AFTER we get rid of the decom systems and install blanking panels.

Thu, Feb 7, 9:21 PM · Patch-For-Review, Operations, ops-ulsfo
RobH added a comment to T214274: Degraded RAID on cp5010.

Oh, just the output from troubleshooting on the system. The system should show TWO SSDs and only sees one now:

Thu, Feb 7, 9:05 PM · Traffic, ops-eqsin, Operations
RobH moved T214274: Degraded RAID on cp5010 from Backlog to Hardware Failure / Troubleshoot on the ops-eqsin board.
Thu, Feb 7, 8:33 PM · Traffic, ops-eqsin, Operations
RobH added a comment to T214274: Degraded RAID on cp5010.

Ok, I opened a support request with dell to ship a replacement SSD to eqsin:

Thu, Feb 7, 8:33 PM · Traffic, ops-eqsin, Operations
RobH reassigned T214079: cloudstore100{8,9} - Upgrade to 10GbE from RobH to GTirloni.

Ok, these are both reinstalled and ready for use/takeover.

Thu, Feb 7, 7:23 PM · Patch-For-Review, ops-eqiad, Operations
RobH added a comment to T214079: cloudstore100{8,9} - Upgrade to 10GbE.

Ok, assisting in this I've done the following:

Thu, Feb 7, 5:28 PM · Patch-For-Review, ops-eqiad, Operations
RobH removed Due Date on T215012: cloudvirt1015: apparent hardware errors in CPU/Memory.
Thu, Feb 7, 4:58 PM · Operations, ops-eqiad, DC-Ops, cloud-services-team (Kanban)
RobH set Due Date to Thu, Feb 14, 12:00 AM on T215012: cloudvirt1015: apparent hardware errors in CPU/Memory.
Thu, Feb 7, 4:56 PM · Operations, ops-eqiad, DC-Ops, cloud-services-team (Kanban)
RobH removed a project from T215012: cloudvirt1015: apparent hardware errors in CPU/Memory: Patch-For-Review.
Thu, Feb 7, 4:54 PM · Operations, ops-eqiad, DC-Ops, cloud-services-team (Kanban)
RobH updated the task description for T215012: cloudvirt1015: apparent hardware errors in CPU/Memory.
Thu, Feb 7, 4:54 PM · Operations, ops-eqiad, DC-Ops, cloud-services-team (Kanban)
RobH reassigned T215012: cloudvirt1015: apparent hardware errors in CPU/Memory from RobH to Cmjohnson.
Thu, Feb 7, 4:54 PM · Operations, ops-eqiad, DC-Ops, cloud-services-team (Kanban)
RobH added a comment to T215012: cloudvirt1015: apparent hardware errors in CPU/Memory.

Since this host is empty we should rebuild it with Stretch before putting any real VMs back on it. Maybe best to resolve the hardware issue first though.

Thu, Feb 7, 4:53 PM · Operations, ops-eqiad, DC-Ops, cloud-services-team (Kanban)
RobH updated the task description for T215012: cloudvirt1015: apparent hardware errors in CPU/Memory.
Thu, Feb 7, 4:52 PM · Operations, ops-eqiad, DC-Ops, cloud-services-team (Kanban)
RobH added a comment to T215012: cloudvirt1015: apparent hardware errors in CPU/Memory.
root@cloudvirt1015.mgmt.eqiad.wmnet's password: 
/admin1-> racadm getsel
Record:      1
Date/Time:   10/29/2018 17:26:21
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
Record:      2
Date/Time:   11/16/2018 19:16:14
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B3.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   11/16/2018 19:16:37
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B3.
-------------------------------------------------------------------------------
Record:      4
Date/Time:   01/19/2019 01:57:00
Source:      system
Severity:    Non-Critical
Description: The PERC1 battery is low.
-------------------------------------------------------------------------------
Record:      5
Date/Time:   01/19/2019 02:47:50
Source:      system
Severity:    Ok
Description: The PERC1 battery is operating normally.
-------------------------------------------------------------------------------
Record:      6
Date/Time:   01/31/2019 11:08:26
Source:      system
Severity:    Critical
Description: CPU 1 machine check error detected.
-------------------------------------------------------------------------------
Record:      7
Date/Time:   01/31/2019 11:08:26
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      8
Date/Time:   01/31/2019 11:08:26
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      9
Date/Time:   01/31/2019 11:08:26
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      10
Date/Time:   01/31/2019 11:08:26
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      11
Date/Time:   01/31/2019 11:08:26
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      12
Date/Time:   01/31/2019 11:08:26
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      13
Date/Time:   01/31/2019 11:08:26
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      14
Date/Time:   01/31/2019 11:08:27
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      15
Date/Time:   01/31/2019 11:11:08
Source:      system
Severity:    Ok
Description: A problem was detected related to the previous server boot.
-------------------------------------------------------------------------------
Record:      16
Date/Time:   01/31/2019 11:11:08
Source:      system
Severity:    Critical
Description: CPU 1 machine check error detected.
-------------------------------------------------------------------------------
Record:      17
Date/Time:   01/31/2019 11:11:08
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      18
Date/Time:   01/31/2019 11:11:08
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      19
Date/Time:   01/31/2019 11:11:08
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      20
Date/Time:   01/31/2019 11:11:08
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      21
Date/Time:   01/31/2019 11:11:08
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      22
Date/Time:   01/31/2019 11:11:08
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      23
Date/Time:   01/31/2019 11:11:08
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      24
Date/Time:   01/31/2019 11:11:08
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
/admin1->
Thu, Feb 7, 4:46 PM · Operations, ops-eqiad, DC-Ops, cloud-services-team (Kanban)

Wed, Feb 6

RobH awarded T215338: WMF7426 fails to accept racadm powercycle commands a Like token.
Wed, Feb 6, 9:24 PM · ops-eqiad, Operations
RobH closed T214516: cp4026 correctable dimm error as Resolved.

Ok, things I did to fix this system so far:

Wed, Feb 6, 7:55 PM · ops-ulsfo, Traffic, Operations
RobH added a comment to T214516: cp4026 correctable dimm error.
robh@cp4026:~$ sudo ipmi-sel
ID  | Date        | Time     | Name             | Type                     | Event
1   | Apr-23-2017 | 23:39:37 | SEL              | Event Logging Disabled   | Log Area Reset/Cleared
2   | May-03-2017 | 15:45:10 | PS Redundancy    | Power Supply             | Fully Redundant
3   | Sep-21-2017 | 03:57:01 | Status           | Power Supply             | Power Supply input lost (AC/DC)
4   | Sep-21-2017 | 03:57:01 | PS Redundancy    | Power Supply             | Redundancy Lost
5   | Sep-21-2017 | 04:28:01 | Status           | Power Supply             | Power Supply input lost (AC/DC)
6   | Sep-21-2017 | 04:28:11 | PS Redundancy    | Power Supply             | Fully Redundant
7   | Oct-05-2017 | 07:20:45 | Mem ECC Warning  | Memory                   | transition to Non-Critical from OK ; OEM Event Data2 code = A0h ; OEM Event Data3 code = 01h
8   | Oct-05-2017 | 07:30:31 | Mem ECC Warning  | Memory                   | transition to Critical from less severe ; OEM Event Data2 code = A0h ; OEM Event Data3 code = 01h
9   | Oct-12-2017 | 00:03:42 | Status           | Power Supply             | Power Supply input lost (AC/DC)
10  | Oct-12-2017 | 00:03:42 | PS Redundancy    | Power Supply             | Redundancy Lost
11  | Oct-12-2017 | 00:05:07 | Status           | Power Supply             | Power Supply input lost (AC/DC)
12  | Oct-12-2017 | 00:05:17 | Status           | Power Supply             | Power Supply input lost (AC/DC)
13  | Oct-17-2017 | 18:01:23 | Status           | Power Supply             | Power Supply input lost (AC/DC)
14  | Oct-17-2017 | 18:01:28 | PS Redundancy    | Power Supply             | Fully Redundant
15  | Oct-17-2017 | 21:32:43 | Intrusion        | Physical Security        | General Chassis Intrusion ; OEM Event Data2 code = 02h
16  | Oct-17-2017 | 21:32:48 | Intrusion        | Physical Security        | General Chassis Intrusion ; OEM Event Data2 code = 02h
17  | Dec-18-2018 | 18:02:36 | PS Redundancy    | Power Supply             | Redundancy Lost
18  | Dec-18-2018 | 18:02:36 | Status           | Power Supply             | Power Supply input lost (AC/DC)
19  | Dec-18-2018 | 18:05:26 | Status           | Power Supply             | Power Supply input lost (AC/DC)
20  | Dec-18-2018 | 18:05:36 | PS Redundancy    | Power Supply             | Fully Redundant
21  | Dec-18-2018 | 18:14:47 | Status           | Power Supply             | Power Supply input lost (AC/DC)
22  | Dec-18-2018 | 18:14:52 | PS Redundancy    | Power Supply             | Redundancy Lost
23  | Dec-18-2018 | 18:19:07 | Status           | Power Supply             | Power Supply input lost (AC/DC)
24  | Dec-18-2018 | 18:19:12 | PS Redundancy    | Power Supply             | Fully Redundant
25  | Dec-22-2018 | 04:13:13 | Mem ECC Warning  | Memory                   | transition to Non-Critical from OK ; OEM Event Data2 code = A1h ; OEM Event Data3 code = 04h
26  | Dec-22-2018 | 05:09:48 | Mem ECC Warning  | Memory                   | transition to Critical from less severe ; OEM Event Data2 code = A1h ; OEM Event Data3 code = 04h
robh@cp4026:~$
Wed, Feb 6, 7:16 PM · ops-ulsfo, Traffic, Operations
RobH moved T214516: cp4026 correctable dimm error from Backlog to Hardware Failure / Repair on the ops-ulsfo board.
Wed, Feb 6, 7:13 PM · ops-ulsfo, Traffic, Operations
RobH updated the task description for T178592: decommission/replace bast4001.wikimedia.org.
Wed, Feb 6, 7:11 PM · decommission, Operations, ops-ulsfo
RobH reassigned T215335: requesting WMF7426 as phabricator system in eqiad from faidon to Dzahn.
Wed, Feb 6, 12:56 AM · Operations, hardware-requests
RobH added a comment to T215335: requesting WMF7426 as phabricator system in eqiad.

So the original phab1002 was requested on T195623, but then @Dzahn advised (via discussion with @20after4) that it needed 64GB, not the 32GB it has.

Wed, Feb 6, 12:55 AM · Operations, hardware-requests

Tue, Feb 5

RobH reassigned T215335: requesting WMF7426 as phabricator system in eqiad from RobH to faidon.

Please approve the allocation of our last dual cpu spare pool system in eqiad to allocation as the secondary phabricator system in eqiad.

Tue, Feb 5, 10:18 PM · Operations, hardware-requests
RobH moved T215338: WMF7426 fails to accept racadm powercycle commands from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.
Tue, Feb 5, 8:26 PM · Operations, ops-eqiad
RobH triaged T215338: WMF7426 fails to accept racadm powercycle commands as Normal priority.
Tue, Feb 5, 8:00 PM · Operations, ops-eqiad
RobH assigned T215335: requesting WMF7426 as phabricator system in eqiad to Dzahn.

So I filed this on behalf of a conversation with @Dzahn regarding parent task T195623.

Tue, Feb 5, 7:58 PM · Operations, hardware-requests
RobH placed T215335: requesting WMF7426 as phabricator system in eqiad up for grabs.
Tue, Feb 5, 7:54 PM · Operations, hardware-requests
RobH reassigned T215335: requesting WMF7426 as phabricator system in eqiad from RobH to faidon.
Tue, Feb 5, 7:46 PM · Operations, hardware-requests
RobH closed T195623: request to assign wmf6937 (mw1298, former imagescaler) (now: wmf4727) as phab1002 as Resolved.

T215332 and T215335 filed as followup, resolving this task.

Tue, Feb 5, 7:46 PM · Patch-For-Review, Operations, hardware-requests
RobH renamed T215335: requesting WMF7426 as phabricator system in eqiad from requesting wmf7622 as phabricator system in eqiad to requesting WMF7426 as phabricator system in eqiad.
Tue, Feb 5, 7:45 PM · Operations, hardware-requests
RobH updated the task description for T215335: requesting WMF7426 as phabricator system in eqiad.
Tue, Feb 5, 7:41 PM · Operations, hardware-requests
RobH triaged T215335: requesting WMF7426 as phabricator system in eqiad as Normal priority.
Tue, Feb 5, 7:40 PM · Operations, hardware-requests
RobH added a parent task for T215332: decommission wmf6937 as phab1002, reimage as mw1298: T192457: Reallocate former image scalers.
Tue, Feb 5, 7:31 PM · decommission, Operations, hardware-requests
RobH added a subtask for T192457: Reallocate former image scalers: T215332: decommission wmf6937 as phab1002, reimage as mw1298.
Tue, Feb 5, 7:30 PM · Patch-For-Review, Operations
RobH triaged T215332: decommission wmf6937 as phab1002, reimage as mw1298 as Normal priority.
Tue, Feb 5, 7:30 PM · decommission, Operations, hardware-requests
RobH added a comment to T195623: request to assign wmf6937 (mw1298, former imagescaler) (now: wmf4727) as phab1002.

Ok, I reviewed this in IRC with @Dzahn and have the following action items:

Tue, Feb 5, 7:24 PM · Patch-For-Review, Operations, hardware-requests