User Details
- User Since
- Dec 5 2022, 4:37 PM (175 w, 2 d)
- Availability
- Available
- LDAP User
- Jhancock.wm
- MediaWiki User
- Jhancock.wm [ Global Accounts ]
Today
Mon, Apr 13
np!
Fri, Apr 10
@MatthewVernon all yours
Thu, Apr 9
also papaul is on vacation and i'd like to have his weight in as well
imho, i'd prefer a rack not in A row cause of the two CR racks already taking up real estate.
D row has no specialty rack at all so we can easily work around that for future private vlan installs.
codfw's E row is 5 racks long but the F row is 4 racks + 1 Frack, so E would be the better choice. and not E-3 cause it already has less room cause of all the patch panels.
Wed, Apr 8
finally replaced all the parts that got fried in a power surge. powered up and back in the rack.
good news everybody!
Tue, Apr 7
@jcrespo would loading the disks from a foreign config be acceptable for you? or will that cause issues with recovery?
Mon, Apr 6
@MatthewVernon these came in. any objections to me racking them in the new cage, rows E and F?
there was a bad psu in T422310 that was causing power fluctuations in the cabinet. it was replaced in that ticket. then this power supply was reseated. alerts have cleared.
psu2 is rapid blinking even after reseated power, trying new port, and replacing the cable. out of warranty, checking decomms for compatible psu swap. found one and replaced. alert cleared.
Fri, Apr 3
had the wrong cable connected for a server that's still in provisioning stage. corrected.
provisioned ports on new server. cleared.
Thu, Apr 2
Wed, Apr 1
we're having the issue that was documented in https://phabricator.wikimedia.org/T418929 with these servers. still working on a solution.
turns out these servers are also having the same issue as these servers https://phabricator.wikimedia.org/T418929
so got a little to figure out if you want to rename them.
this server is having the issue found in T418929 where we can't add the root user because of hardware changes
Tue, Mar 31
Mon, Mar 30
Thu, Mar 26
Hey. The issue turned out to be me more invovled than i thought. I need to replace the power distribution board in the server. but that involves replacing the arm that gets power to the sliding drive bays as well. Might be next week before i can finish that out. I'll let you know.
Wed, Mar 25
@Dzahn
looks like that worked. it rebooted after the update and there wasn't a repeat of these error codes from two weeks ago.
2026-03-14 21:20:20 HWC5003 Front LED Panel is operating correctly. 2026-03-14 21:20:05 SWC5008 Unable to access Front LED Panel because of a hardware error condition. 2026-03-11 07:33:05 HWC5003 Front LED Panel is operating correctly. 2026-03-11 07:32:45 SWC5008 Unable to access Front LED Panel because of a hardware error condition.
@herron we've received these and will try to get them installed by EoW. please update site.pp to include the newest nodes. ty!
Tue, Mar 24
I would definitely start with that and see if it clears the issue. the least invasive. I am onsite most workdays from 1400 UTC to 1800 UTC but can go later if needed. pick your day and i'll get that done for you.
rmultiple servers found on different breakers. no correlation other than rack.
rmultiple servers found on different breakers. no correlation other than rack.
there were a few power supplies that went down in the same rack. it wasn't a breaker trip. all on different channels on the PDUs. I see no correlation or cause other than a possible power surge on the PDU. but it didn't affect the whole rack which i would expect.
I might have accidentally rebooted this one while getting the psu issues to clear for the whole rack.
rmultiple servers found on different breakers. no correlation other than rack.
rmultiple servers found on different breakers. no correlation other than rack.
rmultiple servers found on different breakers. no correlation other than rack.
rmultiple servers found on different breakers. no correlation other than rack.
Thu, Mar 19
Wed, Mar 18
you're welcome!
it's a good thing we didn't wait for dell to send us a new drive. their portal says shipped but the drive still hasn't been delivered to codfw.
Tue, Mar 17
@MoritzMuehlenhoff redid it in trixie.
yes. that matches the time. this error can be from a firmware issue.
got into the idrac/console and found the server as this:
Booting from Hard Drive C: GRUB
rebooted and went to the same screen.
contacted @Papaul for consult. corrupted or missing config file.
drive 0 was replaced back on December 1st. T410195 might have happened then.
also found Critical: CPU 1 machine check error, but the error resolved on it's own. no persistent errors in the idrac logs. likely what caused a reboot and finding the corrupted config file.
got greenlight to reimage the host. going to update the bios and idrac firmware as well
Mar 17 2026
related to T419970. will clear soon
soft rebooted the idrac
Mar 16 2026
@Andrew i swear i didn't forget about you. This is complete.
@Aklapper this notifies on the physical server if something goes wrong. like if a power supply goes bad, it will flash a light.
It's not posting at the moment. I have some tricks to try today and if not, i have some decommed servers i can pull parts from. It's my intention to have it back up by the end of my day. TY for letting me know that errors are okay. just need to get it to boot.
Mar 13 2026
it's definitely having some issues.
-power cables reseated with no results.
-replaced the BP1 and that error went away. system board error persists.
-updated idrac to see if i can get better error messages. update successful, but didn't help.
-reseating CPU1 cause that's the most likely cause for persistent error.
-error persists looking up more stuff. coming back to this.
that did it. ty
Mar 12 2026
@Papaul i ran into an issue running the offline script for this one. is this a thing i can fix?
An exception occurred: IntegrityError: update or delete on table "ipam_ipaddress" violates foreign key constraint "dcim_device_oob_ip_id_5e7219c1_fk_ipam_ipaddress_id" on table "dcim_device" DETAIL: Key (id)=(22652) is still referenced from table "dcim_device".
yes, that's the most likely cause.
there was a loose power cable earlier this week. it might have powered off during that time. power cables have been secured since. T419462
replaced it with an offical warranty disk we had on hand. replacing that warranty disk with a new one from Dell. SR223804656
Mar 11 2026
i need to find a way to wipe it without root access. I'd like to able to fix that issue without pulling someone else in to do that. TY for getting that and explaining it!