User Details
- User Since
- Aug 2 2021, 1:52 PM (86 w, 5 d)
- Availability
- Available
- IRC Nick
- Emperor
- LDAP User
- MVernon
- MediaWiki User
- MVernon (WMF) [ Global Accounts ]
Fri, Mar 31
@Jclark-ctr ms-be1042 shut down ready for you.
@Jclark-ctr I can shut down ms-be1042 for you (or you can DIY, there's no special procedure for this host). Can I confirm you want it shut down at 14:00 UTC today, please?
Wed, Mar 29
I think in these cases, removing the incorrect thumbnail will allow it to be recreated on next GET.
Hi folks. I am happy for you to do the firmware update first if you think that's the best approach.
Please do so at your earliest convenience - a note here when you're doing so would be appreciated. I'm happy for that to involve rebooting the server.
Tue, Mar 28
The firmware update cookbook does offer a firmware update; I was going to apply it once the disks were swapped (as rebooting the system with drives in a funny state doesn't always go smoothly).
ms-be2067.codfw.wmnet (Gen 14): starting ms-be2067.codfw.wmnet (IDRAC): update ms-be2067.codfw.wmnet (IDRAC): current version: 5.0.20.0 poweredge-r740xd2: picking DellDriverCategory.IDRAC update file We have found multiple entries please pick from the list below: 0: /srv/firmware/poweredge-r740xd2/IDRAC/iDRAC-with-Lifecycle-Controller_Firmware_C8NT1_WN64_6.10.30.00_A00.EXE 1: Download new file ==> Please select the entry you want > 0 User input is: "0" ms-be2067.codfw.wmnet (IDRAC): target_version: 6.10.30.0, current_version: 5.0.20.0 ==> ms-be2067.codfw.wmnet IDRAC: About to upload /srv/firmware/poweredge-r740xd2/IDRAC/iDRAC-with-Lifecycle-Controller_Firmware_C8NT1_WN64_6.10.30.00_A00.EXE, please confirm Type "go" to proceed or "abort" to interrupt the execution > abort User input is: "abort"
Fri, Mar 24
Wed, Mar 22
I wonder (but this is not a settled position) whether using an account ACL is the more elegant solution, as we do that once and it'll work for all deleted containers?
The upstream docs are a bit ... spartan, swiftstack provide slightly more information.
Yeah, that's a good question - I think there are about 21675 deleted containers.
I think there's no automation for container management (is that right @fgiunchedi ?) so the options are presumably to write a loop that adds the mw:backup account to the read ACL for each deleted container, or maybe to make a read-only account ACL so the mw:backup user has read-only access to everything the mw:media account owns?
I think the issue is that the deleted container has different permissions:
root@ms-fe1009:/home/mvernon# swift stat wikipedia-mediawiki-local-deleted | grep 'Read ACL' Read ACL: mw:thumbor-private,mw:media root@ms-fe1009:/home/mvernon# swift stat wikipedia-commons-local-public.57 | grep 'Read ACL' Read ACL: mw:thumbor,mw:media,.r:*
So you can see the deleted container is only available to the mw:thumbor-private and mw:media accounts, whereas the public container is readable by any swift account.
Tue, Mar 21
"Write up an incident report", perhaps?
Mon, Mar 20
I think the right thing would be to open a new ticket; but I note it's SRE Sprint Week, so I'm not sure whether clinic duty tasks will get much attention this week unless flagged as urgent, I'm afraid.
I'm not sure how much this will actually help from the swift side (as opposed to frontend capacity, memcached, etc).
The account dbs are tiny, though, so it seems like a cheap thing to try...
Fri, Mar 17
I think this issue is resolved now? Swift error rates have remained low since we restarted swift and thumbor.
I'm closing this task now, as I think thumbnails are now being correctly generated.
Thanks!
Thu, Mar 16
@Papaul that's the one - can you clear the Foreign state from that disk, please? I can't figure out how to do it, and I think without that config being cleared it's not available to the OS.
@Papaul thanks; the other drive has behaved itself since the reboot, so I think we're OK to leave it in place for now.
Wed, Mar 15
This is a k8s application running on the WMF OpenStack, yes?
Tue, Mar 14
@Cmjohnson these are just JBOD I think - at least that's how ms-fe1012 appears to me (and I think that's what I expect for a swift frontend) - we do software-raid on these systems.
Some logs (NDA, they have IPs in) for a particular request above - P45869.
...this may be a different disk that's failed, so maybe it just got unseated?
[for reference, I was on leave on 2022-08-24]
Hi @Papaul Sorry, but there's still a missing disk in this system - Virtual Drive 21 is absent, which I think is Slot Number: 19
Could you have another look, please?
@Papaul that's only 13 disks, not 14?
The recent activity panel in the iDRAC shows:
2023-03-12T15:08:42-0500 Virtual Disk 8 on Integrated RAID Controller 1 was deleted. 2023-03-12T15:08:37-0500 The Patrol Read operation was stopped and did not complete for Integrated RAID Controller 1. 2023-03-12T15:08:37-0500 Controller cache is preserved for missing or offline Virtual Disk 8 on Integrated RAID Controller 1. 2023-03-12T15:08:37-0500 Disk 6 in Backplane 1 of Integrated RAID Controller 1 is removed. 2023-03-12T15:08:37-0500 Disk 6 in Backplane 1 of Integrated RAID Controller 1 was reset. 2023-03-12T15:08:37-0500 Disk 6 in Backplane 1 of Integrated RAID Controller 1 is not functioning correctly. 2023-03-12T15:08:26-0500 Disk 6 in Backplane 1 of Integrated RAID Controller 1 was reset.
Mon, Mar 13
No complaints from me, thanks, drive is now 25% loaded and behaving fine.
Fri, Mar 10
I'm struggling to find any corresponding uptick in error logs on swift frontends, which is making me wonder if this isn't coming from elsewhere in the stack. But it's nearly COB on Friday and I'm running out of good ideas, sorry.
Similarly, looking at logstash searches for e.g. host: thumbor* AND message:502 doesn't show any interesting rate changes.
@sbassett done.
Happy to hold off until we're sure what the correct group is :)
@thcipriani you're the approver for the deployment group, are you happy for me to add @Htriedman to it, please?
Thu, Mar 9
Setting to medium priority, because this is probably now just a case of waiting for the queue to drain.
The swapped-in drive seems OK initially, I'll get swift to start using it shortly.
[after a reboot the drive in slot 2 was in a "Foreign" state; clearing that made it possible to reintroduce it with sudo megacli -CfgEachDskRaid0 WB RA Direct CachedBadBBU -a0 and the filesystem recovered OK.
Can you check the drives in slots 23 and 2 are seated proper please? the kernel still can't see them.
Target Id 4 also missing
Looking at these drives -
sdz is bus info: scsi@0:2.25.0 Target Id: 25 is Enclosure Device ID: 32 Slot Number: 23
Something has gone a bit awry, the kernel reports problems with two other drives instead:
Mar 9 14:13:57 ms-be1066 kernel: [11683056.185701] sd 0:2:4:0: [sdf] tag#699 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=4s Mar 9 14:14:00 ms-be1066 kernel: [11683059.173114] sd 0:2:25:0: [sdz] tag#897 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=0s
Yes, please. I've unmounted a drive in ms-be1066 and turned on the locator light
sudo megacli -PDLocate -PhysDrv [32:15] -a0
@DMartin-WMF all done.
@sbassett I've opened a CR to update your ssh key - if you can confirm it's correct and +1 the CR, I'll merge it.
Wed, Mar 8
Hi @Jclark-ctr any news on getting these frontends ready for use, please?
@S_Mukuti this is all done for you now.
I see two successful PUTs for that object (one per DC), and indeed it seems to be successfully uploaded...
...though there is a "Sarah Mukuti" wikitech user, which I think is correct?
@lwatson this is now done.
@S_Mukuti I think this is a request to join the wmf LDAP group only?
Also, can you double-check the wikitech username, please? I can't find an account by that name.
@FNavas-foundation can I double-check what access you need for what purposes, please? You say you need access to turnilo - that can be done with just wmf access (not analytics_privatedata_users) - see https://wikitech.wikimedia.org/wiki/Analytics/Data_access and/or talk to your manager.
@DMartin-WMF can I confirm you don't require kerberos access (you didn't explicitly ask for it; cf https://wikitech.wikimedia.org/wiki/Analytics/Data_access )?
One further thought - it would be nice if we could take swift's special 404-handler out of the equation, and have our object store be more "vanilla" in operation[0]. So e.g. have thumbnail requests be served from thumbor (or equivalent), and have that service worry about the caching/invalidation/etc. This task might not be the place to address that, but it seemed worth flagging if we're going to expend effort on rethinking how thumbnails are dealt with.
That looks like a DNS error? unless I'm misreading net::ERR_NAME_NOT_RESOLVED which wasn't obviously apparent in the error that started this ticket.
Tue, Mar 7
Yeah, I saw similar on ms-be2070; the problem being the disk isn't entirely blank. I suspect
sudo wipefs -a /dev/disk/by-path/pci-0000:18:00.0-scsi-0:0:0:0-part1 sudo wipefs -a /dev/disk/by-path/pci-0000:18:00.0-scsi-0:0:0:0
will do the trick.
...the icinga warning was systemd timing out waiting for smartd to start up (takes about 2 minutes).
@Papaul I've fixed the underlying problems and you'll see ms-be2070 reimaged to successful completion now, so hopefully that's you unblocked here.
Sorry, have updated that page now.
@nickifeajika all done now.
d'oh, that seems likely, thank you!
Mon, Mar 6
@jbond I dunno if you have any thoughts about this? I've had a look at the iDRAC, and it has one of the SSDs as the boot device to try (which I'd expect to work), and all of the drives are set to non-RAID (and the convert-disks cookbook indeed says nothing to do). And it seemingly boots OK as a vanilla OS install.
@Miriam sorry, I forgot to ask: can I confirm that this is a time-limited account, and you are the contact regarding expiry, please? And can you tell me the expected expiry date, please?