User Details
- User Since
- Aug 2 2021, 1:52 PM (53 w, 6 d)
- Availability
- Available
- LDAP User
- MVernon
- MediaWiki User
- MVernon (WMF) [ Global Accounts ]
Fri, Aug 12
Thu, Aug 11
Also related is that following T309027, all the SSDs on ms-* reliably appear as non-rotational, so could in theory be used to tell puppet (and, indeed, the installer) which drives are which.
T308677 shows an example where the installer destroys a filesystem.
Well, I tried our usual procedure https://wikitech.wikimedia.org/wiki/Swift/How_To#Replacing_a_disk_without_touching_the_rings and the first two commands work OK, but attempting to make a new single-disk RAID out of the new drive fails because the adaptor thinks there's no new drive:
mvernon@ms-be2067:~$ sudo megacli -CfgEachDskRaid0 WB RA Direct CachedBadBBU -a0
Wed, Aug 10
Sorry, this is blocking on my having time to work on thanos (and that in turn is blocking on it being in a happy state to work on, complicated by power work and hardware issues). I've not forgotten!
I've seen auth failures with swift-ring-manager sometimes too on thanos, anecdotally associated with high load, but there's never AFAICS anything useful logged by swift :-/
@Papaul sorry, I don't understand your comment, but I've rechecked, and there are still kernel log errors re sdz and the idrac still thinks there's one removed drive....
Perhaps relatedly, but perhaps not, kern.log is unhappy about /dev/sdz since sdc was removed:
Aug 3 15:18:02 ms-be2067 kernel: [2595942.387928] sd 0:2:2:0: SCSI device is re moved Aug 3 15:18:05 ms-be2067 kernel: [2595945.605821] sd 0:2:25:0: [sdz] tag#250 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=0s Aug 3 15:18:05 ms-be2067 kernel: [2595945.605827] sd 0:2:25:0: [sdz] tag#250 CDB: Write(16) 8a 00 00 00 00 00 80 97 29 60 00 00 01 b8 00 00 Aug 3 15:18:05 ms-be2067 kernel: [2595945.605832] blk_update_request: I/O error, dev sdz, sector 2157390176 op 0x1:(WRITE) flags 0x800 phys_seg 55 prio class 0 Aug 3 15:18:05 ms-be2067 kernel: [2595945.617003] iomap_finish_ioend: 39 callbacks suppressed Aug 3 15:18:15 ms-be2067 kernel: [2595945.617006] sdz1: writeback error on inode 2149037748, offset 0, sector 2157390616
...and it's still logging errors now. (I think that's drive with Slot Number: 23 if that helps).
Hi @Papaul I may be missing something obvious, but I don't think the storage is quite right here - as far as I can see there isn't a new disk visible, and if I visit the idrac, it tells me there's one drive "Physical Disk 0:2:0" in state "removed". Could you have another look, please?
This was a "swift is trying to fill / instead of a storage device" problem, fixed following the procedure here:
https://wikitech.wikimedia.org/wiki/Swift/How_To#Cleanup_fully_used_root_filesystem
Tue, Aug 9
Yep, it's the battery (like ms-be2032).
Yes, it's the battery that's died (presumably related to the recent power work).
While Disk Is Cheap (TM), container listing is not and our thumbs containers are the largest in terms of number-of-objects; I'm not entirely relaxed about the idea of never expiring any thumbnails ever.
Thu, Aug 4
Having moved C2 to today, it needs to wait until all the ms-* nodes in D2 are fully back up before starting.
All the ms-* nodes in C4 & C7 must be back and properly in service before we can start on D2, I'm afraid. I'll be on IRC, but please don't start on D2 until I've OK'd the state of the C ms nodes.
Wed, Aug 3
I don't think Swift itself has a mechanism for setting X-Content-Type-Options; so I think if we want to do this, having it done by varnish is likely to be the way to go.
Tue, Aug 2
Are you proposing to do away with the concept of "active" DC, then? e.g. currently swiftrepl runs from the active DC to fix up where MW failed to create / delete objects in both ms swift clusters. This enables us to answer "bucket a in codfw has object X, bucket a in eqiad does not - should X be added to eqiad, or removed from codfw?"
Without that, I'm not sure what we can do to work around the fact that MW doesn't reliably write/delete to both swift clusters...
moss-fe2001 will need depooling in C2 before work on that rack starts.
Fri, Jul 29
In rack D2, ms-fe2012 needs depooling before the power goes, and if you could ping me once the rack is done so I can check all the ms-be* nodes come back up again, that'd be kind.
For D7, please ping @jbond once done so he can confirm the ms-be* nodes have come back up OK.
Hi,
This drive is now unmounted, so can be swapped at your earliest convenience, please :)
Thanks!
...this behaviour has reverted, since we've gone back to using upstream swift-drive-audit, which is a cron.d entry.
swift-drive-audit needs to run systemctl daemon-reload after making changes to /etc/fstab. Thanks, systemd.
Thu, Jul 28
I've deployed this fix now, so closing this issue.
Wed, Jul 27
Done.
Mon, Jul 25
Rings adjusted.
Hi,
In B2, ms-fe2010 and thanos-fe2002 will need depooling.
Thu, Jul 21
@tstarling thanks for finding that history. That is pushing me harder towards "just disable the request replication", particularly given it looks like it is sometimes causing us problems.
Tue, Jul 19
Are there other teams you think we should talk to before turning this off, then?
@fgiunchedi the obvious starter-for-ten approach to stopping this happening again would be to just ditch the cross-DC-thumbnail copying, since it seems to be sometimes causing problems. What would you feel about that as an approach?
The WMF rewrite middleware always makes me sad, and here it is doing so again...
Jul 15 2022
@tstarling thanks for fixing.
Jul 8 2022
Jul 6 2022
Remaining nodes done by hand during reboots for T310483:
mvernon@cumin1001:~$ sudo cumin 'ms-be*' 'cat /sys/block/md0/queue/rotational' 85 hosts will be targeted: ms-be[2028-2069].codfw.wmnet,ms-be[1028-1033,1035-1071].eqiad.wmnet Ok to proceed on 85 hosts? Enter the number of affected hosts to confirm or "q" to quit 85 ===== NODE GROUP ===== (85) ms-be[2028-2069].codfw.wmnet,ms-be[1028-1033,1035-1071].eqiad.wmnet ----- OUTPUT of 'cat /sys/block/md0/queue/rotational' ----- 0 ================
Jul 5 2022
LGTM :)
Jun 21 2022
Jun 17 2022
Jun 15 2022
That seems to have done the trick, thanks :-)
@Cmjohnson Ah, OK, I sort-of assumed we had HTML5 console everywhere. Now I know better :)
Hi,
Sorry for further noise, but: is it possible there's some USB thing still connected to this system?
It won't reimage because of:
[ 12.234765] sd 0:0:0:0: [sda] Attached SCSI removable disk
(which means the SSDs end up at b and c)
This looks to be a USB device:
~ # ls -l /dev/disk/by-path/pci-0000\:00\:14.0-usb-0\:4\:1.0-scsi-0\:0\:0\:0 lrwxrwxrwx 1 root root 9 Jun 15 09:46 /dev/disk/by-path/pci-0000:00:14.0-usb-0:4:1.0-scsi-0:0:0:0 -> ../../sda ~ # lsscsi | grep 0:0:0:0 [0:0:0:0] disk Generic- SD/MMC CRW 1.00
Similar systems (e.g. ms-be1058) don't have such a thing.
Sorry to reopen this task, but there's a licence issue, I think. When I try and use the HTML5 console, it says:
"License Required
This iLO is not licensed to use the Integrated Remote Console after server POST is complete."
Jun 14 2022
It'll take the cookbooks a while to catch up (they back-off in increasing intervals waiting for puppet to be OK), but after some deployment-related hassle, all of these nodes have had puppet run to completion on them OK, so they should be good to go for you @Eevans .
Jun 13 2022
Trying a reimage of aqs2001 with buster.
This turned out to be an incorrect config section - I've updated https://wikitech.wikimedia.org/wiki/Management_Interfaces#Is_remote_IPMI_enabled? to note this, and how to fix it on HP kit.
Thanks for the update; I think the ILO needs our local configuration re-applying to it? If so, are you OK to do that, please?
Jun 9 2022
Jun 8 2022
Jun 6 2022
This should all be done now, and I've restarted all the thanos swift frontends.
Sorry they're giving you the runaround, that sounds very annoying :( Thanks for the update!
Jun 1 2022
May 30 2022
May 25 2022
Thanks for the update, and thank you for persevering!
May 24 2022
There are (at least!) 4 ways to configure the RAID controller - its own setup utility (hit ^r during boot), the general BIOS setup (F2 during boot), the web-iDRAC interface, and the megacli tool. AFAICT the system needs to be not running while this process is carried out (because / is on the relevant drives).
May 23 2022
Sorry! I noticed that backup2008 is configured differently to all the other backup hosts (its SSDs are both 1-member RAID-0 arrays, rather than non-RAID disks), and I was wondering if this was intentional?
Given your comment, I think the answer is "no" :)
May 20 2022
All production and pre-production codfw backends done.
May 19 2022
May 18 2022
May 11 2022
May 10 2022
Thanks for the update :)
May 9 2022
FWIW, we do occasionally see this on ms-* too, but I can never repro on demand, which might support a load-related cause; I never found much in logs. Could maybe make swift-account-stats retry a couple of times?
May 5 2022
That's ... not inspiring optimism, is it? :(
Oh, bother.
The problem is that our swift clusters can tolerate one failed system; so I can't straightforwardly do any more reimages in the eqiad ms- cluster while this system is out.
So "as soon as reasonably possible", though I appreciate this has gone from a hopefully-quick job to a much larger piece of work, and so that's going to take some time.
eqiad blocked on T307667 ms-be1059 being broken; that's unrelated to the reimages (it's still on stretch), but still blocks us as it's only safe to have one host out at once.
Broadly, there should be little impact (and our monitoring suggests error rates within expected ranges); I hope any errors were infrequent and transient.