Page MenuHomePhabricator

MatthewVernon (Matthew Vernon)
User

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
Aug 2 2021, 1:52 PM (53 w, 6 d)
Availability
Available
LDAP User
MVernon
MediaWiki User
MVernon (WMF) [ Global Accounts ]

Recent Activity

Fri, Aug 12

Eevans awarded T309896: Upgrade Cassandra to latest 3.x (3.11.13) a Cookie token.
Fri, Aug 12, 2:02 PM · Patch-For-Review, Cassandra

Thu, Aug 11

MatthewVernon closed T309896: Upgrade Cassandra to latest 3.x (3.11.13) as Resolved.
Thu, Aug 11, 7:22 PM · Patch-For-Review, Cassandra
MatthewVernon updated the task description for T309896: Upgrade Cassandra to latest 3.x (3.11.13).
Thu, Aug 11, 7:21 PM · Patch-For-Review, Cassandra
MatthewVernon updated the task description for T309896: Upgrade Cassandra to latest 3.x (3.11.13).
Thu, Aug 11, 7:21 PM · Patch-For-Review, Cassandra
MatthewVernon added a comment to T308644: unstable device mapping of SSDs causing swift/puppet problems - example reimage.

Also related is that following T309027, all the SSDs on ms-* reliably appear as non-rotational, so could in theory be used to tell puppet (and, indeed, the installer) which drives are which.

Thu, Aug 11, 4:05 PM · Infrastructure-Foundations, SRE-swift-storage
MatthewVernon added a comment to T308644: unstable device mapping of SSDs causing swift/puppet problems - example reimage.

T308677 shows an example where the installer destroys a filesystem.

Thu, Aug 11, 4:03 PM · Infrastructure-Foundations, SRE-swift-storage
MatthewVernon updated subscribers of T314049: Degraded RAID on ms-be2067.

Well, I tried our usual procedure https://wikitech.wikimedia.org/wiki/Swift/How_To#Replacing_a_disk_without_touching_the_rings and the first two commands work OK, but attempting to make a new single-disk RAID out of the new drive fails because the adaptor thinks there's no new drive:

mvernon@ms-be2067:~$ sudo megacli -CfgEachDskRaid0 WB RA Direct CachedBadBBU -a0
Thu, Aug 11, 8:37 AM · SRE-swift-storage, SRE, ops-codfw

Wed, Aug 10

MatthewVernon added a comment to T310923: Install NVMe SSDs into moss-be200[1|2] & thanos-be200?.

Sorry, this is blocking on my having time to work on thanos (and that in turn is blocking on it being in a happy state to work on, complicated by power work and hardware issues). I've not forgotten!

Wed, Aug 10, 4:29 PM · SRE, ops-codfw, SRE-swift-storage, DC-Ops
MatthewVernon added a comment to T314835: wdqs space usage on thanos-swift.

I've seen auth failures with swift-ring-manager sometimes too on thanos, anecdotally associated with high load, but there's never AFAICS anything useful logged by swift :-/

Wed, Aug 10, 3:44 PM · Patch-For-Review, Data Engineering Planning, wdwb-tech, Wikidata, SRE-swift-storage, SRE, Wikidata-Query-Service
MatthewVernon added a comment to T314049: Degraded RAID on ms-be2067.

@Papaul sorry, I don't understand your comment, but I've rechecked, and there are still kernel log errors re sdz and the idrac still thinks there's one removed drive....

Wed, Aug 10, 2:42 PM · SRE-swift-storage, SRE, ops-codfw
MatthewVernon added a comment to T314049: Degraded RAID on ms-be2067.

Perhaps relatedly, but perhaps not, kern.log is unhappy about /dev/sdz since sdc was removed:

Aug  3 15:18:02 ms-be2067 kernel: [2595942.387928] sd 0:2:2:0: SCSI device is re
moved
Aug  3 15:18:05 ms-be2067 kernel: [2595945.605821] sd 0:2:25:0: [sdz] tag#250 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=0s
Aug  3 15:18:05 ms-be2067 kernel: [2595945.605827] sd 0:2:25:0: [sdz] tag#250 CDB: Write(16) 8a 00 00 00 00 00 80 97 29 60 00 00 01 b8 00 00
Aug  3 15:18:05 ms-be2067 kernel: [2595945.605832] blk_update_request: I/O error, dev sdz, sector 2157390176 op 0x1:(WRITE) flags 0x800 phys_seg 55 prio class 0
Aug  3 15:18:05 ms-be2067 kernel: [2595945.617003] iomap_finish_ioend: 39 callbacks suppressed
Aug  3 15:18:15 ms-be2067 kernel: [2595945.617006] sdz1: writeback error on inode 2149037748, offset 0, sector 2157390616

...and it's still logging errors now. (I think that's drive with Slot Number: 23 if that helps).

Wed, Aug 10, 9:34 AM · SRE-swift-storage, SRE, ops-codfw
MatthewVernon reopened T314049: Degraded RAID on ms-be2067 as "Open".

Hi @Papaul I may be missing something obvious, but I don't think the storage is quite right here - as far as I can see there isn't a new disk visible, and if I visit the idrac, it tells me there's one drive "Physical Disk 0:2:0" in state "removed". Could you have another look, please?

Wed, Aug 10, 9:31 AM · SRE-swift-storage, SRE, ops-codfw
MatthewVernon closed T314915: / full on ms-be2028 as Resolved.

This was a "swift is trying to fill / instead of a storage device" problem, fixed following the procedure here:
https://wikitech.wikimedia.org/wiki/Swift/How_To#Cleanup_fully_used_root_filesystem

Wed, Aug 10, 8:48 AM · SRE-swift-storage
MatthewVernon created T314915: / full on ms-be2028.
Wed, Aug 10, 8:27 AM · SRE-swift-storage

Tue, Aug 9

MatthewVernon added a comment to T314509: Degraded RAID on ms-be2035.

Yep, it's the battery (like ms-be2032).

Tue, Aug 9, 2:00 PM · SRE-swift-storage, SRE, ops-codfw
MatthewVernon added a comment to T314427: Degraded RAID on ms-be2032.

Yes, it's the battery that's died (presumably related to the recent power work).

Tue, Aug 9, 1:58 PM · SRE-swift-storage, SRE, ops-codfw
MatthewVernon added a comment to T211661: Automatically clean up unused thumbnails in Swift.

While Disk Is Cheap (TM), container listing is not and our thumbs containers are the largest in terms of number-of-objects; I'm not entirely relaxed about the idea of never expiring any thumbnails ever.

Tue, Aug 9, 1:51 PM · Patch-For-Review, Traffic, SRE-swift-storage, Performance-Team, SRE

Thu, Aug 4

MatthewVernon added a project to T310145: (Need By:TBD) rack/setup/install row C new PDUs: SRE-swift-storage.
Thu, Aug 4, 1:56 PM · SRE-swift-storage, DBA, SRE, ops-codfw
MatthewVernon added a comment to T310145: (Need By:TBD) rack/setup/install row C new PDUs.

Having moved C2 to today, it needs to wait until all the ms-* nodes in D2 are fully back up before starting.

Thu, Aug 4, 1:56 PM · SRE-swift-storage, DBA, SRE, ops-codfw
MatthewVernon added a comment to T310146: (Need By:TBD) rack/setup/install row D new PDUs.

All the ms-* nodes in C4 & C7 must be back and properly in service before we can start on D2, I'm afraid. I'll be on IRC, but please don't start on D2 until I've OK'd the state of the C ms nodes.

Thu, Aug 4, 1:55 PM · Patch-For-Review, SRE-swift-storage, DBA, SRE, ops-codfw

Wed, Aug 3

MatthewVernon added a comment to T309787: Remove IEContentAnalyzer.

I don't think Swift itself has a mechanism for setting X-Content-Type-Options; so I think if we want to do this, having it done by varnish is likely to be the way to go.

Wed, Aug 3, 9:28 AM · Data-Persistence (Consultation), Patch-For-Review, Technical-Debt, MediaWiki-File-management

Tue, Aug 2

MatthewVernon added a comment to T279664: Progressive Multi-DC roll out.

Are you proposing to do away with the concept of "active" DC, then? e.g. currently swiftrepl runs from the active DC to fix up where MW failed to create / delete objects in both ms swift clusters. This enables us to answer "bucket a in codfw has object X, bucket a in eqiad does not - should X be added to eqiad, or removed from codfw?"
Without that, I'm not sure what we can do to work around the fact that MW doesn't reliably write/delete to both swift clusters...

Tue, Aug 2, 10:55 AM · SRE-swift-storage, Patch-For-Review, SRE, Traffic, serviceops, Performance-Team
MatthewVernon added a project to T279664: Progressive Multi-DC roll out: SRE-swift-storage.
Tue, Aug 2, 10:55 AM · SRE-swift-storage, Patch-For-Review, SRE, Traffic, serviceops, Performance-Team
MatthewVernon moved T314275: Expand thanos-swift sd[ab]3 SSDs from Inbox to In progress on the SRE-swift-storage board.
Tue, Aug 2, 10:31 AM · User-fgiunchedi, SRE-swift-storage
MatthewVernon added a comment to T310145: (Need By:TBD) rack/setup/install row C new PDUs.

moss-fe2001 will need depooling in C2 before work on that rack starts.

Tue, Aug 2, 10:02 AM · SRE-swift-storage, DBA, SRE, ops-codfw

Fri, Jul 29

MatthewVernon added a project to T310146: (Need By:TBD) rack/setup/install row D new PDUs: SRE-swift-storage.
Fri, Jul 29, 2:03 PM · Patch-For-Review, SRE-swift-storage, DBA, SRE, ops-codfw
MatthewVernon added a comment to T310146: (Need By:TBD) rack/setup/install row D new PDUs.

In rack D2, ms-fe2012 needs depooling before the power goes, and if you could ping me once the rack is done so I can check all the ms-be* nodes come back up again, that'd be kind.

Fri, Jul 29, 1:59 PM · Patch-For-Review, SRE-swift-storage, DBA, SRE, ops-codfw
MatthewVernon updated subscribers of T310146: (Need By:TBD) rack/setup/install row D new PDUs.

For D7, please ping @jbond once done so he can confirm the ms-be* nodes have come back up OK.

Fri, Jul 29, 1:58 PM · Patch-For-Review, SRE-swift-storage, DBA, SRE, ops-codfw
MatthewVernon created T314143: Failed disk in ms-be1066.
Fri, Jul 29, 1:07 PM · SRE, ops-eqiad, SRE-swift-storage
MatthewVernon assigned T314049: Degraded RAID on ms-be2067 to Papaul.

Hi,
This drive is now unmounted, so can be swapped at your earliest convenience, please :)
Thanks!

Fri, Jul 29, 12:28 PM · SRE-swift-storage, SRE, ops-codfw
MatthewVernon created T314123: swift-drive-audit configuration broken on >= buster.
Fri, Jul 29, 9:19 AM · SRE-swift-storage
MatthewVernon added a comment to T222362: swift-drive-audit unmounting a drive doesn't produce any alerts or notifications.

...this behaviour has reverted, since we've gone back to using upstream swift-drive-audit, which is a cron.d entry.

Fri, Jul 29, 9:13 AM · SRE-swift-storage, SRE
MatthewVernon added a comment to T265450: flip/flop mounting filesystems between systemd and swift-drive-audit.

swift-drive-audit needs to run systemctl daemon-reload after making changes to /etc/fstab. Thanks, systemd.

Fri, Jul 29, 9:12 AM · SRE-swift-storage

Thu, Jul 28

MatthewVernon updated the task description for T309896: Upgrade Cassandra to latest 3.x (3.11.13).
Thu, Jul 28, 3:23 PM · Patch-For-Review, Cassandra
MatthewVernon closed T313102: Uncaught TimeoutError from inactivedc_request caused swift-proxy to wedge itself as Resolved.

I've deployed this fix now, so closing this issue.

Thu, Jul 28, 9:28 AM · SRE-swift-storage

Wed, Jul 27

MatthewVernon closed T313742: Import Cassandra 3.11.13 as 'dev', Stretch, a subtask of T309896: Upgrade Cassandra to latest 3.x (3.11.13), as Resolved.
Wed, Jul 27, 10:49 AM · Patch-For-Review, Cassandra
MatthewVernon closed T313742: Import Cassandra 3.11.13 as 'dev', Stretch as Resolved.

Done.

Wed, Jul 27, 10:49 AM · Cassandra, Infrastructure-Foundations

Mon, Jul 25

MatthewVernon added a subtask for T309896: Upgrade Cassandra to latest 3.x (3.11.13): T313742: Import Cassandra 3.11.13 as 'dev', Stretch.
Mon, Jul 25, 4:52 PM · Patch-For-Review, Cassandra
MatthewVernon added a parent task for T313742: Import Cassandra 3.11.13 as 'dev', Stretch: T309896: Upgrade Cassandra to latest 3.x (3.11.13).
Mon, Jul 25, 4:52 PM · Cassandra, Infrastructure-Foundations
MatthewVernon closed T312643: Adjust ms ring min_part_hours to 12 hours as Resolved.

Rings adjusted.

Mon, Jul 25, 1:21 PM · SRE-swift-storage
MatthewVernon added a comment to T309957: (Need By:TBD) rack/setup/install row A new PDUs.

I will need to check the state of the swift backends in A7 before it'll be safe to start on B2/4 (but B1/5 have no swift backends in).

Mon, Jul 25, 9:09 AM · Patch-For-Review, SRE, ops-codfw
MatthewVernon added a comment to T310070: (Need By:TBD) rack/setup/install row B new PDUs.

Hi,
In B2, ms-fe2010 and thanos-fe2002 will need depooling.

Mon, Jul 25, 9:04 AM · Patch-For-Review, DBA, SRE, ops-codfw

Thu, Jul 21

MatthewVernon added a comment to T313102: Uncaught TimeoutError from inactivedc_request caused swift-proxy to wedge itself.

@tstarling thanks for finding that history. That is pushing me harder towards "just disable the request replication", particularly given it looks like it is sometimes causing us problems.

Thu, Jul 21, 7:54 AM · SRE-swift-storage

Tue, Jul 19

MatthewVernon added a comment to T313102: Uncaught TimeoutError from inactivedc_request caused swift-proxy to wedge itself.

Are there other teams you think we should talk to before turning this off, then?

Tue, Jul 19, 4:00 PM · SRE-swift-storage
MatthewVernon added a comment to T313102: Uncaught TimeoutError from inactivedc_request caused swift-proxy to wedge itself.

@fgiunchedi the obvious starter-for-ten approach to stopping this happening again would be to just ditch the cross-DC-thumbnail copying, since it seems to be sometimes causing problems. What would you feel about that as an approach?

Tue, Jul 19, 2:51 PM · SRE-swift-storage
MatthewVernon added a comment to T313102: Uncaught TimeoutError from inactivedc_request caused swift-proxy to wedge itself.

The WMF rewrite middleware always makes me sad, and here it is doing so again...

Tue, Jul 19, 1:32 PM · SRE-swift-storage

Jul 15 2022

MatthewVernon added a comment to T313102: Uncaught TimeoutError from inactivedc_request caused swift-proxy to wedge itself.

@tstarling thanks for fixing.

Jul 15 2022, 5:03 PM · SRE-swift-storage

Jul 8 2022

MatthewVernon moved T312643: Adjust ms ring min_part_hours to 12 hours from Inbox to In progress on the SRE-swift-storage board.
Jul 8 2022, 12:56 PM · SRE-swift-storage
MatthewVernon created T312643: Adjust ms ring min_part_hours to 12 hours.
Jul 8 2022, 12:53 PM · SRE-swift-storage
MatthewVernon updated subscribers of T299125: Swiftrepl doesn't work on bullseye (and swiftrepl.conf is deployed by hand).
Jul 8 2022, 12:48 PM · SRE-swift-storage

Jul 6 2022

MatthewVernon closed T309027: Poweredge R730xd, R740xd, R740xd2 SSDs not visible to OS as SSDs as Resolved.

Remaining nodes done by hand during reboots for T310483:

mvernon@cumin1001:~$ sudo cumin 'ms-be*' 'cat /sys/block/md0/queue/rotational'
85 hosts will be targeted:
ms-be[2028-2069].codfw.wmnet,ms-be[1028-1033,1035-1071].eqiad.wmnet
Ok to proceed on 85 hosts? Enter the number of affected hosts to confirm or "q" to quit 85
===== NODE GROUP =====                                                                                 
(85) ms-be[2028-2069].codfw.wmnet,ms-be[1028-1033,1035-1071].eqiad.wmnet                               
----- OUTPUT of 'cat /sys/block/md0/queue/rotational' -----                                            
0                                                                                                      
================
Jul 6 2022, 1:18 PM · DC-Ops, Infrastructure-Foundations

Jul 5 2022

MatthewVernon added a comment to T311690: Shorten Thanos retention.

LGTM :)

Jul 5 2022, 2:28 PM · User-fgiunchedi, SRE-swift-storage

Jun 21 2022

MatthewVernon updated subscribers of T311066: rsync::server::module installs an rsync server even when $ensure is absent.
Jun 21 2022, 1:46 PM · SRE-swift-storage, Infrastructure-Foundations
MatthewVernon created T311066: rsync::server::module installs an rsync server even when $ensure is absent.
Jun 21 2022, 1:37 PM · SRE-swift-storage, Infrastructure-Foundations

Jun 17 2022

MatthewVernon updated the task description for T279637: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options.
Jun 17 2022, 2:33 PM · SRE-swift-storage

Jun 15 2022

MatthewVernon added a comment to T307667: Power drain and restart of ms-be1059.

That seems to have done the trick, thanks :-)

Jun 15 2022, 3:33 PM · SRE, SRE-swift-storage, ops-eqiad
MatthewVernon added a comment to T307667: Power drain and restart of ms-be1059.

@Cmjohnson Ah, OK, I sort-of assumed we had HTML5 console everywhere. Now I know better :)

Jun 15 2022, 1:43 PM · SRE, SRE-swift-storage, ops-eqiad
MatthewVernon added a comment to T307667: Power drain and restart of ms-be1059.

Hi,
Sorry for further noise, but: is it possible there's some USB thing still connected to this system?
It won't reimage because of:

[   12.234765] sd 0:0:0:0: [sda] Attached SCSI removable disk

(which means the SSDs end up at b and c)
This looks to be a USB device:

~ # ls -l /dev/disk/by-path/pci-0000\:00\:14.0-usb-0\:4\:1.0-scsi-0\:0\:0\:0 
lrwxrwxrwx    1 root     root             9 Jun 15 09:46 /dev/disk/by-path/pci-0000:00:14.0-usb-0:4:1.0-scsi-0:0:0:0 -> ../../sda
~ # lsscsi | grep 0:0:0:0
[0:0:0:0]       disk    Generic-        SD/MMC CRW      1.00

Similar systems (e.g. ms-be1058) don't have such a thing.

Jun 15 2022, 10:26 AM · SRE, SRE-swift-storage, ops-eqiad
MatthewVernon reopened T307667: Power drain and restart of ms-be1059 as "Open".

Sorry to reopen this task, but there's a licence issue, I think. When I try and use the HTML5 console, it says:
"License Required
This iLO is not licensed to use the Integrated Remote Console after server POST is complete."

Jun 15 2022, 9:32 AM · SRE, SRE-swift-storage, ops-eqiad
MatthewVernon reopened T307667: Power drain and restart of ms-be1059, a subtask of T279637: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options, as Open.
Jun 15 2022, 9:32 AM · SRE-swift-storage

Jun 14 2022

MatthewVernon added a comment to T307801: Bootstrap new Cassandra nodes (codfw).

It'll take the cookbooks a while to catch up (they back-off in increasing intervals waiting for puppet to be OK), but after some deployment-related hassle, all of these nodes have had puppet run to completion on them OK, so they should be good to go for you @Eevans .

Jun 14 2022, 2:10 PM · Data-Engineering-Radar, Generated Data Platform, Cassandra
MatthewVernon archived P29736 (An Untitled Masterwork).
Jun 14 2022, 11:51 AM
MatthewVernon created P29736 (An Untitled Masterwork).
Jun 14 2022, 11:51 AM

Jun 13 2022

MatthewVernon added a comment to T307801: Bootstrap new Cassandra nodes (codfw).

Trying a reimage of aqs2001 with buster.

Jun 13 2022, 3:32 PM · Data-Engineering-Radar, Generated Data Platform, Cassandra
MatthewVernon added a project to T310478: hw troubleshooting: remote IPMI not working for ms-be105[7-8].eqiad.wmnet: SRE-swift-storage.
Jun 13 2022, 12:26 PM · SRE-swift-storage, SRE, ops-eqiad, DC-Ops
MatthewVernon closed T310478: hw troubleshooting: remote IPMI not working for ms-be105[7-8].eqiad.wmnet, a subtask of T279637: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options, as Resolved.
Jun 13 2022, 12:25 PM · SRE-swift-storage
MatthewVernon closed T310478: hw troubleshooting: remote IPMI not working for ms-be105[7-8].eqiad.wmnet as Resolved.

This turned out to be an incorrect config section - I've updated https://wikitech.wikimedia.org/wiki/Management_Interfaces#Is_remote_IPMI_enabled? to note this, and how to fix it on HP kit.

Jun 13 2022, 12:25 PM · SRE-swift-storage, SRE, ops-eqiad, DC-Ops
MatthewVernon added a subtask for T279637: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options: T310478: hw troubleshooting: remote IPMI not working for ms-be105[7-8].eqiad.wmnet.
Jun 13 2022, 9:50 AM · SRE-swift-storage
MatthewVernon added a parent task for T310478: hw troubleshooting: remote IPMI not working for ms-be105[7-8].eqiad.wmnet: T279637: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options.
Jun 13 2022, 9:50 AM · SRE-swift-storage, SRE, ops-eqiad, DC-Ops
MatthewVernon triaged T310478: hw troubleshooting: remote IPMI not working for ms-be105[7-8].eqiad.wmnet as High priority.
Jun 13 2022, 9:48 AM · SRE-swift-storage, SRE, ops-eqiad, DC-Ops
MatthewVernon updated the task description for T279637: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options.
Jun 13 2022, 8:35 AM · SRE-swift-storage
MatthewVernon added a comment to T307667: Power drain and restart of ms-be1059.

Thanks for the update; I think the ILO needs our local configuration re-applying to it? If so, are you OK to do that, please?

Jun 13 2022, 8:34 AM · SRE, SRE-swift-storage, ops-eqiad

Jun 9 2022

MatthewVernon archived P29589 (An Untitled Masterwork).
Jun 9 2022, 10:26 AM
MatthewVernon created P29589 (An Untitled Masterwork).
Jun 9 2022, 10:26 AM
MatthewVernon updated the task description for T279637: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options.
Jun 9 2022, 10:12 AM · SRE-swift-storage

Jun 8 2022

MatthewVernon archived P29523 (An Untitled Masterwork).
Jun 8 2022, 11:14 AM
MatthewVernon created P29523 (An Untitled Masterwork).
Jun 8 2022, 11:14 AM

Jun 6 2022

MatthewVernon committed rLPRI325c044c6b42: profile::thanos::swift: fake creds for search_platform (authored by MatthewVernon).
profile::thanos::swift: fake creds for search_platform
Jun 6 2022, 2:23 PM
MatthewVernon closed T309715: Create swift thanos account for Search platform team, a subtask of T309648: Restore lost index in cloudelastic, as Resolved.
Jun 6 2022, 2:00 PM · MW-1.39-notes (1.39.0-wmf.18; 2022-06-27), Patch-For-Review, Discovery-Search (Current work)
MatthewVernon closed T309715: Create swift thanos account for Search platform team as Resolved.

This should all be done now, and I've restarted all the thanos swift frontends.

Jun 6 2022, 2:00 PM · Discovery-Search (Current work), SRE-swift-storage
MatthewVernon added a comment to T307667: Power drain and restart of ms-be1059.

Sorry they're giving you the runaround, that sounds very annoying :( Thanks for the update!

Jun 6 2022, 7:24 AM · SRE, SRE-swift-storage, ops-eqiad

Jun 1 2022

MatthewVernon updated the task description for T279637: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options.
Jun 1 2022, 3:57 PM · SRE-swift-storage

May 30 2022

MatthewVernon updated the task description for T279637: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options.
May 30 2022, 9:51 AM · SRE-swift-storage

May 25 2022

MatthewVernon added a comment to T307667: Power drain and restart of ms-be1059.

Thanks for the update, and thank you for persevering!

May 25 2022, 8:12 AM · SRE, SRE-swift-storage, ops-eqiad

May 24 2022

MatthewVernon added a comment to T309027: Poweredge R730xd, R740xd, R740xd2 SSDs not visible to OS as SSDs.

There are (at least!) 4 ways to configure the RAID controller - its own setup utility (hit ^r during boot), the general BIOS setup (F2 during boot), the web-iDRAC interface, and the megacli tool. AFAICT the system needs to be not running while this process is carried out (because / is on the relevant drives).

May 24 2022, 3:17 PM · DC-Ops, Infrastructure-Foundations

May 23 2022

MatthewVernon added a comment to T309027: Poweredge R730xd, R740xd, R740xd2 SSDs not visible to OS as SSDs.

Sorry! I noticed that backup2008 is configured differently to all the other backup hosts (its SSDs are both 1-member RAID-0 arrays, rather than non-RAID disks), and I was wondering if this was intentional?
Given your comment, I think the answer is "no" :)

May 23 2022, 2:59 PM · DC-Ops, Infrastructure-Foundations
MatthewVernon created T309027: Poweredge R730xd, R740xd, R740xd2 SSDs not visible to OS as SSDs.
May 23 2022, 2:08 PM · DC-Ops, Infrastructure-Foundations

May 20 2022

MatthewVernon added a comment to T279637: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options.

All production and pre-production codfw backends done.

May 20 2022, 4:05 PM · SRE-swift-storage
MatthewVernon updated the task description for T279637: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options.
May 20 2022, 4:04 PM · SRE-swift-storage

May 19 2022

MatthewVernon updated the task description for T279637: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options.
May 19 2022, 10:01 AM · SRE-swift-storage

May 18 2022

MatthewVernon created T308677: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem.
May 18 2022, 2:27 PM · Infrastructure-Foundations, SRE-swift-storage
MatthewVernon created T308644: unstable device mapping of SSDs causing swift/puppet problems - example reimage.
May 18 2022, 9:15 AM · Infrastructure-Foundations, SRE-swift-storage

May 11 2022

MatthewVernon updated the task description for T279637: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options.
May 11 2022, 12:36 PM · SRE-swift-storage

May 10 2022

MatthewVernon added a comment to T307667: Power drain and restart of ms-be1059.

Thanks for the update :)

May 10 2022, 6:04 PM · SRE, SRE-swift-storage, ops-eqiad

May 9 2022

MatthewVernon added a comment to T307907: swift-account-stats failures on thanos-swift.

FWIW, we do occasionally see this on ms-* too, but I can never repro on demand, which might support a load-related cause; I never found much in logs. Could maybe make swift-account-stats retry a couple of times?

May 9 2022, 2:15 PM · User-fgiunchedi, SRE, SRE-swift-storage
MatthewVernon updated the task description for T279637: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options.
May 9 2022, 9:22 AM · SRE-swift-storage

May 5 2022

MatthewVernon added a comment to T307667: Power drain and restart of ms-be1059.

That's ... not inspiring optimism, is it? :(

May 5 2022, 4:27 PM · SRE, SRE-swift-storage, ops-eqiad
MatthewVernon added a comment to T307667: Power drain and restart of ms-be1059.

Oh, bother.
The problem is that our swift clusters can tolerate one failed system; so I can't straightforwardly do any more reimages in the eqiad ms- cluster while this system is out.
So "as soon as reasonably possible", though I appreciate this has gone from a hopefully-quick job to a much larger piece of work, and so that's going to take some time.

May 5 2022, 3:53 PM · SRE, SRE-swift-storage, ops-eqiad
MatthewVernon added a comment to T279637: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options.

eqiad blocked on T307667 ms-be1059 being broken; that's unrelated to the reimages (it's still on stretch), but still blocks us as it's only safe to have one host out at once.

May 5 2022, 10:50 AM · SRE-swift-storage
MatthewVernon added a comment to T279637: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options.

Broadly, there should be little impact (and our monitoring suggests error rates within expected ranges); I hope any errors were infrequent and transient.

May 5 2022, 10:48 AM · SRE-swift-storage