Page MenuHomePhabricator

MatthewVernon (Matthew Vernon)
SRE (Data Persistence)

Projects (6)

Today

  • No visible events.

Tomorrow

  • No visible events.

Thursday

  • No visible events.

User Details

User Since
Aug 2 2021, 1:52 PM (227 w, 16 h)
Availability
Available
IRC Nick
Emperor
LDAP User
MVernon
MediaWiki User
MVernon (WMF) [ Global Accounts ]

Recent Activity

Thu, Dec 4

MatthewVernon added a comment to T390251: docker-registry.wikimedia.org keeps serving bad blobs.

per radosgw-admin user stats --uid=docker-registrythere are only 11 objects in that account, which I think equates to it not being currently used.

Thu, Dec 4, 4:32 PM · Patch-For-Review, serviceops

Tue, Dec 2

MatthewVernon added a comment to T410959: Degraded RAID on ms-fe2014.

@Jhancock.wm RAID rebuilt OK, server back in production. Thanks for your help here :)

Tue, Dec 2, 6:33 PM · SRE, DC-Ops, ops-codfw
MatthewVernon closed T400878: Install new JBOD disk controllers into SM swift backends as Resolved.
Tue, Dec 2, 2:45 PM · SRE-swift-storage, DC-Ops, SRE

Mon, Dec 1

MatthewVernon added a comment to T371620: (some) Gitlab builds hanging.

@dancy it's been a while now, but I think we just moved trafficserver to specify tags: [wmcs] for all its jobs and that worked around the issue.

Mon, Dec 1, 5:46 PM · Release-Engineering-Team (Priority Backlog 📥), GitLab (CI & Job Runners)
MatthewVernon added a comment to T410959: Degraded RAID on ms-fe2014.

@Jhancock.wm please go ahead - server is depooled.

Mon, Dec 1, 4:32 PM · SRE, DC-Ops, ops-codfw

Tue, Nov 25

MatthewVernon added a project to T410304: Measure request frequency of thumbnail sizes: Page-Previews.
Tue, Nov 25, 4:41 PM · Page-Previews, MediaViewer, Data-Persistence, Thumbor, SRE-swift-storage, Traffic
MatthewVernon added a comment to T410304: Measure request frequency of thumbnail sizes.

Great find, thank you!

Tue, Nov 25, 4:29 PM · Page-Previews, MediaViewer, Data-Persistence, Thumbor, SRE-swift-storage, Traffic

Fri, Nov 21

MatthewVernon updated the task description for T408715: Compile a list of "canonical" thumbnail sizes.
Fri, Nov 21, 2:09 PM · MediaViewer, Data-Persistence, Thumbor, SRE-swift-storage, Traffic
MatthewVernon updated the task description for T408715: Compile a list of "canonical" thumbnail sizes.
Fri, Nov 21, 1:56 PM · MediaViewer, Data-Persistence, Thumbor, SRE-swift-storage, Traffic
MatthewVernon added a comment to T410304: Measure request frequency of thumbnail sizes.

I have solved the easy one, though: ecosia. If you image search on there (e.g. https://www.ecosia.org/images?q=cattle and find the wikipedia hit (about fourth row down), it's hard-coding the link to https://upload.wikimedia.org/wikipedia/commons/thumb/8/8c/Cow_(Fleckvieh_breed)_Oeschinensee_Slaunger_2009-07-07.jpg/480px-Cow_(Fleckvieh_breed)_Oeschinensee_Slaunger_2009-07-07.jpg (though that does bring about the question of where/how it's getting that from )

Fri, Nov 21, 10:30 AM · Page-Previews, MediaViewer, Data-Persistence, Thumbor, SRE-swift-storage, Traffic
MatthewVernon added a comment to T410304: Measure request frequency of thumbnail sizes.

Thanks! I've spent a fair chunk of time searching and have come up with nothing. My next stop is likely #no-stupid-questions...

Fri, Nov 21, 10:27 AM · Page-Previews, MediaViewer, Data-Persistence, Thumbor, SRE-swift-storage, Traffic

Thu, Nov 20

MatthewVernon added a comment to T410304: Measure request frequency of thumbnail sizes.

So 480 is quite common, but hasn't showed up in our search. I thought it might be instructive to check referer:

select referer, count(*) as hits from wmf.webrequest where webrequest_source='upload' and year=2025 and month=10 and day=24 and hour=10 and http_status='200' and uri_path like '/wikipedia/%/thumb/%' and regexp_extract(uri_path, '([0-9]+)px[^/]+$')='480' group by referer order by hits desc LIMIT 10;

Gives us

Thu, Nov 20, 5:08 PM · Page-Previews, MediaViewer, Data-Persistence, Thumbor, SRE-swift-storage, Traffic
MatthewVernon updated the task description for T408715: Compile a list of "canonical" thumbnail sizes.
Thu, Nov 20, 4:28 PM · MediaViewer, Data-Persistence, Thumbor, SRE-swift-storage, Traffic
MatthewVernon updated the task description for T408715: Compile a list of "canonical" thumbnail sizes.
Thu, Nov 20, 4:18 PM · MediaViewer, Data-Persistence, Thumbor, SRE-swift-storage, Traffic
MatthewVernon added a comment to T372165: Reduce number of bucketsizes for MediaViewer.

(to be clear, the testwiki link above does result in 500 at smallest, not 400 as I get with commons. I don't know if that's expected)

Thu, Nov 20, 4:07 PM · Readers Essential Work 2025, Reader Growth Team (Sprint 4 (Nov 12 - Nov 25) Q2 25/26)), MW-1.45-notes (1.45.0-wmf.24; 2025-10-21), MediaViewer
MatthewVernon updated the task description for T408715: Compile a list of "canonical" thumbnail sizes.
Thu, Nov 20, 3:34 PM · MediaViewer, Data-Persistence, Thumbor, SRE-swift-storage, Traffic
MatthewVernon updated the task description for T408715: Compile a list of "canonical" thumbnail sizes.
Thu, Nov 20, 3:23 PM · MediaViewer, Data-Persistence, Thumbor, SRE-swift-storage, Traffic
MatthewVernon added a comment to T372165: Reduce number of bucketsizes for MediaViewer.

Can we maybe adjust the set of sizes in the light of what is already in common use (cf T410304) - e.g. 400 is currently very uncommon compared to 500 or 330, so if we're trying to rationalise, it'd make sense to go with one of those rather than 400.

Due to T360589 they already switch to the thumbnail steps that's why they are being served as 500px while showing 400px in the list. T410556 will fix the text lying to the user but it's not a different size. Most of the sizes listed in T408715 are similar and have no affect on url to the thumbnail (what cdn/swift/thumbor sees and cares about) but the visual size of the image.

Thu, Nov 20, 3:02 PM · Readers Essential Work 2025, Reader Growth Team (Sprint 4 (Nov 12 - Nov 25) Q2 25/26)), MW-1.45-notes (1.45.0-wmf.24; 2025-10-21), MediaViewer

Wed, Nov 19

MatthewVernon added a comment to T401832: Upgrade Traffic hosts to trixie.

@BCornwall Pcre2 was first released in 2015. Pcre3 stopped receiving any upstream support (including security fixes) back in 2021, and I filed bugs against all packages depending on the obsolete pcre3 late in 2021. The initial aim had been to not ship pcre3 in Bookworm, but there were enough stragglers that the removal of pcre3 was delayed until the trixie development cycle. So while pcre3 was dropped from Debian in February of 2025 (and thus didn't go into trixie), this has been coming for quite some time. I wouldn't want to use pcre3 in any context involving untrusted input at this point. The author of pcre3 has handed over pcre maintenance, so the current maintainers have very little exposure to the old pcre3 code base.

Wed, Nov 19, 9:22 AM · Traffic

Tue, Nov 18

MatthewVernon added a comment to T372165: Reduce number of bucketsizes for MediaViewer.

Can we maybe adjust the set of sizes in the light of what is already in common use (cf T410304) - e.g. 400 is currently very uncommon compared to 500 or 330, so if we're trying to rationalise, it'd make sense to go with one of those rather than 400.

Tue, Nov 18, 11:36 PM · Readers Essential Work 2025, Reader Growth Team (Sprint 4 (Nov 12 - Nov 25) Q2 25/26)), MW-1.45-notes (1.45.0-wmf.24; 2025-10-21), MediaViewer
MatthewVernon updated the task description for T408062: FY 25/26 WE 5.4.7 Standardize thumbnail sizes.
Tue, Nov 18, 11:32 PM · MediaViewer, Data-Persistence, Thumbor, SRE-swift-storage, Traffic
MatthewVernon added a comment to T405942: eqiad row C/D Data Persistence host migrations.

@RobH / @Jclark-ctr as I noted above, moss-be1002 can be done whenever, I'd just like to be told when you're going to do it, please.

Tue, Nov 18, 4:53 PM · media-backups, DBA, Data-Persistence, SRE, DC-Ops, ops-eqiad
MatthewVernon added a comment to T410304: Measure request frequency of thumbnail sizes.

It's about 0.5% difference in count of 250, which isn't a vast amount, but it's not nothing. And the ranking of the top-30-by-hits changes (at least 200/600 swap places, there are other shifts too, albeit not in the top 10). So I think it was worth spending a little time working on improving the query.

Tue, Nov 18, 4:17 PM · Page-Previews, MediaViewer, Data-Persistence, Thumbor, SRE-swift-storage, Traffic
MatthewVernon added a comment to T409036: Disk (sdf) failed in thanos-be2008.

@Jhancock.wm I've just jbodded the new drive, and it seems good, thanks :)

Tue, Nov 18, 2:52 PM · SRE-swift-storage, SRE, ops-codfw, DC-Ops
MatthewVernon archived P84991 cope with both variable length uri_path (might or might not have /archive/ in) and that size may be NNNpx or prefix-NNNpx.
Tue, Nov 18, 12:15 PM
MatthewVernon added a comment to P84991 cope with both variable length uri_path (might or might not have /archive/ in) and that size may be NNNpx or prefix-NNNpx.

Now obsoleted by a regexp-based approach (see https://phabricator.wikimedia.org/T410304#11383363)

Tue, Nov 18, 12:15 PM
MatthewVernon added a comment to T410304: Measure request frequency of thumbnail sizes.

A couple of notes on extracting thumbnail size from uri_path - a previous approach used

SELECT split(split(uri_path, '/')[7], 'px-')[0] as thumbsize

but this has a number of shortcomings, particularly that the array index of 7 is fragile, and incorrect for e.g. /archive/ thumbs. So I refined it somewhat to take the final path element, and then split that at px- and then split the result on - and take the final element (thus coping with prefix-NNNpx like you get with translated SVG files):

1select slice(split(split(slice(split(uri_path, '/'),-1,1)[0], 'px-')[0],'-'),-1,1)[0] as thumbsize, count(*) as hits from wmf.webrequest where webrequest_source = 'upload' and year = 2025 and month = 10 and day = 24 and hour = 10 and http_status = '200' and uri_path like '/wikipedia/%/thumb/%' group by thumbsize order by hits desc limit 10;

This still left a very few stragglers (15, mostly SVG files with URL-encodings in their names), which is likely good enough, but we can do better with a simple regexp:

select regexp_extract( slice(split(uri_path, '/'),-1,1)[0], '([0-9]+)px') as thumbsize, count(*) as hits from wmf.webrequest where webrequest_source = 'upload' and year = 2025 and month = 10 and day = 24 and http_status = '200' and uri_path like '/wikipedia/%/thumb/%' group by thumbsize order by hits desc;

This produces the same answers (modulo the 15 errors), is clearer, and only takes ~10% longer to run. Finally, of course, we can just do the whole operation with a single regexp - to match for thumbsize as previous and then state that it must be followed by only not-\ characters:

select regexp_extract(uri_path, '([0-9]+)px[^/]+$') as thumbsize, count(*) as hits from wmf.webrequest where webrequest_source = 'upload' and year = 2025 and month = 10 and day = 24 and http_status = '200' and uri_path like '/wikipedia/%/thumb/%' group by thumbsize order by hits desc;
Tue, Nov 18, 12:14 PM · Page-Previews, MediaViewer, Data-Persistence, Thumbor, SRE-swift-storage, Traffic
MatthewVernon added a comment to T405958: Q2:rack/setup/install ms-be209[0-4].

@Jhancock.wm reimage had stalled again because puppet wasn't happy, again because of an EFI/vfat partition on one of the spinning disks

Notice: /Stage[main]/Profile::Swift::Storage::Configure_disks/Exec[mkfs-pci-0000:50:00.0-scsi-0:2:12:0]/returns: mkfs.xfs: /dev/disk/by-path/pci-0000:50:00.0-scsi-0:2:12:0-part1 appears to contain an existing filesystem (vfat).

I wiped that partition (and then partition table of the drive), then the reimage went OK.

Tue, Nov 18, 9:28 AM · SRE, ops-codfw, SRE-swift-storage, Data-Persistence, DC-Ops
MatthewVernon updated the task description for T410304: Measure request frequency of thumbnail sizes.
Tue, Nov 18, 8:20 AM · Page-Previews, MediaViewer, Data-Persistence, Thumbor, SRE-swift-storage, Traffic

Mon, Nov 17

MatthewVernon updated the task description for T410304: Measure request frequency of thumbnail sizes.
Mon, Nov 17, 5:59 PM · Page-Previews, MediaViewer, Data-Persistence, Thumbor, SRE-swift-storage, Traffic
MatthewVernon created T410304: Measure request frequency of thumbnail sizes.
Mon, Nov 17, 5:58 PM · Page-Previews, MediaViewer, Data-Persistence, Thumbor, SRE-swift-storage, Traffic
MatthewVernon added a comment to T405958: Q2:rack/setup/install ms-be209[0-4].

@Jhancock.wm for each of ms-be209[0-3] the install was failing because puppet couldn't run, because one of the spinning disks had a vfat partition on containing an EFI setup. In each case, wiping that filesystem (and the partition table of the drive) unwedged puppet and I could then reimage them OK.

Mon, Nov 17, 3:58 PM · SRE, ops-codfw, SRE-swift-storage, Data-Persistence, DC-Ops
MatthewVernon added a comment to T360589: De-fragment thumbnail sizes in mediawiki.

@Romaine the change was done under T408715 not this task, FWIW.

Mon, Nov 17, 9:26 AM · MW-1.44-notes (1.44.0-wmf.20; 2025-03-11), Epic, Commons, MediaWiki-File-management, Data-Persistence

Nov 6 2025

MatthewVernon added a comment to P84991 cope with both variable length uri_path (might or might not have /archive/ in) and that size may be NNNpx or prefix-NNNpx.

We want the final element of uri_path split by / (it's not a fixed length because of archive thumbs).
Then to take the string up to "px-" (which is usually just the size) hence splitting on "px-" and taking the first element.
The complication is that there are a number of prefixes that might come before the size (e.g. langfr-250px for a translated SVG file), so we then want the last element of that string split on '-'.

Nov 6 2025, 12:01 PM
MatthewVernon added a comment to P84991 cope with both variable length uri_path (might or might not have /archive/ in) and that size may be NNNpx or prefix-NNNpx.

(spark 3.3.0 gains the split_part function, which would make this rather simpler)

Nov 6 2025, 11:29 AM
MatthewVernon created P84991 cope with both variable length uri_path (might or might not have /archive/ in) and that size may be NNNpx or prefix-NNNpx.
Nov 6 2025, 11:25 AM

Nov 5 2025

MatthewVernon updated the task description for T354872: Re-IP Swift hosts to per-rack subnets in codfw rows A-D.
Nov 5 2025, 4:32 PM · SRE-swift-storage, Infrastructure-Foundations, SRE
MatthewVernon added a comment to T400878: Install new JBOD disk controllers into SM swift backends.

All completed now.

Nov 5 2025, 4:15 PM · SRE-swift-storage, DC-Ops, SRE
MatthewVernon updated the task description for T400876: Install new disk controllers to SM swift backends (codfw).
Nov 5 2025, 4:15 PM · ops-codfw, DC-Ops, SRE, SRE-swift-storage
MatthewVernon closed T400876: Install new disk controllers to SM swift backends (codfw), a subtask of T400878: Install new JBOD disk controllers into SM swift backends, as Resolved.
Nov 5 2025, 4:14 PM · SRE-swift-storage, DC-Ops, SRE
MatthewVernon closed T400876: Install new disk controllers to SM swift backends (codfw) as Resolved.

@Jhancock.wm Thanks! We're all done here now :)

Nov 5 2025, 4:14 PM · ops-codfw, DC-Ops, SRE, SRE-swift-storage
MatthewVernon updated the task description for T354872: Re-IP Swift hosts to per-rack subnets in codfw rows A-D.
Nov 5 2025, 4:13 PM · SRE-swift-storage, Infrastructure-Foundations, SRE
MatthewVernon closed T407513: Key packages missing from trixie-wikimedia as Resolved.

The necessary packages are now all available, and ms-be1088 managed to run puppet OK as a trixie host. Thanks, all :)

Nov 5 2025, 2:35 PM · Infrastructure-Foundations, SRE-swift-storage, SRE
MatthewVernon added a comment to T409036: Disk (sdf) failed in thanos-be2008.

:(
IME the iDRAC basically never notices a bad disk. The kernel log above (and the Media error reported by perccli64) are all the errors I have.

Nov 5 2025, 2:15 PM · SRE-swift-storage, SRE, ops-codfw, DC-Ops
MatthewVernon added a comment to T409253: Continuous breakages of apt-staging.

It might still be caching, but https://apt-staging.wikimedia.org/wikimedia-staging/dists/trixie-wikimedia/main/binary-amd64/ is saying the Packages file is un-updated since 28 Oct (and it lacks e.g. python3-conftool which should be there by now).

Nov 5 2025, 8:31 AM · Infrastructure-Foundations, GitLab, collaboration-services

Nov 3 2025

MatthewVernon added a comment to T409036: Disk (sdf) failed in thanos-be2008.

@Jhancock.wm that's a good question, to which I don't have a good answer :-/ I think my inclination would be to go for a like-for-like replacement (if nothing else to avoid surprising ourselves later).

Nov 3 2025, 4:15 PM · SRE-swift-storage, SRE, ops-codfw, DC-Ops
MatthewVernon added a comment to T400876: Install new disk controllers to SM swift backends (codfw).

Hi @Jhancock.wm ms-be208[5-7] are now ready for you to swap their controllers, please. I've downtimed them for a couple of days, so please go ahead whenever suits you.

Nov 3 2025, 3:52 PM · ops-codfw, DC-Ops, SRE, SRE-swift-storage
MatthewVernon updated the task description for T400876: Install new disk controllers to SM swift backends (codfw).
Nov 3 2025, 3:50 PM · ops-codfw, DC-Ops, SRE, SRE-swift-storage
MatthewVernon added a comment to T409040: Disk (sdl) failed in ms-be1074.

Cool, thank you :)

Nov 3 2025, 3:49 PM · SRE-swift-storage, ops-eqiad, SRE, DC-Ops
MatthewVernon added a comment to T409040: Disk (sdl) failed in ms-be1074.

Thanks! Do we have a suitable spare in stock still, in the mean time?

Nov 3 2025, 3:42 PM · SRE-swift-storage, ops-eqiad, SRE, DC-Ops
MatthewVernon added a comment to T404356: UEFI installer not installing grub correctly (at least on systems where / is RAID).

@elukey while I'm at it, you also have a Dell Config-J system for testing (ms-be2078, T406964); are you finished with that host now? It's fine if you still want it, I just don't want to forget about it :)

Nov 3 2025, 3:02 PM · SRE-swift-storage, Infrastructure-Foundations
MatthewVernon added a comment to T404356: UEFI installer not installing grub correctly (at least on systems where / is RAID).

Tried to reimage again, there are some HTTP boot issues that we are trying to solve in T394357 but once d-i is triggered I don't see any issue after the reboot. At this point I'd like to get a Supermicro Config J host among the ms-be ones to try some reimages and also a reimage with a specific config for the host to skip configure_swift_disks() in d-i's early_partman settings. @MatthewVernon is there an ms-be host that I can use for tests?

Nov 3 2025, 2:55 PM · SRE-swift-storage, Infrastructure-Foundations
MatthewVernon triaged T409040: Disk (sdl) failed in ms-be1074 as High priority.
Nov 3 2025, 9:50 AM · SRE-swift-storage, ops-eqiad, SRE, DC-Ops
MatthewVernon created T409040: Disk (sdl) failed in ms-be1074.
Nov 3 2025, 9:49 AM · SRE-swift-storage, ops-eqiad, SRE, DC-Ops
MatthewVernon triaged T409036: Disk (sdf) failed in thanos-be2008 as High priority.
Nov 3 2025, 9:31 AM · SRE-swift-storage, SRE, ops-codfw, DC-Ops
MatthewVernon created T409036: Disk (sdf) failed in thanos-be2008.
Nov 3 2025, 9:31 AM · SRE-swift-storage, SRE, ops-codfw, DC-Ops

Oct 31 2025

MatthewVernon added a comment to T408715: Compile a list of "canonical" thumbnail sizes.

A further complication - some wikis (I've found at least fr and de) add a lang{fr,de,...} prefix to the thumb size, e.g. https://upload.wikimedia.org/wikipedia/commons/thumb/6/66/Armoiries_de_Marseille.svg/250px-Armoiries_de_Marseille.svg.png (I think this is done to all svgs, presumably because some of them can be translated?); for multi-page documents the page number gets inserted into the thumb path too.

Oct 31 2025, 4:59 PM · MediaViewer, Data-Persistence, Thumbor, SRE-swift-storage, Traffic
MatthewVernon added a comment to T408715: Compile a list of "canonical" thumbnail sizes.

Updated in the light of review from Android and iOS folks - only change to our list of sizes is the addition of 80.

Oct 31 2025, 10:07 AM · MediaViewer, Data-Persistence, Thumbor, SRE-swift-storage, Traffic
MatthewVernon updated the task description for T408715: Compile a list of "canonical" thumbnail sizes.
Oct 31 2025, 10:06 AM · MediaViewer, Data-Persistence, Thumbor, SRE-swift-storage, Traffic
MatthewVernon added a comment to T400877: Install new disk controllers to SM swift backends (eqiad).

@VRiley-WMF looks good now, thanks!

Oct 31 2025, 9:43 AM · SRE, SRE-swift-storage, ops-eqiad, DC-Ops
MatthewVernon closed T400877: Install new disk controllers to SM swift backends (eqiad) as Resolved.
Oct 31 2025, 9:43 AM · SRE, SRE-swift-storage, ops-eqiad, DC-Ops
MatthewVernon closed T400877: Install new disk controllers to SM swift backends (eqiad), a subtask of T400878: Install new JBOD disk controllers into SM swift backends, as Resolved.
Oct 31 2025, 9:43 AM · SRE-swift-storage, DC-Ops, SRE

Oct 30 2025

MatthewVernon added a comment to T400877: Install new disk controllers to SM swift backends (eqiad).

@VRiley-WMF the host is up, but it can't reach any of its spinning disks (the OS sees none, and the BMC says 0 physical disks). Could you take another look, please?

Oct 30 2025, 1:58 PM · SRE, SRE-swift-storage, ops-eqiad, DC-Ops
MatthewVernon created T408793: Grant Access to analytics-privatedata-users for mvernon.
Oct 30 2025, 11:48 AM · SRE-Access-Requests, Data-Engineering, SRE
MatthewVernon added a comment to T408776: megacli issues on Debian Trixie.

(as suggested in the other task, it might also/instead be worth trying to build megacli for trixie; I've just not had time to do so as yet)

Oct 30 2025, 9:19 AM · SRE, Infrastructure-Foundations

Oct 29 2025

MatthewVernon archived P84382 (An Untitled Masterwork).
Oct 29 2025, 9:55 PM
MatthewVernon created P84382 (An Untitled Masterwork).
Oct 29 2025, 6:38 PM
MatthewVernon updated the task description for T408715: Compile a list of "canonical" thumbnail sizes.
Oct 29 2025, 5:29 PM · MediaViewer, Data-Persistence, Thumbor, SRE-swift-storage, Traffic
MatthewVernon triaged T408715: Compile a list of "canonical" thumbnail sizes as High priority.
Oct 29 2025, 4:14 PM · MediaViewer, Data-Persistence, Thumbor, SRE-swift-storage, Traffic
MatthewVernon updated the task description for T408062: FY 25/26 WE 5.4.7 Standardize thumbnail sizes.
Oct 29 2025, 4:13 PM · MediaViewer, Data-Persistence, Thumbor, SRE-swift-storage, Traffic
MatthewVernon created T408715: Compile a list of "canonical" thumbnail sizes.
Oct 29 2025, 4:12 PM · MediaViewer, Data-Persistence, Thumbor, SRE-swift-storage, Traffic

Oct 27 2025

MatthewVernon added a comment to T405942: eqiad row C/D Data Persistence host migrations.

Yes, for each of those hosts you can depool by running depool from the host in question (and then pool afterwards). Thanks!

Oct 27 2025, 4:51 PM · media-backups, DBA, Data-Persistence, SRE, DC-Ops, ops-eqiad

Oct 24 2025

MatthewVernon added a comment to T404356: UEFI installer not installing grub correctly (at least on systems where / is RAID).

I've not been able to reproduce the boot failure (except by cheating), but the underlying issue remains - the installer is installing the EFI System Partition onto only 1 of the two OS disks, and doesn't touch the equivalent partition on the other one. So as long as drive ordering (as seen by the installer) is consistent, everything is good. We've learned in the past that this isn't something to rely upon.

Oct 24 2025, 4:45 PM · SRE-swift-storage, Infrastructure-Foundations

Oct 23 2025

MatthewVernon added a comment to T407513: Key packages missing from trixie-wikimedia.

[trixie does have the unofficial https://packages.debian.org/stable/admin/megactl but I don't know if a) that works b) we'd want to trust it ]

Oct 23 2025, 1:37 PM · Infrastructure-Foundations, SRE-swift-storage, SRE
MatthewVernon added a comment to T407513: Key packages missing from trixie-wikimedia.

megacli might have been copied to trixie, but it's useless there, because as you say it depends upon libncurses5, which isn't in trixie.

Oct 23 2025, 1:32 PM · Infrastructure-Foundations, SRE-swift-storage, SRE

Oct 22 2025

MatthewVernon added a comment to T406964: No disk boot option when moving ms-be2078 to UEFI.

(to answer the question - like all ms-* nodes, this will continue to be Debian 11 for now, although we might use it for a test install of Debian 13 before its returned to service; it's partman/custom/ms-be_simple-efi.cfg or partman/custom/ms-be_simple.cfg as appropriate for UEFI/BIOS booting)

Oct 22 2025, 12:33 PM · User-Elukey, SRE, ops-codfw, DC-Ops, Infrastructure-Foundations
MatthewVernon added a comment to T400877: Install new disk controllers to SM swift backends (eqiad).

@VRiley-WMF the last two nodes ms-be1089 and ms-be1090 are ready for controller swap, please; I've downtimed them for a couple of days.

Oct 22 2025, 11:40 AM · SRE, SRE-swift-storage, ops-eqiad, DC-Ops
MatthewVernon updated the task description for T400877: Install new disk controllers to SM swift backends (eqiad).
Oct 22 2025, 11:37 AM · SRE, SRE-swift-storage, ops-eqiad, DC-Ops

Oct 17 2025

MatthewVernon closed T407589: File missing from four datacenters as Resolved.

Thanks; our weekly rclone job would have caught up with this on Monday, but it's nice to have it resolved sooner :)

Oct 17 2025, 3:12 PM · SRE-swift-storage

Oct 16 2025

MatthewVernon created T407513: Key packages missing from trixie-wikimedia.
Oct 16 2025, 4:09 PM · Infrastructure-Foundations, SRE-swift-storage, SRE
MatthewVernon updated the task description for T354872: Re-IP Swift hosts to per-rack subnets in codfw rows A-D.
Oct 16 2025, 9:24 AM · SRE-swift-storage, Infrastructure-Foundations, SRE
MatthewVernon added a comment to T406964: No disk boot option when moving ms-be2078 to UEFI.

@elukey FWIW, feel free to wipe these disks (the host isn't in the swift rings ATM).

Oct 16 2025, 8:07 AM · User-Elukey, SRE, ops-codfw, DC-Ops, Infrastructure-Foundations
MatthewVernon added a comment to T400876: Install new disk controllers to SM swift backends (codfw).

Looks good now, thanks :)

Oct 16 2025, 8:03 AM · ops-codfw, DC-Ops, SRE, SRE-swift-storage
MatthewVernon updated the task description for T400876: Install new disk controllers to SM swift backends (codfw).
Oct 16 2025, 8:03 AM · ops-codfw, DC-Ops, SRE, SRE-swift-storage

Oct 15 2025

MatthewVernon added a comment to T405942: eqiad row C/D Data Persistence host migrations.

So, the swift & Ceph nodes:

Oct 15 2025, 2:05 PM · media-backups, DBA, Data-Persistence, SRE, DC-Ops, ops-eqiad

Oct 14 2025

MatthewVernon added a comment to T400877: Install new disk controllers to SM swift backends (eqiad).

Hi @VRiley-WMF I'm afraid not (filesystems still about 25% full, so a little way to go yet).

Oct 14 2025, 3:16 PM · SRE, SRE-swift-storage, ops-eqiad, DC-Ops
MatthewVernon added a comment to T400876: Install new disk controllers to SM swift backends (codfw).

Hi @Jhancock.wm ms-be2083 looks great, thank you.

Oct 14 2025, 1:50 PM · ops-codfw, DC-Ops, SRE, SRE-swift-storage
MatthewVernon updated the task description for T400876: Install new disk controllers to SM swift backends (codfw).
Oct 14 2025, 1:49 PM · ops-codfw, DC-Ops, SRE, SRE-swift-storage
MatthewVernon triaged T407198: ms-be1081 reports no disks - controller failure? as High priority.
Oct 14 2025, 10:16 AM · SRE-swift-storage, SRE, ops-eqiad, DC-Ops
MatthewVernon created T407198: ms-be1081 reports no disks - controller failure?.
Oct 14 2025, 10:15 AM · SRE-swift-storage, SRE, ops-eqiad, DC-Ops

Oct 8 2025

MatthewVernon added a comment to T400876: Install new disk controllers to SM swift backends (codfw).

@Jhancock.wm ms-be2083 and ms-be2084 are now ready to have their controllers swapped - can you do them, please? I've downtimed them for a couple of days.

Oct 8 2025, 12:14 PM · ops-codfw, DC-Ops, SRE, SRE-swift-storage
MatthewVernon updated the task description for T400876: Install new disk controllers to SM swift backends (codfw).
Oct 8 2025, 12:12 PM · ops-codfw, DC-Ops, SRE, SRE-swift-storage
MatthewVernon closed T404351: swift_disks fact needs to cope with change in /dev/disk/by-path in trixie as Resolved.

The new fact works; the failure is because the following key packages are not available in trixie: python3-conftool, megacli, prometheus-statsd-exporter . So we can close this out, and I'll open a ticket with infra foundations about the missing packages.

Oct 8 2025, 11:21 AM · SRE, SRE-swift-storage

Oct 7 2025

MatthewVernon closed T406246: [[commons:File:Things near the Nautical Museum of Litochoro 10.jpg]] only present in codfw as Resolved.

As expected, the Monday rclone copied this image across:

curl -o /dev/null -v --connect-to ::upload-lb.eqiad.wikimedia.org https://upload.wikimedia.org/wikipedia/commons/c/cf/Things_near_the_Nautical_Museum_of_Litochoro_10.jpg 2>&1 | grep "< HTTP"
< HTTP/2 200
Oct 7 2025, 10:16 AM · SRE-swift-storage, Commons

Oct 6 2025

MatthewVernon added a comment to T309027: Poweredge R730xd, R740xd, R740xd2 SSDs not visible to OS as SSDs.

@LSobanski I worked round it for ms-be nodes; it was later re-opened by @jbond to look at the more general issue (see his comment above); I don't know if anything got done about that...

Oct 6 2025, 2:52 PM · DC-Ops, Infrastructure-Foundations

Oct 3 2025

MatthewVernon added a comment to T406246: [[commons:File:Things near the Nautical Museum of Litochoro 10.jpg]] only present in codfw.

As expected from the report, the object is in codfw, but not eqiad:

root@ms-fe1009:~# swift stat wikipedia-commons-local-public.cf c/cf/Things_near_the_Nautical_Museum_of_Litochoro_10.jpg
Object HEAD failed: http://ms-fe.svc.eqiad.wmnet/v1/AUTH_mw/wikipedia-commons-local-public.cf/c/cf/Things_near_the_Nautical_Museum_of_Litochoro_10.jpg 404 Not Found
Failed Transaction ID: tx7ac2ef9f7c79486d84f80-0068dfb6b8
root@ms-fe2009:~# swift stat wikipedia-commons-local-public.cf c/cf/Things_near_the_Nautical_Museum_of_Litochoro_10.jpg
               Account: AUTH_mw
             Container: wikipedia-commons-local-public.cf
                Object: c/cf/Things_near_the_Nautical_Museum_of_Litochoro_10.jpg
          Content Type: image/jpeg
        Content Length: 7341349
         Last Modified: Tue, 30 Sep 2025 20:53:18 GMT
                  ETag: 6383419dbeec344b83cb353b472ab95f
       Meta Sha1Base36: 957witx4ns8tlo71g1o7h4qsrpng7v9
           X-Timestamp: 1759265597.26006
         Accept-Ranges: bytes
            X-Trans-Id: tx7c819eef97524e50a8617-0068dfb6b7
X-Openstack-Request-Id: tx7c819eef97524e50a8617-0068dfb6b7
Oct 3 2025, 11:50 AM · SRE-swift-storage, Commons
MatthewVernon added a parent task for T406309: requestctl clearing all fields on error is not user-friendly when combined with stricter checking: Unknown Object (Task).
Oct 3 2025, 9:37 AM · Hiddenparma, Traffic, serviceops, requestctl, Sustainability (Incident Followup)
MatthewVernon created T406309: requestctl clearing all fields on error is not user-friendly when combined with stricter checking.
Oct 3 2025, 9:36 AM · Hiddenparma, Traffic, serviceops, requestctl, Sustainability (Incident Followup)
MatthewVernon added a parent task for T406308: Link to view requestctl rule in superset no longer working: Unknown Object (Task).
Oct 3 2025, 9:31 AM · Sustainability (Incident Followup), requestctl
MatthewVernon created T406308: Link to view requestctl rule in superset no longer working.
Oct 3 2025, 9:30 AM · Sustainability (Incident Followup), requestctl