Page MenuHomePhabricator

MatthewVernon (Matthew Vernon)
SRE (Data Persistence)

Projects (6)

Today

  • No visible events.

Tomorrow

  • No visible events.

Friday

  • No visible events.

User Details

User Since
Aug 2 2021, 1:52 PM (245 w, 1 d)
Availability
Available
IRC Nick
Emperor
LDAP User
MVernon
MediaWiki User
MVernon (WMF) [ Global Accounts ]

Recent Activity

Yesterday

MatthewVernon added a comment to T423286: Initial puppet run makes ms-be2068 unbootable.

Same failure mode after the BIOS upgrade - post-installer boot is fine, after puppet it gets to:

Booting from Hard drive C:
GRUB

and there it remains forever.

Tue, Apr 14, 4:59 PM · SRE-swift-storage, Infrastructure-Foundations, SRE
MatthewVernon added a comment to T423286: Initial puppet run makes ms-be2068 unbootable.

Tried a BIOS upgrade from 2.12.2 to 2.24.0. That didn't cause the system to become bootable, but trying yet another reimage.

Tue, Apr 14, 3:54 PM · SRE-swift-storage, Infrastructure-Foundations, SRE
MatthewVernon added a comment to T416592: Requesting access to WMF Datalake & Superset SQL lab for Nicholusmuwonge_wmde.

@WMDE-leszek analytics_privatedata_users isn't an LDAP group, it's a shell group, so it wouldn't appear in the ldap listing (for instance - https://ldap.toolforge.org/user/mvernon is me, and you'll see it's not listed there against me either). I think you probably want data engineering for help debugging superset dashboard access.

Tue, Apr 14, 3:23 PM · SRE, SRE-Access-Requests
MatthewVernon added a comment to T423286: Initial puppet run makes ms-be2068 unbootable.

As before, post-installer boot was fine, but after puppet it gets as far as:

Booting from Hard drive C:
GRUB
Tue, Apr 14, 3:15 PM · SRE-swift-storage, Infrastructure-Foundations, SRE
MatthewVernon updated subscribers of T423286: Initial puppet run makes ms-be2068 unbootable.

At the suggestion of @elukey on IRC, I am trying a firmware downgrade to 6.10.30.20 (the version 5.0.20.0 that this system started with wasn't obviously available).

Tue, Apr 14, 2:29 PM · SRE-swift-storage, Infrastructure-Foundations, SRE
MatthewVernon created T423286: Initial puppet run makes ms-be2068 unbootable.
Tue, Apr 14, 2:18 PM · SRE-swift-storage, Infrastructure-Foundations, SRE
MatthewVernon added a comment to T409165: [SPIKE] What would it take to enable Recent Changes patrolling on all Wikipedias?.

My volunteer account has acquired the PersonalDashboard, and my volunteer-hat would appreciate a way to note that I've looked at an edit and it was OK (and was a bit surprised there wasn't a UI element to do so).

Tue, Apr 14, 10:40 AM · MediaWiki-Patrolling, Moderator-Tools-Team, PersonalDashboard

Mon, Apr 13

MatthewVernon reopened T418901: Q3:rack/setup/install apus-be100[56] as "Open".

Hi @Jclark-ctr could you take another look at the disks on these two systems, please? There should be 24 JBOD spinning disks visible to the OS, but neither host has that:
apus-be1005 has 23 (i.e. one missing)

mvernon@apus-be1005:~$ grep -c ' sd' /proc/partitions 
23
Mon, Apr 13, 10:16 AM · SRE-swift-storage, SRE, ops-eqiad, Data-Persistence, DC-Ops

Fri, Apr 10

MatthewVernon added a comment to T418902: Q3:rack/setup/install apus-be200[56].

Thanks @Jhancock.wm :)

Fri, Apr 10, 8:38 AM · SRE-swift-storage, SRE, Data-Persistence, ops-codfw, DC-Ops
MatthewVernon added projects to T422166: scap can’t deploy (blob upload unknown) after apus.discovery.wmnet is repooled in codfw: SRE-swift-storage, Ceph.

I've eyeballed the discussion here - AFAICT apus is behaving as expected? I haven't seen persistent lag between the two clusters, but during bursts of activity replication between the DCs is asynchronous (by design). The problem is the registry (due to caching connections) writing to both clusters at the same time but assuming it's only writing to one, and thus being thrown by asynchronous replication between the clusters.

Fri, Apr 10, 7:50 AM · Ceph, SRE-swift-storage, Patch-For-Review, ServiceOps new, Datacenter-Switchover, SRE

Thu, Apr 9

MatthewVernon updated the task description for T421719: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets.
Thu, Apr 9, 4:12 PM · DBA, Ceph, SRE-swift-storage, User-Eevans, Data-Persistence
MatthewVernon updated the task description for T421719: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets.
Thu, Apr 9, 2:32 PM · DBA, Ceph, SRE-swift-storage, User-Eevans, Data-Persistence
MatthewVernon updated the task description for T421719: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets.
Thu, Apr 9, 1:30 PM · DBA, Ceph, SRE-swift-storage, User-Eevans, Data-Persistence
MatthewVernon updated the task description for T421719: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets.
Thu, Apr 9, 12:33 PM · DBA, Ceph, SRE-swift-storage, User-Eevans, Data-Persistence
MatthewVernon updated the task description for T421719: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets.
Thu, Apr 9, 10:57 AM · DBA, Ceph, SRE-swift-storage, User-Eevans, Data-Persistence
MatthewVernon updated the task description for T421719: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets.
Thu, Apr 9, 9:25 AM · DBA, Ceph, SRE-swift-storage, User-Eevans, Data-Persistence
MatthewVernon added a comment to T413507: Commons file not found.

@Jeff_G please open new tickets when reporting new issues, unless you're really 100% sure you've got a recurrence of exactly the same issue again - it's really easy to merge tickets that turn out to be duplicates, but very hard to un-merge when you've got two different issues on the same ticket (as has happened here).

Thu, Apr 9, 9:12 AM · SRE-swift-storage, Commons

Wed, Apr 8

MatthewVernon added a comment to T421719: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets.

A wrinkle here is that ferm doesn't get reloaded on the other swift nodes (presumably because the config for ferm hasn't actually changed, because the hostname of the node is unchanged), so you have to do that by cumin-hand before the reimaged node works again.

Wed, Apr 8, 11:11 AM · DBA, Ceph, SRE-swift-storage, User-Eevans, Data-Persistence
MatthewVernon updated the task description for T421719: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets.
Wed, Apr 8, 11:10 AM · DBA, Ceph, SRE-swift-storage, User-Eevans, Data-Persistence

Tue, Apr 7

MatthewVernon added a comment to T418902: Q3:rack/setup/install apus-be200[56].

@Jhancock.wm that should be fine, thanks!

Tue, Apr 7, 10:12 AM · SRE-swift-storage, SRE, Data-Persistence, ops-codfw, DC-Ops

Thu, Apr 2

MatthewVernon added a comment to T421986: Disk (sdw) failed in ms-be1069.

Thanks for the quick fixes @Jclark-ctr :-)

Thu, Apr 2, 12:49 PM · ops-eqiad, SRE-swift-storage, SRE, DC-Ops

Wed, Apr 1

MatthewVernon triaged T422011: Disk (sdt) failed in ms-be1065 as High priority.
Wed, Apr 1, 12:07 PM · ops-eqiad, SRE-swift-storage, SRE, DC-Ops
MatthewVernon created T422011: Disk (sdt) failed in ms-be1065.
Wed, Apr 1, 12:06 PM · ops-eqiad, SRE-swift-storage, SRE, DC-Ops
MatthewVernon triaged T421986: Disk (sdw) failed in ms-be1069 as High priority.
Wed, Apr 1, 8:36 AM · ops-eqiad, SRE-swift-storage, SRE, DC-Ops
MatthewVernon created T421986: Disk (sdw) failed in ms-be1069.
Wed, Apr 1, 8:36 AM · ops-eqiad, SRE-swift-storage, SRE, DC-Ops
MatthewVernon updated the task description for T421719: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets.
Wed, Apr 1, 8:05 AM · DBA, Ceph, SRE-swift-storage, User-Eevans, Data-Persistence
MatthewVernon added projects to T421719: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets: SRE-swift-storage, Ceph.
Wed, Apr 1, 8:01 AM · DBA, Ceph, SRE-swift-storage, User-Eevans, Data-Persistence

Mon, Mar 30

MatthewVernon updated the task description for T354872: Re-IP Swift hosts to per-rack subnets in codfw rows A-D.
Mon, Mar 30, 10:14 AM · SRE-swift-storage, Infrastructure-Foundations, SRE

Wed, Mar 25

MatthewVernon created T421226: Kartotherian dashboard links don't work.
Wed, Mar 25, 11:56 AM · SRE, Maps, Sustainability (Incident Followup)
MatthewVernon archived P89929 (An Untitled Masterwork).
Wed, Mar 25, 11:43 AM
MatthewVernon created P89929 (An Untitled Masterwork).
Wed, Mar 25, 11:31 AM
MatthewVernon changed the visibility for T421203: Bad ATS config led to large volume of 5xx from RESTBase.
Wed, Mar 25, 8:40 AM · Incident Severity 3, Traffic, Wikimedia-Incident
MatthewVernon renamed T421203: Bad ATS config led to large volume of 5xx from RESTBase from restbase outage to Bad ATS config led to large volume of 5xx from RESTBase.
Wed, Mar 25, 8:39 AM · Incident Severity 3, Traffic, Wikimedia-Incident
MatthewVernon created T421208: Dead links at https://wikitech.wikimedia.org/wiki/RESTBase#Analytics_and_metrics.
Wed, Mar 25, 8:32 AM · ServiceOps-Services-Oids, ServiceOps new, Documentation, Sustainability (Incident Followup), RESTBase
MatthewVernon created T421207: Dead links on https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Process_count.
Wed, Mar 25, 8:28 AM · Documentation, Sustainability (Incident Followup), SRE, Traffic

Tue, Mar 24

MatthewVernon added a comment to T419577: ms swift capacity for FY 26/27.

A quick back-of-the-envelope is about 73TB for commons transcoded buckets.

Tue, Mar 24, 3:38 PM · SRE-swift-storage, SRE
MatthewVernon added projects to T420978: Move the Docker Registry's /ml prefix to S3/apus: SRE-swift-storage, Ceph.
Tue, Mar 24, 8:21 AM · Ceph, SRE-swift-storage, Infrastructure-Foundations, Machine-Learning-Team

Mon, Mar 23

MatthewVernon closed T340917: Server error 500 after uploading chunk as Resolved.

Thanks, I'm going to optimistically close this ticket then :)

Mon, Mar 23, 8:37 AM · SRE-swift-storage, Commons
MatthewVernon added a comment to T420786: uploadstash-exception: Could not store upload in the stash while uploading PDF file.

I'm guessing you don't have an exact timestamp for the error? I'm afraid it's going to be almost impossible to say anything useful about this, because there's likely nothing in the logs that will be findable (since it's a stash issue, I can't even search for the object path in the logs). Sorry.

Mon, Mar 23, 8:34 AM · SRE-swift-storage, Commons, Wikimedia-production-error

Fri, Mar 20

MatthewVernon added a comment to T419713: thanos swift capacity for FY 26/27.

Right, then the existing thanos-swift infrastructure has no-where near the SSD capacity to support that use case.

Fri, Mar 20, 4:22 PM · SRE-swift-storage, SRE, Observability-Metrics
MatthewVernon added a comment to T419713: thanos swift capacity for FY 26/27.

What sort of storage volume are we talking about here?
The thanos-swift cluster has some lowlatency storage, which is largely unused; each server has 2x200G available for the "lowlatency" storage policy, which equates to about 1TB of usable capacity (given x3 replication). Currently only the chartmuseum account is using any of that capacity.

Fri, Mar 20, 11:54 AM · SRE-swift-storage, SRE, Observability-Metrics

Thu, Mar 19

MatthewVernon updated subscribers of T419713: thanos swift capacity for FY 26/27.

@hnowlan can I push this up your stack, please? Willy wants all procurement requests for next FY done by end of next week (i.e. 27 March).

Thu, Mar 19, 8:41 AM · SRE-swift-storage, SRE, Observability-Metrics

Wed, Mar 18

MatthewVernon added a comment to T419817: Disk (sdm) failed in thanos-be2008.

I remain grateful that we have spare disks available, so thanks again :)

Wed, Mar 18, 3:34 PM · ops-codfw, DC-Ops, SRE, SRE-swift-storage
MatthewVernon added a comment to T414805: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only.

@Ladsgroup they're only a tiny number of files, but XCF will probably likewise need addressing?

From a quick look, it extends TransformationalImageHandler indirectly and it's not overriding the methods so it should be fine with no issues. Do you have an example of an issue?

Wed, Mar 18, 11:44 AM · MW-1.46-notes (1.46.0-wmf.22; 2026-03-31), Patch-For-Review, MW-1.45-notes, MW-1.43-notes, MW-1.44-notes, Data-Persistence, MediaViewer, Traffic, Thumbor, SRE-swift-storage

Mar 13 2026

MatthewVernon added a comment to T419958: db1258 connection went down at 10:43Z.

I got into the host via the serial console. Some notes:

Mar 13 2026, 10:56 AM · SRE, DC-Ops, ops-eqiad, DBA, Sustainability (Incident Followup), Data-Persistence
MatthewVernon updated subscribers of T414805: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only.

@Reedy you did the 1.43 backports (at least according to gerrit), can you have a look at this, please? I can open a new subtask for tracking if that's easier.

Mar 13 2026, 10:12 AM · MW-1.46-notes (1.46.0-wmf.22; 2026-03-31), Patch-For-Review, MW-1.45-notes, MW-1.43-notes, MW-1.44-notes, Data-Persistence, MediaViewer, Traffic, Thumbor, SRE-swift-storage

Mar 12 2026

MatthewVernon added a comment to T419817: Disk (sdm) failed in thanos-be2008.

Thanks! New disk is configured and backfilling fine.

Mar 12 2026, 4:13 PM · ops-codfw, DC-Ops, SRE, SRE-swift-storage
MatthewVernon triaged T419817: Disk (sdm) failed in thanos-be2008 as High priority.
Mar 12 2026, 9:05 AM · ops-codfw, DC-Ops, SRE, SRE-swift-storage
MatthewVernon created T419817: Disk (sdm) failed in thanos-be2008.
Mar 12 2026, 9:05 AM · ops-codfw, DC-Ops, SRE, SRE-swift-storage

Mar 11 2026

MatthewVernon added a comment to T413127: Directory Listing and Download from Object Storage.

Hi, sorry this got dropped - do feel free to poke.

Mar 11 2026, 4:50 PM · Patch-For-Review, MediaWiki-Platform-Team (Radar), Data-Persistence, Arc-Lamp
MatthewVernon added a comment to T416243: Q3:rack/setup/install ms-fe202[1-4].

Thanks :)

Mar 11 2026, 2:59 PM · SRE-swift-storage, SRE, DC-Ops, ops-codfw
MatthewVernon closed T415189: DHCP failing for at least 2 ms-be servers in codfw, a subtask of T354872: Re-IP Swift hosts to per-rack subnets in codfw rows A-D, as Resolved.
Mar 11 2026, 2:50 PM · SRE-swift-storage, Infrastructure-Foundations, SRE
MatthewVernon closed T415189: DHCP failing for at least 2 ms-be servers in codfw as Resolved.

@ayounsi I re-imaged with the --move-vlan argument 3 codfw nodes today, and everything went well, so I think this is done now, thanks!

Mar 11 2026, 2:50 PM · SRE, SRE-swift-storage, ops-codfw, DC-Ops, Infrastructure-Foundations
MatthewVernon created T419713: thanos swift capacity for FY 26/27.
Mar 11 2026, 2:48 PM · SRE-swift-storage, SRE, Observability-Metrics
MatthewVernon updated the task description for T354872: Re-IP Swift hosts to per-rack subnets in codfw rows A-D.
Mar 11 2026, 2:27 PM · SRE-swift-storage, Infrastructure-Foundations, SRE
MatthewVernon added a comment to T413088: FY2526 Q3:rack/setup/install ms-be209[56].

Imaging of both systems was OK once the relevant disk got wiped.

Mar 11 2026, 10:01 AM · SRE, SRE-swift-storage, ops-codfw, DC-Ops
MatthewVernon added a comment to T413088: FY2526 Q3:rack/setup/install ms-be209[56].

Hi @Jhancock.wm I'm afraid this is the problem we've seen with Dell before (but that I hoped they were going to correct), where they send us systems with a Windows EFI partition on one of the spinning disks.
Puppet says (amongst other things):

Notice: /Stage[main]/Profile::Swift::Storage::Configure_disks/Exec[mkfs-pci-0000:50:00.0-scsi-0:2:17:0]/returns: mkfs.xfs: /dev/disk/by-path/pci-0000:50:00.0-scsi-0:2:17:0-part1 appears to contain an existing filesystem (vfat).
Notice: /Stage[main]/Profile::Swift::Storage::Configure_disks/Exec[mkfs-pci-0000:50:00.0-scsi-0:2:17:0]/returns: mkfs.xfs: Use the -f option to force overwrite.
Error: '/usr/sbin/mkfs -t xfs -m crc=1 -m finobt=0 -i size=512 /dev/disk/by-path/pci-0000:50:00.0-scsi-0:2:17:0-part1' returned 1 instead of one of [0]
Error: /Stage[main]/Profile::Swift::Storage::Configure_disks/Exec[mkfs-pci-0000:50:00.0-scsi-0:2:17:0]/returns: change from 'notrun' to ['0'] failed: '/usr/sbin/mkfs -t xfs -m crc=1 -m finobt=0 -i size=512 /dev/disk/by-path/pci-0000:50:00.0-scsi-0:2:17:0-part1' returned 1 instead of one of [0] (corrective)

If I mount that partition and have a look:

mvernon@ms-be2095:~$ sudo mount /dev/disk/by-path/pci-0000\:50\:00.0-scsi-0\:2\:17\:0-part1 /mnt/
mvernon@ms-be2095:~$ ls /mnt/
EFI  EFI.BAK
mvernon@ms-be2095:~$ ls /mnt/EFI
Boot  Microsoft  PEBoot

Which is the pattern we've seen before. I've wiped the offending partition and disk (in this case sudo wipefs -a /dev/sdr1 && sudo wipefs -a /dev/sdr, it was sdd on ms-be2096), and will now reimage.

Mar 11 2026, 9:08 AM · SRE, SRE-swift-storage, ops-codfw, DC-Ops
MatthewVernon added a comment to T419647: Eqiad: lsw1-d2-eqiad BGP maintenance.

Ah, I just put 10:00 EST into date. You're probably right, but a confirmation would be helpful :)

Mar 11 2026, 8:52 AM · netops, Infrastructure-Foundations, SRE
MatthewVernon added a comment to T419647: Eqiad: lsw1-d2-eqiad BGP maintenance.

Can I check this is 15:00 UTC (particularly given daylight confusion...), please? Once it's done I'll check ms-be1091 [the frontends can just be repooled again afterwards]

Mar 11 2026, 8:22 AM · netops, Infrastructure-Foundations, SRE
MatthewVernon updated the task description for T419647: Eqiad: lsw1-d2-eqiad BGP maintenance.
Mar 11 2026, 8:21 AM · netops, Infrastructure-Foundations, SRE

Mar 10 2026

MatthewVernon updated the task description for T419577: ms swift capacity for FY 26/27.
Mar 10 2026, 4:59 PM · SRE-swift-storage, SRE
MatthewVernon created T419577: ms swift capacity for FY 26/27.
Mar 10 2026, 4:59 PM · SRE-swift-storage, SRE
MatthewVernon added a comment to T413080: Design and build the next generation of container-registry service for the WMF production realm.

I'm currently trying to get some quotes to put an expansion request in for next FY for apus, primarily to enable us to have a small solid-state-only storage pool to use for bucket indexes and similar, which should give us better performance and reliability with buckets with larger numbers of objects in.

Mar 10 2026, 3:07 PM · ServiceOps new, Epic, Ceph, Kubernetes, Infrastructure-Foundations, Data-Platform-SRE, Machine-Learning-Team

Mar 9 2026

MatthewVernon added a comment to T419394: Disk (sdg) failed on ms-be2064.

Brilliant, thanks! Replacement is back in service and refilling now.

Mar 9 2026, 2:54 PM · SRE-swift-storage, SRE, ops-codfw, DC-Ops
MatthewVernon added a comment to T416972: Move Gerrit data to CephFS.

I see from T414407 you're thinking about moving gerrit to k8s. I do wonder if this is the sort of thing that k8s persistent volume claims are intended for?

Mar 9 2026, 11:58 AM · collaboration-services, Gerrit
MatthewVernon added a comment to T416972: Move Gerrit data to CephFS.

The Apus cluster does not currently support CephFS, I'm afraid. It wouldn't be straightforward to add support either - Apus does multi-site replication at the RGW/S3 level, the underlying Ceph clusters (1 in eqiad, 1 in codfw) don't talk to each other except via https/S3 communication between the RGWs. So even if we added MDS (metadata servers, the things you need to run CephFS on top of Ceph), you'd have two separate filesystems, one per DC. There is snapshot mirroring, but I don't think it's what you'd want here. Data Platform Engineering's cluster has CephFS (per wikitech), but I don't know if they do any sort of cross-site stuff with it.

Mar 9 2026, 11:45 AM · collaboration-services, Gerrit
MatthewVernon triaged T419394: Disk (sdg) failed on ms-be2064 as High priority.
Mar 9 2026, 10:22 AM · SRE-swift-storage, SRE, ops-codfw, DC-Ops
MatthewVernon created T419394: Disk (sdg) failed on ms-be2064.
Mar 9 2026, 10:22 AM · SRE-swift-storage, SRE, ops-codfw, DC-Ops

Mar 6 2026

MatthewVernon added a comment to T414805: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only.

@Ladsgroup they're only a tiny number of files, but XCF will probably likewise need addressing?

Mar 6 2026, 2:41 PM · MW-1.46-notes (1.46.0-wmf.22; 2026-03-31), Patch-For-Review, MW-1.45-notes, MW-1.43-notes, MW-1.44-notes, Data-Persistence, MediaViewer, Traffic, Thumbor, SRE-swift-storage
MatthewVernon added a comment to T401966: PXE provision script needed for data-persistence hosts.

FWIW, I have no objection to your doing so.

Mar 6 2026, 8:54 AM · SRE-swift-storage, Data-Persistence, SRE, ops-eqiad, DC-Ops

Mar 4 2026

MatthewVernon closed T416721: Requesting access to "Community Wishlist" dashboard for hmonroy on Superset as Resolved.
Mar 4 2026, 5:17 PM · Data-Platform-SRE (2026-02-13 - 2026-03-06)
MatthewVernon added projects to T419010: ms-fe1013 reports a backplane error: SRE-swift-storage, SRE.
Mar 4 2026, 1:52 PM · SRE, SRE-swift-storage, Data-Persistence, ops-eqiad, DC-Ops
MatthewVernon added a comment to T417594: Requesting update of Raymond Ndibe's SSH key to Yubikey-backed key.

Hi @Raymond_Ndibe - I've removed your old key now (so it'll be removed from production systems in the next 20 minutes or so).

Mar 4 2026, 11:59 AM · SRE, SRE-Access-Requests
MatthewVernon placed T418901: Q3:rack/setup/install apus-be100[56] up for grabs.

{{done}}

Mar 4 2026, 10:46 AM · SRE-swift-storage, SRE, ops-eqiad, Data-Persistence, DC-Ops
MatthewVernon placed T418902: Q3:rack/setup/install apus-be200[56] up for grabs.

{{done}}

Mar 4 2026, 10:45 AM · SRE-swift-storage, SRE, Data-Persistence, ops-codfw, DC-Ops
MatthewVernon added a comment to T418772: Eqiad: lsw1-d7-eqiad BGP maintenance.

Is this maintenance happening at 15:00 UTC today?

Mar 4 2026, 9:36 AM · Prod-Kubernetes, ServiceOps new, netops, Infrastructure-Foundations, SRE
MatthewVernon updated the task description for T418772: Eqiad: lsw1-d7-eqiad BGP maintenance.
Mar 4 2026, 9:34 AM · Prod-Kubernetes, ServiceOps new, netops, Infrastructure-Foundations, SRE
MatthewVernon closed T413089: FY2526 Q3:rack/setup/install ms-be109[67] as Resolved.

Yes, they look good now, thank you!

Mar 4 2026, 9:19 AM · SRE, SRE-swift-storage, ops-eqiad, DC-Ops

Mar 3 2026

MatthewVernon added a comment to T413089: FY2526 Q3:rack/setup/install ms-be109[67].

Looking at 1095, the drives appear in the web-iDRAC as "NonRAID Disk 0" and the Storage Overview says 26 "Non-RAID Disks".

Mar 3 2026, 2:46 PM · SRE, SRE-swift-storage, ops-eqiad, DC-Ops
MatthewVernon reopened T413089: FY2526 Q3:rack/setup/install ms-be109[67] as "Open".

@Jclark-ctr sorry, I was wrong, the disks are now setup incorrectly - it looks like you've set them up as a set of RAID-0 arrays, but these systems are meant to be JBOD - so no virtual disks at all, all non-RAID. Can you re-do both of these systems thus, please? We've moved to JBOD-only for swift (and Ceph) backends entirely.

Mar 3 2026, 2:30 PM · SRE, SRE-swift-storage, ops-eqiad, DC-Ops
MatthewVernon closed T413089: FY2526 Q3:rack/setup/install ms-be109[67] as Resolved.

@Jclark-ctr yep, both look good now, thanks!

Mar 3 2026, 2:02 PM · SRE, SRE-swift-storage, ops-eqiad, DC-Ops
MatthewVernon added a comment to T418745: MediaViewer (and the commons file page) should serve WebP originals not thumbnails of equivalent size.

Perhaps instead for the odd non-web format (which seems to include XCF and TIFF currently, begging the WebP question for a moment) we should only generate standard-size thumbnails, and the commons file interface should drop the "original-size-converted-to-PNG" option entirely; mediaviewer would need adjusting to request the largest-standard-size-smaller-than-original too. These aren't formats intended for images-for-display on commons (per https://commons.wikimedia.org/wiki/Commons:File_types#Images).

Mar 3 2026, 9:54 AM · MW-1.43-notes, MW-1.44-notes, MW-1.45-notes, MW-1.46-notes (1.46.0-wmf.19; 2026-03-10), Data-Persistence, MediaViewer, Traffic, Thumbor, SRE-swift-storage
MatthewVernon reopened T413089: FY2526 Q3:rack/setup/install ms-be109[67] as "Open".

@Jclark-ctr can you take another look at these, please? In neither system can the OS see any of the spinning disks, which should be available as JBOD devices - at a guess that still needs setting up in the storage controller.

Mar 3 2026, 9:39 AM · SRE, SRE-swift-storage, ops-eqiad, DC-Ops

Mar 2 2026

MatthewVernon created T418745: MediaViewer (and the commons file page) should serve WebP originals not thumbnails of equivalent size.
Mar 2 2026, 2:50 PM · MW-1.43-notes, MW-1.44-notes, MW-1.45-notes, MW-1.46-notes (1.46.0-wmf.19; 2026-03-10), Data-Persistence, MediaViewer, Traffic, Thumbor, SRE-swift-storage
MatthewVernon added a comment to T414805: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only.

So that's a "thumbnail the same size as original, rather than original" issue (the original image is 1074px wide) - you should be being shown the original image, but are instead getting a non-standard size thumbnail. I had thought mediaviewer had fixed this.

Mar 2 2026, 10:09 AM · MW-1.46-notes (1.46.0-wmf.22; 2026-03-31), Patch-For-Review, MW-1.45-notes, MW-1.43-notes, MW-1.44-notes, Data-Persistence, MediaViewer, Traffic, Thumbor, SRE-swift-storage

Feb 27 2026

MatthewVernon added a comment to T418178: imageinfo API requests for DJVU files don't follow thumbnail steps, allows upscaling.

@Atieno Whilst SRE is driving WE 5.4.10, we do need support from other teams in P&T as appropriate to get this work done - Is the MW interfaces team not best placed to address this issue, please?

Feb 27 2026, 11:41 AM · MW-1.46-notes (1.46.0-wmf.20; 2026-03-17), MW-Interfaces-Team, MediaWiki-Action-API, MediaWiki-File-management, MediaWiki-DjVu

Feb 26 2026

MatthewVernon created T418515: decommission moss-fe100[1-2].eqiad.wmnet.
Feb 26 2026, 4:54 PM · DC-Ops, Ceph, SRE-swift-storage, SRE, ops-eqiad, decommission-hardware

Feb 25 2026

MatthewVernon added a comment to T417655: Requesting access to analytics-private-users for maxbinderWMF.

I think he is not - the former is now self-service via IDM.

Feb 25 2026, 4:22 PM · Data-Engineering, SRE, SRE-Access-Requests
MatthewVernon added a project to T417655: Requesting access to analytics-private-users for maxbinderWMF: Data-Engineering.

OK, I've tagged Data-Engineering, since I think this is their ballpark now. Hopefully they can help :)

Feb 25 2026, 2:56 PM · Data-Engineering, SRE, SRE-Access-Requests
MatthewVernon added a comment to T414805: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only.

This change to standardised size has also broken the "Preview Pagelist" functionality for editing Index: pagelists at English Wikisource

Feb 25 2026, 8:35 AM · MW-1.46-notes (1.46.0-wmf.22; 2026-03-31), Patch-For-Review, MW-1.45-notes, MW-1.43-notes, MW-1.44-notes, Data-Persistence, MediaViewer, Traffic, Thumbor, SRE-swift-storage

Feb 24 2026

MatthewVernon added a comment to T394476: Onboard the Docker Registry to apus.

At least so far, no issues with sync getting far behind either.

Feb 24 2026, 5:10 PM · Patch-For-Review, ServiceOps new, SRE, Ceph, SRE-swift-storage, Data-Persistence
MatthewVernon added a comment to T414805: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only.

I spent quite a bit of time with codesearch last quarter trying to track down thumbnail size (ab)use, but we can't possibly hope to find (or fix) every single externally-written bit of software.

Feb 24 2026, 4:36 PM · MW-1.46-notes (1.46.0-wmf.22; 2026-03-31), Patch-For-Review, MW-1.45-notes, MW-1.43-notes, MW-1.44-notes, Data-Persistence, MediaViewer, Traffic, Thumbor, SRE-swift-storage
MatthewVernon created T418225: decommission moss-fe200[1-2].codfw.wmnet.
Feb 24 2026, 11:51 AM · DC-Ops, SRE, Ceph, SRE-swift-storage, ops-codfw, decommission-hardware

Feb 23 2026

MatthewVernon closed T417396: Upgrade apus' ceph to 18.2.7 (or .8 if already available), a subtask of T394476: Onboard the Docker Registry to apus, as Resolved.
Feb 23 2026, 1:44 PM · Patch-For-Review, ServiceOps new, SRE, Ceph, SRE-swift-storage, Data-Persistence
MatthewVernon closed T417396: Upgrade apus' ceph to 18.2.7 (or .8 if already available) as Resolved.

codfw cluster done, too.

Feb 23 2026, 1:44 PM · Ceph, SRE-swift-storage, Data-Persistence
MatthewVernon added a comment to T417396: Upgrade apus' ceph to 18.2.7 (or .8 if already available).

eqiad cluster done.

Feb 23 2026, 11:56 AM · Ceph, SRE-swift-storage, Data-Persistence
MatthewVernon closed T416387: Q3:rack/setup/install apus-fe200[4-5] as Resolved.

Yep, setting preseed to expect UEFI booting fixed things.

Feb 23 2026, 10:42 AM · Ceph, SRE-swift-storage, SRE, DC-Ops, ops-codfw
MatthewVernon updated the task description for T416387: Q3:rack/setup/install apus-fe200[4-5].
Feb 23 2026, 10:42 AM · Ceph, SRE-swift-storage, SRE, DC-Ops, ops-codfw
MatthewVernon added a comment to T417655: Requesting access to analytics-private-users for maxbinderWMF.

Looking at the access groups documentation, analytics-privatedata-users should be sufficient for dashboards with private data.

Feb 23 2026, 10:38 AM · Data-Engineering, SRE, SRE-Access-Requests

Feb 21 2026

MatthewVernon added a comment to T416243: Q3:rack/setup/install ms-fe202[1-4].

@Jhancock.wm these nodes are swift frontends in the ms cluster, so should be ms-fe* not moss-fe* (moss* is a legacy name that should never apply to new nodes).

Feb 21 2026, 10:28 AM · SRE-swift-storage, SRE, DC-Ops, ops-codfw