Page MenuHomePhabricator

MatthewVernon (Matthew Vernon)
SRE (Data Persistence)

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Friday

  • Clear sailing ahead.

User Details

User Since
Aug 2 2021, 1:52 PM (142 w, 2 d)
Availability
Available
IRC Nick
Emperor
LDAP User
MVernon
MediaWiki User
MVernon (WMF) [ Global Accounts ]

Recent Activity

Fri, Apr 19

MatthewVernon awarded T358308: AssembleUploadChunksJob & PublishStashedFile jobs seem to be timing out at about 3 minutes, but should be ~20 minutes a Like token.
Fri, Apr 19, 4:54 PM · Patch-For-Review, WMF-JobQueue, MediaWiki-File-management

Wed, Apr 17

MatthewVernon added a comment to T350179: Reimage cookbook on new eqiad hosts stuck at PXE booting.

+1 to thanks to Papaul for getting to the bottom of this!

Wed, Apr 17, 3:26 PM · SRE, Traffic, SRE-swift-storage, ops-codfw, DC-Ops, ops-eqiad

Mon, Apr 15

MatthewVernon added a comment to T357333: SystemdUnitFailed alerts are too noisy for data-persistence.

I can confirm that we're definitely getting alerts by email and IRC ever 4 hours again now :(
( for wmf_auto_restart_prometheus-mysqld-exporter@x1.service on db2101:9100 - should be evident in the channel logs for #wikimedia-data-persistence)

Mon, Apr 15, 7:29 AM · Data-Persistence, Observability-Alerting

Tue, Apr 9

MatthewVernon closed T361844: Swift TLS certificates will expire soon (14 April) as Resolved.

I've added a new section to Swift/How_To that documents this process (and links to the Cergen docs), and also arranged for https://wikitech.wikimedia.org/wiki/TLS/Runbook#swift-https:443 to point to something useful (which is the link that appears in the alert).

Tue, Apr 9, 10:44 AM · Patch-For-Review, SRE-swift-storage
MatthewVernon added a comment to T361844: Swift TLS certificates will expire soon (14 April).

eqiad done, Not After : Apr 8 10:04:14 2029 GMT

Tue, Apr 9, 10:30 AM · Patch-For-Review, SRE-swift-storage
MatthewVernon added a comment to T361844: Swift TLS certificates will expire soon (14 April).

codfw done OK, cert now says Not After : Apr 8 08:00:23 2029 GMT.

Tue, Apr 9, 8:47 AM · Patch-For-Review, SRE-swift-storage

Mon, Apr 8

MatthewVernon added a comment to T357719: Consider separating Gitlab code management and deb building management.

I've also updated the CI config for pybal's bullseye-wikimedia branch; when we merged Joe's changes to wmf-debci (to support a build-off-main-branch workflow) the old job name got changed.

Mon, Apr 8, 10:49 AM · Traffic, collaboration-services
MatthewVernon added a comment to T357719: Consider separating Gitlab code management and deb building management.

What's wrong with sre/dnsdist? The pipelines page shows a successful build on 13 Jan, and then no activity since...

Mon, Apr 8, 10:42 AM · Traffic, collaboration-services

Thu, Apr 4

MatthewVernon added a comment to T351927: Decide and tweak Thanos retention.

@fgiunchedi I see (when looking for something else) that thanos is up to "Warning" for disk usage...

Thu, Apr 4, 3:44 PM · Patch-For-Review, User-fgiunchedi, Observability-Metrics
MatthewVernon triaged T361844: Swift TLS certificates will expire soon (14 April) as High priority.
Thu, Apr 4, 2:34 PM · Patch-For-Review, SRE-swift-storage
MatthewVernon created T361844: Swift TLS certificates will expire soon (14 April).
Thu, Apr 4, 2:34 PM · Patch-For-Review, SRE-swift-storage
MatthewVernon added a comment to T182085: Connect Phabricator to swift for storage of git-lfs and file uploads..

there was maybe a suggestion of using it for files uploaded to phab?

Thu, Apr 4, 9:03 AM · git-lfs, Release-Engineering-Team (Seen), User-MModell, Phabricator, SRE-swift-storage
MatthewVernon added a comment to T357333: SystemdUnitFailed alerts are too noisy for data-persistence.

@fgiunchedi did this change get undeployed somehow? we've had alerts every 4 hours about SystemdUnitFailed on db2202:9100
(since 19:32 UTC yesterday)...

Thu, Apr 4, 8:32 AM · Data-Persistence, Observability-Alerting

Mar 25 2024

MatthewVernon added a comment to T360913: Swift proxy server misbehaviour (no longer calling `accept`?).

Reported upstream as Bug #2058945.

Mar 25 2024, 4:15 PM · SRE-swift-storage
MatthewVernon renamed T360913: Swift proxy server misbehaviour (no longer calling `accept`?) from Swift server misbehaviour (no longer calling `accept`?) to Swift proxy server misbehaviour (no longer calling `accept`?).
Mar 25 2024, 3:59 PM · SRE-swift-storage
MatthewVernon created T360913: Swift proxy server misbehaviour (no longer calling `accept`?).
Mar 25 2024, 3:53 PM · SRE-swift-storage

Mar 22 2024

MatthewVernon added a comment to T360589: De-fragment thumbnail sizes in mediawiki.

This seems broadly sensible - what's the concrete proposal in terms of which thumb sizes will be supported/generated?

Mar 22 2024, 2:59 PM · Commons, MediaWiki-File-management, Data-Persistence
MatthewVernon added a comment to T345334: Cache thumbs in our caching infrastructure (e.g. ATS).

One thing that was discussed at the SRE meeting in Warsaw was looking at turnilo data (which IIRC is the last 90 days' requests) to effectively simulate a cache and ask questions about the relationship between cache size/age and hit/miss ratios and so on.
[which might be a useful KR for the forthcoming quarter]

Mar 22 2024, 2:56 PM · SRE, Thumbor, SRE-swift-storage, Traffic

Mar 21 2024

MatthewVernon updated the task description for T360617: docker-pkg error messages are confusing.
Mar 21 2024, 11:54 AM · docker-pkg
MatthewVernon updated the task description for T360617: docker-pkg error messages are confusing.
Mar 21 2024, 11:54 AM · docker-pkg
MatthewVernon created T360619: docker-pkg: it would be nice to have a --rebuild option or similar.
Mar 21 2024, 11:47 AM · docker-pkg
MatthewVernon added a comment to T360617: docker-pkg error messages are confusing.

(if you try and build any other image, you still get the same error, which led me to assume my docker-pkg setup was faulty somehow - again, in hindsight, the error message does say it's unhappy with the ceph image, but the naïve user will observe the same bad request for localhost/v1.41/images/docker-registry.wikimedia.org/ and get led down the garden path.)

Mar 21 2024, 11:43 AM · docker-pkg
MatthewVernon created T360617: docker-pkg error messages are confusing.
Mar 21 2024, 11:40 AM · docker-pkg

Mar 20 2024

MatthewVernon added a comment to T357547: ☂️ Northward Datacentre Switchover (March 2024) .

Noting here for future reference - we found that thumbor was incorrectly using the global discovery record for swift, which meant that codfw-thumbor was trying to talk to eqiad-swift after codfw-swift was depooled, resulting in a rise in TempAuth errors (and 401s):

swift_tempauth_graph.png (812×1 px, 91 KB)

Mar 20 2024, 1:44 PM · Patch-For-Review, Datacenter-Switchover, Data-Persistence, SRE Observability (FY2023/2024-Q3), collaboration-services, observability, serviceops, DC-Ops, Traffic
MatthewVernon added a comment to T358738: Commons thumbnails are broken for certain large sizes of thumbnail images.

Yes, we don't replicate thumbnails between DCs any more (and this has been the case since July 2022 cf. T313102)

Mar 20 2024, 8:21 AM · SRE-swift-storage, serviceops, Commons

Mar 7 2024

MatthewVernon added a comment to T359077: 2024-2025 ms swift capacity.

Additionally, we are retiring the last 9 12x4 T nodes from eqiad and the last 6 12x4T nodes from codfw and replacing them with 24x8T units.

Mar 7 2024, 5:34 PM · SRE-swift-storage

Mar 5 2024

MatthewVernon added a comment to T359176: Long-titled archived files can get its path metadata truncated due to not having enough storage space, leading to orphan, non accesible files (was: Two files on commons have invalid UTF-8 characters in path metadata).

The issue is that the path is stored as varbinary(255) and path length is checked at upload to not exceed that. But then archiving adds archive and a date string to the start of the path, resulting in truncation.

Mar 5 2024, 3:11 PM · Patch-For-Review, Commons, MediaWiki-File-management, media-backups
MatthewVernon added a comment to T359176: Long-titled archived files can get its path metadata truncated due to not having enough storage space, leading to orphan, non accesible files (was: Two files on commons have invalid UTF-8 characters in path metadata).

As does the second:

root@ms-fe1009:~# swift stat wikipedia-commons-local-public.1e 'archive/1/1e/20231203130229!ДАЖО_127-1-68.1897._Геодезичний_опис_ділянки_землі_вічного_чиншовика_Антона_Станіслава_Гарбовських_села_Рудня-Старики_Овруцького_повіту.pdf'
               Account: AUTH_mw
             Container: wikipedia-commons-local-public.1e
                Object: archive/1/1e/20231203130229!ДАЖО_127-1-68.1897._Геодезичний_опис_ділянки_землі_вічного_чиншовика_Антона_Станіслава_Гарбовських_села_Рудня-Старики_Овруцького_повіту.pdf
          Content Type: application/pdf
        Content Length: 23751233
         Last Modified: Sat, 09 Dec 2023 03:08:11 GMT
                  ETag: bf7ae1c816785fe887ad2846e13d8e11
       Meta Sha1Base36: a9bue5nc4oj88z3bf65tbh339kjh4un
           X-Timestamp: 1702091290.63527
         Accept-Ranges: bytes
            X-Trans-Id: tx6abde957d45f4e978361f-0065e72d4d
X-Openstack-Request-Id: tx6abde957d45f4e978361f-0065e72d4d
Mar 5 2024, 2:34 PM · Patch-For-Review, Commons, MediaWiki-File-management, media-backups
MatthewVernon added a comment to T359176: Long-titled archived files can get its path metadata truncated due to not having enough storage space, leading to orphan, non accesible files (was: Two files on commons have invalid UTF-8 characters in path metadata).

The first exists:

root@ms-fe1009:~# swift stat wikipedia-commons-local-public.16 'archive/1/16/20240116211741!Алфавітно-предметний_покажчик_за_1938_рік_до_Збірника_постанов_і_розпоряджень_Уряду_Української_Радянської_Соціалістичної_Республіки.pdf'
               Account: AUTH_mw
             Container: wikipedia-commons-local-public.16
                Object: archive/1/16/20240116211741!Алфавітно-предметний_покажчик_за_1938_рік_до_Збірника_постанов_і_розпоряджень_Уряду_Української_Радянської_Соціалістичної_Республіки.pdf
          Content Type: application/pdf
        Content Length: 1330605
         Last Modified: Tue, 16 Jan 2024 21:18:05 GMT
                  ETag: ac929ceaf65d932bf2bfe683643b47de
       Meta Sha1Base36: ja3vvtx04izk863x7mzzwc3wjkptjbn
           X-Timestamp: 1705439884.96510
         Accept-Ranges: bytes
            X-Trans-Id: tx8e6e322995b04547a920c-0065e72bf1
X-Openstack-Request-Id: tx8e6e322995b04547a920c-0065e72bf1
Mar 5 2024, 2:28 PM · Patch-For-Review, Commons, MediaWiki-File-management, media-backups

Mar 4 2024

MatthewVernon created T359077: 2024-2025 ms swift capacity.
Mar 4 2024, 4:19 PM · SRE-swift-storage

Feb 29 2024

MatthewVernon added a comment to T355872: Migrate servers in codfw rack B7 from asw-b7-codfw to lsw1-b7-codfw.

thanos and ms swift clusters OK post-move, thank you!

Feb 29 2024, 4:20 PM · SRE-swift-storage, netops, SRE, Infrastructure-Foundations, ops-codfw

Feb 26 2024

MatthewVernon added a comment to T358489: mw2420-mw2451 do have unnecessary raid controllers (configured).

If you do decide you might want to reprovision these nodes as non-RAID, there is a sre.swift.convert-disks cookbook that does most of the heavy lifting (though you'd probably need to relax the host restriction a bit).

Feb 26 2024, 2:00 PM · SRE, serviceops
MatthewVernon added a comment to T357380: Degraded RAID on mw2442.

After the reboot, you could still have made the new virtual drive with the last of those lines:

megacli -CfgEachDskRaid0 WB RA Direct CachedBadBBU -a0
Feb 26 2024, 11:19 AM · serviceops, ops-codfw
MatthewVernon updated the task description for T358455: Primary outbound port utilisation over 80% alert muted.
Feb 26 2024, 9:41 AM · Traffic, Sustainability (Incident Followup), Infrastructure-Foundations, netops

Feb 25 2024

MatthewVernon created T358455: Primary outbound port utilisation over 80% alert muted.
Feb 25 2024, 10:51 PM · Traffic, Sustainability (Incident Followup), Infrastructure-Foundations, netops

Feb 23 2024

MatthewVernon added a comment to T200820: FAILED: stashfailed: Could not read file "mwstore://local-swift-eqiad/local-temp/a/ac/15xi9btm14os.u9p1dr.1208681.webm.0"..

Longer term, using swift large object support might be a better way to handle these files, since they are already chunked.

Feb 23 2024, 10:04 AM · MW-1.42-notes (1.42.0-wmf.23; 2024-03-19), SRE-swift-storage, MediaWiki-Uploading, User-revi, Multimedia

Feb 22 2024

MatthewVernon claimed T269108: Create a read-only swift identity for backup taking.
Feb 22 2024, 5:17 PM · Data-Persistence-Backup, SRE, SRE-swift-storage
MatthewVernon added a comment to T355868: Migrate servers in codfw rack B2 from asw-b2-codfw to lsw1-b2-codfw.

Swift is back OK, thanks.

Feb 22 2024, 4:12 PM · SRE-swift-storage, ops-codfw, netops, Infrastructure-Foundations, SRE
MatthewVernon added a comment to T200820: FAILED: stashfailed: Could not read file "mwstore://local-swift-eqiad/local-temp/a/ac/15xi9btm14os.u9p1dr.1208681.webm.0"..

Here's the relevant logs, sorted by time:

moss-fe2001.codfw.wmnet: Feb 21 20:19:14 moss-fe2001 proxy-server: 10.192.48.105 10.192.32.51 21/Feb/2024/20/19/14 PUT /v1/AUTH_mw/wikipedia-commons-local-temp.b8/b/b8/1aqaxe3dattw.u1suhi.6080484.webm.0 HTTP/1.0 201 - wikimedia/multi-http-client%20v1.1 AUTH_tk7b6ed208d... - - - tx948576b6a72e4600a62d1-0065d65ac1 - 0.9693 - - 1708546753.577275038 1708546754.546539783 -
ms-fe1009.eqiad.wmnet: Feb 21 20:19:16 ms-fe1009 proxy-server: 10.192.48.105 10.64.0.166 21/Feb/2024/20/19/16 PUT /v1/AUTH_mw/wikipedia-commons-local-temp.b8/b/b8/1aqaxe3dattw.u1suhi.6080484.webm.0 HTTP/1.0 201 - wikimedia/multi-http-client%20v1.1 AUTH_tke5beae87e... - - - tx4683f2a851d249ea89bf1-0065d65ac2 - 1.5437 - - 1708546754.563668013 1708546756.107330084 -
ms-fe2013.codfw.wmnet: Feb 21 20:59:24 ms-fe2013 proxy-server: 10.194.152.76 10.192.0.87 21/Feb/2024/20/59/24 GET /v1/AUTH_mw/wikipedia-commons-local-temp.b8/b/b8/1aqaxe3dattw.u1suhi.6080484.webm.0 HTTP/1.0 200 - wikimedia/multi-http-client%20v1.1 AUTH_tk7b6ed208d... - 97498542 - tx8b68cb8506d34d1a9bb03-0065d66420 - 12.5442 - - 1708549152.229032040 1708549164.773195744 0
ms-fe2013.codfw.wmnet: Feb 21 21:00:02 ms-fe2013 proxy-server: 10.194.152.76 10.192.0.87 21/Feb/2024/21/00/02 DELETE /v1/AUTH_mw/wikipedia-commons-local-temp.b8/b/b8/1aqaxe3dattw.u1suhi.6080484.webm.0 HTTP/1.0 204 - wikimedia/multi-http-client%20v1.1 AUTH_tk7b6ed208d... - - - txfabad3d0f009413ea82dd-0065d66452 - 0.0458 - - 1708549202.850069761 1708549202.895848513 0
ms-fe1009.eqiad.wmnet: Feb 21 21:00:03 ms-fe1009 proxy-server: 10.194.152.76 10.64.0.166 21/Feb/2024/21/00/03 DELETE /v1/AUTH_mw/wikipedia-commons-local-temp.b8/b/b8/1aqaxe3dattw.u1suhi.6080484.webm.0 HTTP/1.0 204 - wikimedia/multi-http-client%20v1.1 AUTH_tke5beae87e... - - - tx26b0b6c03f1b4577a65b7-0065d66453 - 0.0454 - - 1708549203.070249796 1708549203.115652323 0
ms-fe2013.codfw.wmnet: Feb 21 21:03:12 ms-fe2013 proxy-server: 10.194.155.232 10.192.0.87 21/Feb/2024/21/03/12 GET /v1/AUTH_mw/wikipedia-commons-local-temp.b8/b/b8/1aqaxe3dattw.u1suhi.6080484.webm.0 HTTP/1.0 404 - wikimedia/multi-http-client%20v1.1 AUTH_tk7b6ed208d... - 70 - tx36a819fdb39c4d61bb4ee-0065d66510 - 0.0335 - - 1708549392.151752949 1708549392.185259581 0
ms-fe2013.codfw.wmnet: Feb 21 21:03:12 ms-fe2013 proxy-server: 10.194.155.232 10.192.0.87 21/Feb/2024/21/03/12 GET /v1/AUTH_mw/wikipedia-commons-local-temp.b8/b/b8/1aqaxe3dattw.u1suhi.6080484.webm.0 HTTP/1.0 404 - wikimedia/multi-http-client%20v1.1 AUTH_tk7b6ed208d... - 70 - txfffc016f7af4438eac05b-0065d66510 - 0.0284 - - 1708549392.332885504 1708549392.361268997 0
Feb 22 2024, 2:40 PM · MW-1.42-notes (1.42.0-wmf.23; 2024-03-19), SRE-swift-storage, MediaWiki-Uploading, User-revi, Multimedia
MatthewVernon moved T269108: Create a read-only swift identity for backup taking from Radar to In progress on the SRE-swift-storage board.
Feb 22 2024, 1:51 PM · Data-Persistence-Backup, SRE, SRE-swift-storage
MatthewVernon added a comment to T269108: Create a read-only swift identity for backup taking.

@jcrespo can you try now, please?

Feb 22 2024, 1:17 PM · Data-Persistence-Backup, SRE, SRE-swift-storage

Feb 21 2024

MatthewVernon added a comment to T200820: FAILED: stashfailed: Could not read file "mwstore://local-swift-eqiad/local-temp/a/ac/15xi9btm14os.u9p1dr.1208681.webm.0"..

Apropos theory 3, we do run the swift object expirer, but the relevant headers are not set (except for some specific use cases e.g. phonos). So I don't think it can be that.

Feb 21 2024, 11:34 AM · MW-1.42-notes (1.42.0-wmf.23; 2024-03-19), SRE-swift-storage, MediaWiki-Uploading, User-revi, Multimedia
MatthewVernon added a comment to T200820: FAILED: stashfailed: Could not read file "mwstore://local-swift-eqiad/local-temp/a/ac/15xi9btm14os.u9p1dr.1208681.webm.0"..

I'd be surprised (and unhappy!) were swift randomly losing objects. If you have object names (ideally plus timestamps) from a recent example I could go grobbling in the logs to check.

Feb 21 2024, 11:31 AM · MW-1.42-notes (1.42.0-wmf.23; 2024-03-19), SRE-swift-storage, MediaWiki-Uploading, User-revi, Multimedia
MatthewVernon added a project to T357747: Capacity planning/estimation for Thanos: SRE-swift-storage.
Feb 21 2024, 10:37 AM · SRE-swift-storage, Observability-Metrics
MatthewVernon added a comment to T357747: Capacity planning/estimation for Thanos.

I think the proposed table should look like this?

Feb 21 2024, 10:37 AM · SRE-swift-storage, Observability-Metrics

Feb 20 2024

MatthewVernon added a comment to T355867: Migrate servers in codfw rack A7 from asw-a7-codfw to lsw1-a7-codfw.

ms and thanos swift both OK post-move.

Feb 20 2024, 4:19 PM · SRE-swift-storage, ops-codfw, netops, Infrastructure-Foundations, SRE

Feb 16 2024

MatthewVernon closed T353149: Q3 ms backend refresh work as Resolved.
Feb 16 2024, 4:42 PM · SRE-swift-storage
MatthewVernon updated the task description for T353149: Q3 ms backend refresh work.
Feb 16 2024, 4:41 PM · SRE-swift-storage
MatthewVernon created T357790: decommission ms-be10[44-50].eqiad.wmnet.
Feb 16 2024, 4:39 PM · SRE-swift-storage, SRE, ops-eqiad, DC-Ops, decommission-hardware
MatthewVernon created T357764: spicerack.redfish needs to know about Jobs as well as Tasks.
Feb 16 2024, 12:06 PM · Spicerack, SRE-tools, SRE, Infrastructure-Foundations

Feb 14 2024

MatthewVernon added a comment to T357446: Inbound interface errors.

Great, thanks, I can confirm that swift is happy with that node.

Feb 14 2024, 3:03 PM · SRE-swift-storage, SRE, ops-codfw
MatthewVernon added a project to T357446: Inbound interface errors: SRE-swift-storage.
Feb 14 2024, 2:33 PM · SRE-swift-storage, SRE, ops-codfw
MatthewVernon added a comment to T357446: Inbound interface errors.

Yes, please go ahead whenever is convenient (if you can let me know when done I can check the node is still happy).

Feb 14 2024, 2:33 PM · SRE-swift-storage, SRE, ops-codfw
MatthewVernon added a comment to T357333: SystemdUnitFailed alerts are too noisy for data-persistence.

Thanks, this is definitely a step in the right direction :)

Feb 14 2024, 10:13 AM · Data-Persistence, Observability-Alerting

Feb 13 2024

MatthewVernon added a comment to T357441: Container Image policy for non-k8s uses.

Thanks for your comment.

Feb 13 2024, 5:22 PM · serviceops, SRE
MatthewVernon added a comment to T355863: Migrate servers in codfw rack A4 from asw-a4-codfw to lsw1-a4-codfw.

Swift looks happy, thanks :)

Feb 13 2024, 4:52 PM · SRE-swift-storage, DBA, ops-codfw, netops, Infrastructure-Foundations, SRE
MatthewVernon created T357441: Container Image policy for non-k8s uses.
Feb 13 2024, 4:48 PM · serviceops, SRE

Feb 12 2024

MatthewVernon created T357333: SystemdUnitFailed alerts are too noisy for data-persistence.
Feb 12 2024, 5:24 PM · Data-Persistence, Observability-Alerting

Feb 10 2024

MatthewVernon added a comment to T357198: Page: cr2-codfw.wikimedia.org - Primary outbound port utilisation over 80%.

Looking at this briefly (it's Saturday and the moment has passed), the request rate goes up somewhat (so looks unusual, but not at the level I would expect to cause an issue), but both frontend and backend network utilisation is significantly elevated, which makes me wonder if this was a lot of hits on an original rather than a thumb or similar.

Feb 10 2024, 10:19 AM · SRE

Feb 9 2024

MatthewVernon added a project to T354872: Re-IP Swift hosts to per-rack subnets in codfw row A and B.: SRE-swift-storage.
Feb 9 2024, 3:45 PM · SRE-swift-storage, Infrastructure-Foundations, SRE
MatthewVernon added a comment to T354872: Re-IP Swift hosts to per-rack subnets in codfw row A and B..

Swift uses IP(v4) address (and then device name) as the identifier for entries in its rings.

Feb 9 2024, 3:44 PM · SRE-swift-storage, Infrastructure-Foundations, SRE

Feb 8 2024

MatthewVernon added a comment to T355544: Migrate hosts from codfw row A/B ASW to new LSW devices.

moss-be* hosts should be @MatthewVernon unless I am mistaken, in which case, please accept my apologies in advance :)

Feb 8 2024, 3:40 PM · ops-codfw, Infrastructure-Foundations, netops, SRE

Feb 7 2024

MatthewVernon added a comment to T356788: thanos-query probedown due to OOM of both eqiad titan frontends.

Turn it on at 15:55 UTC? </only-half-joking>

Feb 7 2024, 4:20 PM · Patch-For-Review, SRE Observability (FY2023/2024-Q4), Sustainability (Incident Followup), SRE, observability
MatthewVernon added a comment to T356788: thanos-query probedown due to OOM of both eqiad titan frontends.

We had a repeat at almost exactly the same time today, only this time neither node recovered and both needed power-cycling.
VO incident 4427.

Feb 7 2024, 4:19 PM · Patch-For-Review, SRE Observability (FY2023/2024-Q4), Sustainability (Incident Followup), SRE, observability
MatthewVernon added a comment to T355861: Migrate servers in codfw rack A2 from asw-a2-codfw to lsw1-a2-codfw.

You're good to go re swift and thanos now.

Feb 7 2024, 3:47 PM · SRE-swift-storage, ops-codfw, netops, Infrastructure-Foundations, SRE
MatthewVernon updated the task description for T353149: Q3 ms backend refresh work.
Feb 7 2024, 3:06 PM · SRE-swift-storage
MatthewVernon created T356878: decommission ms-be20[44-50].
Feb 7 2024, 3:05 PM · SRE, DC-Ops, SRE-swift-storage, ops-codfw, decommission-hardware

Feb 6 2024

MatthewVernon added a comment to T355860: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw.

swift backends look happy, thanks :)

Feb 6 2024, 4:56 PM · SRE-swift-storage, Data-Persistence-Backup, Data-Persistence, ops-codfw, netops, Infrastructure-Foundations, SRE
MatthewVernon created T356788: thanos-query probedown due to OOM of both eqiad titan frontends.
Feb 6 2024, 4:35 PM · Patch-For-Review, SRE Observability (FY2023/2024-Q4), Sustainability (Incident Followup), SRE, observability

Feb 1 2024

MatthewVernon added a comment to T356412: Consolidate TLS cert puppetry for ms and thanos swift frontends.

[it's not immediately obvious to me what the extra work of cfssl gets us over sslcert]

Feb 1 2024, 2:34 PM · SRE, SRE-swift-storage
MatthewVernon created T356412: Consolidate TLS cert puppetry for ms and thanos swift frontends.
Feb 1 2024, 2:33 PM · SRE, SRE-swift-storage

Jan 31 2024

MatthewVernon added a comment to T355914: Change default image thumbnail size.

I think the main issue is likely that we'll melt Thumbor if we just switch enwiki to 250, because 250 isn't a pregenerated size, and last time someone looked (T211661#8377883) only about 2% of requests were for that size. So I would assume that for the vast majority of images on enwiki we don't currently have a 250 thumb.

Jan 31 2024, 11:42 AM · Web Team Essential Work 2024, Web-Team-Backlog, Design, Wikimedia-Design, Thumbor, Traffic, SRE-swift-storage, Data-Persistence, SRE, Wikimedia-Site-requests
MatthewVernon added a project to T355914: Change default image thumbnail size: Thumbor.
Jan 31 2024, 11:25 AM · Web Team Essential Work 2024, Web-Team-Backlog, Design, Wikimedia-Design, Thumbor, Traffic, SRE-swift-storage, Data-Persistence, SRE, Wikimedia-Site-requests

Jan 30 2024

MatthewVernon added a comment to T211661: Automatically clean up unused thumbnails in Swift.

Thumbs that are being used get cached in the CDN in any case.

Jan 30 2024, 11:31 AM · MediaWiki-Platform-Team (Radar), Performance Issue, Traffic, SRE-swift-storage, SRE
MatthewVernon added a comment to T355872: Migrate servers in codfw rack B7 from asw-b7-codfw to lsw1-b7-codfw.

I'll want to check the backends once this work is complete, but it shouldn't be an issue.

Jan 30 2024, 11:27 AM · SRE-swift-storage, netops, SRE, Infrastructure-Foundations, ops-codfw
MatthewVernon added a project to T355872: Migrate servers in codfw rack B7 from asw-b7-codfw to lsw1-b7-codfw: SRE-swift-storage.
Jan 30 2024, 11:26 AM · SRE-swift-storage, netops, SRE, Infrastructure-Foundations, ops-codfw
MatthewVernon added a comment to T355868: Migrate servers in codfw rack B2 from asw-b2-codfw to lsw1-b2-codfw.

The affected thanos frontend will need depooling.
Similarly, swift in codfw will need depooling.

Jan 30 2024, 11:25 AM · SRE-swift-storage, ops-codfw, netops, Infrastructure-Foundations, SRE
MatthewVernon added a project to T355868: Migrate servers in codfw rack B2 from asw-b2-codfw to lsw1-b2-codfw: SRE-swift-storage.
Jan 30 2024, 11:23 AM · SRE-swift-storage, ops-codfw, netops, Infrastructure-Foundations, SRE
MatthewVernon added a comment to T355867: Migrate servers in codfw rack A7 from asw-a7-codfw to lsw1-a7-codfw.

Once complete I'll want to check the backends, but this shouldn't be an issue.

Jan 30 2024, 11:22 AM · SRE-swift-storage, ops-codfw, netops, Infrastructure-Foundations, SRE
MatthewVernon added a project to T355867: Migrate servers in codfw rack A7 from asw-a7-codfw to lsw1-a7-codfw: SRE-swift-storage.
Jan 30 2024, 11:21 AM · SRE-swift-storage, ops-codfw, netops, Infrastructure-Foundations, SRE
MatthewVernon added a comment to T355863: Migrate servers in codfw rack A4 from asw-a4-codfw to lsw1-a4-codfw.

Once complete, I'll want to check the ms-be nodes are all happy (shouldn't be an issue).

Jan 30 2024, 11:21 AM · SRE-swift-storage, DBA, ops-codfw, netops, Infrastructure-Foundations, SRE
MatthewVernon added a project to T355863: Migrate servers in codfw rack A4 from asw-a4-codfw to lsw1-a4-codfw: SRE-swift-storage.
Jan 30 2024, 11:19 AM · SRE-swift-storage, DBA, ops-codfw, netops, Infrastructure-Foundations, SRE
MatthewVernon added a comment to T355861: Migrate servers in codfw rack A2 from asw-a2-codfw to lsw1-a2-codfw.

swift will need depooling in codfw before this work.
Likewise the affected thanos-fe node.

Jan 30 2024, 11:19 AM · SRE-swift-storage, ops-codfw, netops, Infrastructure-Foundations, SRE
MatthewVernon added a project to T355861: Migrate servers in codfw rack A2 from asw-a2-codfw to lsw1-a2-codfw: SRE-swift-storage.
Jan 30 2024, 11:16 AM · SRE-swift-storage, ops-codfw, netops, Infrastructure-Foundations, SRE
MatthewVernon added a comment to T355860: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw.

[I'll want to check afterwards that the ms-be nodes are happy, but this shouldn't be an issue]

Jan 30 2024, 11:14 AM · SRE-swift-storage, Data-Persistence-Backup, Data-Persistence, ops-codfw, netops, Infrastructure-Foundations, SRE
MatthewVernon added a project to T355860: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw: SRE-swift-storage.
Jan 30 2024, 11:13 AM · SRE-swift-storage, Data-Persistence-Backup, Data-Persistence, ops-codfw, netops, Infrastructure-Foundations, SRE

Jan 29 2024

MatthewVernon closed T350020: Access request to deleted image files in the production Swift cluster as Resolved.
Jan 29 2024, 5:32 PM · Data-Persistence, UploadWizard, Structured-Data-Backlog
MatthewVernon closed T350020: Access request to deleted image files in the production Swift cluster, a subtask of T349641: [Investigation EPIC] Machine detection for media with copyright issues in Upload Wizard on Commons, as Resolved.
Jan 29 2024, 5:31 PM · UploadWizard, Epic, Structured-Data-Backlog (Current Work)
MatthewVernon added a comment to T356033: Disk (sdl) failed in ms-be1068.

@Jclark-ctr thank you for the quick swap, much appreciated :-)

Jan 29 2024, 4:41 PM · SRE, SRE-swift-storage, ops-eqiad, DC-Ops
MatthewVernon added a comment to T350020: Access request to deleted image files in the production Swift cluster.

@mfossati Of the 12,000 objects you named, I could find 11,608 in the database, and was able to download 11,596 objects.
out_of_domain.tar.bz2 is 23G, and available on stat1008 like the others.

Jan 29 2024, 4:31 PM · Data-Persistence, UploadWizard, Structured-Data-Backlog
MatthewVernon added a comment to T350020: Access request to deleted image files in the production Swift cluster.

Full error message:
Access Denied: Restricted File
You do not have permission to view this object.
Users with the "Can View" capability:

Jan 29 2024, 11:29 AM · Data-Persistence, UploadWizard, Structured-Data-Backlog
MatthewVernon added a comment to T350020: Access request to deleted image files in the production Swift cluster.

@MatthewVernon, please find attached the out-of-domain sample:


Looking forward to it, thanks again!

Jan 29 2024, 11:28 AM · Data-Persistence, UploadWizard, Structured-Data-Backlog
MatthewVernon triaged T356033: Disk (sdl) failed in ms-be1068 as High priority.
Jan 29 2024, 9:55 AM · SRE, SRE-swift-storage, ops-eqiad, DC-Ops
MatthewVernon created T356033: Disk (sdl) failed in ms-be1068.
Jan 29 2024, 9:54 AM · SRE, SRE-swift-storage, ops-eqiad, DC-Ops
MatthewVernon added a comment to T352215: Error 503, Backend fetch failed while uploading file from Internet Archive.

That was due to an incident - T356022

Jan 29 2024, 8:59 AM · SRE-swift-storage, Internet-Archive, Commons

Jan 26 2024

MatthewVernon added a comment to T350020: Access request to deleted image files in the production Swift cluster.

12k should be perfectly doable.

Jan 26 2024, 5:25 PM · Data-Persistence, UploadWizard, Structured-Data-Backlog
MatthewVernon added a comment to T350020: Access request to deleted image files in the production Swift cluster.

[I should say: these are all originals, because we wouldn't necessarily have thumbnails for deleted objects and couldn't straightforwardly generate them either]

Jan 26 2024, 9:20 AM · Data-Persistence, UploadWizard, Structured-Data-Backlog

Jan 25 2024

MatthewVernon added a comment to T350020: Access request to deleted image files in the production Swift cluster.

I've now done logos.tar.bz2, which is a 4.7G file; of the 11,153 objects you requested, filearchive contained 10,770 of them, and I was able to download 10,764 of those.

Jan 25 2024, 2:48 PM · Data-Persistence, UploadWizard, Structured-Data-Backlog
MatthewVernon archived P55686 Download deleted objects based on storagekey and original filename.
Jan 25 2024, 2:42 PM
MatthewVernon archived P55685 Extract storagekey for deleted objects.
Jan 25 2024, 2:41 PM