Page MenuHomePhabricator
Feed Advanced Search

Jan 23 2024

MatthewVernon added a comment to T355433: Problem with uploading large files (2 GB).

When did you try with upload wizard and get the error message you describe here? I've checked the swift logs for 18 and 19 January, and get no hits at all for 1an8dgb0q6ow.gr4vk6.12187057.pdf.0.

Jan 23 2024, 10:36 AM · SRE-swift-storage, UploadWizard
MatthewVernon added a comment to T355610: Problem uploading 4GB FLAC file in Upload Wizard to Wikimedia Commons.

I don't think this is a result of a swift failure, so we'd need input from the upload wizard folks. Looking in the swift logs, I see:

Jan 23 2024, 10:25 AM · SRE-swift-storage, UploadWizard

Jan 22 2024

MatthewVernon added a comment to T355465: thanos-be1001 disk space alerts.

@fgiunchedi are you able to look at thanos retention again, please? [I think T351927 is related].

Jan 22 2024, 9:20 AM · observability, Grafana

Jan 18 2024

MatthewVernon triaged T355330: Disk (sda) failed in ms-be2072 as High priority.
Jan 18 2024, 2:53 PM · SRE, SRE-swift-storage, ops-codfw, DC-Ops
MatthewVernon created T355330: Disk (sda) failed in ms-be2072.
Jan 18 2024, 2:53 PM · SRE, SRE-swift-storage, ops-codfw, DC-Ops

Jan 11 2024

MatthewVernon closed T354766: Create swift account for netbox-next, a subtask of T310717: Netbox: get rid of WMF Production Patches, as Resolved.
Jan 11 2024, 4:11 PM · Patch-For-Review, netbox, Infrastructure-Foundations
MatthewVernon closed T354766: Create swift account for netbox-next as Resolved.

The new account is created for you.

Jan 11 2024, 4:11 PM · SRE-swift-storage
MatthewVernon committed rLPRI01e66179b517: hiera: add fake swift passwords for netbox_dev user.
hiera: add fake swift passwords for netbox_dev user
Jan 11 2024, 3:42 PM

Jan 10 2024

MatthewVernon added a comment to T354766: Create swift account for netbox-next.

Usage will indeed be light, most likely a few cats pictures.

Jan 10 2024, 2:51 PM · SRE-swift-storage
MatthewVernon added a comment to T354766: Create swift account for netbox-next.

Hi!
I can certainly create you another swift account. Naming things is hard, but are you sure you want netbox-next rather than, say, netbox-dev? To me, netbox-next sounds like an account you plan to move prod to in due course rather than one you want to use for development/testing.

Jan 10 2024, 2:23 PM · SRE-swift-storage
MatthewVernon edited projects for T354718: Show a better error page when returning an HTTP 429, not the "Our servers are currently under maintenance" one for 5xxs, added: Traffic; removed WMF-General-or-Unknown.
Jan 10 2024, 2:18 PM · Traffic, SRE

Jan 9 2024

MatthewVernon added a comment to T354516: Requesting write access to ml-staging-codfw for ML team.

Hi. I'm the clinician on duty this week. I'm afraid I'm not quite clear what sort of access you are requesting here (ml-staging-codfw isn't a group I can see in puppet, nor is it an LDAP group)?

Jan 9 2024, 11:17 AM · Patch-For-Review, SRE, Machine-Learning-Team
MatthewVernon moved T354516: Requesting write access to ml-staging-codfw for ML team from Untriaged to Awaiting User Input on the SRE-Access-Requests board.
Jan 9 2024, 11:17 AM · Patch-For-Review, SRE, Machine-Learning-Team
MatthewVernon moved T354276: Grant Access to wmde, nda for Dima Koushha from Backlog to NDA Pending on the LDAP-Access-Requests board.
Jan 9 2024, 10:17 AM · SRE, LDAP-Access-Requests
MatthewVernon moved T354049: Requesting access to <restricted> for Arthur Taylor from Untriaged to Awaiting User Input on the SRE-Access-Requests board.
Jan 9 2024, 10:14 AM · User-ItamarWMDE, SRE, SRE-Access-Requests
MatthewVernon moved T353958: Requesting access to deployment for wenjun fan from Untriaged to Manager/NDA Approval/Confirmation on the SRE-Access-Requests board.
Jan 9 2024, 10:12 AM · SRE, SRE-Access-Requests

Jan 8 2024

MatthewVernon added a comment to T354407: Error creating thumbnail: Unknown option --no-external-files.

@hnowlan I've taken the SRE tag off; if that's incorrect do shout (but my clinician hat will then want a team who should own it...)

Jan 8 2024, 4:53 PM · Thumbor, Trust and Safety Product Sprint (Sprint Northumbrian smallpipes (8th Jan.‘24 - 19th Jan.'24)), MediaModeration (MediaModeration 2.0)
MatthewVernon removed a project from T354407: Error creating thumbnail: Unknown option --no-external-files: SRE.
Jan 8 2024, 4:52 PM · Thumbor, Trust and Safety Product Sprint (Sprint Northumbrian smallpipes (8th Jan.‘24 - 19th Jan.'24)), MediaModeration (MediaModeration 2.0)

Jan 4 2024

MatthewVernon added a project to T141756: audit / test / upgrade hp smartarray P840 firmware: SRE-swift-storage.
Jan 4 2024, 1:30 PM · SRE-swift-storage, SRE
MatthewVernon added a comment to T141756: audit / test / upgrade hp smartarray P840 firmware.

Hm, actually, that list from netbox includes servers not in the description of this task (ah, and they have manufacturer = HPE not HP) and the necessary binary is now /usr/sbin/ssacli. So the check now looks like:

mvernon@cumin2002:~$ sudo cumin "A:swift and P{F:manufacturer = HPE}" 'if [ -x /usr/sbin/ssacli ] ; then cat /sys/class/scsi_disk/*\:1\:0\:0/device/rev; fi '
15 hosts will be targeted:
ms-be[2051-2056].codfw.wmnet,ms-be[1051-1059].eqiad.wmnet
OK to proceed on 15 hosts? Enter the number of affected hosts to confirm or "q" to quit: 15
===== NODE GROUP =====                                                          
(15) ms-be[2051-2056].codfw.wmnet,ms-be[1051-1059].eqiad.wmnet                  
----- OUTPUT of 'if [ -x /usr/sbi.../device/rev; fi ' -----                     
1.98                                                                            
================                                                                
PASS |████████████████████████████████| 100% (15/15) [00:01<00:00, 14.15hosts/s]
FAIL |                                         |   0% (0/15) [00:01<?, ?hosts/s]
100.0% (15/15) success ratio (>= 100.0% threshold) for command: 'if [ -x /usr/sbi.../device/rev; fi '.
100.0% (15/15) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Jan 4 2024, 10:39 AM · SRE-swift-storage, SRE
MatthewVernon added a comment to T354180: Disk (sdh) failed in ms-be2068.

@Papaul Thanks for the quick swap :)

Jan 4 2024, 9:39 AM · SRE, SRE-swift-storage, ops-codfw, DC-Ops

Jan 3 2024

MatthewVernon added a comment to T353149: Q3 ms backend refresh work.

Similarly, 3 unhappy nodes in codfw from the install, all done now.

Jan 3 2024, 4:42 PM · SRE-swift-storage
MatthewVernon added a comment to T353149: Q3 ms backend refresh work.

Prep work (make sure all fs' mounted correctly) done on ms-be10[76-82], three nodes had an FS unhappy from the install.

Jan 3 2024, 4:30 PM · SRE-swift-storage
MatthewVernon added a comment to T141756: audit / test / upgrade hp smartarray P840 firmware.

[removing swift-storage tag as none of the relevant swift nodes are still in production]

Jan 3 2024, 12:04 PM · SRE-swift-storage, SRE
MatthewVernon removed a project from T141756: audit / test / upgrade hp smartarray P840 firmware: SRE-swift-storage.
Jan 3 2024, 12:04 PM · SRE-swift-storage, SRE

Jan 2 2024

MatthewVernon triaged T354180: Disk (sdh) failed in ms-be2068 as High priority.
Jan 2 2024, 11:34 AM · SRE, SRE-swift-storage, ops-codfw, DC-Ops
MatthewVernon created T354180: Disk (sdh) failed in ms-be2068.
Jan 2 2024, 11:33 AM · SRE, SRE-swift-storage, ops-codfw, DC-Ops

Dec 23 2023

MatthewVernon added a comment to T350192: On-call batphone escalation configuration holidays FY2023-24.

@lmata the UI looks a bit like a lot of individuals are on-call rather than batphone. Is that intentional, or am I confused?

Dec 23 2023, 4:06 PM · SRE Observability (FY2023/2024-Q4)

Dec 21 2023

MatthewVernon added a comment to T353498: Several people experiencing 'Internal error: Server failed to store temporary file' when trying to upload a file to Commons .

I've created T353871 to track the later failure mode (since it's something different going wrong to that reported at the start of this ticket), and subscribed the people who have complained here about that other problem.

Dec 21 2023, 10:40 AM · UploadWizard, SRE-swift-storage, Commons
MatthewVernon created T353871: Uploadwizard sometimes fails "Internal error: Server failed to publish temporary file".
Dec 21 2023, 10:38 AM · Structured-Data-Backlog, Wikimedia-production-error, Commons, SRE-swift-storage, UploadWizard
MatthewVernon added a comment to T349839: Q2:rack/setup/install ms-be refresh.

Thanks :)

Dec 21 2023, 10:25 AM · SRE-swift-storage, SRE, Data-Persistence, ops-codfw, DC-Ops
MatthewVernon moved T353149: Q3 ms backend refresh work from Inbox to In progress on the SRE-swift-storage board.
Dec 21 2023, 10:25 AM · SRE-swift-storage
MatthewVernon updated the task description for T353149: Q3 ms backend refresh work.
Dec 21 2023, 10:24 AM · SRE-swift-storage

Dec 20 2023

MatthewVernon added a project to T353498: Several people experiencing 'Internal error: Server failed to store temporary file' when trying to upload a file to Commons : UploadWizard.
Dec 20 2023, 6:09 PM · UploadWizard, SRE-swift-storage, Commons
MatthewVernon added a comment to T353498: Several people experiencing 'Internal error: Server failed to store temporary file' when trying to upload a file to Commons .

But evidently plenty of people can upload (we have metrics for this - e.g. this panel of the Swift dashboard showing a couple of thousand uploads in the last hour), and bundling different failures modes into one Phab ticket just leads to confusion.

Dec 20 2023, 4:45 PM · UploadWizard, SRE-swift-storage, Commons
MatthewVernon added a comment to T353498: Several people experiencing 'Internal error: Server failed to store temporary file' when trying to upload a file to Commons .

That's a different issue, then; and I don't know anything about the internals of the upload wizard (so what the underlying operation at that point is), nor if/what/where it logs more useful details...

Dec 20 2023, 4:38 PM · UploadWizard, SRE-swift-storage, Commons
MatthewVernon added a comment to T353797: Missing original File:Ignatyevo.jpg.

I don't know, and I suspect it is impossible to know, a number of years after the fact.

Dec 20 2023, 4:04 PM · media-backups, Data-Persistence, SRE-swift-storage
MatthewVernon closed T353498: Several people experiencing 'Internal error: Server failed to store temporary file' when trying to upload a file to Commons as Resolved.

So I tried to reproduce this just now with the PDF noted at the top of this ticket, and it works for me (it gets as far as the duplicate file check, at which point I get the error message that tells me someone has managed to successfully upload this image).
Similarly, I am able to upload the PDF linked by @Smaxims (which hasn't yet been uploaded by anyone else).

Dec 20 2023, 2:00 PM · UploadWizard, SRE-swift-storage, Commons
MatthewVernon closed T353797: Missing original File:Ignatyevo.jpg as Declined.

The object doesn't appear in the container listing either (so it's not a "ghost" as we have seen occasionally) (I checked with swift list wikipedia-en-local-public.27 --prefix 2/27/Ignatyevo.jpg). I note a briefly-added speedy deletion request in 2019 on the grounds of missing or corrupted image, so maybe it wasn't even present then...

Dec 20 2023, 1:33 PM · media-backups, Data-Persistence, SRE-swift-storage

Dec 15 2023

MatthewVernon added a comment to T308644: unstable device mapping of SSDs causing swift/puppet problems - example reimage.

I think we're at the point where we're going to move to the more reliable approach on a rolling basis as hardware gets replaced; so other than ironing out the remaining cookbook issues, probably not.

Dec 15 2023, 3:39 PM · SRE-swift-storage
MatthewVernon closed T341488: Split Thanos components from thanos-fe hosts into titan hosts as Resolved.

Done, thanks for the reminder.

Dec 15 2023, 11:42 AM · SRE Observability (FY2023/2024-Q1), User-fgiunchedi, SRE-swift-storage, Observability-Metrics
MatthewVernon added a comment to T350179: Reimage cookbook on new eqiad hosts stuck at PXE booting.

The plural of anecdote is not data, but: I had one system (ms-fe2013) that did this when rebooted by the reimage cookbook; I did a cold power cycle and hit F12 for PXE boot and it again didn't DHCP; I then set "PXE" from the Boot menu (via the HTML iDRAC interface) and did a warm boot, and then it PXEd OK.

Dec 15 2023, 9:17 AM · SRE, Traffic, SRE-swift-storage, ops-codfw, DC-Ops, ops-eqiad

Dec 12 2023

MatthewVernon awarded T351927: Decide and tweak Thanos retention a Like token.
Dec 12 2023, 10:09 AM · Patch-For-Review, User-fgiunchedi, Observability-Metrics

Dec 11 2023

MatthewVernon added a comment to T352003: Create a dedicated image for Debian package builds.

@Jelto my various pipelines in repos I use for testing (pcre2 and swift) are now all passing.

Dec 11 2023, 2:47 PM · collaboration-services
MatthewVernon added a project to T349839: Q2:rack/setup/install ms-be refresh: SRE-swift-storage.
Dec 11 2023, 2:21 PM · SRE-swift-storage, SRE, Data-Persistence, ops-codfw, DC-Ops
MatthewVernon renamed T353149: Q3 ms backend refresh work from Q2 ms backend refresh work to Q3 ms backend refresh work.
Dec 11 2023, 2:19 PM · SRE-swift-storage
MatthewVernon created T353149: Q3 ms backend refresh work.
Dec 11 2023, 2:18 PM · SRE-swift-storage
MatthewVernon added a comment to T352003: Create a dedicated image for Debian package builds.

I think https://gitlab.wikimedia.org/repos/sre/wmf-debci/-/merge_requests/7 will fix this.

Dec 11 2023, 2:03 PM · collaboration-services
MatthewVernon added a comment to T352003: Create a dedicated image for Debian package builds.

I think we may be missing an apt update step now - I now get failures like https://gitlab.wikimedia.org/repos/data_persistence/swift/-/jobs/174503 :

Dec 11 2023, 1:35 PM · collaboration-services
MatthewVernon added a comment to T353091: Disk space thanos-be1001:9100 alert.

Quite significant growth in thanos disk usage over the last 6 months:
https://grafana.wikimedia.org/d/NDWQoBiGk/thanos-swift?orgId=1&from=1686482606897&to=1702293746897&var-site=eqiad&var-prometheus=thanos&var-cluster=thanos&viewPanel=9

Dec 11 2023, 11:24 AM · Observability-Metrics, Grafana, SRE-swift-storage
MatthewVernon added a comment to T353091: Disk space thanos-be1001:9100 alert.

@fgiunchedi are you in a position to reduce some thanos disk usage/retention? Most swift drives are 93/4% full now:

mvernon@thanos-fe1001:~$ sudo swift-recon -d --human-readable
===============================================================================
--> Starting reconnaissance on 112 hosts (object)
===============================================================================
[2023-12-11 11:20:34] Checking disk usage now
Distribution Graph:
  0%   17 *****************
  1%    1 *
  2%    2 **
  3%    1 *
  4%    1 *
  6%    2 **
  9%    1 *
 10%    2 **
 19%    1 *
 20%    1 *
 25%    1 *
 34%    1 *
 41%    1 *
 92%    6 ******
 93%   68 *********************************************************************
 94%   21 *********************
 95%    1 *
Disk usage: space used: 360 TB of 390 TB
Disk usage: space free: 30 TB of 390 TB
Disk usage: lowest: 0.72%, highest: 95.16%, avg: 92.19293660063543%
===============================================================================
Dec 11 2023, 11:22 AM · Observability-Metrics, Grafana, SRE-swift-storage

Dec 5 2023

MatthewVernon added a project to T352744: OpenSSL 3.x performance issues: SRE-swift-storage.
Dec 5 2023, 5:12 PM · SRE-swift-storage, Traffic
MatthewVernon added a comment to T352744: OpenSSL 3.x performance issues.

I think ms-* swift will fall foul of this too, via the wmf-rewrite middleware (which is using python's urllib.request.build_opener to talk to e.g. thumbor.svc.codfw.wmnet:8800 ) [I'm not 100% sure, that might be http rather than https?]

Dec 5 2023, 5:11 PM · SRE-swift-storage, Traffic

Dec 4 2023

MatthewVernon added a project to T349840: Q1:rack/setup/install ms-be refresh: SRE-swift-storage.
Dec 4 2023, 2:21 PM · SRE-swift-storage, SRE, Data-Persistence, ops-eqiad, DC-Ops
MatthewVernon added a comment to T349840: Q1:rack/setup/install ms-be refresh.

@Jclark-ctr sorry, there are some puppet changes that have to be made before new ms-be* nodes will install cleanly, which is why those nodes failed on Friday. I've made the relevant changes now, so you should be good to go.
I hadn't realised the new kit was quite that close to being ready, apologies for the hassle.

Dec 4 2023, 9:35 AM · SRE-swift-storage, SRE, Data-Persistence, ops-eqiad, DC-Ops

Nov 29 2023

MatthewVernon added a comment to T350924: Swift container for archived mariadb tables.

I think we've broadly agreed the "process"; do you want to put a wikitech page together with that (and the initial data set(s)) on? And suggest a name for the swift account and I'll get it made for you.

Nov 29 2023, 4:35 PM · SRE-swift-storage
MatthewVernon added a comment to T351283: Compile and package MariaDB 10.6.16 and 10.4.32.

I'm inclined to agree.

Nov 29 2023, 12:34 PM · DBA

Nov 28 2023

MatthewVernon added a comment to T191804: Allow to store files between 4 and 5 GB.

I think it's fair to say 12 5GB files a month would not be overwhelming (about 2TB of raw capacity per cluster per year given 3x replication, cf. our current growth rate of very approximately 120TB/year), and the underlying filesystems that swift sits upon could cope with some 5GB objects.

Nov 28 2023, 11:45 AM · User-notice-archive, Data-Persistence-Backup, media-backups, SRE-swift-storage, MediaWiki-File-management, Commons, Multimedia
MatthewVernon added a comment to T351475: Reduce impact of Elastic snapshots.

I don't expect the change to make difference to how anyone is using swift - moving from nginx to envoy for TLS termination was more about bringing swift more up-to-date in terms of TLS termination, and getting better observability and reliability.

Nov 28 2023, 10:23 AM · Data-Platform-SRE (2023.12.01 - 2023.12.31)

Nov 27 2023

MatthewVernon added a comment to T352003: Create a dedicated image for Debian package builds.

I don't feel strongly, but would incline to just build-essential and devscripts? [that'll pull in dpkg-dev]: that way packages will declare a correct set of Build-Depends (or not build), and git, dgit, ca-certificates are all pretty small so I don't think they'll extend build times quite as much (plus they're only needed for some of the optional jobs).

Nov 27 2023, 4:26 PM · collaboration-services
MatthewVernon added a project to T350917: Incomplete files uploaded - chunked upload drops last chunk.: UploadWizard.
Nov 27 2023, 12:30 PM · MW-1.42-notes (1.42.0-wmf.20; 2024-02-27), UploadWizard, SRE-swift-storage, Commons
MatthewVernon added a comment to T350917: Incomplete files uploaded - chunked upload drops last chunk..

Picking a recent failure:

mvernon@cumin1001:~$ sudo cumin -x --force --no-progress --no-color -o txt O:swift::proxy "zgrep -F '0/0e/Wikidata_43.jpg' /var/log/swift/proxy-access.log.3.gz" >~/junk/T350917.txt
#Cumin output elided
mvernon@cumin1001:~$ grep " PUT " junk/T350917.txt | grep 'public'
moss-fe2001.codfw.wmnet: Nov 24 19:12:12 moss-fe2001 proxy-server: 10.192.48.101 10.192.32.51 24/Nov/2023/19/12/12 PUT /v1/AUTH_mw/wikipedia-commons-local-public.0e/0/0e/Wikidata_43.jpg HTTP/1.0 201 - wikimedia/multi-http-client%20v1.1 AUTH_tk77cb529d3... 10485760 - 574193492001fdf36bdf02cbd36887a4 tx56ca15da11c2455281325-006560f58b - 0.2022 - - 1700853131.809281588 1700853132.011436224 0
ms-fe1013.eqiad.wmnet: Nov 24 19:12:12 ms-fe1013 proxy-server: 10.192.48.101 10.64.48.149 24/Nov/2023/19/12/12 PUT /v1/AUTH_mw/wikipedia-commons-local-public.0e/0/0e/Wikidata_43.jpg HTTP/1.0 201 - wikimedia/multi-http-client%20v1.1 AUTH_tk3f3803a6c... 10485760 - 574193492001fdf36bdf02cbd36887a4 tx460fd22327174e6ba0df8-006560f58c - 0.3835 - - 1700853132.315957069 1700853132.699444294 0
Nov 27 2023, 12:29 PM · MW-1.42-notes (1.42.0-wmf.20; 2024-02-27), UploadWizard, SRE-swift-storage, Commons
MatthewVernon added a comment to T352003: Create a dedicated image for Debian package builds.

To be clear - this needs to be one image per suite we currently support and predictably named so builddebs.yml can use these images instead of docker-registry.wikimedia.org/${SUITE}

Nov 27 2023, 9:47 AM · collaboration-services

Nov 26 2023

MatthewVernon added a comment to T350917: Incomplete files uploaded - chunked upload drops last chunk..

Perhaps worth noting that some of these images predate the move to envoy too (e.g. https://commons.wikimedia.org/wiki/File:Youngtimer_Trophy_Schwedenkreutz_2023_15.jpg ) so I don't think this is related to the recent nginx->envoy swift changes. I'm tempted to suggest a client issue?

Nov 26 2023, 1:23 PM · MW-1.42-notes (1.42.0-wmf.20; 2024-02-27), UploadWizard, SRE-swift-storage, Commons
MatthewVernon added a comment to T350917: Incomplete files uploaded - chunked upload drops last chunk..

Looking at recent uploads, there are definitely >10MB files being uploaded:
https://commons.wikimedia.org/wiki/File:Ambito_veneziano,_Martirio_di_san_Tommaso_Becket,_1260_ca.,_dal_palazzo_vescovile_di_tv,_05.jpg

Nov 26 2023, 1:20 PM · MW-1.42-notes (1.42.0-wmf.20; 2024-02-27), UploadWizard, SRE-swift-storage, Commons

Nov 24 2023

AlexisJazz awarded T351876: Download cut off (envoy response timeout at 65s) for Commons file over CDN size threshold (1GB) a Like token.
Nov 24 2023, 1:29 PM · SRE-swift-storage, Traffic, Commons
MatthewVernon closed T351876: Download cut off (envoy response timeout at 65s) for Commons file over CDN size threshold (1GB) as Resolved.

I can confirm that I can now download this file, even though it takes nearly 9 minutes on my rubbish internet:

matthew@aragorn:/tmp$ wget "https://upload.wikimedia.org/wikipedia/commons/3/3d/How_to_de-package_and_expose_a_GPU_flip_chip_die.webm"
--2023-11-24 13:16:32--  https://upload.wikimedia.org/wikipedia/commons/3/3d/How_to_de-package_and_expose_a_GPU_flip_chip_die.webm
Resolving upload.wikimedia.org (upload.wikimedia.org)... 2a02:ec80:300:ed1a::2:b, 185.15.59.240
Connecting to upload.wikimedia.org (upload.wikimedia.org)|2a02:ec80:300:ed1a::2:b|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1144169738 (1.1G) [video/webm]
Saving to: ‘How_to_de-package_and_expose_a_GPU_flip_chip_die.webm’
Nov 24 2023, 1:27 PM · SRE-swift-storage, Traffic, Commons
MatthewVernon added a comment to T351876: Download cut off (envoy response timeout at 65s) for Commons file over CDN size threshold (1GB).

@AlexisJazz it's a time for how long the connect has no data going over it "Each time an encode/decode event for headers or data is processed for the stream, the timer will be reset."

Nov 24 2023, 1:14 PM · SRE-swift-storage, Traffic, Commons
MatthewVernon added a comment to T351876: Download cut off (envoy response timeout at 65s) for Commons file over CDN size threshold (1GB).

Let's add it as an optional parameter, and try and pass it through.

Nov 24 2023, 12:02 PM · SRE-swift-storage, Traffic, Commons
MatthewVernon added a comment to T351876: Download cut off (envoy response timeout at 65s) for Commons file over CDN size threshold (1GB).

...but profile::tlsproxy::envoy doesn't have that configuation available as far as I can see...

Nov 24 2023, 11:54 AM · SRE-swift-storage, Traffic, Commons
MatthewVernon triaged T351876: Download cut off (envoy response timeout at 65s) for Commons file over CDN size threshold (1GB) as High priority.
Nov 24 2023, 11:36 AM · SRE-swift-storage, Traffic, Commons
MatthewVernon added a comment to T351876: Download cut off (envoy response timeout at 65s) for Commons file over CDN size threshold (1GB).

nginx didn't enforce a timeout for the whole request but just a timeout (180s) between reads from the server so that won't be enough. To mimick the behavior you need to set the response timeout to 0 and stream_idle_timeout to 180s (dunno if the latter is supported by our puppetization)

Nov 24 2023, 11:36 AM · SRE-swift-storage, Traffic, Commons
MatthewVernon added a comment to T351876: Download cut off (envoy response timeout at 65s) for Commons file over CDN size threshold (1GB).

I think, per modules/tlsproxy/manifests/localssl.pp it was 180s when we were using nginx, so I'll adjust it accordingly.

Nov 24 2023, 11:31 AM · SRE-swift-storage, Traffic, Commons
MatthewVernon added a comment to T351876: Download cut off (envoy response timeout at 65s) for Commons file over CDN size threshold (1GB).

I'm trying to see what the previous nginx-based timeout was, but it's code I'm unfamiliar with

Nov 24 2023, 11:26 AM · SRE-swift-storage, Traffic, Commons

Nov 23 2023

MatthewVernon closed T317616: Revisit CDN<-->Swift communication as Resolved.

I think this is now done - ms clusters default to using envoy (I've not done anything to beta, but it should carry on using nginx just fine).

Nov 23 2023, 1:58 PM · SRE-swift-storage, SRE, Traffic

Nov 22 2023

MatthewVernon updated the task description for T317616: Revisit CDN<-->Swift communication.
Nov 22 2023, 4:57 PM · SRE-swift-storage, SRE, Traffic
MatthewVernon updated the task description for T317616: Revisit CDN<-->Swift communication.
Nov 22 2023, 4:47 PM · SRE-swift-storage, SRE, Traffic
MatthewVernon added a comment to T317616: Revisit CDN<-->Swift communication.

(perhaps the moss-fe2001 puppet failures are due to T350809 )

Nov 22 2023, 4:34 PM · SRE-swift-storage, SRE, Traffic
MatthewVernon updated the task description for T317616: Revisit CDN<-->Swift communication.
Nov 22 2023, 4:14 PM · SRE-swift-storage, SRE, Traffic
MatthewVernon updated the task description for T317616: Revisit CDN<-->Swift communication.
Nov 22 2023, 3:24 PM · SRE-swift-storage, SRE, Traffic
MatthewVernon added a comment to T350179: Reimage cookbook on new eqiad hosts stuck at PXE booting.

I can try if/when I get another one that fails (I'd be surprised if that were the solution, given "enough reboots" seems to have worked with the troublesome nodes I've had so far...)

Nov 22 2023, 3:23 PM · SRE, Traffic, SRE-swift-storage, ops-codfw, DC-Ops, ops-eqiad
MatthewVernon changed the status of T317616: Revisit CDN<-->Swift communication from Stalled to In Progress.
Nov 22 2023, 2:36 PM · SRE-swift-storage, SRE, Traffic
MatthewVernon moved T350179: Reimage cookbook on new eqiad hosts stuck at PXE booting from Inbox to Radar on the SRE-swift-storage board.
Nov 22 2023, 2:28 PM · SRE, Traffic, SRE-swift-storage, ops-codfw, DC-Ops, ops-eqiad
MatthewVernon added projects to T350179: Reimage cookbook on new eqiad hosts stuck at PXE booting: ops-codfw, SRE-swift-storage.
Nov 22 2023, 2:25 PM · SRE, Traffic, SRE-swift-storage, ops-codfw, DC-Ops, ops-eqiad
MatthewVernon added a comment to T350179: Reimage cookbook on new eqiad hosts stuck at PXE booting.

I hit this problem when re-imaging ms-fe* nodes (for T317616). Most of them PXE booted fine, but two didn't - ms-fe2014.codfw.wmnet needed one further reboot (which I did from the HTML console) before it would PXE, and ms-fe1013.eqiad.wmnet needed two further reboots - i.e. it wedged twice at the same point before finally PXEing properly.

Nov 22 2023, 2:25 PM · SRE, Traffic, SRE-swift-storage, ops-codfw, DC-Ops, ops-eqiad

Nov 21 2023

MatthewVernon added a comment to T351653: thanos internal TLS failure after puppet 7 update.

@jbond thanks, that CR has fixed the sad services (and the openssl runes now work too).

Nov 21 2023, 11:09 AM · SRE, Infrastructure-Foundations, Puppet (Puppet 7.0), SRE-swift-storage

Nov 20 2023

MatthewVernon updated subscribers of T351653: thanos internal TLS failure after puppet 7 update.

[it was suggested I added jbond to this task]

Nov 20 2023, 5:03 PM · SRE, Infrastructure-Foundations, Puppet (Puppet 7.0), SRE-swift-storage
MatthewVernon updated the task description for T351653: thanos internal TLS failure after puppet 7 update.
Nov 20 2023, 4:25 PM · SRE, Infrastructure-Foundations, Puppet (Puppet 7.0), SRE-swift-storage
MatthewVernon added a comment to T351653: thanos internal TLS failure after puppet 7 update.

(priority set to high as we do use the swift-dispersion-stats to check for cluster health)

Nov 20 2023, 4:22 PM · SRE, Infrastructure-Foundations, Puppet (Puppet 7.0), SRE-swift-storage
MatthewVernon triaged T351653: thanos internal TLS failure after puppet 7 update as High priority.
Nov 20 2023, 4:22 PM · SRE, Infrastructure-Foundations, Puppet (Puppet 7.0), SRE-swift-storage
MatthewVernon created T351653: thanos internal TLS failure after puppet 7 update.
Nov 20 2023, 4:22 PM · SRE, Infrastructure-Foundations, Puppet (Puppet 7.0), SRE-swift-storage

Nov 17 2023

MatthewVernon reassigned T349839: Q2:rack/setup/install ms-be refresh from MatthewVernon to RobH.

Hi. I think:
hostnames: ms-be20[74-80]
racking: not more than 1 per rack, please, though they can share with existing nodes (e.g. you could put them in the racks the old systems are coming out of, which I think are A2,A7,B2,B7,C2,C7,D7)
networking: 10G private VLAN like existing ms-be* nodes
Partitioning/Raid: JBOD, please unlike previous ms-be* nodes, we now want everything non-RAID (cf T308677)

Nov 17 2023, 3:33 PM · SRE-swift-storage, SRE, Data-Persistence, ops-codfw, DC-Ops
MatthewVernon reassigned T349840: Q1:rack/setup/install ms-be refresh from MatthewVernon to RobH.

Hi @RobH. I think:
hostnames: ms-be1076-1082
racking: no more than 1 server per rack, please (but they can go in racks that already have other ms-be* nodes in e.g. the ones that the old systems are coming out of which are A2,A4,B2,C2,D2)
networking setup: 10G private VLAN like existing ms-be* nodes
Partitioning/Raid: JBOD, please unlike older previous ms-be* nodes, we now want everything non-RAID (cf T308677)
OS Distro: Bullseye

Nov 17 2023, 3:24 PM · SRE-swift-storage, SRE, Data-Persistence, ops-eqiad, DC-Ops
MatthewVernon moved T351431: Requesting access to deployment for sfaci from Awaiting User Input to Manager/NDA Approval/Confirmation on the SRE-Access-Requests board.
Nov 17 2023, 10:11 AM · SRE, SRE-Access-Requests
MatthewVernon added a comment to T351431: Requesting access to deployment for sfaci.

ssh pubkey confirmed OOB; this just needs group approval.

Nov 17 2023, 10:11 AM · SRE, SRE-Access-Requests
MatthewVernon updated the task description for T351431: Requesting access to deployment for sfaci.
Nov 17 2023, 10:10 AM · SRE, SRE-Access-Requests
MatthewVernon moved T351431: Requesting access to deployment for sfaci from Untriaged to Awaiting User Input on the SRE-Access-Requests board.
Nov 17 2023, 10:07 AM · SRE, SRE-Access-Requests
MatthewVernon updated subscribers of T351431: Requesting access to deployment for sfaci.

@thcipriani you're the approver for the deployment group, can you approve (or otherwise) this request, please?

Nov 17 2023, 10:07 AM · SRE, SRE-Access-Requests
MatthewVernon updated the task description for T351431: Requesting access to deployment for sfaci.
Nov 17 2023, 10:04 AM · SRE, SRE-Access-Requests

Nov 16 2023

MatthewVernon moved T351387: Requesting access to wmf and analytics-privatedata-users for EHughes (superset access with no server access) from Untriaged to Manager/NDA Approval/Confirmation on the SRE-Access-Requests board.
Nov 16 2023, 12:00 PM · SRE, SRE-Access-Requests