When did you try with upload wizard and get the error message you describe here? I've checked the swift logs for 18 and 19 January, and get no hits at all for 1an8dgb0q6ow.gr4vk6.12187057.pdf.0.
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Jan 23 2024
I don't think this is a result of a swift failure, so we'd need input from the upload wizard folks. Looking in the swift logs, I see:
Jan 22 2024
@fgiunchedi are you able to look at thanos retention again, please? [I think T351927 is related].
Jan 18 2024
Jan 11 2024
The new account is created for you.
Jan 10 2024
In T354766#9450359, @ayounsi wrote:Usage will indeed be light, most likely a few cats pictures.
Hi!
I can certainly create you another swift account. Naming things is hard, but are you sure you want netbox-next rather than, say, netbox-dev? To me, netbox-next sounds like an account you plan to move prod to in due course rather than one you want to use for development/testing.
Jan 9 2024
Hi. I'm the clinician on duty this week. I'm afraid I'm not quite clear what sort of access you are requesting here (ml-staging-codfw isn't a group I can see in puppet, nor is it an LDAP group)?
Jan 8 2024
@hnowlan I've taken the SRE tag off; if that's incorrect do shout (but my clinician hat will then want a team who should own it...)
Jan 4 2024
Hm, actually, that list from netbox includes servers not in the description of this task (ah, and they have manufacturer = HPE not HP) and the necessary binary is now /usr/sbin/ssacli. So the check now looks like:
mvernon@cumin2002:~$ sudo cumin "A:swift and P{F:manufacturer = HPE}" 'if [ -x /usr/sbin/ssacli ] ; then cat /sys/class/scsi_disk/*\:1\:0\:0/device/rev; fi ' 15 hosts will be targeted: ms-be[2051-2056].codfw.wmnet,ms-be[1051-1059].eqiad.wmnet OK to proceed on 15 hosts? Enter the number of affected hosts to confirm or "q" to quit: 15 ===== NODE GROUP ===== (15) ms-be[2051-2056].codfw.wmnet,ms-be[1051-1059].eqiad.wmnet ----- OUTPUT of 'if [ -x /usr/sbi.../device/rev; fi ' ----- 1.98 ================ PASS |████████████████████████████████| 100% (15/15) [00:01<00:00, 14.15hosts/s] FAIL | | 0% (0/15) [00:01<?, ?hosts/s] 100.0% (15/15) success ratio (>= 100.0% threshold) for command: 'if [ -x /usr/sbi.../device/rev; fi '. 100.0% (15/15) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
@Papaul Thanks for the quick swap :)
Jan 3 2024
Similarly, 3 unhappy nodes in codfw from the install, all done now.
Prep work (make sure all fs' mounted correctly) done on ms-be10[76-82], three nodes had an FS unhappy from the install.
[removing swift-storage tag as none of the relevant swift nodes are still in production]
Jan 2 2024
Dec 23 2023
@lmata the UI looks a bit like a lot of individuals are on-call rather than batphone. Is that intentional, or am I confused?
Dec 21 2023
I've created T353871 to track the later failure mode (since it's something different going wrong to that reported at the start of this ticket), and subscribed the people who have complained here about that other problem.
Thanks :)
Dec 20 2023
But evidently plenty of people can upload (we have metrics for this - e.g. this panel of the Swift dashboard showing a couple of thousand uploads in the last hour), and bundling different failures modes into one Phab ticket just leads to confusion.
That's a different issue, then; and I don't know anything about the internals of the upload wizard (so what the underlying operation at that point is), nor if/what/where it logs more useful details...
I don't know, and I suspect it is impossible to know, a number of years after the fact.
So I tried to reproduce this just now with the PDF noted at the top of this ticket, and it works for me (it gets as far as the duplicate file check, at which point I get the error message that tells me someone has managed to successfully upload this image).
Similarly, I am able to upload the PDF linked by @Smaxims (which hasn't yet been uploaded by anyone else).
The object doesn't appear in the container listing either (so it's not a "ghost" as we have seen occasionally) (I checked with swift list wikipedia-en-local-public.27 --prefix 2/27/Ignatyevo.jpg). I note a briefly-added speedy deletion request in 2019 on the grounds of missing or corrupted image, so maybe it wasn't even present then...
Dec 15 2023
I think we're at the point where we're going to move to the more reliable approach on a rolling basis as hardware gets replaced; so other than ironing out the remaining cookbook issues, probably not.
Done, thanks for the reminder.
The plural of anecdote is not data, but: I had one system (ms-fe2013) that did this when rebooted by the reimage cookbook; I did a cold power cycle and hit F12 for PXE boot and it again didn't DHCP; I then set "PXE" from the Boot menu (via the HTML iDRAC interface) and did a warm boot, and then it PXEd OK.
Dec 12 2023
Dec 11 2023
@Jelto my various pipelines in repos I use for testing (pcre2 and swift) are now all passing.
I think https://gitlab.wikimedia.org/repos/sre/wmf-debci/-/merge_requests/7 will fix this.
I think we may be missing an apt update step now - I now get failures like https://gitlab.wikimedia.org/repos/data_persistence/swift/-/jobs/174503 :
Quite significant growth in thanos disk usage over the last 6 months:
https://grafana.wikimedia.org/d/NDWQoBiGk/thanos-swift?orgId=1&from=1686482606897&to=1702293746897&var-site=eqiad&var-prometheus=thanos&var-cluster=thanos&viewPanel=9
@fgiunchedi are you in a position to reduce some thanos disk usage/retention? Most swift drives are 93/4% full now:
mvernon@thanos-fe1001:~$ sudo swift-recon -d --human-readable =============================================================================== --> Starting reconnaissance on 112 hosts (object) =============================================================================== [2023-12-11 11:20:34] Checking disk usage now Distribution Graph: 0% 17 ***************** 1% 1 * 2% 2 ** 3% 1 * 4% 1 * 6% 2 ** 9% 1 * 10% 2 ** 19% 1 * 20% 1 * 25% 1 * 34% 1 * 41% 1 * 92% 6 ****** 93% 68 ********************************************************************* 94% 21 ********************* 95% 1 * Disk usage: space used: 360 TB of 390 TB Disk usage: space free: 30 TB of 390 TB Disk usage: lowest: 0.72%, highest: 95.16%, avg: 92.19293660063543% ===============================================================================
Dec 5 2023
I think ms-* swift will fall foul of this too, via the wmf-rewrite middleware (which is using python's urllib.request.build_opener to talk to e.g. thumbor.svc.codfw.wmnet:8800 ) [I'm not 100% sure, that might be http rather than https?]
Dec 4 2023
@Jclark-ctr sorry, there are some puppet changes that have to be made before new ms-be* nodes will install cleanly, which is why those nodes failed on Friday. I've made the relevant changes now, so you should be good to go.
I hadn't realised the new kit was quite that close to being ready, apologies for the hassle.
Nov 29 2023
I think we've broadly agreed the "process"; do you want to put a wikitech page together with that (and the initial data set(s)) on? And suggest a name for the swift account and I'll get it made for you.
I'm inclined to agree.
Nov 28 2023
I think it's fair to say 12 5GB files a month would not be overwhelming (about 2TB of raw capacity per cluster per year given 3x replication, cf. our current growth rate of very approximately 120TB/year), and the underlying filesystems that swift sits upon could cope with some 5GB objects.
I don't expect the change to make difference to how anyone is using swift - moving from nginx to envoy for TLS termination was more about bringing swift more up-to-date in terms of TLS termination, and getting better observability and reliability.
Nov 27 2023
I don't feel strongly, but would incline to just build-essential and devscripts? [that'll pull in dpkg-dev]: that way packages will declare a correct set of Build-Depends (or not build), and git, dgit, ca-certificates are all pretty small so I don't think they'll extend build times quite as much (plus they're only needed for some of the optional jobs).
Picking a recent failure:
mvernon@cumin1001:~$ sudo cumin -x --force --no-progress --no-color -o txt O:swift::proxy "zgrep -F '0/0e/Wikidata_43.jpg' /var/log/swift/proxy-access.log.3.gz" >~/junk/T350917.txt #Cumin output elided mvernon@cumin1001:~$ grep " PUT " junk/T350917.txt | grep 'public' moss-fe2001.codfw.wmnet: Nov 24 19:12:12 moss-fe2001 proxy-server: 10.192.48.101 10.192.32.51 24/Nov/2023/19/12/12 PUT /v1/AUTH_mw/wikipedia-commons-local-public.0e/0/0e/Wikidata_43.jpg HTTP/1.0 201 - wikimedia/multi-http-client%20v1.1 AUTH_tk77cb529d3... 10485760 - 574193492001fdf36bdf02cbd36887a4 tx56ca15da11c2455281325-006560f58b - 0.2022 - - 1700853131.809281588 1700853132.011436224 0 ms-fe1013.eqiad.wmnet: Nov 24 19:12:12 ms-fe1013 proxy-server: 10.192.48.101 10.64.48.149 24/Nov/2023/19/12/12 PUT /v1/AUTH_mw/wikipedia-commons-local-public.0e/0/0e/Wikidata_43.jpg HTTP/1.0 201 - wikimedia/multi-http-client%20v1.1 AUTH_tk3f3803a6c... 10485760 - 574193492001fdf36bdf02cbd36887a4 tx460fd22327174e6ba0df8-006560f58c - 0.3835 - - 1700853132.315957069 1700853132.699444294 0
To be clear - this needs to be one image per suite we currently support and predictably named so builddebs.yml can use these images instead of docker-registry.wikimedia.org/${SUITE}
Nov 26 2023
Perhaps worth noting that some of these images predate the move to envoy too (e.g. https://commons.wikimedia.org/wiki/File:Youngtimer_Trophy_Schwedenkreutz_2023_15.jpg ) so I don't think this is related to the recent nginx->envoy swift changes. I'm tempted to suggest a client issue?
Looking at recent uploads, there are definitely >10MB files being uploaded:
https://commons.wikimedia.org/wiki/File:Ambito_veneziano,_Martirio_di_san_Tommaso_Becket,_1260_ca.,_dal_palazzo_vescovile_di_tv,_05.jpg
Nov 24 2023
I can confirm that I can now download this file, even though it takes nearly 9 minutes on my rubbish internet:
matthew@aragorn:/tmp$ wget "https://upload.wikimedia.org/wikipedia/commons/3/3d/How_to_de-package_and_expose_a_GPU_flip_chip_die.webm" --2023-11-24 13:16:32-- https://upload.wikimedia.org/wikipedia/commons/3/3d/How_to_de-package_and_expose_a_GPU_flip_chip_die.webm Resolving upload.wikimedia.org (upload.wikimedia.org)... 2a02:ec80:300:ed1a::2:b, 185.15.59.240 Connecting to upload.wikimedia.org (upload.wikimedia.org)|2a02:ec80:300:ed1a::2:b|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 1144169738 (1.1G) [video/webm] Saving to: ‘How_to_de-package_and_expose_a_GPU_flip_chip_die.webm’
@AlexisJazz it's a time for how long the connect has no data going over it "Each time an encode/decode event for headers or data is processed for the stream, the timer will be reset."
Let's add it as an optional parameter, and try and pass it through.
...but profile::tlsproxy::envoy doesn't have that configuation available as far as I can see...
In T351876#9356447, @Vgutierrez wrote:nginx didn't enforce a timeout for the whole request but just a timeout (180s) between reads from the server so that won't be enough. To mimick the behavior you need to set the response timeout to 0 and stream_idle_timeout to 180s (dunno if the latter is supported by our puppetization)
I think, per modules/tlsproxy/manifests/localssl.pp it was 180s when we were using nginx, so I'll adjust it accordingly.
I'm trying to see what the previous nginx-based timeout was, but it's code I'm unfamiliar with
Nov 23 2023
I think this is now done - ms clusters default to using envoy (I've not done anything to beta, but it should carry on using nginx just fine).
Nov 22 2023
(perhaps the moss-fe2001 puppet failures are due to T350809 )
I can try if/when I get another one that fails (I'd be surprised if that were the solution, given "enough reboots" seems to have worked with the troublesome nodes I've had so far...)
I hit this problem when re-imaging ms-fe* nodes (for T317616). Most of them PXE booted fine, but two didn't - ms-fe2014.codfw.wmnet needed one further reboot (which I did from the HTML console) before it would PXE, and ms-fe1013.eqiad.wmnet needed two further reboots - i.e. it wedged twice at the same point before finally PXEing properly.
Nov 21 2023
@jbond thanks, that CR has fixed the sad services (and the openssl runes now work too).
Nov 20 2023
[it was suggested I added jbond to this task]
(priority set to high as we do use the swift-dispersion-stats to check for cluster health)
Nov 17 2023
Hi. I think:
hostnames: ms-be20[74-80]
racking: not more than 1 per rack, please, though they can share with existing nodes (e.g. you could put them in the racks the old systems are coming out of, which I think are A2,A7,B2,B7,C2,C7,D7)
networking: 10G private VLAN like existing ms-be* nodes
Partitioning/Raid: JBOD, please unlike previous ms-be* nodes, we now want everything non-RAID (cf T308677)
Hi @RobH. I think:
hostnames: ms-be1076-1082
racking: no more than 1 server per rack, please (but they can go in racks that already have other ms-be* nodes in e.g. the ones that the old systems are coming out of which are A2,A4,B2,C2,D2)
networking setup: 10G private VLAN like existing ms-be* nodes
Partitioning/Raid: JBOD, please unlike older previous ms-be* nodes, we now want everything non-RAID (cf T308677)
OS Distro: Bullseye
ssh pubkey confirmed OOB; this just needs group approval.
@thcipriani you're the approver for the deployment group, can you approve (or otherwise) this request, please?