Reported upstream as Bug #2058945.
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Mon, Mar 25
Fri, Mar 22
This seems broadly sensible - what's the concrete proposal in terms of which thumb sizes will be supported/generated?
One thing that was discussed at the SRE meeting in Warsaw was looking at turnilo data (which IIRC is the last 90 days' requests) to effectively simulate a cache and ask questions about the relationship between cache size/age and hit/miss ratios and so on.
[which might be a useful KR for the forthcoming quarter]
Thu, Mar 21
(if you try and build any other image, you still get the same error, which led me to assume my docker-pkg setup was faulty somehow - again, in hindsight, the error message does say it's unhappy with the ceph image, but the naïve user will observe the same bad request for localhost/v1.41/images/docker-registry.wikimedia.org/ and get led down the garden path.)
Wed, Mar 20
Noting here for future reference - we found that thumbor was incorrectly using the global discovery record for swift, which meant that codfw-thumbor was trying to talk to eqiad-swift after codfw-swift was depooled, resulting in a rise in TempAuth errors (and 401s):
Yes, we don't replicate thumbnails between DCs any more (and this has been the case since July 2022 cf. T313102)
Thu, Mar 7
Additionally, we are retiring the last 9 12x4 T nodes from eqiad and the last 6 12x4T nodes from codfw and replacing them with 24x8T units.
Tue, Mar 5
The issue is that the path is stored as varbinary(255) and path length is checked at upload to not exceed that. But then archiving adds archive and a date string to the start of the path, resulting in truncation.
As does the second:
root@ms-fe1009:~# swift stat wikipedia-commons-local-public.1e 'archive/1/1e/20231203130229!ДАЖО_127-1-68.1897._Геодезичний_опис_ділянки_землі_вічного_чиншовика_Антона_Станіслава_Гарбовських_села_Рудня-Старики_Овруцького_повіту.pdf' Account: AUTH_mw Container: wikipedia-commons-local-public.1e Object: archive/1/1e/20231203130229!ДАЖО_127-1-68.1897._Геодезичний_опис_ділянки_землі_вічного_чиншовика_Антона_Станіслава_Гарбовських_села_Рудня-Старики_Овруцького_повіту.pdf Content Type: application/pdf Content Length: 23751233 Last Modified: Sat, 09 Dec 2023 03:08:11 GMT ETag: bf7ae1c816785fe887ad2846e13d8e11 Meta Sha1Base36: a9bue5nc4oj88z3bf65tbh339kjh4un X-Timestamp: 1702091290.63527 Accept-Ranges: bytes X-Trans-Id: tx6abde957d45f4e978361f-0065e72d4d X-Openstack-Request-Id: tx6abde957d45f4e978361f-0065e72d4d
The first exists:
root@ms-fe1009:~# swift stat wikipedia-commons-local-public.16 'archive/1/16/20240116211741!Алфавітно-предметний_покажчик_за_1938_рік_до_Збірника_постанов_і_розпоряджень_Уряду_Української_Радянської_Соціалістичної_Республіки.pdf' Account: AUTH_mw Container: wikipedia-commons-local-public.16 Object: archive/1/16/20240116211741!Алфавітно-предметний_покажчик_за_1938_рік_до_Збірника_постанов_і_розпоряджень_Уряду_Української_Радянської_Соціалістичної_Республіки.pdf Content Type: application/pdf Content Length: 1330605 Last Modified: Tue, 16 Jan 2024 21:18:05 GMT ETag: ac929ceaf65d932bf2bfe683643b47de Meta Sha1Base36: ja3vvtx04izk863x7mzzwc3wjkptjbn X-Timestamp: 1705439884.96510 Accept-Ranges: bytes X-Trans-Id: tx8e6e322995b04547a920c-0065e72bf1 X-Openstack-Request-Id: tx8e6e322995b04547a920c-0065e72bf1
Mon, Mar 4
Thu, Feb 29
thanos and ms swift clusters OK post-move, thank you!
Feb 26 2024
If you do decide you might want to reprovision these nodes as non-RAID, there is a sre.swift.convert-disks cookbook that does most of the heavy lifting (though you'd probably need to relax the host restriction a bit).
After the reboot, you could still have made the new virtual drive with the last of those lines:
megacli -CfgEachDskRaid0 WB RA Direct CachedBadBBU -a0
Feb 25 2024
Feb 23 2024
In T200820#9571042, @Bawolff wrote:Longer term, using swift large object support might be a better way to handle these files, since they are already chunked.
Feb 22 2024
Swift is back OK, thanks.
Here's the relevant logs, sorted by time:
moss-fe2001.codfw.wmnet: Feb 21 20:19:14 moss-fe2001 proxy-server: 10.192.48.105 10.192.32.51 21/Feb/2024/20/19/14 PUT /v1/AUTH_mw/wikipedia-commons-local-temp.b8/b/b8/1aqaxe3dattw.u1suhi.6080484.webm.0 HTTP/1.0 201 - wikimedia/multi-http-client%20v1.1 AUTH_tk7b6ed208d... - - - tx948576b6a72e4600a62d1-0065d65ac1 - 0.9693 - - 1708546753.577275038 1708546754.546539783 - ms-fe1009.eqiad.wmnet: Feb 21 20:19:16 ms-fe1009 proxy-server: 10.192.48.105 10.64.0.166 21/Feb/2024/20/19/16 PUT /v1/AUTH_mw/wikipedia-commons-local-temp.b8/b/b8/1aqaxe3dattw.u1suhi.6080484.webm.0 HTTP/1.0 201 - wikimedia/multi-http-client%20v1.1 AUTH_tke5beae87e... - - - tx4683f2a851d249ea89bf1-0065d65ac2 - 1.5437 - - 1708546754.563668013 1708546756.107330084 - ms-fe2013.codfw.wmnet: Feb 21 20:59:24 ms-fe2013 proxy-server: 10.194.152.76 10.192.0.87 21/Feb/2024/20/59/24 GET /v1/AUTH_mw/wikipedia-commons-local-temp.b8/b/b8/1aqaxe3dattw.u1suhi.6080484.webm.0 HTTP/1.0 200 - wikimedia/multi-http-client%20v1.1 AUTH_tk7b6ed208d... - 97498542 - tx8b68cb8506d34d1a9bb03-0065d66420 - 12.5442 - - 1708549152.229032040 1708549164.773195744 0 ms-fe2013.codfw.wmnet: Feb 21 21:00:02 ms-fe2013 proxy-server: 10.194.152.76 10.192.0.87 21/Feb/2024/21/00/02 DELETE /v1/AUTH_mw/wikipedia-commons-local-temp.b8/b/b8/1aqaxe3dattw.u1suhi.6080484.webm.0 HTTP/1.0 204 - wikimedia/multi-http-client%20v1.1 AUTH_tk7b6ed208d... - - - txfabad3d0f009413ea82dd-0065d66452 - 0.0458 - - 1708549202.850069761 1708549202.895848513 0 ms-fe1009.eqiad.wmnet: Feb 21 21:00:03 ms-fe1009 proxy-server: 10.194.152.76 10.64.0.166 21/Feb/2024/21/00/03 DELETE /v1/AUTH_mw/wikipedia-commons-local-temp.b8/b/b8/1aqaxe3dattw.u1suhi.6080484.webm.0 HTTP/1.0 204 - wikimedia/multi-http-client%20v1.1 AUTH_tke5beae87e... - - - tx26b0b6c03f1b4577a65b7-0065d66453 - 0.0454 - - 1708549203.070249796 1708549203.115652323 0 ms-fe2013.codfw.wmnet: Feb 21 21:03:12 ms-fe2013 proxy-server: 10.194.155.232 10.192.0.87 21/Feb/2024/21/03/12 GET /v1/AUTH_mw/wikipedia-commons-local-temp.b8/b/b8/1aqaxe3dattw.u1suhi.6080484.webm.0 HTTP/1.0 404 - wikimedia/multi-http-client%20v1.1 AUTH_tk7b6ed208d... - 70 - tx36a819fdb39c4d61bb4ee-0065d66510 - 0.0335 - - 1708549392.151752949 1708549392.185259581 0 ms-fe2013.codfw.wmnet: Feb 21 21:03:12 ms-fe2013 proxy-server: 10.194.155.232 10.192.0.87 21/Feb/2024/21/03/12 GET /v1/AUTH_mw/wikipedia-commons-local-temp.b8/b/b8/1aqaxe3dattw.u1suhi.6080484.webm.0 HTTP/1.0 404 - wikimedia/multi-http-client%20v1.1 AUTH_tk7b6ed208d... - 70 - txfffc016f7af4438eac05b-0065d66510 - 0.0284 - - 1708549392.332885504 1708549392.361268997 0
@jcrespo can you try now, please?
Feb 21 2024
Apropos theory 3, we do run the swift object expirer, but the relevant headers are not set (except for some specific use cases e.g. phonos). So I don't think it can be that.
I'd be surprised (and unhappy!) were swift randomly losing objects. If you have object names (ideally plus timestamps) from a recent example I could go grobbling in the logs to check.
I think the proposed table should look like this?
Feb 20 2024
ms and thanos swift both OK post-move.
Feb 16 2024
Feb 14 2024
Great, thanks, I can confirm that swift is happy with that node.
Yes, please go ahead whenever is convenient (if you can let me know when done I can check the node is still happy).
Thanks, this is definitely a step in the right direction :)
Feb 13 2024
Thanks for your comment.
Swift looks happy, thanks :)
Feb 12 2024
Feb 10 2024
Looking at this briefly (it's Saturday and the moment has passed), the request rate goes up somewhat (so looks unusual, but not at the level I would expect to cause an issue), but both frontend and backend network utilisation is significantly elevated, which makes me wonder if this was a lot of hits on an original rather than a thumb or similar.
Feb 9 2024
Swift uses IP(v4) address (and then device name) as the identifier for entries in its rings.
Feb 8 2024
In T355544#9525282, @ssingh wrote:moss-be* hosts should be @MatthewVernon unless I am mistaken, in which case, please accept my apologies in advance :)
Feb 7 2024
Turn it on at 15:55 UTC? </only-half-joking>
We had a repeat at almost exactly the same time today, only this time neither node recovered and both needed power-cycling.
VO incident 4427.
You're good to go re swift and thanos now.
Feb 6 2024
swift backends look happy, thanks :)
Feb 1 2024
[it's not immediately obvious to me what the extra work of cfssl gets us over sslcert]
Jan 31 2024
I think the main issue is likely that we'll melt Thumbor if we just switch enwiki to 250, because 250 isn't a pregenerated size, and last time someone looked (T211661#8377883) only about 2% of requests were for that size. So I would assume that for the vast majority of images on enwiki we don't currently have a 250 thumb.
Jan 30 2024
Thumbs that are being used get cached in the CDN in any case.
I'll want to check the backends once this work is complete, but it shouldn't be an issue.
The affected thanos frontend will need depooling.
Similarly, swift in codfw will need depooling.
Once complete I'll want to check the backends, but this shouldn't be an issue.
Once complete, I'll want to check the ms-be nodes are all happy (shouldn't be an issue).
swift will need depooling in codfw before this work.
Likewise the affected thanos-fe node.
[I'll want to check afterwards that the ms-be nodes are happy, but this shouldn't be an issue]
Jan 29 2024
@Jclark-ctr thank you for the quick swap, much appreciated :-)
@mfossati Of the 12,000 objects you named, I could find 11,608 in the database, and was able to download 11,596 objects.
out_of_domain.tar.bz2 is 23G, and available on stat1008 like the others.
Full error message:
Access Denied: Restricted File
You do not have permission to view this object.
Users with the "Can View" capability:
In T350020#9493534, @mfossati wrote:@MatthewVernon, please find attached the out-of-domain sample:
out_of_domain_12k.clean496 KBDownload
Looking forward to it, thanks again!
That was due to an incident - T356022
Jan 26 2024
12k should be perfectly doable.
[I should say: these are all originals, because we wouldn't necessarily have thumbnails for deleted objects and couldn't straightforwardly generate them either]
Jan 25 2024
I've now done logos.tar.bz2, which is a 4.7G file; of the 11,153 objects you requested, filearchive contained 10,770 of them, and I was able to download 10,764 of those.
The sed transform is space to underscore (a standard change for object name -> database entry) and ' to '' which quotes the ' character for mysql use (in a '-quoted string); the equivalent backslash-based approach would have been s/'/\\\'/g which is uglier.
I've now done books.tar.bz2; of the 2527 objects you requested, filearchive knew of 2453, and the tarball contains 2441 images.
Jan 24 2024
mvernon@stat1008:~$ ls -lsh total 2.5G 877M -rw-r--r-- 1 root root 877M Jan 24 16:28 album_covers.tar.bz2 1.7G -rw-r--r-- 1 root root 1.7G Jan 24 17:44 screenshots.tar.bz2
Presumably we want to restrict access somewhat beyond "everything the cassandra user can do"? At which point a separate user to sudo to seems like a sensible idea unless it's a lot of hassle...
Jan 23 2024
Right, those are all too far ago to still in the recent logs. Today's, however, I can find, and swift has done what was asked of it - that file was uploaded and subsequently deleted (before further requests got 404, which you'd expect after a successful delete).
I'm afraid I don't have the time and resources to follow discussions elsewhere, and without that information there's nothing much more I can do with this report.
That first one looks to have uploaded OK as https://commons.wikimedia.org/wiki/File:Washstand_in_the_dog_run_and_kitchen_of_Floyd_Burroughs%27_cabin._Hale_County,_Alabama,_8c52869a.tif ?