Page MenuHomePhabricator

Image 429 errors for most images on private wikis
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue:

Request from [IP address] via cp1077 cp1077, Varnish XID 125896807
Upstream caches: cp1077 int
Error: 429, Too Many Requests at Sun, 11 Jun 2023 15:23:38 GMT

Possibly this is a duplicate of (or related to) either or both of these 2 tasks. However I couldn't reproduce the exact problem at a dozen non-free content wikis, so I'll file separately just in case.

Event Timeline

Urbanecm_WMF subscribed.

I can confirm this for both officewiki and community-related private wikis I'm privy to (stewardwiki, checkuserwiki).

In some cases I think this is a manifestation of T337649 as many of the files on officewiki are PDFs and the like, but there is something else at work here. I somewhat suspect poolcounter is failing outright, or is possibly failing and unintentionally blocking internal IP addresses.

Thumb.php hides the error because these are private wikis, but ultimately this results in Too many thumbnail requests which is another poolcounter throttle even for inexpensive formats.

This has subsided as a result of T337649#8938960 - however this behaviour is a side effect of the work required in T338297

Change 931568 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] changeprop: reduce ThumbnailRender concurrency

https://gerrit.wikimedia.org/r/931568

Change 931568 merged by jenkins-bot:

[operations/deployment-charts@master] changeprop: reduce ThumbnailRender concurrency

https://gerrit.wikimedia.org/r/931568

This is in part caused by T339863. Kubernetes hosts are being rate limited incorrectly - however, this is a symptom. The real cause here appears to be that we do not store generated results for private wikis properly. The images are generated successfully (albeit slowly) but they are then regenerated upon next viewing as currently we get a HTTP 403 when we attempt to PUT the generated image into Swift. This issue has been present in Thumbor perhaps since support for private wikis was added.

Change 931896 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/software/thumbor-plugins@master] swift: add logging for when private connections are used

https://gerrit.wikimedia.org/r/931896

Change 931896 merged by jenkins-bot:

[operations/software/thumbor-plugins@master] swift: add logging for when private connections are used

https://gerrit.wikimedia.org/r/931896

Change 931963 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] thumbor: bump chart, swift private debug

https://gerrit.wikimedia.org/r/931963

Change 931963 merged by jenkins-bot:

[operations/deployment-charts@master] thumbor: bump chart, swift private debug

https://gerrit.wikimedia.org/r/931963

This issue is excerpted by the fact that integration of thumbor to private swift containers is completely broken and it can't save any thumbnail it makes and has to re-thumbnail it every time someone loads an image in private wikis.

Looking at swift, the ACL is not correct:
(first tried reading the container in mw:thumbor-private but didn't have rights, I tried it with mw instead)

root@ms-fe1009:~# swift stat wikipedia-office-local-public
               Account: AUTH_mw
             Container: wikipedia-office-local-public
               Objects: 7019
                 Bytes: *redacted*
              Read ACL: mw:thumbor,mw:media,.r:*
             Write ACL: mw:thumbor,mw:media
               Sync To:
              Sync Key:
          Content-Type: application/json; charset=utf-8
           X-Timestamp: 1381305313.16744
         Last-Modified: Wed, 10 Apr 2019 09:42:41 GMT
         Accept-Ranges: bytes
      X-Storage-Policy: standard
                  Vary: Accept

(I know it's named "public" but these are private... https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/thumbor/values.yaml#87)
cc @MatthewVernon

swift post wikipedia-office-local-public --read-acl "mw:thumbor,mw:thumbor-private,mw:media,.r:*" --write-acl "mw:thumbor,mw:thumbor-private,mw:media"

Should fix the officewiki one, but I won't run it without double check by Matthew.

Yeah, if it's thumbor-private trying to write to that container, we will not go to space today.

I think, though, that the write ACL needs updating on the thumb container, though, not -public, which is where the images are? Currently wikipedia-office-local-public is world-readable (is that right for a private wiki?), so that shouldn't be a problem, but the thumbs will go in a different container:

root@ms-fe2009:~# swift stat wikipedia-office-local-thumb
               Account: AUTH_mw
             Container: wikipedia-office-local-thumb
               Objects: *some*
                 Bytes: *some*
              Read ACL: mw:thumbor,mw:media,.r:*
             Write ACL: mw:thumbor,mw:media
               Sync To:
              Sync Key:
          Content-Type: application/json; charset=utf-8
           X-Timestamp: 1456276405.75652
         Last-Modified: Wed, 10 Apr 2019 09:42:41 GMT
         Accept-Ranges: bytes
      X-Storage-Policy: standard
                  Vary: Accept
            X-Trans-Id: tx54e0b801c5c84858bc504-0064957c8c
X-Openstack-Request-Id: tx54e0b801c5c84858bc504-0064957c8c

So some things can write to that container, as it has stuff in it. Are we sure that thumbor-private (rather than thumbor) is the correct user? If so, then your rune would be correct if applied to wikipedia-office-local-thumb, but I don't think thumbor needs rw to the image container rather than the thumb one?

[aside: how do these get set up, and is there some infra that needs updating?]

I don't think thumbor needs rw to the image container rather than the thumb one

mw also needs to write to it for upload

What about?

swift post wikipedia-office-local-public --read-acl "mw:thumbor-private,mw:media:*" --write-acl "mw:thumbor-private,mw:media"

I don't think thumbor needs rw to the image container rather than the thumb one

mw also needs to write to it for upload

Sorry, I don't understand - mw can already write to wikipedia-office-local-public?

Can and should, read and write in private wikis must go through mw as that's the only part of infra that has the knowledge if the requesting party is actually authorized to read or write the file.

If your question is that if that's already the case or not, it is. I'm just saying why it should stay as is

Sorry, we may be talking past each other. I think that to make thumbs on officewiki work, we would need to add mw:thumbor-private to the write acl to wikipedia-office-local-thumb (but not wikipedia-office-local-public, which it shouldn't need to write to).

ah, yeah. That's good. My main worry for that is that it might need to do some write that's not obvious, e.g. update list of thumbnails of a given image in the main container so I rather avoid it but it can't get much more broken that it is today so we can start with not giving access today.

Mentioned in SAL (#wikimedia-operations) [2023-06-23T12:10:42Z] <Emperor> updating ACLs on wikipedia-office containers T340189 T338765

Let me check if thumbor can actually store them now

This comment was removed by Ladsgroup.

and thumbor private still can't write I think:

root@ms-fe1009:~# swift list wikipedia-office-local-thumb --prefix  7/7b/Abbrev-bot.png
root@ms-fe1009:~#

Maybe swift needs a flush/reboot for the ACL to actually take into effect?

At least in ms-fe1009, it's not correct:

root@ms-fe1009:~# swift stat wikipedia-office-local-public
               Account: AUTH_mw
             Container: wikipedia-office-local-public
               Objects: 7019
                 Bytes: 8861846814
              Read ACL: mw:thumbor,mw:media,.r:*
             Write ACL: mw:thumbor,mw:media
               Sync To:
              Sync Key:
          Content-Type: application/json; charset=utf-8
           X-Timestamp: 1381305313.16744
         Last-Modified: Wed, 10 Apr 2019 09:42:41 GMT
         Accept-Ranges: bytes
      X-Storage-Policy: standard
                  Vary: Accept
            X-Trans-Id: txa7993d9022d04a788df4a-0064958e01
X-Openstack-Request-Id: txa7993d9022d04a788df4a-0064958e01
root@ms-fe1009:~# swift stat wikipedia-office-local-thumb
               Account: AUTH_mw
             Container: wikipedia-office-local-thumb
               Objects: 31308
                 Bytes: 2059911583
              Read ACL: mw:thumbor,mw:media,.r:*
             Write ACL: mw:thumbor,mw:media
               Sync To:
              Sync Key:
          Content-Type: application/json; charset=utf-8
           X-Timestamp: 1381945489.22890
         Last-Modified: Wed, 10 Apr 2019 09:42:41 GMT
         Accept-Ranges: bytes
      X-Storage-Policy: standard
                  Vary: Accept
            X-Trans-Id: tx50cbd4580fb842929a7eb-0064958e07
X-Openstack-Request-Id: tx50cbd4580fb842929a7eb-0064958e07

No, I have to remember that codfw and eqiad are two different clusters, and do the same thing on both. Sorry, done now.

\o/

root@ms-fe1009:~# swift list wikipedia-office-local-thumb --prefix  7/7b/Abbrev-bot.png
7/7b/Abbrev-bot.png/120px-Abbrev-bot.png
7/7b/Abbrev-bot.png/800px-Abbrev-bot.png

Special:NewFiles now work just fine, should call this done?

Let me fix collab first, and then I think we can close here.

Let me fix collab first, and then I think we can close here.

Urbanecm_WMF mentioned above that he sees the same issue at "stewardwiki [and] checkuserwiki". Possibly every private-wiki needs to be fixed?

Let me fix collab first, and then I think we can close here.

Urbanecm_WMF mentioned above that he sees the same issue at "stewardwiki [and] checkuserwiki". Possibly every private-wiki needs to be fixed?

FWIW, those two wikis now appear to be working well from thunbnailing perspective.

Let me fix collab first, and then I think we can close here.

Urbanecm_WMF mentioned above that he sees the same issue at "stewardwiki [and] checkuserwiki". Possibly every private-wiki needs to be fixed?

Yup, we are going through them

MatthewVernon claimed this task.

Right, I think I have fixed this on all private wikis.