Page MenuHomePhabricator

Gradually drop all thumbnails as a one-off clean up
Open, Needs TriagePublic

Description

We store all thumbnails ever requested in swift. This has many disadvantages. T211661: Automatically clean up unused thumbnails in Swift and T360589: De-fragment thumbnail sizes in mediawiki are long-term solutions but in the mean time, we are doing a one-off deletion of all thumbnails gradually over period of at least several months to free up space, reduce size of swift databases and allow for change of thumbnail sizes (T355914: Change default image thumbnail size)

Exploratory notes:
Something like this would clean up the thumbnails just fine:

swift list wikipedia-ja-local-thumb.01 | xargs -I{}  swift delete wikipedia-ja-local-thumb.01 "{}"

I ran it a couple of times on small containers and it basically drops thumbnails at rate of 2.3 thumbs per second. With that rate, it'll take 13.6 years to go through all containers. So some parallelism is needed.

Another issue to figure out why dc switchovers cause extra load on thumbor and make sure these deletions don't make things worse in the next switchover.

First we start with drop of thumbnails in local uploads of wikis (except commons) as they are quite small in comparison.

Progress:

  • codfw
    • 0x (should be re-done after other runs)
    • 1x (should be re-done after other runs)
    • 2x (should be re-done after other runs)
    • 3x
    • 4x
    • 5x
    • 6x
    • 7x: Running
    • 8x
    • 9x
    • ax
    • bx
    • cx
    • dx
    • ex
    • fx
  • eqiad
    • 0x: Partially done: 01-04 done, 05-0f not done.
    • 1x
    • 2x
    • 3x
    • 4x
    • 5x: Running
    • 6x: Running
    • 7x
    • 8x
    • 9x
    • ax
    • bx
    • cx
    • dx
    • ex
    • fx

Event Timeline

I'm deleting all thumbnails on every container except commons right now. Only on codfw and in alphabetical order and in serial. Right now, it's on enwikibooks (container 1/13)

My plan is to start a 16 parallel cleaners for commons thumbnails, the first one doing the clean up on containers ending with 0 (00, 10, 20, ..., f0), the second one doing the containers ending with 1 and so on. That way, it will take only ten months which is slow enough not to cause any issues for thumbor but also not so slow that'll take 14 years to finish.

My plan is to start a 16 parallel cleaners for commons thumbnails, the first one doing the clean up on containers ending with 0 (00, 10, 20, ..., f0), the second one doing the containers ending with 1 and so on. That way, it will take only ten months which is slow enough not to cause any issues for thumbor but also not so slow that'll take 14 years to finish.

Well, it is fun. ms-fe2009 has only 15 cores, so when I started 17 parallel threads, even though they were mostly I/O bound, the host was on its knees, I had to cut it to just containers ending from 0 to 5 (7 threads). I wanted to run the rest on ms-fe2010 and ms-fe2011 but I think it didn't work well (didn't have mw auth and even after I added those, it still didn't run correctly and I don't want to poke around production hosts like this too much so I let it be). For now this is going, let's see how we can move further.

We do only deploy the swift credentials to one frontend host per DC; and all the swift frontends have only 15 cores (they don't often end up CPU-bound - typically around 1/3 CPU used).

We do only deploy the swift credentials to one frontend host per DC; and all the swift frontends have only 15 cores (they don't often end up CPU-bound - typically around 1/3 CPU used).

Is creds the only difference between ms-fe2009 and rest of frontend? If it's just creds, we can do something about it.

We do only deploy the swift credentials to one frontend host per DC; and all the swift frontends have only 15 cores (they don't often end up CPU-bound - typically around 1/3 CPU used).

Is creds the only difference between ms-fe2009 and rest of frontend? If it's just creds, we can do something about it.

Yes (well, and age); by policy we only copy the creds out to profile::swift::stats_reporter_host: (one per cluster), to reduce the number of hosts with plain-text credentials lying around on them...

So now 0-5 are running on ms-fe2009 (plus non commons thumbs) and 6-a are running on ms-fe2010. I'll start b-f on ms-fe2011 tomorrow.

Now 17 different deletion scripts are ongoing: https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?orgId=1&var-site=codfw&viewPanel=7&from=now-24h&to=now-1m

grafik.png (812×1 px, 122 KB)

With that rate (50 thumbnails being deleted every second), it'll take around a year to finish. We could probably make it faster but also, we will be running out of ms-fe hosts to run them from.

We had a problem with codfw swift this morning, with the sort of load pattern that I'd normally expect to "just" result in swift filling a network connection up. Is it possible that the extra load from the deletion-scripts is making the frontends more vulnerable to getting overwhelmed by traffic?

I'm not saying it's impossible but it's unlikely. The number of scripts per host is quite small (6-7) and they are mostly I/O bound waiting for the backends to respond. On top of that, only ms-fe2009 had issues (from what I'm seeing might be missing something though) not ms-fe2010 or ms-fe2011 which both run scripts as well. That being said, the frontends are really tiny.

No, all frontends had problems, the entire cluster was very sad cf envoy on graphana, which is the sort of failure mode we've seen in the past when the frontends get overloaded and the whole thing goes into a bit of a death spiral (it shouldn't, but...).

Since this happened yesterday and has happened in the past too. Maybe we should just throw a bit of hardware at it? Specially maybe some vertical expansion. 15 cpus is borderline PC-grade hardware. I know usually it's only 60% but it probably needs more headspace for spikes? Given that I'm sure we are going to see way more spikes thanks to the AI hype.

On top, the memory is 32GB which already is PC-grade and constantly runs out (including during the spike)? https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=ms-fe2009&var-datasource=thanos&var-cluster=swift&from=1732333274324&to=1732859094559

That's an interesting graph, but not what you see if you look at that node during the incident I linked to - e.g. https://grafana.wikimedia.org/goto/mNw6Ge7HR?orgId=1 there's a spike in swapins, but that's not replicated across the other nodes (and there isn't obviously vast CPU/Memory load).

[err, which is not to say we shouldn't be looking at further frontend capacity]

...though on the contrary argument (for horizontal expansion), we've also seen nodes filling their relatively modest NICs.

I came, here, though, to mention T377827 since I had to do horrible stunt VACUUM on a couple more thumbnail container dbs.

If you do a vaccum on all container dbs for wikipedia-commons-local-thumb.0x that should save you a decent chunk already.

root@ms-fe2009:~# swift stat -v --lh wikipedia-commons-local-thumb.04
...
               Objects: 6.3M
                 Bytes: 848G
...

In comparison:

root@ms-fe2009:~# swift stat -v --lh wikipedia-commons-local-thumb.14
...
               Objects: 6.9M
                 Bytes: 941G
...

I'm not done with those deletions yet, we can wait until the containers are done.

ugh, that's eqiad. I'm only cleaning codfw now, shall I start eqiad too?

I don't think so just yet (not least because I'm a bit twitchy about impact on frontends); the issue arises because some DBs are way too large (2x their unvacuumed size), which compounds the problem of all the thumbnail DBs being ~4G (and I think unevenly spread round the backends).

If you do a vaccum on all container dbs for wikipedia-commons-local-thumb.0x that should save you a decent chunk already.

root@ms-fe2009:~# swift stat -v --lh wikipedia-commons-local-thumb.04
...
               Objects: 6.3M
                 Bytes: 848G
...

In comparison:

root@ms-fe2009:~# swift stat -v --lh wikipedia-commons-local-thumb.14
...
               Objects: 6.9M
                 Bytes: 941G
...

I'm not done with those deletions yet, we can wait until the containers are done.

So you've done deleting wikipedia-commons-local-thumb.04, but it still has 6.3M objects in it??

oh no, it's still ongoing but 10% have been cleaned up. It'll take a while until it's fully done (also I'm running the clean up on all containers from 00 to 0f in parallel so all of them will save up some space and they will add up, that's sorta my point)

Oh, OK, cool, sorry I misunderstood you. I think ATM I'd not want to think about vacuuming a container until the deletion of that container has done.

Sure, early Jan we can vaccum the first 16 containers.

The script is done with 0f:

root@ms-fe2009:~# swift stat -v --lh wikipedia-commons-local-thumb.0f
                   URL: http://ms-fe.svc.codfw.wmnet/v1/AUTH_mw/wikipedia-commons-local-thumb.0f
               Account: AUTH_mw
             Container: wikipedia-commons-local-thumb.0f
               Objects: 904K
                 Bytes: 80G
              Read ACL: mw:thumbor,mw:media,.r:*
             Write ACL: mw:thumbor,mw:media
               Sync To:
              Sync Key:
          Content-Type: application/json; charset=utf-8
           X-Timestamp: 1454672114.58952
         Last-Modified: Wed, 10 Apr 2019 08:17:32 GMT
         Accept-Ranges: bytes
      X-Storage-Policy: standard
                  Vary: Accept

To compare with something that hasn't been touched at all:

root@ms-fe2009:~# swift stat -v --lh wikipedia-commons-local-thumb.1f
                   URL: http://ms-fe.svc.codfw.wmnet/v1/AUTH_mw/wikipedia-commons-local-thumb.1f
               Account: AUTH_mw
             Container: wikipedia-commons-local-thumb.1f
               Objects: 6.9M
                 Bytes: 941G
              Read ACL: mw:thumbor,mw:media,.r:*
             Write ACL: mw:thumbor,mw:media
               Sync To:
              Sync Key:
          Content-Type: application/json; charset=utf-8
           X-Timestamp: 1454672114.51023
         Last-Modified: Wed, 10 Apr 2019 08:17:33 GMT
         Accept-Ranges: bytes
      X-Storage-Policy: standard
                  Vary: Accept

All 16 containers of 00 to 0f have been cleaned up. Starting 10 to 1f now.

The second batch is around 70% done now. Will probably finish in a week or so.

The second batch is around 70% done now. Will probably finish in a week or so.

Almost done now. Probably by tomorrow.

Third wave of deletions in codfw just started (from 20 to 2f). I will start eqiad on Monday.

eqiad containers are much bigger and it'll take way more time to clean them. 24 days have passed and only roughly 30% have been removed from 0x containers. Now that we have more frontends (Are they getting traffic now?), maybe we can start 1X on eqiad at the same time as 0X (and then do two waves at the same time). It'll take two month to do each wave, it'll mean if we don't do this, it'll take three years to finish eqiad.

@Ladsgroup can you let me know when one of the current batch has finished, please? Now we've done the thumbnail defrag stuff, I'd like to re-asses (for the purposes of wondering about a cache system) how big a freshly-deleted thumb container is (and how it grows).

Sure. In eqiad it's running and it'll take a while

Just apropos the sizes, I took one at random (wikipedia-commons-local-thumb.f9), and whilst eqiad is bigger, it's not a lot bigger:
eqiad: 9,285,690 objects 1,307,136,946,396 bytes
codfw: 7,434,712 objects 1,027,498,001,890 bytes

Obviously if operations have a non-linear scale with container DB size that might explain some increase in duration, but I'm surprised it's a lot.

I think it takes around one month and half to two months, not too much.

Looks like you've just finished the codfw 3x ones, so I looked:
wikipedia-commons-local-thumb.30 838,472 objects 96,830,126,986 bytes
wikipedia-commons-local-thumb.3f 836,193 objects 95,360,984,940 bytes

The reason I didn't ping you is that when I got to ms-fe, all screens were terminated which might mean it was cut (and rebooted?) halfway through the deletion, I am sure it was mostly done if not fully done but it's still not a good baseline to measure. Maybe good enough?

The reason I didn't ping you is that when I got to ms-fe, all screens were terminated which might mean it was cut (and rebooted?) halfway through the deletion, I am sure it was mostly done if not fully done but it's still not a good baseline to measure. Maybe good enough?

Now thinking about it, maybe I can take a random container in 3x and do another round, according to my measurements it'll take "only" 4 days and 12 hours to go through all 900K of them. Then we can be sure and measure it, it's definitely much faster than waiting for another round to finish.

root@ms-fe1009:~# swift stat --lh wikipedia-commons-local-thumb.13
               Account: AUTH_mw
             Container: wikipedia-commons-local-thumb.13
               Objects: 1.5M
                 Bytes: 214G
              Read ACL: mw:thumbor,mw:media,.r:*
             Write ACL: mw:thumbor,mw:media