Page MenuHomePhabricator

Followups for Tegola and Swift interactions
Open, Needs TriagePublic

Description

Following up from T306424: Tegola pods are crashing because swift doesnt allow connections we have to (possibly in subtasks, listing things here so we don't lose track, please add as needed):

  • Stop copying files from the old container to the new container
  • Delete the old (big) container for tegola and thus free up space on swift SSDs
  • Revert fallocate for thanos-swift https://gerrit.wikimedia.org/r/784250
  • Get Tegola to auto-create containers as needed
  • Investigate container sharding at the swift level

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2022-05-02T13:13:20Z] <godog> start removal of 'tegola-swift-container' and its objects - T307184

@Jgiannelos something that occurred to me while deleting swift-tegola-container (still in progress, will take a while): when tile regeneration runs, the filenames are kept the same and new versions are uploaded, correct? (if basepath/prefix are kept the same that is).

I'm asking because I think S3 object versioning might come in play (i.e. new versions are uploaded at regeneration time, old versions are never deleted) and might explain why the bucket/container database got so big. This theory should be easy to verify next time tile regeneration runs!

Yes filenames are kept the same. On each tile pregeneration we send a PUT request for the same filename but different content. I don't know the internals of swift but if its configurable we will never need to access old versions of the same object so we can disable versioning if possible.

Yes filenames are kept the same. On each tile pregeneration we send a PUT request for the same filename but different content. I don't know the internals of swift but if its configurable we will never need to access old versions of the same object so we can disable versioning if possible.

Thank you! My understanding is that S3 API object versioning is off by default in swift (and I can't find traces of multiple versions so far) so indeed the objects should get replaced upon a new PUT

Mentioned in SAL (#wikimedia-operations) [2022-05-09T08:09:20Z] <godog> temp stop tegola-swift-container delete - T307184

Mentioned in SAL (#wikimedia-operations) [2022-05-24T08:22:12Z] <godog> resume deletion of 'swift-tegola-container' on thanos-fe2001 - T307184

hi @Jgiannelos, I have resumed work on this and was wondering what's the theoretical limit of tiles per container? Assuming we're going with separate containers per "schema" (or per "version", not sure), in other words what values can the numbers after osm/ take ? I'd like to understand what others of magnitude of objects per container we need to deal with. Thank you !

hi @Jgiannelos, I have resumed work on this and was wondering what's the theoretical limit of tiles per container? Assuming we're going with separate containers per "schema" (or per "version", not sure), in other words what values can the numbers after osm/ take ? I'd like to understand what others of magnitude of objects per container we need to deal with. Thank you !

hi @Jgiannelos, what do you think of the above?

Hey @fgiunchedi the size of the current active deployment is stabilized at ~12269804 objects for quite some time now (last 1 month). The theoretical upper limit can be way higher if we assume that all planet tiles are pregenerated and the count could be as high as 1431655765 (all tiles from zoom level 0 to zoom level 15). The first estimate is more realistic, we never end up generating all planet tiles.

Hey @fgiunchedi the size of the current active deployment is stabilized at ~12269804 objects for quite some time now (last 1 month). The theoretical upper limit can be way higher if we assume that all planet tiles are pregenerated and the count could be as high as 1431655765 (all tiles from zoom level 0 to zoom level 15). The first estimate is more realistic, we never end up generating all planet tiles.

Thank you for the information, I am trying to gauge whether we should be realistically need to deploy container sharding for tegola containers. In the former case (i.e. in the order of 12M objects/container) I don't think we do, in the latter we obviously do (1.4B objects/container).

I'm sure the reality will be somewhere in the middle; did you find out whether tegola is going to auto-create containers ? (i.e. https://phabricator.wikimedia.org/T306424#7864446) once we have that in place and schema-based containers I think we'll be in a good place.

Just a quick correction on the numbers: the current production container size is ~40M objects not ~12M (i was counting the wrong container).

Change 807567 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[operations/deployment-charts@master] tegola: Point codfw to a new swift container

https://gerrit.wikimedia.org/r/807567

Change 807567 merged by jenkins-bot:

[operations/deployment-charts@master] tegola: Point codfw to a new swift container

https://gerrit.wikimedia.org/r/807567

Just a quick correction on the numbers: the current production container size is ~40M objects not ~12M (i was counting the wrong container).

Thank you for those numbers! It looks like to me the objects in the current production container (tegola-swift-new) grows at a steady rate, we were at 50M a few days ago, now at 53M. Is this steady growth expected?

I think a good estimate is this graph:
https://grafana.wikimedia.org/goto/LgX0j2e7k?orgId=1

This is the rate of new tiles (per 5mins)

Also just a heads up, current production is using bucket: tegola-swift-v001 in case you want to cleanup the old containers.

Regarding auto-creating containers: Tegola codebase doesn't autocreate new containers on start but we can manually create them from the maps nodes. In case we need to automate it in the future we can implement this on k8s pod initialization but I am not expecting too many changes either way. Manual should be OK for now.

I think a good estimate is this graph:
https://grafana.wikimedia.org/goto/LgX0j2e7k?orgId=1

This is the rate of new tiles (per 5mins)

Thank you for the dashboard link! What do you mean "per 5mins" here? I'm asking because the graph linked shows a per-second rate, calculated over five minutes "slices" of time.

From the graph above it looks like the "normal" rate of cache misses (i.e. writes to object storage) on average seems to be in the order of 3-4/s (I'm looking at this graph for eqiad over the last 90d, https://grafana.wikimedia.org/goto/adfggT6nk?orgId=1). Is this rate of cache misses sth you were expecting and/or can be considered normal operation ?

In other words about a million objects every 3-5 days, which I don't think is sustainable long term in terms of object count growth (as we've seen in the parent task)

In terms of solutions I see at least two (not mutually exclusive):

  • expire tiles objects so object count is somewhat bounded
  • deploy swift container sharding to support containers with larger object count
  • some other solution I'm missing?

I think the former might be easier to achieve, either at the tegola level (i.e. swift is unaware of expiration dates) or at the swift/s3 level (tegola sets a ttl for the object, swift/s3 eventually deletes the object when the ttl is up).

The latter solution requires more invasive / larger scale changes to deploy container sharding to swift (and thus likely longer lead times to complete).

What do you think? (cc @MatthewVernon too)

Also just a heads up, current production is using bucket: tegola-swift-v001 in case you want to cleanup the old containers.

Thank you, will do

Regarding auto-creating containers: Tegola codebase doesn't autocreate new containers on start but we can manually create them from the maps nodes. In case we need to automate it in the future we can implement this on k8s pod initialization but I am not expecting too many changes either way. Manual should be OK for now.

ACK, creating the containers manually from maps nodes SGTM for now.

Thank you for the dashboard link! What do you mean "per 5mins" here? I'm asking because the graph linked shows a per-second rate, calculated over five minutes "slices" of time.

You are right its per second I misread the queries.

Do we have an upper limit were the object count is going to end up being problematic in our container?

From a tegola development point of view I think it will be complicated to implement some sort of custom sharding logic with multiple containers, especially given the priority of maps in my current teams scope.
I think it would be fairly easy to put some sort of cache control if the s3 client translates the cache headers to swift (that needs some time to be investigated but we can try it on staging).

Do we have an upper limit were the object count is going to end up being problematic in our container?

Good question, it is complicated to give an exact/hard number though because there are multiple factors at play. Also because the original issue was the single container database growing larger than half of the available filesystem space on the SSDs and swift was trying to pre-allocate the new database. Even without the preallocation though we'd be eventually run out of space on the SSDs hosting the container databases (which are shared with other users of thanos-swift). There are also performance considerations with big container databases, though I think it is less of a concern in this case since the workload is read-heavy.

Anyways I think conservatively "low tens of millions" as the number for objects in a single container (without underlying swift sharding) is a good rule of thumb.

From a tegola development point of view I think it will be complicated to implement some sort of custom sharding logic with multiple containers, especially given the priority of maps in my current teams scope.

Note the "swift sharding" solution did not entail changes to tegola, but rather the implementation of what I suggested here https://phabricator.wikimedia.org/T306424#7868672

I think it would be fairly easy to put some sort of cache control if the s3 client translates the cache headers to swift (that needs some time to be investigated but we can try it on staging).

Agreed, I'll find out what the support for this is on the swift/s3 side (from a quick glance the official s3 api doesn't seem to have per-object expiration support)

From a tegola development point of view I think it will be complicated to implement some sort of custom sharding logic with multiple containers, especially given the priority of maps in my current teams scope.

Note the "swift sharding" solution did not entail changes to tegola, but rather the implementation of what I suggested here https://phabricator.wikimedia.org/T306424#7868672

Sounds good, thanks for the clarification.

Regarding cache-control. Here is the output of my local setup with swift API running on http:127.0.0.1:8080

bash-4.2# aws --endpoint=http://127.0.0.1:8080 s3api create-bucket --bucket testing-swift-cache-control
{
    "Location": "/testing-swift-cache-control"
}
bash-4.2# aws --endpoint=http://127.0.0.1:8080 s3api put-object --bucket testing-swift-cache-control --key testing-no-cache-control.txt --body testing-no-cache-control.txt
{
    "ETag": "\"168306795fa6f71e0ac45c1cfdcb5351\""
}
bash-4.2# aws --endpoint=http://127.0.0.1:8080 s3api put-object --bucket testing-swift-cache-control --key testing-cache-control.txt --body testing-cache-control.txt --cache-control max-age=604800
{
    "ETag": "\"5d4d65f5583b539a8ca9013993ecc193\""
}

On swift side:

 nemo@nemoworld-wikimedia  ~  swift -A http://127.0.0.1:8080/auth/v1.0 -U test:tester -K testing stat testing-swift-cache-control testing-no-cache-control.txt
               Account: AUTH_test
             Container: testing-swift-cache-control
                Object: testing-no-cache-control.txt
          Content Type: binary/octet-stream
        Content Length: 25
         Last Modified: Fri, 08 Jul 2022 10:32:43 GMT
                  ETag: 168306795fa6f71e0ac45c1cfdcb5351
           X-Timestamp: 1657276362.57306
         Accept-Ranges: bytes
            X-Trans-Id: txf5b4434d1f1b406680eb5-0062c807d8
X-Openstack-Request-Id: txf5b4434d1f1b406680eb5-0062c807d8
 nemo@nemoworld-wikimedia  ~  swift -A http://127.0.0.1:8080/auth/v1.0 -U test:tester -K testing stat testing-swift-cache-control testing-cache-control.txt
               Account: AUTH_test
             Container: testing-swift-cache-control
                Object: testing-cache-control.txt
          Content Type: binary/octet-stream
        Content Length: 22
         Last Modified: Fri, 08 Jul 2022 10:34:45 GMT
                  ETag: 5d4d65f5583b539a8ca9013993ecc193
         Cache-Control: max-age=604800
           X-Timestamp: 1657276484.94083
         Accept-Ranges: bytes
            X-Trans-Id: tx7b45bf724c6a4107bb19b-0062c80856
X-Openstack-Request-Id: tx7b45bf724c6a4107bb19b-0062c80856

It looks like cache control header gets passed to the 2nd object when using the s3api compatibility layer and is reflected on swift.

On a second thought, this is for serving cache-control headers so not very relevant to our problem.

Overall the idea of sending additional headers is the right one @Jgiannelos: specifically for swift x-delete-after or x-delete-at is what we want (https://docs.openstack.org/swift/rocky/api/object-expiration.html). I'll find out if talking s3 api to swift and sending those headers will do the right thing

Overall the idea of sending additional headers is the right one @Jgiannelos: specifically for swift x-delete-after or x-delete-at is what we want (https://docs.openstack.org/swift/rocky/api/object-expiration.html). I'll find out if talking s3 api to swift and sending those headers will do the right thing

I can confirm that in the o11y Pontoon/testing stack I can expire objects talking directly to the S3 API by sending x-delete-after header.

s3cmd --add-header 'x-delete-after: 60' put <localfile> s3://bucket/file/path

The semantics of the operation are as follows:

  • GET requests for file/path will return 200 for 60s, and 404 after that.
  • The object is still present on disk and appears in file listings for bucket but it is not accessible by clients
  • When swift-object-expirer runs (a background process) then the file will be actually deleted from disk and won't show up anymore in file listings.

We have to deploy swift-object-expirer to one of the thanos-be hosts, but other than that I think we're good to go.

@Jgiannelos on the Tegola side in practice the above would mean sending an additional header with the tile's TTL in seconds, what do you think ?

Technically we can do this (although it wasn't very trivial from a quick look at the s3 go sdk). Maybe its worth revisiting using envoy between tegola and swift to avoid forking the tegola codebase from upstream and pass the headers there.
The problem I see though is that TTL is not based on last access but on file creation date so eventually we might delete all files at once and end up with an empty storage. This can happen because we need to bootstrap the storage with enough tiles else requests will overload the DB and eventually fail.

Technically we can do this (although it wasn't very trivial from a quick look at the s3 go sdk). Maybe its worth revisiting using envoy between tegola and swift to avoid forking the tegola codebase from upstream and pass the headers there.

Understandable not wanting to fork tegola, might be worth reaching out to upstream and see if they are interested in such feature to be merged? This way we'd have a forked tegola only until the next release that includes the feature. Failing that I can see how envoy might set the headers instead (though that might be trickier to get right I think because envoy wouldn't have access to tegola's logic, see below)

The problem I see though is that TTL is not based on last access but on file creation date so eventually we might delete all files at once and end up with an empty storage. This can happen because we need to bootstrap the storage with enough tiles else requests will overload the DB and eventually fail.

Good point re: access time, my understanding is that the header can be refreshed on access by sending the header again. This I think would solve the bootstrap problem too because only the non-accessed tiles will be expired, hope that makes sense!

The problem I see though is that TTL is not based on last access but on file creation date so eventually we might delete all files at once and end up with an empty storage. This can happen because we need to bootstrap the storage with enough tiles else requests will overload the DB and eventually fail.

Good point re: access time, my understanding is that the header can be refreshed on access by sending the header again. This I think would solve the bootstrap problem too because only the non-accessed tiles will be expired, hope that makes sense!

From a brief check on swift I didn't find a way to send the delete header on object GET but I could be wrong. Can you verify if that's an option?

The problem I see though is that TTL is not based on last access but on file creation date so eventually we might delete all files at once and end up with an empty storage. This can happen because we need to bootstrap the storage with enough tiles else requests will overload the DB and eventually fail.

Good point re: access time, my understanding is that the header can be refreshed on access by sending the header again. This I think would solve the bootstrap problem too because only the non-accessed tiles will be expired, hope that makes sense!

From a brief check on swift I didn't find a way to send the delete header on object GET but I could be wrong. Can you verify if that's an option?

My apologies, I was too hasty there with my wishful thinking! x-delete-at can be refreshed by POSTing to the object, not on GET indeed.

At any rate, since the main goal of expiring objects in this case is to bound their overall number of objects in the container, I think we'd be fine with either (on bootstrap/cache warmup only)

  • Not setting expiration, since my understanding is that the tiles will get refreshed anyways eventually
  • Smearing expiration time over say one/multiple days (i.e. TTL + random(86400) more or less)

Also just a heads up, current production is using bucket: tegola-swift-v001 in case you want to cleanup the old containers.

I see eqiad using the old name and codfw the new name, is that expected?

helmfile.d/services/tegola-vector-tiles/values-codfw.yaml:    bucket: tegola-swift-v001
helmfile.d/services/tegola-vector-tiles/values-eqiad.yaml:    bucket: tegola-swift-new

Regarding auto-creating containers: Tegola codebase doesn't autocreate new containers on start but we can manually create them from the maps nodes. In case we need to automate it in the future we can implement this on k8s pod initialization but I am not expecting too many changes either way. Manual should be OK for now.

ack re: manually creating containers, seems workable for now

Eqiad is not serving live traffic at the moment. We need to re-import planet and switchover to the last active swift container, but currently its depooled.

Eqiad is not serving live traffic at the moment. We need to re-import planet and switchover to the last active swift container, but currently its depooled.

Thank you, I'll kick off the deletion of tegola-swift-new as well

Status update on this: there's a root screen on thanos-fe2001 to delete tegola-swift-new and tegola-swift-container (in a loop because deletes can timeout):

  • while sleep 1m ; do swift delete --prefix eqiad- --versions tegola-swift-container ; done
  • while sleep 1m ; do swift delete --prefix eqiad- --versions tegola-swift-new ; done

Those will eventually finish thus unlocking the rest of this task