Page MenuHomePhabricator

Tegola pods are crashing because swift doesnt allow connections
Closed, ResolvedPublic

Description

From eqiad k8s:

2022-04-19 09:15:02 [INFO] providers.go:82: registering provider(type): osm (mvt_postgis)
Error: could not register cache: cache: error setting to (s3) cache: ServiceUnavailable: Please reduce your request rate.
	status code: 503, request id: txc9d9ff821d9a4bffac1ab-00625e7d9e, host id: txc9d9ff821d9a4bffac1ab-00625e7d9e
Usage:
  tegola serve [flags]

Aliases:
  serve, server

Flags:
  -h, --help          help for serve
  -n, --no-cache      turn off the cache
  -p, --port string   port to bind tile server to (default ":8080")

Global Flags:
      --config string   path to config file (default "config.toml")

could not register cache: cache: error setting to (s3) cache: ServiceUnavailable: Please reduce your request rate.
	status code: 503, request id: txc9d9ff821d9a4bffac1ab-00625e7d9e, host id: txc9d9ff821d9a4bffac1ab-00625e7d9e

Event Timeline

Jgiannelos added subscribers: hnowlan, MSantos.

It looks like swift is rate limiting tegola which causes pods to fail

Change 784230 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[operations/deployment-charts@master] Temporarily disable tile pregeneration on eqiad

https://gerrit.wikimedia.org/r/784230

Change 784231 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[operations/deployment-charts@master] tegola: Temporarily disable tile pregeneration

https://gerrit.wikimedia.org/r/784231

Change 784230 abandoned by Jgiannelos:

[operations/deployment-charts@master] tegola: Temporarily disable tile pregeneration

Reason:

https://gerrit.wikimedia.org/r/784230

Change 784231 merged by jenkins-bot:

[operations/deployment-charts@master] tegola: Temporarily disable tile pregeneration

https://gerrit.wikimedia.org/r/784231

error we're seeing:

Error: could not register cache: cache: error setting to (s3) cache: ServiceUnavailable: Please reduce your request rate.
	status code: 503, request id: REQ_ID, host id: HOST_ID
could not register cache: cache: error setting to (s3) cache: ServiceUnavailable: Please reduce your request rate.
	status code: 503, request id: REQ_ID, host id: HOST_ID

Possibly related to some swift issues, still under investigation

Here is the output when running a manual stat command to swift:

> swift -A $ST_AUTH -U $ST_USER -K $ST_KEY stat tegola-swift-container
Container HEAD failed: https://thanos-swift.discovery.wmnet/v1/AUTH_tegola/tegola-swift-container 503 Service Unavailable
Failed Transaction ID: txc2d1d8cdaf3f40fe84353-00625e8a73

It looks like the rate limiting error log is probably misleading

A side-note that we should create a follow-up action is: these logs are not registered as errors, but as notices, which make it harder to debug.

Current/immediate plan of action is:

  • disable pregen/caching of tiles on swift
  • temporarily reduce fallocate_reserve in thanos-swift to allow cleanup/deletion of old tegola objects
  • if the above fails or is impractical switch to a different bucket/container and cache/pregen tiles there

Change 784246 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] tegola: increase replicas

https://gerrit.wikimedia.org/r/784246

Change 784250 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] swift: add disable_fallocate config option

https://gerrit.wikimedia.org/r/784250

Change 784254 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] tegola: increase memory usage

https://gerrit.wikimedia.org/r/784254

Change 784254 merged by jenkins-bot:

[operations/deployment-charts@master] tegola: increase memory usage

https://gerrit.wikimedia.org/r/784254

Change 784257 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] tegola: increase memory limit further

https://gerrit.wikimedia.org/r/784257

Change 784250 merged by Filippo Giunchedi:

[operations/puppet@production] swift: add disable_fallocate config option

https://gerrit.wikimedia.org/r/784250

Change 784258 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: temp disable fallocate for thanos-swift

https://gerrit.wikimedia.org/r/784258

Change 784257 merged by jenkins-bot:

[operations/deployment-charts@master] tegola: increase memory limit further

https://gerrit.wikimedia.org/r/784257

Change 784258 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: temp disable fallocate for thanos-swift

https://gerrit.wikimedia.org/r/784258

Mentioned in SAL (#wikimedia-operations) [2022-04-19T14:06:41Z] <godog> start deleting tegola-cache/osm prefix from tegola-swift-container - T306424

I have temporarily disabled fallocate in thanos-swift with https://gerrit.wikimedia.org/r/c/operations/puppet/+/784258 meaning we can clean up tegola-swift-container now.

There's a few follow ups and questions I'd like to see/discuss in regards on how tegola uses swift:

  1. Tegola should be able to create its own containers/buckets (e.g. at startup if the container is missing, without involvement from SRE). Permissions-wise this is already possible with tegola's credentials.
  2. Using a single (unsharded) container is not sustainable. There's a few fixes for this, including sharding at the tegola level (at a very basic level one container per eqiad/codfw, but see below) and sharding at the swift level (i.e. transparent to the application, though container sharding isn't deployed/active/tested yet in our swift deployment)
  3. There were distinct prefixes for eqiad/codfw within the same container, though thanos-swift is shared between eqiad and codfw, leading to an effective duplication of the tiles. In other words at least when using thanos-swift we should be using the same prefix and bucket/container regardless of tegola running in codfw/eqiad (if my understanding is correct, please let me know if not!)

What do you think ?

  1. Good to know, I thought that we needed manual intervention to create new containers. I have to check the s3 API behaviour to see if it handles it gracefully in the current codebase.
  2. Its fairly straightforward to do some basic sharding at the region level (1 container per region) but based on your 3rd bullet we might not even need it. We have a staging deployment were we can test the swift level sharding in case we need a test env.
  3. Its a bit misleading the way we use prefixes. I understand that thanos is shared between codfw/eqiad and we could potentially deduplicate the storage. The idea behind the prefixes is that up until now we had only one active environment at a time and we used the non-active for failovers, schema changes, etc. Maybe moving forward its worth deduplicating the storage used in both codfw/eqiad and just use some sort of naming pattern in containers in order to apply schema changes. I guess using different basepaths (instead of containers) was partially based on the convenience that we already had access to one container but its good to know that we can automatically create new as you mentioned.

Regarding next steps:

Currently we have an interim swift container just to help a bit with caching in our current state.
From what I understand you are going to cleanup the tegola-cache/osm prefix and hope that the object db is going to be small enough so we can bring back the service to the previous state. Is this right?
Should we prepare a fallback scenario in case this doesn't work?

  1. Good to know, I thought that we needed manual intervention to create new containers. I have to check the s3 API behaviour to see if it handles it gracefully in the current codebase.

Great! Yes the application self-managing its buckets/containers is the preferred mode

  1. Its fairly straightforward to do some basic sharding at the region level (1 container per region) but based on your 3rd bullet we might not even need it. We have a staging deployment were we can test the swift level sharding in case we need a test env.

Ack, thank you

  1. Its a bit misleading the way we use prefixes. I understand that thanos is shared between codfw/eqiad and we could potentially deduplicate the storage. The idea behind the prefixes is that up until now we had only one active environment at a time and we used the non-active for failovers, schema changes, etc. Maybe moving forward its worth deduplicating the storage used in both codfw/eqiad and just use some sort of naming pattern in containers in order to apply schema changes. I guess using different basepaths (instead of containers) was partially based on the convenience that we already had access to one container but its good to know that we can automatically create new as you mentioned.

"versioning" containers based on schema (and sharing the same container eqiad/codfw) sounds good to me, at least in the current thanos-swift configuration.

Regarding next steps:

Currently we have an interim swift container just to help a bit with caching in our current state.
From what I understand you are going to cleanup the tegola-cache/osm prefix and hope that the object db is going to be small enough so we can bring back the service to the previous state. Is this right?

That's correct, although deleting tegola-cache/osm "just" created tombstones AFAICS so the container database didn't shrink effectively. Plus updates to the objects are backfilling in background anyways. I think we're best served abandoning the container altogether once we're ready and I'll delete it.

Should we prepare a fallback scenario in case this doesn't work?

I think we should, how about going with the versioned container schema from above while we're at it?

In other words sth like eqiad/codfw using tegola-prod-v0.0.1 as container, basepath shared-cache (for example). For staging then tegola-staging-v0.0.1 as container (if that makes sense) and same basepath. If that works I can create the containers right away.

The problem with starting a new container from scratch is that we rely on the pregenerated data in order to serve map tiles so given the current status it means that we are going to have an unhealthy service for days until we pregenerate a good amount of the planet.

Is it an option to bootstrap a new container from backups ?

Is it an option to bootstrap a new container from backups ?

There's no backups for this data, however what we can do is copy files from the last-known prefix in the old container to a new prefix in a new container. I don't know offhand any tool that can do that but I'm sure it is possible.

If I've understood correctly, and this is a question of "contents of S3 bucket A into S3 bucket B", then rclone (available in Debian) ought to be able to do so.

If I've understood correctly, and this is a question of "contents of S3 bucket A into S3 bucket B", then rclone (available in Debian) ought to be able to do so.

Essentially yes, restricting the source selection to a given prefix. And ideally changing stripping that prefix and changing it with another one, though I believe we can live with the old/outdated prefix (basepath in tegola's config) for now.

I think a good way forward could be to try to bootstrap the new (shared between codfw/eqiad tegola) container with the last known state from the eqiad-v0.0.1 prefix which is the latest known good state for the maps stack and then point tegola to the new container.
@MSantos @hnowlan thoughts?

Change 784651 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] tegola: bump memory and CPU limit

https://gerrit.wikimedia.org/r/784651

Change 784651 merged by jenkins-bot:

[operations/deployment-charts@master] tegola: bump memory and CPU limit

https://gerrit.wikimedia.org/r/784651

I think a good way forward could be to try to bootstrap the new (shared between codfw/eqiad tegola) container with the last known state from the eqiad-v0.0.1 prefix which is the latest known good state for the maps stack and then point tegola to the new container.
@MSantos @hnowlan thoughts?

Sounds good to me.

Change 784656 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[operations/puppet@production] maps: Disable replication and make postgres config on codfw/eqiad identical

https://gerrit.wikimedia.org/r/784656

Change 784656 merged by Hnowlan:

[operations/puppet@production] maps: Disable replication and make postgres config on codfw/eqiad identical

https://gerrit.wikimedia.org/r/784656

Change 784660 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] tegola: increase number of replicas

https://gerrit.wikimedia.org/r/784660

I think a good way forward could be to try to bootstrap the new (shared between codfw/eqiad tegola) container with the last known state from the eqiad-v0.0.1 prefix which is the latest known good state for the maps stack and then point tegola to the new container.
@MSantos @hnowlan thoughts?

Sounds good to me.

SGTM too. re: the prefix stripping I think we can skip that for now, in other words we'll be copying tegola-swift-container/eqiad-v0.0.1/ to NEW CONTAINER/eqiad-v0.0.1 since that appears simpler tool-wise. What do you think?

Change 784660 merged by jenkins-bot:

[operations/deployment-charts@master] tegola: increase number of replicas

https://gerrit.wikimedia.org/r/784660

Sounds good

I've begun a copy from tegola-swift-container to tegola-swift-new (I'm not good with names!) of eqiad-v0.0.1/ prefix only. There's ~57M files to copy so this will take a while. The transfer is running as my user on netmon2001 under screen. Why netmon2001 ? because using thanos-fe means connections won't be load-balanced (i.e. going to localhost).

Once the transfer is complete we can point tegola to the new container and basepath at least for the time being, then reconvene on the longer term fixes for sustainability.

@fgiunchedi is there a mitigation for the underlying object DB issue in the new container? Or we rely on the fact that deduplicating objects between codfw/eqiad will be good enough?

@fgiunchedi is there a mitigation for the underlying object DB issue in the new container? Or we rely on the fact that deduplicating objects between codfw/eqiad will be good enough?

One mitigation is in the form of swift's container sharding. Though it isn't deployed in production and will require some thought for sure once we're in a more stable place, including considering whether we want to go for it as opposed to application-level sharding like mediawiki does.

Hope that helps!

Sounds good

I've begun a copy from tegola-swift-container to tegola-swift-new (I'm not good with names!) of eqiad-v0.0.1/ prefix only. There's ~57M files to copy so this will take a while. The transfer is running as my user on netmon2001 under screen. Why netmon2001 ? because using thanos-fe means connections won't be load-balanced (i.e. going to localhost).

Status update: we're ~17M files in or in other words 1/3 of the way in

Change 784246 abandoned by Hnowlan:

[operations/deployment-charts@master] tegola: increase replicas

Reason:

https://gerrit.wikimedia.org/r/784246

Change 787435 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[operations/deployment-charts@master] tegola: Use new container for maps tiles

https://gerrit.wikimedia.org/r/787435

Change 787435 merged by jenkins-bot:

[operations/deployment-charts@master] tegola: Use new swift container for maps tiles

https://gerrit.wikimedia.org/r/787435

Jgiannelos claimed this task.

Closing this ticket since production tegola looks stable for some time now.