Page MenuHomePhabricator

Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing
Closed, ResolvedPublic

Description

Hi,

Event Platform needs to operationalise an Apache Flink streaming application on k8s (DSE cluster and wikikube). We are in need of a storage solution for checkpointing state and support highly available application lifecycles. This storage would be accessed via the s3 protocol and will not need cross DC replication.

In developing this application we iterated on the wqds updater service, that currently runs on wikikube and uses Thanos s3 for checkpointing.

Currently the application is deployed in dse-k8s-eqiad, but we plant to move to wikikube. The dse-k8s-eqiad deployment is single DC, but we are going to deploy this in wikikube as active/active single compute, similar to other multi DC services.

The application - in Flink terms - is stateless, and will only need to checkpoint Kafka offsets into a partition to handle restarts. Current experiments suggest a checkpoint size (write) of 10s of MBs, at a frequency of once every 1-3 minutes. This might change as we gain more experience with operating Flink. Reads will be sporadic: they will happen at application restarts caused either by scheduled maintenance or recovery from failure. Data will not need to be stored indefinitely and will be pruned (cutoff not yet known - but we can start with strict policies).

We don't have a lot of metrics yet. If we wanted to collect actual metrics from our development target (DSE k8s), would it be possible for you to create a throwaway (non replicated) eqiad bucket with a quota (<1 GB) ?

The application currently does not have a SLO, and is not yet supporting feature or production use cases.

Scalability needs

While this request is specific to mediawiki-page-content-change-enrichment we expect the need to support similar uses cases in the future (estimated in the order of 1-10 in the next 2-4 quarters). The abstraction is multi-tenant, in the sense that each application owns a helmfile deploymenttt and Docker image). Each application will execute in its own k8s namespace. Applications deployment and lifecycle management is handled by a [flink k8s operator] (https://phabricator.wikimedia.org/T324576). The operator service user is the same across applications k8s namespaces.

Done is
  • mediawiki-page-content-change-enrichment in DSE can store Flink checkpoints in an object store

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

To loop in some of the discussion from the sre-data-persistence@ email thread:

Nothing about this jumps out as unreasonable to me, but some basic questions that come to mind are things like:

  • Why did search choose to do it this way? What other choices were discussed/considered and/or evaluated (either by you or search)?
  • In what ways are your use-case similar to the search one, and in what ways are it different?

Precedent can be a trap; If a prior decision would no longer hold up, or if the use-cases differ sufficiently, it's easy to be led down the wrong path. -- @Eevans

I would like to understand the need for search using Flink, the problem space and requirements. Is there a document for this? -- @Ladsgroup

My *assumption*: Flink needs a file oriented storage to checkpoint state. On Hadoop, the default solution would be HDFS. However, HDFS is not considered a production system. Leaving swift as the only option at the time for applications deployed on wikikube. -- @gmodena

Ok, this sounds like: "we needed a place to put it, and Swift was easiest/quickest. That may not change anything in the end, but if true, I'd prefer that it be reflected in the decision of record. And I'd still like to see us attempt to characterize this use-case better, even if it seems like doing so won't change our approach in the near-term.

My understanding thus far: Network accessible, file-oriented storage, in the 10s of MB range (x10 over the coming year). Primarily writes, ~20/hour (x10 over the coming year). Not currently replicated (but maybe eventually).

It is also my understanding that this differs from Search's use (at a minimum) in that they are storing bigger files, and do require replication across data-centers?

The abstraction is multi-tenant (each team/application owns a Helm chart and Docker image), but applications will be managed by the same service user. I need to validate how namespaces will be managed per application. -- @gmodena

Let me frame it this way. Is this being setup as a platform; Will disparate WMF teams implement these on a platform you are providing? If so, would checkpoint storage be exposed to those implementers, or is that something that Platform and/or DSE would manage for them? I assume if it is the former, then you'd probably want to ensure that a misconfiguration didn't allow one job to trample on the checkpoint of another, or inadvertently have one job erroneously attempting to use the checkpoint of another, etc. Do we need them separated by accounts (including different credentials)? Different containers?

Since this is an 'enrichment' job, we are going to deploy this in wikikube as 'active/active single compute', similar to other multi DC services. This is actually different from what search does, which is 'active/active double compute'.

Enrichment apps doing active/active single compute won't need any replication of checkpoints in Swift. The input here is MW page changes (edits. etc.), so generally one DC is doing more (all?) work than the other. The app only consumes from its local (main-eqiad or main-codfw) kafka cluster, and the the only real thing that is stored in checkpoints are the Kafka topic partition offsets. -- @Ottomata

Ok, so our media storage cluster isn't replicated between DCs (per say), but it also does not run the s3 interface. The Thanos cluster does run the s3 interface, but it is replicated across DCs. I assume replicated will be OK, even if it isn't a requirement (presumably we'll need DC-specific containers, and each DC can ignore the replicated checkpoints of the other).

And this kind of highlights a problem; Each of these clusters was put up for a specific purpose, and these sorts of use-cases are an afterthought. This is why I believe it's important that we take these opportunities to properly work through the engineering, so that we can evolve our infrastructure to satisfy requirements like this, instead of shoehorning them where we can.

Hi,

Regarding the use of Flink for streaming uses cases, our tech evaluation can be found at https://www.mediawiki.org/wiki/Platform_Engineering_Team/Event_Platform_Value_Stream/Stream_Processing_Framework_Evaluation.
Search evaluation (which we reference) can be found at https://docs.google.com/document/d/1NWYnbvktbMxdsztOd6h_aGUDMWdPEf8QMbor3ATd0Wo/edit cc / @dcausse

Ok, this sounds like: "we needed a place to put it, and Swift was easiest/quickest.

Note that we are not looking for a shortcut, but for a sustainable path to production. My understanding is that the our deployment environment leaves Swift / Thanos as the only options for file-oriented storage accessible from k8s. Are there other options we should look into?

That may not change anything in the end, but if true, I'd prefer that it be reflected in the decision of record. And I'd still like to see us attempt to characterize this use-case better, even if it seems like doing so won't change our approach in the near-term.

My understanding thus far: Network accessible, file-oriented storage, in the 10s of MB range (x10 over the coming year). Primarily writes, ~20/hour (x10 over the coming year). Not currently replicated (but maybe eventually).

Sounds right, based on what we have been able to measure so far. We would also need an s3 interface, which is the protocol flink connectors speak natively. In the email thread you mentioned, @MatthewVernon suggested targeting Thanos (as Search did).

It is also my understanding that this differs from Search's use (at a minimum) in that they are storing bigger files, and do require replication across data-centers?

Our footprint is smaller because we only store kafka offsets into a partition, and not full application state. The application itself is stateless.
cc / @dcausse for Search's replication needs.

The abstraction is multi-tenant (each team/application owns a Helm chart and Docker image), but applications will be managed by the same service user. I need to validate how namespaces will be managed per application. -- @gmodena

Let me frame it this way. Is this being setup as a platform; Will disparate WMF teams implement these on a platform you are providing? If so, would checkpoint storage be exposed to those implementers, or is that something that Platform and/or DSE would manage for them? I assume if it is the former, then you'd probably want to ensure that a misconfiguration didn't allow one job to trample on the checkpoint of another, or inadvertently have one job erroneously attempting to use the checkpoint of another, etc. Do we need them separated by accounts (including different credentials)? Different containers?

While we don't really have an SLO, I think your assumption is correct. Separation by accounts / credentials would be good to have (fwiw in k8s each application will run within a dedicated namespace). What do you mean with containers in this context?

Since this is an 'enrichment' job, we are going to deploy this in wikikube as 'active/active single compute', similar to other multi DC services. This is actually different from what search does, which is 'active/active double compute'.

Enrichment apps doing active/active single compute won't need any replication of checkpoints in Swift. The input here is MW page changes (edits. etc.), so generally one DC is doing more (all?) work than the other. The app only consumes from its local (main-eqiad or main-codfw) kafka cluster, and the the only real thing that is stored in checkpoints are the Kafka topic partition offsets. -- @Ottomata

Ok, so our media storage cluster isn't replicated between DCs (per say), but it also does not run the s3 interface. The Thanos cluster does run the s3 interface, but it is replicated across DCs. I assume replicated will be OK, even if it isn't a requirement (presumably we'll need DC-specific containers, and each DC can ignore the replicated checkpoints of the other).

We'll need s3, and replication should be ok. As you point out, we'd need DC-specific storage (we won't be able to use codfw kafka offset to restart from eqiad and vice-versa).

And this kind of highlights a problem; Each of these clusters was put up for a specific purpose, and these sorts of use-cases are an afterthought. This is why I believe it's important that we take these opportunities to properly work through the engineering, so that we can evolve our infrastructure to satisfy requirements like this, instead of shoehorning them where we can.

How would you suggest we move forward?

This is a k8s application running on the WMF OpenStack, yes?

Might it be appropriate to use a persistent volume claim, backed by the OpenStack storage itself for this? At my last job that's the sort of solution we'd have looked at for this kind of workflow. That would give you small amounts of fast storage local to your compute environment.

Ok, this sounds like: "we needed a place to put it, and Swift was easiest/quickest.

Note that we are not looking for a shortcut, but for a sustainable path to production.

I know; I didn't mean for this to come across as an indictment (apologies if it seemed that way).

My understanding is that the our deployment environment leaves Swift / Thanos as the only options for file-oriented storage accessible from k8s.

It sounds like there are many things that could work for this, but —as you say— Swift was chosen for Search because it was the only one of them readily available. On the basis of that decision (precedent) we're prepared to do it again (and again, and again). That doesn't by any means make it a bad decision (and I'd be loathe to proliferate new systems unnecessarily), but it's why I'd like to make sure we understand the concerns.

It's also possible that I'm over-explaining my motivations here (vis-a-vis discovery) and leading you to believe I plan to obstruct (I'm don't). 🙂

Are there other options we should look into?

@MatthewVernon brought up persistent volume claim. I have no experience with it, but it sounds like a good fit; I'd like to explore that a bit.

[ ... ]

The abstraction is multi-tenant (each team/application owns a Helm chart and Docker image), but applications will be managed by the same service user. I need to validate how namespaces will be managed per application. -- @gmodena

Let me frame it this way. Is this being setup as a platform; Will disparate WMF teams implement these on a platform you are providing? If so, would checkpoint storage be exposed to those implementers, or is that something that Platform and/or DSE would manage for them? I assume if it is the former, then you'd probably want to ensure that a misconfiguration didn't allow one job to trample on the checkpoint of another, or inadvertently have one job erroneously attempting to use the checkpoint of another, etc. Do we need them separated by accounts (including different credentials)? Different containers?

While we don't really have an SLO, I think your assumption is correct. Separation by accounts / credentials would be good to have (fwiw in k8s each application will run within a dedicated namespace). What do you mean with containers in this context?

Swift containers; A Swift account can have an arbitrary number of containers, a container can hold an arbitrary number of files (so it's sort of like a directory). Objects themselves can use a delimiter (/) to create hierarchical namespaces (think: directories), but —for example— you can place ACLs on a container, whereas you cannot on objects.

[ ... ]

How would you suggest we move forward?

I think we are, no?

Hi everyone and sorry to jump into this conversion but just wanted to add a quick note on the usefulness of cross-DC replication as it's something that was not obvious to us when we started to use swift containers in thanos for our search jobs.
If you consider all the dependent services&platforms:

  • wikikube
  • kafka-main
  • MW mw-async* (hopefully a new mw-async-ro should be available at some point)
  • thanos*

(* for services that I think have an automatic failover via dns discovery in place)
Thanos being replicated this brings down to only 2 the services/platforms that would cause a downtime to our job if they're down.

@gmodena for your job and if the plan is still to go with active/active - single compute (à la changeprop) I think this is even more important for you to have cross-DC replication at the object store level. The activity of your job will be bound to the status of eventgate-main and this might create a strong entanglement between the objecstore & eventgate-main. E.g. if eventgate-main is pooled in eqiad and there's a downtime on your objectstore@eqiad this will cause a downtime of your job without an easy way to resume it.

Re: k8s PV, I believe it might be possible to use them with flink as long as the path remains stable over time and that the volume they point to is shared by all flink PODs (like a nfs share). But I'm not sure this is an option here unless there are plans to have them in wikikube?

This is a k8s application running on the WMF OpenStack, yes?

Might it be appropriate to use a persistent volume claim, backed by the OpenStack storage itself for this? At my last job that's the sort of solution we'd have looked at for this kind of workflow. That would give you small amounts of fast storage local to your compute environment.

It's running in the production realm (owned by Service Ops SRE). We are currently targeting DSE and Wikikube.

I need to investigate persistent volumes. Thanks for the pointer.

Hi Eric,

I know; I didn't mean for this to come across as an indictment (apologies if it seemed that way).

No worries! I just wanted to be explicit that we are trying to do the right thing, and not work around process / systems.

My understanding is that the our deployment environment leaves Swift / Thanos as the only options for file-oriented storage accessible from k8s.

It sounds like there are many things that could work for this, but —as you say— Swift was chosen for Search because it was the only one of them readily available. On the basis of that decision (precedent) we're prepared to do it again (and again, and again). That doesn't by any means make it a bad decision (and I'd be loathe to proliferate new systems unnecessarily), but it's why I'd like to make sure we understand the concerns.

@MatthewVernon brought up persistent volume claim. I have no experience with it, but it sounds like a good fit; I'd like to explore that a bit.

I will investigate. I'm also aware of efforts on MOSS, Ceph and making HDFS accessible to k8s (DSE), but afaik they are WIPs. It does not discount them as viable solutions once ready.

Swift containers; A Swift account can have an arbitrary number of containers, a container can hold an arbitrary number of files (so it's sort of like a directory). Objects themselves can use a delimiter (/) to create hierarchical namespaces (think: directories), but —for example— you can place ACLs on a container, whereas you cannot on objects.

Thanks for clarifying. I'll do some investigation to understand our options here.

[ ... ]

How would you suggest we move forward?

I think we are, no?

I was a bit cryptic, sorry. IIRC we discussed (email thread) about streamlining project onboarding / cross team requirements gathering. It sounds to me related to the issue you highlighted.
Happy to contribute to any requirements / design doc that might help your team better understand (and capture) concerns.

[ ... ]

How would you suggest we move forward?

I think we are, no?

I was a bit cryptic, sorry. IIRC we discussed (email thread) about streamlining project onboarding / cross team requirements gathering. It sounds to me related to the issue you highlighted.
Happy to contribute to any requirements / design doc that might help your team better understand (and capture) concerns.

Oh, right. I think the answer is the same though (i.e. "we are"):

Broadly, we need to establish (or review/refine as the case may be) a methodology for requests like these, and document it. That should make subsequent requests easier, because you'll have a better idea when (at what stage) to engage, and the sorts of questions that will need answering.

More specifically, I can imagine us coming away from this with some language for this use-case: Durable, file-oriented, small-to-medium-sized, low-throughput, replicated shared state for k8s or similar, (though hopefully pithier, something that rolls off the tongue better 😛). We (DP) can then find better ways of accommodating it, be that a different technology, a different cluster, or just documentation and related collateral (scripts, puppet code, etc) for provisioning. And we'll be better prepared to recognize requests that match this criteria when they come in.

usefulness of cross-DC replication

After asking @dcausse, I understand now why this is useful. If the backing object store can be transparently failed over for both writes and reads, then SRE can do downtime maintenance on an objectstore in either DC without manual intervention. This is pretty desirable, so if we go with one of the existent clusters, thanos would be a better fit after all.

Hi, sorry, I just came back from ooo. I want to take a step back and see what is the actual problem, what is the solution, what are the alternatives and trade-offs. I'm not talking about thanos vs PVC. I'm more inclined to understand the search platform's usecase and problem. Is it the WDQS update mechanism?

[...]

@MatthewVernon brought up persistent volume claim. I have no experience with it, but it sounds like a good fit; I'd like to explore that a bit.

I will investigate.

Flink doc does suggest that their k8s HA implementation could work with persistent volumes. This assumes that PVs are enabled in k8s, that to the best of my knowledge is not the case on our clusters.
It also assumes that Flink will use k8s HA. We are leaning towards using zookeeper instead. https://phabricator.wikimedia.org/T331283 goes more into details, but the reason for this choice is to be able to persist snapshot/savepoint metadata across k8s restarts (ConfigMaps are dropped during updates).

Here are the notes from today's meeting.

Action items:

  • Event Platform will investigate k8s persistent volume claims a little more for long term. Can we really not use Zookeeper if we use PV? It really does sound like PV would be a good long term solution for internal / ephemeral application state.
  • Data Persistence to follow up about long term plan. @Eevans and Amir will talk with @KOfori and @MatthewVernon.
  • @pmiazga reach out to @Ottomata about arch notebooks & documenting temp solutions / tech debt.
  • For short term solution, we will keep corresponding on ticket here. @Ottomata + @gmodena to answer questions in this ticket.

Answering some specific questions from Eric:

Will disparate WMF teams implement these on a platform you are providing?

yes

If so, would checkpoint storage be exposed to those implementers, or is that something that Platform and/or DSE would manage for them?

Generally implementers would not directly think about checkpoint storage; it would be handled by our platform abstractions. The only case I can think about would be the need to 'rewind' or 'reset' Kafka offsets in state. We don't have practice even ourselves doing this yet. I think @dcausse and others have some tools for this? For now, let's assume that implementers will not directly interact with checkpoints. As platform providers, we would need to. In the future we might build tooling to allow implementers to modify checkpoints, but that is far off.

you'd probably want to ensure that a misconfiguration didn't allow one job to trample on the checkpoint of another, or inadvertently have one job erroneously attempting to use the checkpoint of another, etc. Do we need them separated by accounts (including different credentials)? Different containers?

Good question. Each Flink app is functionally distinct, so separate containers per Flink app (job) makes sense. However, it is possible that some apps naturally group together in a deployment unit (a helmfile service and/or k8s namespace), and in that case I'd expect a single swift (or whatever) deploy user account would be used to manage the checkpoints of those apps. I believe @dcausse is working on 2 apps in the same k8s namespace now (streaming updaters for WDQS and for commons query services?).

To keep things a little simpler, I think a container / account per k8s namespace would be sufficient.


@Eevans what else would be helpful for you to know now? (aside from more info on k8s PVs)

Generally implementers would not directly think about checkpoint storage; it would be handled by our platform abstractions. The only case I can think about would be the need to 'rewind' or 'reset' Kafka offsets in state. We don't have practice even ourselves doing this yet. I think @dcausse and others have some tools for this? For now, let's assume that implementers will not directly interact with checkpoints. As platform providers, we would need to. In the future we might build tooling to allow implementers to modify checkpoints, but that is far off.

We have tools to generate bootstrap states and manage kafka offsets manually but I would consider this very advanced, possibly dangerous and probably out of scope for the users of the platform you're building.

To keep things a little simpler, I think a container / account per k8s namespace would be sufficient.

+1, most swift metrics that we monitor are labelled by account not container.

Flink doc does suggest that their k8s HA implementation could work with persistent volumes. This assumes that PVs are enabled in k8s, that to the best of my knowledge is not the case on our clusters.

It also assumes that Flink will use k8s HA

Doing a little research, I'm not sure about this. I believe that the configuration of the checkpoint storage location is independent of where the HA pointer to that location is stored. I think we can use Zookeeper if we were to use k8s PersistentVolumes for checkpoint storage.

I asked @JMeybohm about potential PV support in wikikube. He said.

we're not going to support persistent storage in our k8s clusters in the foreseeable future, sorry. IIUC there are some experiments(?) (@BTullis maybe) but those will probably not reach wikikube

Here are relevant notes from a k8s-sig meeting.

So, it looks like PVs are not an option for us, for now.

After discussions with serviceops about use of Persistent Volume Claims, it's clear that if we go that route, it won't be on a timeline that accommodates this. We're going to move forward with a Swift account for this; I will pick up the task and try to have something ready this week.

Per a discussion with @gmodena on IRC, I'll create an account named mediawiki-event-enrichment (or AUTH_mediawiki-event-enrichment specifically).

Change 918582 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] hierdata: add swift (thanos) mw-event-enrichment account

https://gerrit.wikimedia.org/r/918582

Change 918583 had a related patch set uploaded (by Eevans; author: Eevans):

[labs/private@master] hierdata: add mw_event_enrichment swift account (thanos)

https://gerrit.wikimedia.org/r/918583

Change 918583 merged by Eevans:

[labs/private@master] hierdata: add mw_event_enrichment swift account (thanos)

https://gerrit.wikimedia.org/r/918583

Change 918582 merged by Eevans:

[operations/puppet@production] hierdata: add swift (thanos) mw-event-enrichment account

https://gerrit.wikimedia.org/r/918582

Ok, this is setup and has been tested. I created the two containers discussed as well (mediawiki-page-content-change-enrichment-{eqiad,codfw}).

One thing we didn't discuss, is sourcing of credentials; Is it safe to assume since this is being deployed to wikikube that it can be templated from private.git during deployment?

Ok, this is setup and has been tested. I created the two containers discussed as well (mediawiki-page-content-change-enrichment-{eqiad,codfw}).

Terrific, thanks for this!

One thing we didn't discuss, is sourcing of credentials; Is it safe to assume since this is being deployed to wikikube that it can be templated from private.git during deployment?

I need to review how private.git templating works. We follow deployment pipeline practices, so I'm tempted to say yes.

We might want to test things out on DSE first, but that deployment follows deployment pipeline practices too.

[ ... ]

One thing we didn't discuss, is sourcing of credentials; Is it safe to assume since this is being deployed to wikikube that it can be templated from private.git during deployment?

I need to review how private.git templating works. We follow deployment pipeline practices, so I'm tempted to say yes.

We might want to test things out on DSE first, but that deployment follows deployment pipeline practices too.

Ok, that should definitely be doable then (i.e. it's done all the time). I'm probably not the best person to ask how, but whomever is setting up the deployment charts will know what to do (/cc @hnowlan because I know he does :)).

macro-deployed

[ ... ]

One thing we didn't discuss, is sourcing of credentials; Is it safe to assume since this is being deployed to wikikube that it can be templated from private.git during deployment?

I need to review how private.git templating works. We follow deployment pipeline practices, so I'm tempted to say yes.

We might want to test things out on DSE first, but that deployment follows deployment pipeline practices too.

Ok, that should definitely be doable then (i.e. it's done all the time). I'm probably not the best person to ask how, but whomever is setting up the deployment charts will know what to do (/cc @hnowlan because I know he does :)).

Yep, there's a standard process for secrets via puppet private using role::common::deployment_server::kubernetes. Happy to help with it whenever it's needed!

Thanks @hnowlan took me a bit to find this, but I did and we added it to puppet private yesterday. Work happening now in T336656: mediawiki-page-content-change-enrichment checkpoints should be stored in Swift

Change 922595 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] mw-page-content-change-enrich - use bucket names created by T330693

https://gerrit.wikimedia.org/r/922595

Change 922595 merged by Ottomata:

[operations/deployment-charts@master] mw-page-content-change-enrich - use bucket names created by T330693

https://gerrit.wikimedia.org/r/922595

Per a discussion with @gmodena on IRC, I'll create an account named mediawiki-event-enrichment (or AUTH_mediawiki-event-enrichment specifically).

For posterity sake, this was actually created as AUTH_mw-event-enrichment. The username is: mw-event-enrichment:prod.

Sorry for any confusion.

Hopping on this thread to confirm that we are now able to store snapshots / savepoints in swift using the provided containers.

@Eevans @Ottomata could you help me rubber duck something? I created a new container in our swift account with (I assume)
default policies:

$ swift stat mw_page_content_change_enrich__dse-k8s-eqiad
                      Account: AUTH_mw-event-enrichment
                    Container: mw_page_content_change_enrich__dse-k8s-eqiad
                      Objects: 0
                        Bytes: 0
                     Read ACL:
                    Write ACL:
                      Sync To:
                     Sync Key:
                 Content-Type: application/json; charset=utf-8
                  X-Timestamp: 1684875444.58154
                Last-Modified: Wed, 24 May 2023 07:53:49 GMT
                Accept-Ranges: bytes
             X-Storage-Policy: standard
                         Vary: Accept
                   X-Trans-Id: tx1c6701bdacbb4084a973c-00646ddcfe
       X-Openstack-Request-Id: tx1c6701bdacbb4084a973c-00646ddcfe
X-Envoy-Upstream-Service-Time: 27
                       Server: envoy

When I set it as checkpoints storage, Flink fails with the following error:

Caused by: com.facebook.presto.hive.s3.PrestoS3FileSystem$UnrecoverableS3OperationException: com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID: tx8c60ec5e0ca74f8f851de-00646dbe61; S3 Extended Request ID: tx8c60ec5e0ca74f8f851de-00646dbe61; Proxy: null), S3 Extended Request ID: tx8c60ec5e0ca74f8f851de-00646dbe61 (Path: s3://mw_page_content_change_enrich__dse-k8s-eqiad/checkpoints/480beab788c847640ccb07ea26669e8c/chk-11/_metadata)

Full stack trace is available in logstash.

Is there any container specific access policy that I should set?

This log entry is also relevant to the error above:

com.amazonaws.services.s3.model.AmazonS3Exception: The specified bucket is not valid. (Service: Amazon S3; Status Code: 400; Error Code: InvalidBucketName; Request ID: tx07b821592df14024bd0fb-00646dbe61; S3 Extended Request ID: null; Proxy: null), S3 Extended Request ID: null

mw_page_content_change_enrich__dse-k8s-eqiad is not a valid s3 bucket, because the protocol does not allow _ in names: https://docs.aws.amazon.com/AmazonS3/latest/userguide/bucketnamingrules.html.
As a test, I replaced _ with - and checkpoints are now stored.

@Ottomata: we need to revisit this naming convention.