Hi,
Event Platform needs to operationalise an Apache Flink [streaming application](https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-event-enrichment) on k8s (DSE cluster and wikikube). We are in need of a storage solution for checkpointing state and [support highly available application lifecycles](https://phabricator.wikimedia.org/T328563). This storage would be accessed via the s3 protocol and will not need cross DC replication.
In developing this application we iterated on the [wqds updater service](https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Flink_On_Kubernetes), that currently runs on wikikube and uses Thanos s3 for checkpointing.
Currently the application is deployed in dse-k8s-eqiad, but we plant to [move to wikikube](https://phabricator.wikimedia.org/T330507). The dse-k8s-eqiad deployment is single DC, but we are going to deploy this in wikikube as [active/active single compute](https://docs.google.com/presentation/d/1xxnFcxFJQGfbxnlmgwzCGyS8ULG0clP2QHcZ2OeeL1c/edit#slide=id.g1277d24d13d_0_0), similar to other multi DC services.
The application - in Flink terms - is stateless, and will only need to checkpoint Kafka offsets into a partition [to handle restarts](https://www.mediawiki.org/wiki/Platform_Engineering_Team/Event_Platform_Value_Stream/Pyflink_Enrichment_Service_Deployment#Application_restarts). Current experiments suggest a checkpoint size (write) of 10s of MBs, at a frequency of once every 1-3 minutes. This might change as we gain more experience with operating Flink. Reads will be sporadic: they will happen at application restarts caused either by scheduled maintenance or recovery from failure. Data will not need to be stored indefinitely and will be pruned (cutoff not yet known - but we can start with strict policies).
We don't have a lot of metrics yet. If we wanted to collect actual metrics from our development target (DSE k8s), would it be possible for you to create a throwaway (non replicated) eqiad bucket with a quota (<1 GB) ?
The application currently does not have a SLO, and is not yet supporting feature or production use cases.
===== Scalability needs
While this request is specific to mediawiki-page-content-change-enrichment we expect the need to support similar uses cases in the future (estimated in the order of 1-10 in the next 2-4 quarters). The abstraction is multi-tenant (each team/application owns a Helm
chart and Docker image), but applications will be managed by the same service user. We will need to validate how namespaces can be managed per application.