Hi,
Event Platform needs to operationalise an Apache Flink streaming application on k8s (DSE cluster and wikikube). We are in need of a storage solution for checkpointing state and support highly available application lifecycles. This storage would be accessed via the s3 protocol and will not need cross DC replication.
In developing this application we iterated on the wqds updater service, that currently runs on wikikube and uses Thanos s3 for checkpointing.
Currently the application is deployed in dse-k8s-eqiad, but we plant to move to wikikube. The dse-k8s-eqiad deployment is single DC, but we are going to deploy this in wikikube as active/active single compute, similar to other multi DC services.
The application - in Flink terms - is stateless, and will only need to checkpoint Kafka offsets into a partition to handle restarts. Current experiments suggest a checkpoint size (write) of 10s of MBs, at a frequency of once every 1-3 minutes. This might change as we gain more experience with operating Flink. Reads will be sporadic: they will happen at application restarts caused either by scheduled maintenance or recovery from failure. Data will not need to be stored indefinitely and will be pruned (cutoff not yet known - but we can start with strict policies).
We don't have a lot of metrics yet. If we wanted to collect actual metrics from our development target (DSE k8s), would it be possible for you to create a throwaway (non replicated) eqiad bucket with a quota (<1 GB) ?
The application currently does not have a SLO, and is not yet supporting feature or production use cases.
Scalability needs
While this request is specific to mediawiki-page-content-change-enrichment we expect the need to support similar uses cases in the future (estimated in the order of 1-10 in the next 2-4 quarters). The abstraction is multi-tenant, in the sense that each application owns a helmfile deploymenttt and Docker image). Each application will execute in its own k8s namespace. Applications deployment and lifecycle management is handled by a [flink k8s operator] (https://phabricator.wikimedia.org/T324576). The operator service user is the same across applications k8s namespaces.
Done is
- mediawiki-page-content-change-enrichment in DSE can store Flink checkpoints in an object store