As a maintainer of the W[DC]QS Streaming Updater I want to be monitor and be alerted when the space usage of these flink jobs reach a certain threshold so that I can act quickly to investigate any issues and perform the required cleanups.
(This is a followup of T314835)
Today we use 3 containers in thanos:
- rdf-streaming-updater-codfw
- rdf-streaming-updater-eqiad
- rdf-streaming-updater-staging
Given than:
- a wikidata savepoint is 3Gb
- a commons savepoint is 2Gb
- incremental checkpoints do consume less than 1.5Gb for each job
- flink_ha_storage should not need more than 200Mb
If we keep a couple savepoints per job we should be able to operate with 50Gb.
The number of objects should be relatively small as well, 12 per savepoints, a bit more per checkpoints so consuming more than 500 objects might require some investigation.
AC:
- update the dashboard on https://grafana-rw.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater?orgId=1 and add a graph to have the space and object usage of these containers
- create an alert if the space usage is above 50Gb per container
- create an alert if the number of objects is above 500 per container