Page MenuHomePhabricator

Add monitoring and alerting on the usage of the rdf-streaming-updater swift containers in thanos
Closed, ResolvedPublic5 Estimated Story Points

Description

As a maintainer of the W[DC]QS Streaming Updater I want to be monitor and be alerted when the space usage of these flink jobs reach a certain threshold so that I can act quickly to investigate any issues and perform the required cleanups.

(This is a followup of T314835)

Today we use 3 containers in thanos:

  • rdf-streaming-updater-codfw
  • rdf-streaming-updater-eqiad
  • rdf-streaming-updater-staging

Given than:

  • a wikidata savepoint is 3Gb
  • a commons savepoint is 2Gb
  • incremental checkpoints do consume less than 1.5Gb for each job
  • flink_ha_storage should not need more than 200Mb

If we keep a couple savepoints per job we should be able to operate with 50Gb.
The number of objects should be relatively small as well, 12 per savepoints, a bit more per checkpoints so consuming more than 500 objects might require some investigation.

AC:

Event Timeline

Gehel set the point value for this task to 5.Aug 29 2022, 3:28 PM

Change 834008 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/alerts@master] rdf-streaming-updater: alert on thanos-swift space usage

https://gerrit.wikimedia.org/r/834008

The above patch adds quick alert on the space used by auth_WDQS, it does not address all the ACs of this ticket but I think is the minimal requirement to make sure we react quickly if similar problems occur in the future.

Change 834008 merged by jenkins-bot:

[operations/alerts@master] rdf-streaming-updater: alert on thanos-swift space usage

https://gerrit.wikimedia.org/r/834008

Gehel claimed this task.
Gehel subscribed.

We have a (minimal) alert available. Let's close this and reopen if we have an incident where this isn't enough.