Add monitoring and alerting on the usage of the rdf-streaming-updater swift containers in thanos
Closed, ResolvedPublic5 Estimated Story Points
Actions

Assigned To

Authored By

	dcausse
	Aug 23 2022, 1:18 PM

Description

As a maintainer of the W[DC]QS Streaming Updater I want to be monitor and be alerted when the space usage of these flink jobs reach a certain threshold so that I can act quickly to investigate any issues and perform the required cleanups.

(This is a followup of T314835)

Today we use 3 containers in thanos:

rdf-streaming-updater-codfw
rdf-streaming-updater-eqiad
rdf-streaming-updater-staging

Given than:

a wikidata savepoint is 3Gb
a commons savepoint is 2Gb
incremental checkpoints do consume less than 1.5Gb for each job
flink_ha_storage should not need more than 200Mb

If we keep a couple savepoints per job we should be able to operate with 50Gb.
The number of objects should be relatively small as well, 12 per savepoints, a bit more per checkpoints so consuming more than 500 objects might require some investigation.

AC:

update the dashboard on https://grafana-rw.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater?orgId=1 and add a graph to have the space and object usage of these containers
create an alert if the space usage is above 50Gb per container
create an alert if the number of objects is above 500 per container

Details

	Subject	Repo	Branch	Lines +/-
	rdf-streaming-updater: alert on thanos-swift space usage	operations/alerts	master	+30 -0

Customize query in gerrit

Related Objects

Mentioned In: T314835: wdqs space usage on thanos-swift
Mentioned Here: T314835: wdqs space usage on thanos-swift

Event Timeline

dcausse created this task.Aug 23 2022, 1:18 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 23 2022, 1:18 PM

Maintenance_bot added a project: Wikidata.Aug 23 2022, 1:29 PM

Gehel moved this task from Incoming to Current work on the Wikidata-Query-Service board.Aug 29 2022, 3:20 PM

Gehel added a project: Discovery-Search (Current work).

Gehel set the point value for this task to 5.Aug 29 2022, 3:28 PM

Gehel moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.

dcausse mentioned this in T314835: wdqs space usage on thanos-swift.Sep 12 2022, 8:25 PM

Change 834008 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/alerts@master] rdf-streaming-updater: alert on thanos-swift space usage

https://gerrit.wikimedia.org/r/834008

gerritbot added a project: Patch-For-Review.Sep 22 2022, 10:17 AM

The above patch adds quick alert on the space used by auth_WDQS, it does not address all the ACs of this ticket but I think is the minimal requirement to make sure we react quickly if similar problems occur in the future.

Change 834008 merged by jenkins-bot:

[operations/alerts@master] rdf-streaming-updater: alert on thanos-swift space usage

https://gerrit.wikimedia.org/r/834008

Maintenance_bot removed a project: Patch-For-Review.Sep 22 2022, 6:30 PM

dcausse removed a project: Discovery-Search (Current work).Sep 26 2022, 3:48 PM

dcausse moved this task from Current work to Operations/SRE on the Wikidata-Query-Service board.

We have a (minimal) alert available. Let's close this and reopen if we have an incident where this isn't enough.

Add monitoring and alerting on the usage of the rdf-streaming-updater swift containers in thanosClosed, ResolvedPublic5 Estimated Story PointsActions

Description

Details

Related Objects

Event Timeline

Add monitoring and alerting on the usage of the rdf-streaming-updater swift containers in thanos
Closed, ResolvedPublic5 Estimated Story Points
Actions