Track and clean up object storage used by rdf-streaming-updater
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	bking
	Oct 11 2023, 7:38 PM

Description

The rdf-streaming-updater uses Flink, which creates checkpoints in Thanos-swift object storage. A recent audit by @dcausse discovered ~1 TB of data. After removing stale/unnecessary data, total usage was down to ~20 GB.

This suggests that we need to be more aggressive about removing data, particularly because we will soon be moving the Search Update Pipeline to Flink.

Creating this ticket to:

~~Create monitoring/alerts for object storage usage~~ These were already created by @dcausse , see this dashboard for an example of metrics use, and the alerts live here .
~~Decide whether or not we need an automated cleanup process~~ We have decided to script this process. Automation is possible in the future, but out of scope for this ticket.
- Design/implement cleanup.

Details

Subject	Repo	Branch	Lines +/-
rdf-streaming-updater: raise storage alert threshold	operations/alerts	master	+6 -6
rdf-streaming-updater: restore from savepoint (WIP)	operations/deployment-charts	master	+10 -10
rdf-streaming-updater: Trigger savepoints in production envs	operations/deployment-charts	master	+6 -6
rdf-streaming-updater: Change notification from email to task	operations/alerts	master	+2 -3

Customize query in gerrit

Title	Reference	Author	Source Branch	Dest Branch
Add docstrings and minor formatting changes	repos/search-platform/sre/cleanup-flink-object-storage!3	bking	docstring	main
Add kubernetes logic	repos/search-platform/sre/cleanup-flink-object-storage!2	bking	k8s	main
Add README	repos/search-platform/sre/cleanup-flink-object-storage!1	bking	readme	main

Customize query in GitLab

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		bking	T348685 Track and clean up object storage used by rdf-streaming-updater
		Invalid		None	T356283 Clean up object storage in response to latest alert

Event Timeline

bking created this task.Oct 11 2023, 7:38 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 11 2023, 7:38 PM

Space usage alerts for the Search team added in this PR

bking added a subtask: T340548: [EPIC] Deployment of the Search Update Pipeline on Flink / k8s.Oct 11 2023, 8:54 PM

Gehel removed a subtask: T340548: [EPIC] Deployment of the Search Update Pipeline on Flink / k8s.Oct 16 2023, 1:04 PM

Gehel moved this task from needs triage to Ops / SRE on the Discovery-Search board.

Gehel triaged this task as Medium priority.Oct 18 2023, 8:40 AM

Gehel moved this task from Incoming to Misc on the Data-Platform-SRE board.

bking moved this task from Misc to In Progress on the Data-Platform-SRE board.Nov 1 2023, 7:17 PM

bking updated the task description. (Show Details)

bking updated the task description. (Show Details)Nov 1 2023, 7:22 PM

We got another alert for Swift disk usage today. Rather than have an automated cleanup process, I wonder if it would be useful to set a TTL on our objects (possibly we do this already?). Both S3 and Swift support this concept, although this isn't directly available from the Flink S3 plugin. Maybe a batch job to set TTLs daily?

Gehel moved this task from In Progress to Misc on the Data-Platform-SRE board.Dec 5 2023, 4:39 PM

Change 994758 had a related patch set uploaded (by Bking; author: Bking):

[operations/alerts@master] rdf-streaming-updater: Change notification from email to task

https://gerrit.wikimedia.org/r/994758

gerritbot added a project: Patch-For-Review.Jan 31 2024, 4:25 PM

Change 994758 merged by jenkins-bot:

[operations/alerts@master] rdf-streaming-updater: Change notification from email to task

https://gerrit.wikimedia.org/r/994758

Adding some notes on this topic based on conversation with @dcausse yesterday. Feel free to correct this if I missed anything.

What causes Flink to use so much storage space?
There are a few situations that may disrupt Flink's normal cleanup processes:

Transient issues reaching object storage endpoint
Pods restarting or being killed/replaced

Both situations could cause Flink to lose track of its checkpoints, which means it won't delete them automatically. We could automate a deletion process, but we need to spend some time determining how to distinguish whether the data is expendable. A good first step might be to generate a report on the current state of Flink and its object storage usage.

I created some AC for Flink object storage cleanup here . I'll ask my team for feedback, but anyone reading this is welcome to share their opinion as well.

Gehel mentioned this in T356313: RdfStreamingUpdaterSpaceUsageTooHigh.Feb 5 2024, 4:25 PM

bking closed subtask T356283: Clean up object storage in response to latest alert as Invalid.Feb 6 2024, 2:31 PM

Maintenance_bot removed a project: Patch-For-Review.Feb 6 2024, 3:32 PM

JMeybohm subscribed.Feb 7 2024, 5:16 PM

Mentioned in SAL (#wikimedia-operations) [2024-02-09T21:36:48Z] <inflatador> bking@deploy2002 install 'python3-plac' pkg T348685

Mentioned in SAL (#wikimedia-operations) [2024-02-09T21:38:59Z] <inflatador> bking@deploy2002 install 'python3-boto3' pkg T348685

bking claimed this task.Feb 16 2024, 2:50 PM

bking updated the task description. (Show Details)

bking edited projects, added Data-Platform-SRE (2024.02.12 - 2024.03.03); removed Data-Platform-SRE.

Change 1004199 had a related patch set uploaded (by Bking; author: Bking):

[operations/deployment-charts@master] rdf-streaming-updater: Trigger savepoints in production envs

https://gerrit.wikimedia.org/r/1004199

gerritbot added a project: Patch-For-Review.Feb 16 2024, 5:03 PM

bking opened https://gitlab.wikimedia.org/repos/search-platform/sre/cleanup-flink-object-storage/-/merge_requests/1

Add README

bking merged https://gitlab.wikimedia.org/repos/search-platform/sre/cleanup-flink-object-storage/-/merge_requests/1

Add README

Change 1004199 merged by Bking:

[operations/deployment-charts@master] rdf-streaming-updater: Trigger savepoints in production envs

https://gerrit.wikimedia.org/r/1004199

Maintenance_bot removed a project: Patch-For-Review.Feb 20 2024, 3:31 PM

Gehel moved this task from Backlog to In Progress on the Data-Platform-SRE (2024.02.12 - 2024.03.03) board.Feb 20 2024, 4:42 PM

Change 1005572 had a related patch set uploaded (by Bking; author: Bking):

[operations/deployment-charts@master] rdf-streaming-updater: restore from savepoint (WIP)

https://gerrit.wikimedia.org/r/1005572

gerritbot added a project: Patch-For-Review.Feb 21 2024, 7:07 PM

Mentioned in SAL (#wikimedia-operations) [2024-02-21T19:38:41Z] <inflatador> bking@deploy2002 deleting old flink data from thanos-swift T348685

Change 1005791 had a related patch set uploaded (by Bking; author: Bking):

[operations/alerts@master] rdf-streaming-updater: raise storage alert threshold

https://gerrit.wikimedia.org/r/1005791

Change 1005572 abandoned by Bking:

[operations/deployment-charts@master] rdf-streaming-updater: restore from savepoint (WIP)

Reason:

No longer needed, see task for more details.

https://gerrit.wikimedia.org/r/1005572

Change 1005791 merged by jenkins-bot:

[operations/alerts@master] rdf-streaming-updater: raise storage alert threshold

https://gerrit.wikimedia.org/r/1005791

Maintenance_bot removed a project: Patch-For-Review.Feb 22 2024, 7:30 PM

bking opened https://gitlab.wikimedia.org/repos/search-platform/sre/cleanup-flink-object-storage/-/merge_requests/2

Add kubernetes logic

bking merged https://gitlab.wikimedia.org/repos/search-platform/sre/cleanup-flink-object-storage/-/merge_requests/2

Add kubernetes logic

Maintenance_bot removed a project: Patch-For-Review.Feb 22 2024, 8:31 PM

bking mentioned this in T357330: RdfStreamingUpdaterSpaceUsageTooHigh.Feb 22 2024, 9:12 PM

bking opened https://gitlab.wikimedia.org/repos/search-platform/sre/cleanup-flink-object-storage/-/merge_requests/3

Add docstrings and minor formatting changes

bking merged https://gitlab.wikimedia.org/repos/search-platform/sre/cleanup-flink-object-storage/-/merge_requests/3

Add docstrings and minor formatting changes

Maintenance_bot removed a project: Patch-For-Review.Feb 22 2024, 9:30 PM

The script is good enough to run as a one-off job. We may want to have it run automatically one day, but its current satte satisfies our AC. As such, I'm closing out this ticket.

bking mentioned this in T357932: RdfStreamingUpdaterSpaceUsageTooHigh.Mar 4 2024, 4:24 PM

Track and clean up object storage used by rdf-streaming-updaterClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Track and clean up object storage used by rdf-streaming-updater
Closed, ResolvedPublic
Actions

Related Objects
Search...