Page MenuHomePhabricator

Track and clean up object storage used by rdf-streaming-updater
Closed, ResolvedPublic

Description

The rdf-streaming-updater uses Flink, which creates checkpoints in Thanos-swift object storage. A recent audit by @dcausse discovered ~1 TB of data. After removing stale/unnecessary data, total usage was down to ~20 GB.

This suggests that we need to be more aggressive about removing data, particularly because we will soon be moving the Search Update Pipeline to Flink.

Creating this ticket to:

  • Create monitoring/alerts for object storage usage These were already created by @dcausse , see this dashboard for an example of metrics use, and the alerts live here .
  • Decide whether or not we need an automated cleanup process We have decided to script this process. Automation is possible in the future, but out of scope for this ticket.
    • Design/implement cleanup.

Event Timeline

Space usage alerts for the Search team added in this PR

Gehel triaged this task as Medium priority.Oct 18 2023, 8:40 AM
Gehel moved this task from Incoming to Misc on the Data-Platform-SRE board.
bking updated the task description. (Show Details)

We got another alert for Swift disk usage today. Rather than have an automated cleanup process, I wonder if it would be useful to set a TTL on our objects (possibly we do this already?). Both S3 and Swift support this concept, although this isn't directly available from the Flink S3 plugin. Maybe a batch job to set TTLs daily?

Change 994758 had a related patch set uploaded (by Bking; author: Bking):

[operations/alerts@master] rdf-streaming-updater: Change notification from email to task

https://gerrit.wikimedia.org/r/994758

Change 994758 merged by jenkins-bot:

[operations/alerts@master] rdf-streaming-updater: Change notification from email to task

https://gerrit.wikimedia.org/r/994758

Adding some notes on this topic based on conversation with @dcausse yesterday. Feel free to correct this if I missed anything.

What causes Flink to use so much storage space?
There are a few situations that may disrupt Flink's normal cleanup processes:

  • Transient issues reaching object storage endpoint
  • Pods restarting or being killed/replaced

Both situations could cause Flink to lose track of its checkpoints, which means it won't delete them automatically. We could automate a deletion process, but we need to spend some time determining how to distinguish whether the data is expendable. A good first step might be to generate a report on the current state of Flink and its object storage usage.

I created some AC for Flink object storage cleanup here . I'll ask my team for feedback, but anyone reading this is welcome to share their opinion as well.

Mentioned in SAL (#wikimedia-operations) [2024-02-09T21:36:48Z] <inflatador> bking@deploy2002 install 'python3-plac' pkg T348685

Mentioned in SAL (#wikimedia-operations) [2024-02-09T21:38:59Z] <inflatador> bking@deploy2002 install 'python3-boto3' pkg T348685

Change 1004199 had a related patch set uploaded (by Bking; author: Bking):

[operations/deployment-charts@master] rdf-streaming-updater: Trigger savepoints in production envs

https://gerrit.wikimedia.org/r/1004199

Change 1004199 merged by Bking:

[operations/deployment-charts@master] rdf-streaming-updater: Trigger savepoints in production envs

https://gerrit.wikimedia.org/r/1004199

Change 1005572 had a related patch set uploaded (by Bking; author: Bking):

[operations/deployment-charts@master] rdf-streaming-updater: restore from savepoint (WIP)

https://gerrit.wikimedia.org/r/1005572

Mentioned in SAL (#wikimedia-operations) [2024-02-21T19:38:41Z] <inflatador> bking@deploy2002 deleting old flink data from thanos-swift T348685

Change 1005791 had a related patch set uploaded (by Bking; author: Bking):

[operations/alerts@master] rdf-streaming-updater: raise storage alert threshold

https://gerrit.wikimedia.org/r/1005791

Change 1005572 abandoned by Bking:

[operations/deployment-charts@master] rdf-streaming-updater: restore from savepoint (WIP)

Reason:

No longer needed, see task for more details.

https://gerrit.wikimedia.org/r/1005572

Change 1005791 merged by jenkins-bot:

[operations/alerts@master] rdf-streaming-updater: raise storage alert threshold

https://gerrit.wikimedia.org/r/1005791

The script is good enough to run as a one-off job. We may want to have it run automatically one day, but its current satte satisfies our AC. As such, I'm closing out this ticket.