Page MenuHomePhabricator

Clean up the rdf-streaming-updater-codfw container from thanos-swift.
Closed, ResolvedPublic3 Estimated Story Points

Description

See parent task.

Empty and remove the rdf-streaming-updater-codfw .

I may have to fire up swiftly but will use low concurrency.

Event Timeline

bking updated the task description. (Show Details)

Per conversation with @dcausse , we need to keep all objects within the T314835 (pseudo) folder in the 'rdf-streaming-updater-codfw' container (not to be confused with the 'rdf-streaming-updater-codfw-T314835' container)

Using swiftly's fordo feature, I think we can get this accomplished.

Example read-only command using head:
swiftly for -p commons/checkpoints/1475a2038f088807f9d695aea3e1c7e3/ rdf-streaming-updater-codfw do head "<item>"
Note that "<item>" is literal.

Mentioned in SAL (#wikimedia-operations) [2022-08-23T17:33:07Z] <inflatador> 'bking@cumin starting thanos-swift cleanup for wdqs T316031'

Swiftly is running in a tmux window on cumin1001. Command run:

swiftly --cache-auth --eventlet --concurrency=5 for -p commons/ rdf-streaming-updater-codfw do delete "<item>"

using the command swiftly head rdf-streaming-updater-codfw, we can see the delete happening at a rate of ~200 objects/minute. We can probably bump up concurrency, but not unless we get approval from the data persistence team.

Per the output of swiftly head rdf-streaming-updater-codfw, the rdf-streaming-updater-codfw swift container is down to 2,853,739,944 bytes, which works out to about 2.6 GB. Will verify with stakeholders and close with their approval.

bking triaged this task as Medium priority.
Gehel moved this task from Incoming to Needs review on the Discovery-Search (Current work) board.
Gehel moved this task from Incoming to Current work on the Wikidata-Query-Service board.

@bking thanks for running the cleanup!

I can confirm that the wikidata and commons pseudo-folders are empty, the flink_ha_storage folder also needs to be emptied.

Something I don't fully understand yet is why https://thanos.wikimedia.org/graph?g0.deduplicate=1&g0.expr=swift_account_stats_bytes_total{account%3D%22AUTH_wdqs%22}&g0.max_source_resolution=0s&g0.partial_response=0&g0.range_input=8w&g0.stacked=0&g0.store_matches=[]&g0.tab=0 still reports 19TB of usage.

Even adding the stat of rdf-streaming-updater-codfw, rdf-streaming-updater-eqiad and rdf-streaming-updater-staging I don't see such usage.

rdf-streaming-updater-codfw:

  Account: AUTH_wdqs
Container: rdf-streaming-updater-codfw
  Objects: 1177
    Bytes: 3031622935

rdf-streaming-updater-eqiad:

  Account: AUTH_wdqs
Container: rdf-streaming-updater-eqiad
  Objects: 2666
    Bytes: 60190521403

rdf-streaming-updater-staging:

  Account: AUTH_wdqs
Container: rdf-streaming-updater-staging
  Objects: 1395
    Bytes: 5870706686

I cleaned out the flink_ha_storage pseudofolder from the rdf-streaming-updater-codfw bucket as requested above. However, per @dcausse dashboard link above, it appears that there's still ~19 TB used on Thanos.

Will circle back with @fgiunchedi to see if the wdqs user is still using too much space on Thanos cluster.

Thank you for following up, I think the culprit is the fact that the S3 compat API stores chunks of big files in a separate container (suffixed with +segments). See also the audit below I ran logged into swift as wdqs:flink:

# swift list | xargs -n1 swift stat | grep -e Container -e Objects -e Bytes
                    Container: rdf-streaming-updater-codfw
                      Objects: 125
                        Bytes: 336259981
                    Container: rdf-streaming-updater-codfw+segments
                      Objects: 3752575
                        Bytes: 19043034820255
                    Container: rdf-streaming-updater-codfw-T314835
                      Objects: 0
                        Bytes: 0
                    Container: rdf-streaming-updater-eqiad
                      Objects: 3079
                        Bytes: 61140716537
                    Container: rdf-streaming-updater-eqiad+segments
                      Objects: 13832
                        Bytes: 55454475470
                    Container: rdf-streaming-updater-staging
                      Objects: 1423
                        Bytes: 5878764535
                    Container: thanos-swift
                      Objects: 48
                        Bytes: 25856176
                    Container: updater
                      Objects: 2921
                        Bytes: 86172646251
                    Container: updater+segments
                      Objects: 552
                        Bytes: 2856364302
                    Container: updater-zbyszko
                      Objects: 31
                        Bytes: 1622577
                    Container: updater-zbyszko-v2
                      Objects: 36
                        Bytes: 120047231

Per @fgiunchedi 's comment above, I started the delete task on cumin1001 again, this time targeting swift container rdf-streaming-updater-codfw+segments.

bking set the point value for this task to 3.Sep 19 2022, 3:17 PM

Swiftly dies every few days due to 404s (a fairly common response from Swift when you ask it to delete a file that's already gone), but it picks back up again when I start running the same command. I just tried a fordo against the wikidata/ pseudo-folder structure and it finished within a minute or two.

I think that's good enough, so I'm closing this ticket. If anyone else disagrees, feel free to re-open and ping us again.

bking lowered the priority of this task from Medium to Lowest.

Re-opening so I can document the above swiftly commands on Wikitech.

@bking I see that the doc has been updated, can we move this ticket to the Needs reporting column?