Page MenuHomePhabricator

wdqs space usage on thanos-swift
Closed, ResolvedPublic8 Estimated Story Points

Description

It looks like wdqs more than tripled its storage space usage in the span of 10 days (from ~6T to ~21T), is this expected? We should cull its disk usage or risk running out of disk space on the whole thanos-swift cluster

see also the account's space usage: https://thanos.wikimedia.org/graph?g0.expr=swift_account_stats_bytes_total%7Baccount%3D%22AUTH_wdqs%22%7D&g0.tab=0&g0.stacked=0&g0.range_input=2w&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D

Event Timeline

Gehel triaged this task as High priority.Aug 9 2022, 7:49 AM

Hi, I don't know much about this, but I did a little bit of digging.

I can see that the flink session cluster jobmanager is taking checkpoints every few seconds, for each of the jobs it is running:

[@deploy1002:/srv/deployment-charts/helmfile.d/services/rdf-streaming-updater] (master+)[64d48331] ± kubectl logs -l component=jobmanager -c flink-session-cluster-main -f

{"@timestamp":"2022-08-09T13:02:12.987Z", "log.level": "INFO", "message":"Completed checkpoint 889862 for job ca62db1723c0dd82afa1fd846c9923bc (447282465 bytes in 11509 ms).", "ecs.version": "1.2.0","service.name":"main","event.dataset":"main.log","process.thread.name":"jobmanager-future-thread-1","log.logger":"org.apache.flink.runtime.checkpoint.CheckpointCoordinator"}
{"@timestamp":"2022-08-09T13:02:21.485Z", "log.level": "INFO", "message":"Triggering checkpoint 592244 (type=CHECKPOINT) @ 1660050141460 for job 0cd81dd86866a8575cc187d92c98eb49.", "ecs.version": "1.2.0","service.name":"main","event.dataset":"main.log","process.thread.name":"Checkpoint Timer","log.logger":"org.apache.flink.runtime.checkpoint.CheckpointCoordinator"}

...

I can also see that the taskmanagers seem to be re instantiating and connecting a new kafka producer fairly frequently:

[@deploy1002:/srv/deployment-charts/helmfile.d/services/rdf-streaming-updater] (master+)[64d48331] ± kubectl logs -l component=taskmanager -c flink-session-cluster-main-taskmanager -f --since=48h --max-log-requests=6 | jq .

However, I don't know if either of those behaviors is abnormal. Every few seconds seems like a lot to take checkpoints, and I'd expect Kafka producers to stay alive unless there is some problem, but I don't know for sure. We'll need @dcausse here I think.

root@thanos-fe1001:/home/elukey# source /etc/swift/account_AUTH_wdqs.env
root@thanos-fe1001:/home/elukey# swift list 
rdf-streaming-updater-codfw
rdf-streaming-updater-codfw+segments
rdf-streaming-updater-eqiad
rdf-streaming-updater-eqiad+segments
rdf-streaming-updater-staging
thanos-swift
updater
updater+segments
updater-zbyszko
updater-zbyszko-v2

root@thanos-fe1001:/home/elukey# swift stat rdf-streaming-updater-codfw | egrep 'Objects|Bytes'
                      Objects: 854053
                        Bytes: 2383274480337
root@thanos-fe1001:/home/elukey# swift stat rdf-streaming-updater-eqiad | egrep 'Objects|Bytes'
                      Objects: 2605
                        Bytes: 60359444619
root@thanos-fe1001:/home/elukey# swift stat rdf-streaming-updater-codfw+segments | egrep 'Objects|Bytes'
                      Objects: 3742134
                        Bytes: 18993738619543
root@thanos-fe1001:/home/elukey# swift stat rdf-streaming-updater-eqiad+segments | egrep 'Objects|Bytes'
                      Objects: 10275
                        Bytes: 40649663164

So the rdf-streaming-updater-codfw+segments bucket/container seems to be storing around 18T if I made the conversion correctly, storing ~4M objects.

root@thanos-fe1001:/home/elukey# swift list rdf-streaming-updater-codfw+segments | head
commons/checkpoints/XXXXXXXX/shared/XXXXXXX/1 (redacted just in case)

It seems containing a ton of checkpoints.

It seems (still not 100% sure yet but seeing a lot of failures related to this) that the repeated failures are caused by the bad swift client we are still using for the flink ha storage, we stopped using this client for the job states (T302494) but we haven't fully removed this client from the image (T304914).
The current immediate plan is as follow:

Change 821768 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/deployment-charts@master] rdf-streaming-updater: use the S3 client for flink ha

https://gerrit.wikimedia.org/r/821768

Current status:

  • all flink jobs are stopped in codfw
  • wdqs traffic is routed to eqiad
  • wikidata maxlag is only checking eqiad
  • the rdf-streaming-updater namespace in k8s@codfw has been wiped out in preparation of the deployment of https://gerrit.wikimedia.org/r/821768
  • the swift cleanup is in progress but is having issues (see below)

I'm having difficulties to do a mass cleanup. The swift client is failing with:
Error Deleting: rdf-streaming-updater-codfw/commons/checkpoints/1475a2038f088807f9d695aea3e1c7e3/shared/4d1afbc4-2413-420c-9f60-4be2f4790445: ('Connection broken: IncompleteRead(6 bytes read)', IncompleteRead(6 bytes read))
when I use the command:
swift -A https://thanos-swift.svc.eqiad.wmnet/auth/v1.0 -U wdqs:flink -K PASS delete rdf-streaming-updater-codfw --prefix commons/checkpoints/1475a2038f088807f9d695aea3e1c7e3

I'm using now the list command with -p path_to_delete with xargs in a loop, but it might take a while as well.

Next steps:

  • once the flink_ha_storage folder is empty in the container rdf-streaming-updater-codfw: resume the jobs with https://gerrit.wikimedia.org/r/821768 deployed
  • verify that it fixes the issue

Regarding the cleanup of the old checkpoints we might need to find a way to do this more efficiently, we could possibly (if this is possible) drop the container altogether (by making a copy of the last savepoints to hdfs).

Change 821788 had a related patch set uploaded (by DCausse; author: DCausse):

[wikidata/query/deploy@master] Use temporary rdf-streaming-updater-codfw-T314835 swift container

https://gerrit.wikimedia.org/r/821788

Change 821768 merged by jenkins-bot:

[operations/deployment-charts@master] rdf-streaming-updater: use the S3 client for flink ha

https://gerrit.wikimedia.org/r/821768

Unfortunately I could not finish the cleanup of the flink_ha_storage folder to properly resume operations from k8s.
I resumed the jobs from yarn using the same swift container rdf-streaming-updater-codfw (I had tried to resume the jobs from a fresh container rdf-streaming-updater-codfw-T314835 hoping that it would ease the cleanup but for some reasons it did not work).

Concerning the cleanup everything that is not under the path T314835 can be deleted, top level dirs should be:

  • flink_ha_storage
  • wikidata
  • commons

relatedly I'm getting random failures when accessing swift from codfw using the swift client, e.g.:

dcausse@search-loader2001:~$ swift -A https://thanos-swift.discovery.wmnet/auth/v1.0 -U wdqs:flink -K PASS list rdf-streaming-updater-codfw -p T314835
Container GET failed: https://thanos-swift.discovery.wmnet/v1/AUTH_wdqs/rdf-streaming-updater-codfw?format=json&prefix=T314835 401 Unauthorized  [first 60 chars of response] b'<html><h1>Unauthorized</h1><p>This server could not verify t'
Failed Transaction ID: tx0e1d7d9152644a15a0523-0062f2cd3d

Thank you @dcausse for diving deep into this issue and mitigating it! I can confirm that the space has stopped growing at the same rate (i.e. not growing ATM).

I can confirm that I've seen the same failures from swift client doing mass deletes, not sure why though. I am looking into the auth failures in codfw and can confirm that too, only on thanos-fe2001 though! I have depooled that host as a precaution for now

Thank you @dcausse for diving deep into this issue and mitigating it! I can confirm that the space has stopped growing at the same rate (i.e. not growing ATM).

I can confirm that I've seen the same failures from swift client doing mass deletes, not sure why though.

I noticed this independently when trying to delete big Tegola containers in T307184: Followups for Tegola and Swift interactions , while some deletes timeout the swift build delete continues with the remaining files. I think once the first pass of deletes is done then it'd be sufficient to repeat the command as many times as needed

I am looking into the auth failures in codfw and can confirm that too, only on thanos-fe2001 though! I have depooled that host as a precaution for now

I have mitigated the auth failures for now (permanent fix in https://phabricator.wikimedia.org/T314914)

I've seen auth failures with swift-ring-manager sometimes too on thanos, anecdotally associated with high load, but there's never AFAICS anything useful logged by swift :-/

The 3 tasks above should be the followups of this incident.
The root cause of the incident is I think a mix of the poor swift client used by the flink H/A component and possibly the instability of thanos-fe2001 that exacerbated the poor behaviors of this swift client.
The checkpoints were stored properly but the flink H/A component was not able to fully acknowledge that the checkpoint was successful. The job is configured to keep only the last successful checkpoints but the H/A issue caused this cleanup to fail and old checkpoints were not removed.

Moving forward we will:

  • stop the presto-swift client in favor of an S3 connector.
  • cleanup the rdf-streaming-updater-codfw container
  • monitor and alert on the space usage on these containers (if there's also way to implement a quota per container I'd be in favor of doing so)

Unless I missed something or that we want to continue tracking some work with task I believe we can close this task.

Moving forward we will:

  • stop the presto-swift client in favor of an S3 connector.
  • cleanup the rdf-streaming-updater-codfw container
  • monitor and alert on the space usage on these containers (if there's also way to implement a quota per container I'd be in favor of doing so)

Unless I missed something or that we want to continue tracking some work with task I believe we can close this task.

Thank you @dcausse for the write up and action items, all looks good to me!

Change 821788 abandoned by DCausse:

[wikidata/query/deploy@master] Use temporary rdf-streaming-updater-codfw-T314835 swift container

Reason:

we're going to use the same container

https://gerrit.wikimedia.org/r/821788

@dcausse Are these action items filed into appropriate places such that this ticket, which seems "finished", can be closed?

dcausse claimed this task.

@BCornwall yes, this ticket can be closed, remaining work is tracked here: