Page MenuHomePhabricator

The WDQS Streaming Updater should use S3 to access thanos-swift instead of the native swift protocol
Closed, ResolvedPublic3 Estimated Story Points

Description

Followup of T302396.

The thanos-swift cluster is S‌3 compatible so we should use that instead of the native swift client which we customized to implement tmp auth and has been removed from the official flink distribution: https://issues.apache.org/jira/browse/FLINK-21819.

Migration plan:

  • Preflight checks: Test that s3 actually fixes T302396
    • deploy a new image with s3&swift enabled to codfw
    • save a savepoint to s3 from the updater running in yarn and stop it (requires restarting this session cluster with S3 enabled)
    • start the application from this s3 savepoint
  • Migrate jobs from swift to s3

AC:

  • W[DC]QS Streaming Updater is using thanos-swift through the s3 protocol

Event Timeline

Change 766072 had a related patch set uploaded (by ZPapierski; author: ZPapierski):

[wikidata/query/flink-rdf-streaming-updater@master] Replace Swift tempauth with S3

https://gerrit.wikimedia.org/r/766072

Change 766123 had a related patch set uploaded (by ZPapierski; author: ZPapierski):

[operations/deployment-charts@master] Replace Swift native API with S3

https://gerrit.wikimedia.org/r/766123

Gehel triaged this task as High priority.Feb 28 2022, 4:34 PM
Gehel moved this task from Incoming to Current work on the Wikidata-Query-Service board.

Change 769075 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/deployment-charts@master] flink-session-cluster: add thanos S3 config

https://gerrit.wikimedia.org/r/769075

Change 766072 merged by jenkins-bot:

[wikidata/query/flink-rdf-streaming-updater@master] Add S3 support with s3-presto

https://gerrit.wikimedia.org/r/766072

Change 769075 merged by jenkins-bot:

[operations/deployment-charts@master] flink-session-cluster: add thanos S3 config

https://gerrit.wikimedia.org/r/769075

Change 769697 had a related patch set uploaded (by DCausse; author: DCausse):

[wikidata/query/deploy@master] Switch to S3 scheme by default instead of swift

https://gerrit.wikimedia.org/r/769697

Change 769699 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/deployment-charts@master] flink-session-cluster: fix swift API key for s3

https://gerrit.wikimedia.org/r/769699

Change 769699 merged by jenkins-bot:

[operations/deployment-charts@master] flink-session-cluster: fix swift API key for s3

https://gerrit.wikimedia.org/r/769699

Change 769697 merged by DCausse:

[wikidata/query/deploy@master] Switch to S3 scheme by default instead of swift

https://gerrit.wikimedia.org/r/769697

Change 770508 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/puppet@production] [wdqs] adapt updateQueryServiceLag...

https://gerrit.wikimedia.org/r/770508

Change 770508 merged by Bking:

[operations/puppet@production] [wdqs] adapt updateQueryServiceLag...

https://gerrit.wikimedia.org/r/770508

Mentioned in SAL (#wikimedia-operations) [2022-03-14T21:38:30Z] <inflatador> T302494 bking@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=wdqs-internal,name=codfw

Mentioned in SAL (#wikimedia-operations) [2022-03-14T22:03:20Z] <inflatador> T302494 bking@puppetmaster1001 depooling eqiad in DNS-discovery for wdqs and wdqs-internal services

Per messages above, we have completely failed over the wdqs and wdqs-internal services from eqiad to codfw.

Mentioned in SAL (#wikimedia-operations) [2022-03-28T14:46:59Z] <inflatador> 'bking@cumin1001 repooling wdqs services in IAD ref T302494'