Page MenuHomePhabricator

mediawiki-page-content-change-enrichment checkpoints should be stored in Swift
Closed, ResolvedPublic

Description

We have been assigned a Swift account to manage mediawiki-event-enrichment snapshots.

We should update the application helmfile to enable HA and checkpointing.

As part of https://phabricator.wikimedia.org/T330693, a set of credentials has been provided in private.git.

Success criteria

  • The application helmfile has been updated to source Swift auth credentials from private.git.
  • Checkpointing to Swift (S3 protocol) has been enabled.

Event Timeline

Change 919885 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] dse-k8s-services/mediawiki-page-content-change-enrichment - fix private helmfile values

https://gerrit.wikimedia.org/r/919885

Change 919885 merged by Ottomata:

[operations/deployment-charts@master] dse-k8s-services/mediawiki-page-content-change-enrichment - fix private helmfile values

https://gerrit.wikimedia.org/r/919885

Ottomata renamed this task from mediawiki-event-enrichment checkpoints should be stored in Swift to mediawiki-page-content-change-enrichment checkpoints should be stored in Swift.May 15 2023, 8:06 PM
Ottomata reassigned this task from Ottomata to gmodena.
Ottomata updated the task description. (Show Details)

Checkpointing to Swift (S3 protocol) has been enabled.

Here's a summary of a CR thread with @dcausse and @Ottomata wrt bucket layout:

s3://<namespace>-<k8s-cluster-name-env>/<release>/flink/

  • ha_storage
  • checkpoints
  • savepoints

<namespace> and <release> are relevant to Search, because they plan to re-use namespaces for multiple apps, and use different helmfile releases for those.

For our application on dse this would currently be:
s3://stream-enrichment-poc-dse-k8s-eqiad/main/flink.

To normalize naming with wikikube, IMHO we should rename/redo the DSE namespace and go for a bucket named:
s3://mediawiki-page-content-change-enrichment-dse-k8s-eqiad/main/flink.

@Ottomata unless there's no objections, I'd like to already stick with the latter naming. We could use a more generic namespace (mediawiki-event-enrichment), but I'd not change working assumption at this stage and stick to having a single application for namespace (till we learn more from @dcausse experience).

+1! And yes stick with the same name in DSE, even though the namespace is stream-enrichment-poc there. We'll (hopefully) be deleting that namespace shortly.

Change 920268 had a related patch set uploaded (by Gmodena; author: Gmodena):

[operations/deployment-charts@master] mediawiki-page-content-change-enrichment: enable HA

https://gerrit.wikimedia.org/r/920268

s3://<namespace>-<k8s-cluster-name-env>/<release>/flink/

Let's go with flink <job_name> instead of <namespace>. In https://phabricator.wikimedia.org/T330507#8860655 I just ran into an issue where the helmfile/namespace name is too long for our WMF prod k8s/helm conventions, and in that case, let's use double underscores for a separator where we can't use /. Also, in Slack, we decided to drop <release> from the convention, because most of the time it will be redundant (we'll only have one release per namespace). If Search does do more than one release per namespace, they can override this convention as needed.

So:

s3://<flink_job_name>__<k8s-cluster-name-env>/flink/

s3://<namespace>-<k8s-cluster-name-env>/<release>/flink/

Let's go with flink <job_name> instead of <namespace>. In https://phabricator.wikimedia.org/T330507#8860655 I just ran into an issue where the helmfile/namespace name is too long for our WMF prod k8s/helm conventions, and in that case,

Ack.

For "standard" deployments (one job per namespace), IIRC the agreement was to use a consistent naming convention for namespace and job name. DSE was an odd situation because we picked namespace early on in development, and stuck with it.

Can normalise job name and namespace to mw-page-content-change-enrich? Or would using dashes for a job name break some system integration? I don't have strong feelings about _ vs -, but my assumption would be that changing job name is easier than namespace.

Should we also adopt the same naming convention (mw-page-content-change-enrich) also for Kafka's consumer group?

let's use double underscores for a separator where we can't use /.

[...]

s3://<flink_job_name>__<k8s-cluster-name-env>/flink/

Sounds good.

? I don't have strong feelings about _ vs -,

I have a semi strong feeling, mostly around the fact that - is often normalized to _ in both metrics in prometheus and in SQL systems. So I lean towards _ whenever possible (and find it a little annoying that we can't use _ in k8s stuff).

IIRC the agreement was to use a consistent naming convention for namespace and job name.
but my assumption would be that changing job name is easier than namespace.

Hm, indeed. Hm. That would also make the error topic be mw_page_content_change_enrich.error, which I find a wee annoying since it won't match up with mediawiki.page_change.v1, but perhaps that is good? You'll remember I did for a second consider prefixing with 'error' instead, so that 'mediawiki_page_content_change_enrichment_error' Hive table wouldn't be mixed up with 'mediawiki_page_content_change_v1'. Perhaps error Hive table named "mw_page_content_change_enrich_error" is fine.

Okay, I'm convinced, let's set job name to match the namespace...but with underscores instead(?) :)

Change 920268 merged by jenkins-bot:

[operations/deployment-charts@master] mw-page-content-change-enrich: enable HA and checkpointing

https://gerrit.wikimedia.org/r/920268

Change 922568 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] mw-page-content-change-enrich - disable ZK based HA

https://gerrit.wikimedia.org/r/922568

Change 922568 merged by jenkins-bot:

[operations/deployment-charts@master] mw-page-content-change-enrich - disable ZK based HA

https://gerrit.wikimedia.org/r/922568

Change 922570 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/mediawiki-config@master] Rename page content change enrich error stream to match convention

https://gerrit.wikimedia.org/r/922570

Change 922570 merged by jenkins-bot:

[operations/mediawiki-config@master] Rename page content change enrich error stream to match convention

https://gerrit.wikimedia.org/r/922570

Mentioned in SAL (#wikimedia-operations) [2023-05-23T16:31:40Z] <otto@deploy1002> Synchronized wmf-config/ext-EventStreamConfig.php: EventStreamConfig - Rename page content change enrich error stream to match convention - T336656 (duration: 06m 58s)

Change 922576 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] mw-page-content-change-enrich - use proper config for flink state.backend.type

https://gerrit.wikimedia.org/r/922576

Change 922576 merged by Ottomata:

[operations/deployment-charts@master] mw-page-content-change-enrich - use proper config for flink state.backend.type

https://gerrit.wikimedia.org/r/922576

@gmodena I think(?) I've deployed in dse-k8s-eqiad staging. HA has been disabled, but swift checkpointing should be enabled? I'm not entirely sure how to check though.

@gmodena I think(?) I've deployed in dse-k8s-eqiad staging. HA has been disabled, but swift checkpointing should be enabled? I'm not entirely sure how to check though.

Ack. There's info on listing containers content at
https://wikitech.wikimedia.org/wiki/Swift/How_To#list_the_contents_of_one_container.

Change 922595 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] mw-page-content-change-enrich - use bucket names created by T330693

https://gerrit.wikimedia.org/r/922595

Change 922595 merged by Ottomata:

[operations/deployment-charts@master] mw-page-content-change-enrich - use bucket names created by T330693

https://gerrit.wikimedia.org/r/922595

Change 922600 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] mw-page-content-change-enrich - set proper s3.access-key for swift

https://gerrit.wikimedia.org/r/922600

Change 922600 merged by Ottomata:

[operations/deployment-charts@master] mw-page-content-change-enrich - set proper s3.access-key for swift

https://gerrit.wikimedia.org/r/922600

Change 922831 had a related patch set uploaded (by Gmodena; author: Gmodena):

[operations/deployment-charts@master] mw-page-content-change-enrich: revert checkpoint dir

https://gerrit.wikimedia.org/r/922831

I think I'm missing something. Is it possible to enable checkpoints without a "HA storage"? I always thought that it is an requirement because flink needs to store the path to the latest checkpoint somewhere (somewhere else than the checkpoint).

IIUC, checkpointing without HA allows the JobManager to restart the full job pipeline and TaskManagers with checkpoint state.

It also allows us to take savepoints when stopping the job. Without HA state (path to latest checkpoint), we have to manually intervene to restart the job: take a savepoint, record the savepoint location, then when restarting, restart from that savepoint.

I think this is what is done for the rdf-streaming-updater in the Flink session cluster now. @dcausse is ^^ correct?

We haven't yet figured out what we will do for T331283: [Event Platform] [NEEDS GROOMING] Store Flink HA metadata in Zookeeper since we need a newer ZK. For HA for now, we might want to use k8s ConfigMaps. This would allow us to restart the job without the savepoint dance, but since the ConfigMap is ephemeral, we'd need to do the savepoint dance during k8s cluster restarts.

Anyway, one thing at a time I guess.

Change 922831 merged by jenkins-bot:

[operations/deployment-charts@master] mw-page-content-change-enrich: enable checkpointing

https://gerrit.wikimedia.org/r/922831