Page MenuHomePhabricator

mw-page-content-change-enrich should enable HA with k8s ConfigMaps
Closed, ResolvedPublic

Description

We'd much prefer to do T331283: [Event Platform] Store Flink HA metadata in Zookeeper, but until we have Zookeeper clusters with a newer version, we won't be able to.

As a stop gap, we should enable HA using k8s ConfigMaps, as the Search team does for rdf-streaming-udpater. Let's verify this with Search and SRE ServiceOps, but this should be better than running without HA.

  • Enable Flink JobManager HA with state stored in ConfigMaps
  • Documentation on how to redeploy jobs from previous savepoints

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Ottomata moved this task from Backlog to Sprint 14 B on the Event-Platform board.
Ottomata edited projects, added Event-Platform (Sprint 14 B); removed Event-Platform.

Let's verify this with Search and SRE ServiceOps

@JMeybohm @dcausse we'd like to pick this up in sprint 14B. Would you have any concern about this approach?

@gmodena no concerns from my side, main question I'd have is when do we consider the flink-app chart ready enough so that we can consider switching the WDQS updater to it and remove the flink-session-cluster deployment model.

main question I'd have is when do we consider the flink-app chart ready enough so that we can consider switching the WDQS updater to it and remove the flink-session-cluster deployment model.

The testing we've done so far is limited to Python jobs, but so far we did not encounter any blocker. I can't guarantee everything will go smoothly with Scala though. Would you like to coordinate some SPIKE work to explore feasibility after we gained some hands on experience with HA? cc / @lbowmaker

main question I'd have is when do we consider the flink-app chart ready enough so that we can consider switching the WDQS updater to it and remove the flink-session-cluster deployment model.

I'd consider it ready enough!

mw-page-content-change-enrich has been deployed with Kurbenetes HA (ConfigMaps) on staging. So far so good during routine application restarts.
We should test a sequence of undeploy/save configmaps/blow the namespace away/restore configmaps/re-deploy before moving forward to main.

Let's verify this with Search and SRE ServiceOps

@JMeybohm @dcausse we'd like to pick this up in sprint 14B. Would you have any concern about this approach?

I was on PTO, sorry. No concerns or objections from my side and I see you already started working on it, nice!

Let's verify this with Search and SRE ServiceOps

@JMeybohm @dcausse we'd like to pick this up in sprint 14B. Would you have any concern about this approach?

I was on PTO, sorry. No concerns or objections from my side and I see you already started working on it, nice!

@JMeybohm the changes have been rolled out to staging and tests are fine. We are ready to merge - and apply - changes for codfw / eqiad main. How and where would you like us to document the need to save / restore ConfigMaps?

Let's verify this with Search and SRE ServiceOps

@JMeybohm @dcausse we'd like to pick this up in sprint 14B. Would you have any concern about this approach?

I was on PTO, sorry. No concerns or objections from my side and I see you already started working on it, nice!

@JMeybohm the changes have been rolled out to staging and tests are fine. We are ready to merge - and apply - changes for codfw / eqiad main. How and where would you like us to document the need to save / restore ConfigMaps?

I'd assume this is not specific to mw-page-content-change-enrich bug rather generic to all flink apps deployed using the operator, right? I would suggest to out that somewhere next to the flink-operator docs in wikitech and probably link it from mw-page-content-change-enrich page.

I'd assume this is not specific to mw-page-content-change-enrich bug rather generic to all flink apps deployed using the operator, right?

You are correct. Would the following doc be enough (cc / @Ottomata @dcausse ) ?
https://wikitech.wikimedia.org/wiki/Event_Platform/Stream_Processing/Flink#If_using_k8s_ConfigMaps_to_store_HA_state

@Ottomata we don't need to depool mw enrichment jobs, right?

MW enrichment runs active/active single compute, and there are no downstream applications to 'depool'.

If mw enrichment jobs are off in the active MW DC for a very long time, we will have issues. If they are of in the active MW DC for a short amount of time, we will just have late events. If they are off in the inactive MW DC, there shouldn't be any real issues (unless somehow page changes are processed by MW in the inactive DC).

@Ottomata ack. Just wanted to validate and have it documented. I added a comment to wikitech.

My understanding from the SLO of mw-page-content-change-enrich was that re-deploying the application (e.g. loosing the configmap/ha state) is okay. The wikitech doc now says that we (as in service ops) are required to save and restore the state.

My understanding from the SLO of mw-page-content-change-enrich was that re-deploying the application (e.g. loosing the configmap/ha state) is okay.

It's ok for staging, but for prod deployments we would need HA state. Its implementation is the goal of this task.

We hoped zookeeper would have been an option, but unfortunately Flink HA requires a more recent version that what we have deployed. Upgrading zoookeeper is not straightforward either, because the version required by Flink is not currently packaged in debian.

The wikitech doc now says that we (as in service ops) are required to save and restore the state.

This would only be required after cluster updates (like for WQDS), not for regular application lifecycles.

The wikitech doc now says that we (as in service ops) are required to save and restore the state.

This would only be required after cluster updates (like for WQDS), not for regular application lifecycles.

By cluster updates, you mean kubernetes cluster, right? My understanding is that this is only needed if the ConfigMap will be lost, which happens when ServiceOps does k8s upgrades. Right?

By cluster updates, you mean kubernetes cluster, right?

Correct; k8s clusters. Flink (cluster) updates should not require manual intervention to save/restore ConfigMaps.

It's ok for staging, but for prod deployments we would need HA state

To clarify: we should be able to recover even if ConfigMaps are wiped and not restored. In that case we would restart consuming from the last offset known to kafka (for the given consume group). There _should_ be no data loss, but we can expect duplicate messages to be produced.

Having HA state preserved between restarts would be cleaner.

Change 933914 had a related patch set uploaded (by Gmodena; author: Gmodena):

[operations/deployment-charts@master] page-content-change: fix error sink stream name.

https://gerrit.wikimedia.org/r/933914

Change 933914 merged by jenkins-bot:

[operations/deployment-charts@master] page-content-change: fix error sink stream name.

https://gerrit.wikimedia.org/r/933914