Page MenuHomePhabricator

Provision Zookeeper Cluster for storing Flink HA data
Closed, ResolvedPublic5 Estimated Story Points

Description

Per T331283 and related tickets, Data Engineering (in consultation with Event Platform and Service Ops) has settled on Zookeeper to implement Flink HA, as described at Flink's website .

AC:

  • Puppet code committed and working
  • Alerting/dashboards created and working

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 938000 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] zookeeper: prepare for new zk cluster

https://gerrit.wikimedia.org/r/938000

Mentioned in SAL (#wikimedia-operations) [2023-07-13T20:59:43Z] <inflatador> bking@cumin1001 'disable puppet on hosts using zookeeper class T341792'

Change 938000 merged by Bking:

[operations/puppet@production] zookeeper: prepare for new zk cluster

https://gerrit.wikimedia.org/r/938000

Gehel triaged this task as High priority.Jul 17 2023, 3:30 PM
Gehel moved this task from Incoming to Ready for Work on the Data-Platform-SRE board.
Gehel moved this task from needs triage to Current work on the Discovery-Search board.
Gehel edited projects, added Discovery-Search (Current work); removed Discovery-Search.
Gehel set the point value for this task to 5.Jul 17 2023, 3:51 PM

Change 940243 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] flink-zk: Initiate new flink::zookeeper role

https://gerrit.wikimedia.org/r/940243

Change 940243 merged by Bking:

[operations/puppet@production] flink-zk: Initiate new flink::zookeeper role

https://gerrit.wikimedia.org/r/940243

Change 942428 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] flink-zk: use correct variable for firewall defs

https://gerrit.wikimedia.org/r/942428

Change 942428 merged by Bking:

[operations/puppet@production] flink-zk: use correct variable for firewall defs

https://gerrit.wikimedia.org/r/942428

The cluster is up and all nodes appear to have joined correctly; my compliments to whoever wrote the puppet code.

The next step is to get metrics from the flink-zk cluster into the Zookeeper dashboard.

I've confirmed the zookeeper exporter is up and listening on 12181. I'm going to check the software firewall rules next.

Change 942457 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] flink-zk: allow analytics network

https://gerrit.wikimedia.org/r/942457

Change 942494 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] flink-zk: Enable prometheus scrapes

https://gerrit.wikimedia.org/r/942494

Change 942457 abandoned by Bking:

[operations/puppet@production] flink-zk: allow analytics network

Reason:

Confirmed that firewall is not the issue.

https://gerrit.wikimedia.org/r/942457

Change 942494 merged by Bking:

[operations/puppet@production] flink-zk: Enable prometheus scrapes

https://gerrit.wikimedia.org/r/942494

Change 945640 had a related patch set uploaded (by Bking; author: Bking):

[operations/alerts@master] prometheus-analytics: create alerts for new ZK cluster

https://gerrit.wikimedia.org/r/945640

Per today's Data Platform meeting, Ben provided an example of existing zookeeper alerts , we should be able to use this for inspiration.

It might be the case that the alerts are already in place, since we have the metrics (for the eqiad cluster) in Grafana now.

You can also see that flink-eqiad appears here.

What about stopping a couple of the zookeeper servers and seeing if we get an alert raised. Might want to announce the test first :-)

I suppose that it might also depend a bit on how for we should go with: T342578: Ensure Data Platform SREs have a contact group in puppet/alerting

Perhaps we want to have this zookeeper cluster notify our ourselves, rather than the core SRE team.

Oh! I see that you've already done this in https://gerrit.wikimedia.org/r/c/operations/alerts/+/945640 but we haven't onboarded team-data-platform-sre to Alertmanager yet, have we?

Oh! I see that you've already done this in https://gerrit.wikimedia.org/r/c/operations/alerts/+/945640 but we haven't onboarded team-data-platform-sre to Alertmanager yet, have we?

Good catch, I updated the AC of that ticket and will work on it as time permits.

Change 948615 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] flink-zk: add cluster info for codfw

https://gerrit.wikimedia.org/r/948615

Change 948615 merged by Bking:

[operations/puppet@production] flink-zk: add cluster info for codfw

https://gerrit.wikimedia.org/r/948615

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk2001.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk2001.codfw.wmnet with OS bookworm executed with errors:

  • flink-zk2001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk2002.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk2002.codfw.wmnet with OS bookworm executed with errors:

  • flink-zk2002 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk2002.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk2002.codfw.wmnet with OS bookworm executed with errors:

  • flink-zk2002 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk2003.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk2003.codfw.wmnet with OS bookworm executed with errors:

  • flink-zk2003 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details

cookbooks.sre.hosts.decommission executed by bking@cumin1001 for hosts: flink-zk2002.codfw.wmnet

  • flink-zk2002.codfw.wmnet (WARN)
    • Host not found on Icinga, unable to downtime it
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox

Mentioned in SAL (#wikimedia-operations) [2023-08-18T17:18:10Z] <inflatador> bking@cumin1001 temporarily enabling alerts for flink-zk hosts to see if they work T341792

Mentioned in SAL (#wikimedia-operations) [2023-08-18T17:25:06Z] <inflatador> bking@ganeti1024 shutting off flink-zk1001 to check alerting T341792

To check alerting, I removed suppressions and shut off flink-zk1001 via the ganeti master. I saw flink-zk1001 turn red in Icinga, failing all of its checks, but I didn't see any alerts appear in AlertManager.

That suggests to me we're missing some important config in Puppet and/or the alerts repo. Will re-read Wikidata docs on alerting today and escalate Monday if I can't make progress.

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk2001.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk2001.codfw.wmnet with OS bookworm executed with errors:

  • flink-zk2001 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details

cookbooks.sre.hosts.decommission executed by bking@cumin1001 for hosts: flink-zk2001.codfw.wmnet

  • flink-zk2001.codfw.wmnet (WARN)
    • Host not found on Icinga, unable to downtime it
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox

cookbooks.sre.hosts.decommission executed by bking@cumin1001 for hosts: flink-zk2003.codfw.wmnet

  • flink-zk2003.codfw.wmnet (WARN)
    • Host not found on Icinga, unable to downtime it
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox

Change 954134 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] flink-zk: Move codfw hosts back to insetup

https://gerrit.wikimedia.org/r/954134

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk2001.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk2001.codfw.wmnet with OS bookworm executed with errors:

  • flink-zk2001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details

Change 954134 merged by Bking:

[operations/puppet@production] flink-zk: Move codfw hosts back to insetup

https://gerrit.wikimedia.org/r/954134

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk2001.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk2001.codfw.wmnet with OS bookworm executed with errors:

  • flink-zk2001 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details

cookbooks.sre.hosts.decommission executed by bking@cumin1001 for hosts: flink-zk2001.codfw.wmnet

  • flink-zk2001.codfw.wmnet (WARN)
    • Host not found on Icinga, unable to downtime it
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk2001.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk2001.codfw.wmnet with OS bookworm executed with errors:

  • flink-zk2001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk2002.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk2002.codfw.wmnet with OS bookworm completed:

  • flink-zk2002 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309062042_bking_2324207_flink-zk2002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk2003.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk2003.codfw.wmnet with OS bookworm completed:

  • flink-zk2003 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309062156_bking_2505065_flink-zk2003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

To check alerting, I removed suppressions and shut off flink-zk1001 via the ganeti master. I saw flink-zk1001 turn red in Icinga, failing all of its checks, but I didn't see any alerts appear in AlertManager.

That suggests to me we're missing some important config in Puppet and/or the alerts repo. Will re-read Wikidata docs on alerting today and escalate Monday if I can't make progress.

OK, let's look at this again together, because I was under the impression that this check should fail with one zookeeper server down:
https://gerrit.wikimedia.org/r/plugins/gitiles/operations/alerts/+/refs/heads/master/team-sre/zookeeper.yaml#19

Change 958991 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add the analytics and search-pltform teams to flink zk contacts

https://gerrit.wikimedia.org/r/958991

Change 958991 merged by Btullis:

[operations/puppet@production] Add the analytics and search-platform teams to flink zk contacts

https://gerrit.wikimedia.org/r/958991

bking moved this task from In Progress to Done on the Data-Platform-SRE board.

The patch above ensures that Data Platform SREs will be alerted if there's a problem with the flink-zk cluster. Thus, I'm happy to close this one out. Thanks to everyone who contributed to this effort.

Change 958991 merged by Btullis:

[operations/puppet@production] Add the analytics and search-platform teams to flink zk contacts

https://gerrit.wikimedia.org/r/958991

Change was reverted due to nonexistent search-platform icinga contactgroup leading to a broken icinga config: https://gerrit.wikimedia.org/r/c/operations/puppet/+/959015

Change 962660 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] flink: Add correct contactgroups

https://gerrit.wikimedia.org/r/962660

Change 962660 merged by Bking:

[operations/puppet@production] flink: Add correct contactgroups

https://gerrit.wikimedia.org/r/962660

Change 945640 abandoned by Bking:

[operations/alerts@master] prometheus-analytics: create alerts for new ZK cluster

Reason:

WIP patch is blocking this repo's gitlab migration

https://gerrit.wikimedia.org/r/945640