Page MenuHomePhabricator

Eq: 5 VM request for kafka-test-eqiad cluster
Closed, ResolvedPublic

Description

Cloud VPS Project Tested: deployment-prep
Site/Location:EQIAD
Number of systems: 5
Service: kafka-test-eqiad
Networking Requirements: internal IP
Processor Requirements: 4 vcpu
Memory: 8GB
Disks: 100GB
Other Requirements: N/A

See https://phabricator.wikimedia.org/T268074#6630570 for discussion. If this is too much for Ganeti, we can use some spare physical hosts. Let us know!

Event Timeline

razzi added a subscriber: akosiaris.

@akosiaris does this seem like a reasonable request?

Just to verify, total is 20 vCPUs, 40GB RAM and 500GB disk space?

Just to verify, total is 20 vCPUs, 40GB RAM and 500GB disk space?

By the way, that's totally ok, it's about 3%-4% of our current free capacity, just making sure I haven't misunderstood.

Correct! The CPUs and disk space can probably be adjusted, but RAM is going to be a bit important. We could probably make due with 6GB or 4GB if we had to.

We plan on mirroring some real traffic from kafka jumbo-eqiad to this test-eqiad cluster for testing T255973: Balance Kafka topic partitions on Kafka Jumbo to take advantage of the new brokers. But having a Kafka test cluster with real data will also make it easier to practice Kafka version upgrades and other dangerous changes in the future.

OO, and actually, we could use this cluster as the target for Kafka client services in the staging k8s cluster. Right now the services in k8s staging use either main-eqiad or kafka-jumbo; it'd be nice to separate them from production clusters too.

OK then. +1 from my side (and my role as a rubber-stamper is done here). Feel free to create those VMs. Docs if you need them are at https://wikitech.wikimedia.org/wiki/Ganeti#Create_a_VM

Everything is done by cookbook sre.ganeti.makevm now including DNS. So you don't have to start with that anymore as before and can go right away to the cookbook which will pick an IP and call the DNS cookbook. Once that finishes you get the MAC address and still manually add that to DHCP and continue from there as normal with physical hardware.

I plan to put these machines on the same ganeti host, since as a test use case we don't need high availability. Let me know if they should be distributed instead.

@razzi It's based on a "node group" or "row" rather than host (1). (While there is just a single primary ganeti server you'll use to control things)

There are rows: A, B, C and D in eqiad nowadays. They align with networks used for physical rows. Row D and B are newer than A and C and therefore the number of VMs on them looks like this:

34 row_A
11 row_B
36 row_C
12 row_D

So I'd say you can do as you like but would be nice if you can put them in B or D first to keep it a bit balanced.

@Ottomata and I are planning create a new small standalone node to be the zookeeper, requiring 2GB ram, 20G disk, and 2 vcpus.

Change 642168 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] Add zookeeper-test virtual machine

https://gerrit.wikimedia.org/r/642168

Change 642168 merged by Razzi:
[operations/puppet@production] Add zookeeper-test virtual machine

https://gerrit.wikimedia.org/r/642168

Change 642497 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] zookeeper-test: Give zookeeper-test1001 zookeeper role

https://gerrit.wikimedia.org/r/642497

Change 642563 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] Add kafka-test virtual machine mac addresses

https://gerrit.wikimedia.org/r/642563

I originally created these virtual machines in the analytics vlan, but it should be in the default private network instead, so I'm decommissioning the nodes that I created and remaking them.

cookbooks.sre.hosts.decommission executed by razzi@cumin1001 for hosts: zookeeper-test1001.eqiad.wmnet

  • zookeeper-test1001.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
  • COMMON_STEPS (FAIL)
    • Failed to run the sre.dns.netbox cookbook: Cumin execution failed (exit_code=2)

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by razzi@cumin1001 for hosts: kafka-test1001.eqiad.wmnet

  • kafka-test1001.eqiad.wmnet (WARN)
    • Failed downtime host on Icinga (likely already removed)
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
  • COMMON_STEPS (FAIL)
    • Failed to run the sre.dns.netbox cookbook: Cumin execution failed (exit_code=2)

ERROR: some step on some host failed, check the bolded items above

@razzi can you copy/paste in here what failed for the dns netbox step? There might be some follow ups to do to avoid an inconsistent state..

@razzi in general on FAIL always better to investigate what happens. In this case it left some changes in Netbox not propagated to the DNS (see https://icinga.wikimedia.org/cgi-bin/icinga/history.cgi?host=netbox1001&service=Uncommitted+DNS+changes+in+Netbox for which the runbook is https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes )

I've run sudo cookbook sre.dns.netbox -t T268202 "Remove kafka-test100[12] records" to fix it and propagate the deletion of the kafka-test100[12] DNS records.

Here's the cumin output for the kafka-test1001 decomission:

razzi@cumin1001:~$ sudo cookbook sre.hosts.decommission kafka-test1001.eqiad.wmnet -t T268202
START - Cookbook sre.hosts.decommission
ATTENTION: the query does not match any host in PuppetDB or failed
Hostname expansion matches 1 hosts: kafka-test1001.eqiad.wmnet
Do you want to proceed anyway?
Type "done" to proceed
> done
ATTENTION: destructive action for 1 hosts: kafka-test1001.eqiad.wmnet
Are you sure to proceed?
Type "done" to proceed
> done
Looking for matches in puppetmaster1001.eqiad.wmnet:/var/lib/git/operations/puppet
Looking for matches in puppetmaster1001.eqiad.wmnet:/srv/private
Looking for matches in deploy1001.eqiad.wmnet:/srv/mediawiki-staging
No matches found in the Puppet or mediawiki-config repositories
Looking for Kerberos credentials on KDC kadmin node.
No Kerberos credentials found.
Scheduling downtime on Icinga server alert1001.wikimedia.org for hosts: ['kafka-test1001.eqiad.wmnet']
**Failed downtime host on Icinga (likely already removed)**
Found Ganeti VM
Shutting down VM kafka-test1001.eqiad.wmnet in cluster ganeti01.svc.eqiad.wmnet
VM shutdown
Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
Sleeping for 20s to avoid race conditions...
Host kafka-test1001.eqiad.wmnet already missing on Debmonitor
Removed from DebMonitor
Removed from Puppet master and PuppetDB
Issuing Ganeti remove command, it can take up to 15 minutes...
Removing VM kafka-test1001.eqiad.wmnet in cluster ganeti01.svc.eqiad.wmnet. This may take a few minutes.
VM removed
Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
Generating the DNS records from Netbox data. It will take a couple of minutes.
Failed to run the sre.dns.netbox cookbook
Traceback (most recent call last):
  File "/srv/deployment/spicerack/cookbooks/sre/hosts/decommission.py", line 351, in run
    dns_netbox_run(dns_netbox_args, spicerack)
  File "/srv/deployment/spicerack/cookbooks/sre/dns/netbox.py", line 74, in run
    results = netbox_host.run_sync(command, is_safe=True)
  File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 476, in run_sync
    batch_sleep=batch_sleep, is_safe=is_safe)
  File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 646, in _execute
    raise RemoteExecutionError(ret, 'Cumin execution failed')
spicerack.remote.RemoteExecutionError: Cumin execution failed (exit_code=2)
**Failed to run the sre.dns.netbox cookbook**: Cumin execution failed (exit_code=2)
ERROR: some step failed, check the task updates.
Updated Phabricator task T268202
END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1)

I didn't know what to make of the error, and should have asked around. Thanks @Volans for stepping in.

cookbooks.sre.hosts.decommission executed by razzi@cumin1001 for hosts: kafka-test1003.eqiad.wmnet

  • kafka-test1003.eqiad.wmnet (WARN)
    • Failed downtime host on Icinga (likely already removed)
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
  • COMMON_STEPS (FAIL)
    • Failed to run the sre.dns.netbox cookbook: Cumin execution failed (exit_code=2)

ERROR: some step on some host failed, check the bolded items above

Change 644607 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] Add kafka-test1006.eqiad.wmnet virtual machine

https://gerrit.wikimedia.org/r/644607

Change 642563 abandoned by Razzi:
[operations/puppet@production] Add kafka-test virtual machine mac addresses

Reason:

https://gerrit.wikimedia.org/r/642563

Change 644620 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] Add kafka-test1006 as start of test kafka cluster

https://gerrit.wikimedia.org/r/644620

Change 644607 merged by Razzi:
[operations/puppet@production] Add kafka-test1006.eqiad.wmnet virtual machine

https://gerrit.wikimedia.org/r/644607

Change 644679 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] install_server: fix dhcp config for kafka-test1006

https://gerrit.wikimedia.org/r/644679

Change 644679 merged by Elukey:
[operations/puppet@production] install_server: fix dhcp config for kafka-test1006

https://gerrit.wikimedia.org/r/644679

Change 642497 merged by Razzi:
[operations/puppet@production] zookeeper: configure test-eqiad single-node cluster

https://gerrit.wikimedia.org/r/642497

Change 644620 merged by Razzi:
[operations/puppet@production] Add kafka-test1006 as start of test kafka cluster

https://gerrit.wikimedia.org/r/644620

Change 645169 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] Add kafka_test and zookeeper_test clusters

https://gerrit.wikimedia.org/r/645169

Change 645169 merged by Razzi:
[operations/puppet@production] Add kafka_test and zookeeper_test clusters

https://gerrit.wikimedia.org/r/645169

Change 645188 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] Add back test-eqiad zookeeper cluster

https://gerrit.wikimedia.org/r/645188

Icinga downtime for 40 days, 0:00:00 set by razzi@cumin1001 on 1 host(s) and their services with reason: new_install

kafka-test1006.eqiad.wmnet

Change 645188 merged by Elukey:
[operations/puppet@production] role::zookeeper:test: set cluster name and prometheus instance

https://gerrit.wikimedia.org/r/645188

Icinga downtime for 40 days, 0:00:00 set by razzi@cumin1001 on 1 host(s) and their services with reason: new_install

kafka-test1006.eqiad.wmnet

@razzi what's the end goal here? Usually long-lasting downtimes tend to be the wrong solution because will bite us in the future in one way or the other. Either the downtime will expire and trigger some unwanted alarm or the downtime get forgotten and it doesn't alarm when it should.

For example we have a hiera setting that can disable notifications for a given host, they will still appear in Icinga but will not alert on IRC or page.

Ah, we should for sure not page on this. I just looked, and if monitoring is enabled we set critical => true for the Kafka Broker Server process: https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/kafka/broker/monitoring.pp#L53

@razzi, can you add a $is_critical boolean parameter to profile::kafka::broker::monitoring and default it to false, set nrpe::monitor_service critical => $is_critical, and then also set profile::kafka::broker::monitoring::is_critical: true in all non test kafka role hieras? I think that should keep this test cluster (and any other new cluster) from paging unless we explicitly want it to.

Sorry I didn't catch this in my review yesterday!

Change 645371 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] zookeeper: Support a standalone server's mbeans in the JMX exporter's conf

https://gerrit.wikimedia.org/r/645371

@Ottomata Yeah, I'll add an $is_critical parameter.

Change 645398 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] kafka: make alerts not critical for test cluster

https://gerrit.wikimedia.org/r/645398

Change 645398 merged by Razzi:
[operations/puppet@production] kafka: make alerts not critical for test cluster

https://gerrit.wikimedia.org/r/645398

Change 646757 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] kafka: configure kafka test broker

https://gerrit.wikimedia.org/r/646757

Change 646819 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] kafka: allow accessing kafka-jumbo from kafka-test

https://gerrit.wikimedia.org/r/646819

Change 646757 merged by Razzi:
[operations/puppet@production] kafka: configure kafka test broker

https://gerrit.wikimedia.org/r/646757

Change 646819 merged by Razzi:
[operations/puppet@production] kafka: allow accessing kafka-jumbo from kafka-test

https://gerrit.wikimedia.org/r/646819

Change 647109 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] Add kafka-test1007 virtual machine

https://gerrit.wikimedia.org/r/647109

Change 645371 merged by Elukey:
[operations/puppet@production] zookeeper: Support a standalone server's mbeans in the JMX exporter's conf

https://gerrit.wikimedia.org/r/645371

Change 647109 merged by Razzi:
[operations/puppet@production] Add kafka-test1007 virtual machine

https://gerrit.wikimedia.org/r/647109

Change 647758 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] kafka: add kafka-test1007 to kafka-test cluster

https://gerrit.wikimedia.org/r/647758

Change 647758 merged by Razzi:
[operations/puppet@production] kafka: add kafka-test1007 to kafka-test cluster

https://gerrit.wikimedia.org/r/647758

Change 648342 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] kafka: Add kafka-test1008 - 1010

https://gerrit.wikimedia.org/r/648342

Change 648342 merged by Razzi:
[operations/puppet@production] kafka: Add kafka-test1008 - 1010

https://gerrit.wikimedia.org/r/648342

Change 649894 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] kafka: add remaining nodes to kafka test cluster

https://gerrit.wikimedia.org/r/649894

Change 649894 merged by Razzi:
[operations/puppet@production] kafka: add remaining nodes to kafka test cluster

https://gerrit.wikimedia.org/r/649894

cookbooks.sre.hosts.decommission executed by razzi@cumin1001 for hosts: kafka-test1004.eqiad.wmnet

  • kafka-test1004.eqiad.wmnet (WARN)
    • Failed downtime host on Icinga (likely already removed)
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
  • COMMON_STEPS (FAIL)
    • Failed to run the sre.dns.netbox cookbook: Cumin execution failed (exit_code=2)

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by razzi@cumin1001 for hosts: kafka-test1005.eqiad.wmnet

  • kafka-test1005.eqiad.wmnet (WARN)
    • Failed downtime host on Icinga (likely already removed)
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
  • COMMON_STEPS (FAIL)
    • Failed to run the sre.dns.netbox cookbook: Cumin execution failed (exit_code=2)

ERROR: some step on some host failed, check the bolded items above

Cluster is up and running!