Page MenuHomePhabricator

cloudservices1006: put into service
Closed, ResolvedPublic

Description

Put the server cloudservices1006 https://netbox.wikimedia.org/dcim/devices/4790/ into service.

This is the first designate host in the new network setup in eqiad1, also the first transition into T342621: eqiad1: cloudlb: transition DNS clients (VMs) to the new BGP-based recursor VIP.

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+4 -5
operations/dnsmaster+0 -4
operations/puppetproduction+7 -8
operations/puppetproduction+6 -6
operations/puppetproduction+5 -4
operations/puppetproduction+8 -0
operations/puppetproduction+7 -2
operations/puppetproduction+2 -0
operations/puppetproduction+5 -0
operations/puppetproduction+0 -2
operations/puppetproduction+2 -0
operations/puppetproduction+6 -0
operations/puppetproduction+4 -0
operations/puppetproduction+18 -0
operations/puppetproduction+20 -3
operations/puppetproduction+14 -8
operations/puppetproduction+4 -0
operations/puppetproduction+39 -1
Show related patches Customize query in gerrit

Event Timeline

Change 941383 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudservices1006: prepare service

https://gerrit.wikimedia.org/r/941383

aborrero changed the task status from Open to In Progress.Aug 30 2023, 10:46 AM
aborrero triaged this task as Medium priority.
aborrero moved this task from Backlog to Doing on the User-aborrero board.

Change 941383 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudservices1006: prepare service

https://gerrit.wikimedia.org/r/941383

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudservices1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudservices1006.eqiad.wmnet with OS bullseye executed with errors:

  • cloudservices1006 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudservices1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudservices1006.eqiad.wmnet with OS bullseye executed with errors:

  • cloudservices1006 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudservices1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudservices1006.eqiad.wmnet with OS bullseye executed with errors:

  • cloudservices1006 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudservices1006.eqiad.wmnet with OS bullseye

Change 953595 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudservices: enable ns0-next.openstack.eqiad1.wikimediacloud.org

https://gerrit.wikimedia.org/r/953595

Change 953605 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudservices1006: make it aware of ns0-next.openstack.eqiad1.wikimediacloud.org

https://gerrit.wikimedia.org/r/953605

Change 953605 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudservices1006: make it aware of ns0-next.openstack.eqiad1.wikimediacloud.org

https://gerrit.wikimedia.org/r/953605

Change 953685 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: eqiad1: pdns: use modern recursor setting for cloudservices1006

https://gerrit.wikimedia.org/r/953685

Change 953685 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: eqiad1: pdns: use modern recursor setting for cloudservices1006

https://gerrit.wikimedia.org/r/953685

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudservices1006.eqiad.wmnet with OS bullseye completed:

  • cloudservices1006 (WARN)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run failed and logged in /var/log/spicerack/sre/hosts/reimage/202308301216_aborrero_3102738_cloudservices1006.out, asking the operator what to do
    • First Puppet run failed and logged in /var/log/spicerack/sre/hosts/reimage/202308301251_aborrero_3102738_cloudservices1006.out, asking the operator what to do
    • First Puppet run failed and logged in /var/log/spicerack/sre/hosts/reimage/202308301251_aborrero_3102738_cloudservices1006.out, asking the operator what to do
    • First Puppet run failed and logged in /var/log/spicerack/sre/hosts/reimage/202308301520_aborrero_3102738_cloudservices1006.out, asking the operator what to do
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308311049_aborrero_3102738_cloudservices1006.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

The reimage of cloudservices1006 is now completed. I followed the instructions at https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/DNS/Designate#Initial_designate/pdns_node_setup

However, the service has not been enrolled as a designate worker properly. I'll let that to @Andrew

Icinga downtime and Alertmanager silence (ID=34bcbdf9-bb21-499b-89d2-556ebea9203e) set by aborrero@cumin1001 for 5 days, 0:00:00 on 1 host(s) and their services with reason: service bootstrap

cloudservices1006.eqiad.wmnet

today @cmooney enabled the BGP VIPs for this server.

I can ping them:

arturo@nostromo:~ $ ping -c1 185.15.56.162
PING 185.15.56.162 (185.15.56.162) 56(84) bytes of data.
64 bytes from 185.15.56.162: icmp_seq=1 ttl=50 time=106 ms

--- 185.15.56.162 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 105.712/105.712/105.712/0.000 ms

aborrero@tools-sgebastion-11:~$ ping -c1 172.20.255.1
PING 172.20.255.1 (172.20.255.1) 56(84) bytes of data.
64 bytes from 172.20.255.1: icmp_seq=1 ttl=61 time=0.496 ms

--- 172.20.255.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.496/0.496/0.496/0.000 ms

And also the DNS recursor is working right away:

aborrero@tools-sgebastion-11:~$ dig @172.20.255.1 www.google.com +short
142.251.167.104
[..]

Change 954317 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudlb: haproxy: mysql: expose tcp port to all internal networks

https://gerrit.wikimedia.org/r/954317

Change 954317 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudlb: haproxy: mysql: expose tcp port to cloud-private networks only

https://gerrit.wikimedia.org/r/954317

Mentioned in SAL (#wikimedia-cloud) [2023-09-04T08:40:41Z] <arturo> stopped all designate services on cloudservices1006 T345240

Mentioned in SAL (#wikimedia-cloud) [2023-09-04T10:46:22Z] <arturo> added designate galera DB grants for cloudlb T345240

Added some grants in the galera DB like this:

# for cloudlb1001.private.eqiad.wikimedia.cloud
GRANT ALL PRIVILEGES ON designate.* TO 'designate'@'172.20.1.2'  IDENTIFIED BY '<redacted>';
# for cloudlb1002.private.eqiad.wikimedia.cloud
GRANT ALL PRIVILEGES ON designate.* TO 'designate'@'172.20.2.2'  IDENTIFIED BY '<redacted>';

Change 954654 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: designate: override to enable cloud-private for designate

https://gerrit.wikimedia.org/r/954654

Change 954679 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: cloudcontro1l005: open memcached to cloud-private

https://gerrit.wikimedia.org/r/954679

Change 954654 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: designate: override to enable cloud-private for designate

https://gerrit.wikimedia.org/r/954654

Change 954679 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: cloudcontro1l005: open memcached to cloud-private

https://gerrit.wikimedia.org/r/954679

Change 954692 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudservices1006: make it talk to cloudcontrol via cloud-private

https://gerrit.wikimedia.org/r/954692

Change 954692 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudservices1006: make it talk to cloudcontrol via cloud-private

https://gerrit.wikimedia.org/r/954692

Mentioned in SAL (#wikimedia-cloud) [2023-09-04T14:19:28Z] <arturo> started all designate services on cloudservices1006 T345240

Change 954698 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudservices1006: additional keystone overrides for cloud-private migration

https://gerrit.wikimedia.org/r/954698

Change 954698 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudservices1006: additional keystone overrides for cloud-private migration

https://gerrit.wikimedia.org/r/954698

Change 954704 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: eqiad1: cloudlb: disable older designate backends

https://gerrit.wikimedia.org/r/954704

Change 954704 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: eqiad1: cloudlb: disable older designate backends

https://gerrit.wikimedia.org/r/954704

Change 954708 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudservices1006: allow cloudlb's haproxy connectivity

https://gerrit.wikimedia.org/r/954708

Change 954708 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudservices1006: allow cloudlb's haproxy connectivity

https://gerrit.wikimedia.org/r/954708

Change 954712 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudservices1006: override deisngate servers

https://gerrit.wikimedia.org/r/954712

Change 954712 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudservices1006: override deisngate servers

https://gerrit.wikimedia.org/r/954712

<taavi> but isn't 1006 running designate-sink already? so that would mean it would be processing vm creation and deletion events, which would have the same issue
<taavi> it seems like the easiest solution would be to add a homer term to allow all traffic between cloudservices nodes, and then update ::designate_hosts to contain all three nodes on all three nodes

add a homer term to allow all traffic between cloudservices nodes

Hm, actually, I think this is not needed as cr-labs only default-blocks traffic to private addresses. So just setting ::designate_hosts should be all that's needed.

Change 954726 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] hieradata: add cloudservices1006 to all designate fw rules

https://gerrit.wikimedia.org/r/954726

Change 954726 merged by Majavah:

[operations/puppet@production] hieradata: add cloudservices1006 to all designate fw rules

https://gerrit.wikimedia.org/r/954726

Change 954891 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: eqiad1: rabbitmq: allow access from new designate node

https://gerrit.wikimedia.org/r/954891

Change 954891 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: eqiad1: rabbitmq: allow access from new designate node

https://gerrit.wikimedia.org/r/954891

Mentioned in SAL (#wikimedia-cloud) [2023-09-05T10:54:48Z] <arturo> running SQL command update domains set master="208.80.154.11:5354 208.80.154.148:5354 10.64.151.4:5354"; on all 3 cloudservices nodes (T345240)

Mentioned in SAL (#wikimedia-cloud) [2023-09-05T12:18:52Z] <arturo> updating pools.yaml in all cloudservices designate nodes (T345240)

New pools.yaml file:

- also_notifies: []
  attributes: {}
  description: Pool for pdns backing designate
  id: 794ccc2c-d751-44fe-b57f-8894c9f5c842
  name: default
  nameservers:
  - host: 208.80.154.11
    port: 53
  - host: 208.80.154.148
    port: 53
  ns_records:
  - hostname: ns1.openstack.eqiad1.wikimediacloud.org.
    priority: 10
  - hostname: ns0.openstack.eqiad1.wikimediacloud.org.
    priority: 10
  targets:
  - masters:
    - host: 208.80.154.148
      port: 5354
    - host: 208.80.154.11
      port: 5354
    # added
    - host: 172.20.1.5
      port: 5354
    options:
      api_endpoint: http://208.80.154.148:8081
      api_token: redacted
      host: 208.80.154.148
      port: '53'
    type: pdns4
  - masters:
    - host: 208.80.154.11
      port: 5354
    - host: 208.80.154.148
      port: 5354
    # added
    - host: 172.20.1.5
      port: 5354
    options:
      api_endpoint: http://208.80.154.11:8081
      api_token: redacted
      host: 208.80.154.11
      port: '53'
    type: pdns4
  # added
  - masters:
    - host: 208.80.154.11
      port: 5354
    - host: 208.80.154.148
      port: 5354
    - host: 172.20.1.5
      port: 5354
    options:
      api_endpoint: http://172.20.1.5:8081
      api_token: redacted
      host: 172.20.1.5
      port: '53'
    type: pdns4

The pool update command hangs, let me try with the .eqiad.wmnet address:

- also_notifies: []
  attributes: {}
  description: Pool for pdns backing designate
  id: 794ccc2c-d751-44fe-b57f-8894c9f5c842
  name: default
  nameservers:
  - host: 208.80.154.11
    port: 53
  - host: 208.80.154.148
    port: 53
  ns_records:
  - hostname: ns1.openstack.eqiad1.wikimediacloud.org.
    priority: 10
  - hostname: ns0.openstack.eqiad1.wikimediacloud.org.
    priority: 10
  targets:
  - masters:
    - host: 208.80.154.148
      port: 5354
    - host: 208.80.154.11
      port: 5354
    - host: 10.64.151.4
      port: 5354
    options:
      api_endpoint: http://208.80.154.148:8081
      api_token: redacted
      host: 208.80.154.148
      port: '53'
    type: pdns4
  - masters:
    - host: 208.80.154.11
      port: 5354
    - host: 208.80.154.148
      port: 5354
    - host: 10.64.151.4
      port: 5354
    options:
      api_endpoint: http://208.80.154.11:8081
      api_token: redacted
      host: 208.80.154.11
      port: '53'
    type: pdns4
  - masters:
    - host: 208.80.154.11
      port: 5354
    - host: 208.80.154.148
      port: 5354
    - host: 10.64.151.4
      port: 5354
    options:
      api_endpoint: http://10.64.151.4:8081
      api_token: redacted
      host: 10.64.151.4
      port: '53'
    type: pdns4

Mentioned in SAL (#wikimedia-cloud) [2023-09-05T12:34:47Z] <arturo> synced pdns database from cloudservices1004 to cloudservices1006 (T345240)

Mentioned in SAL (#wikimedia-cloud) [2023-09-05T12:45:57Z] <arturo> moved all VMs to ns-recursor.openstack.eqiad1.wikimediacloud.org via project puppet (T345240, T342621)

Mentioned in SAL (#wikimedia-cloud) [2023-09-06T08:47:02Z] <arturo> switch project to new DNS recursor via horizon project hiera (T345240, T342621)

Change 956415 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/dns@master] wmcs: drop cloudservices1004 addresses

https://gerrit.wikimedia.org/r/956415

Change 956417 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: refresh cloudservices1006 ns address

https://gerrit.wikimedia.org/r/956417

Change 956419 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] wmcs: refresh DNS addresses

https://gerrit.wikimedia.org/r/956419

Change 956417 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: refresh cloudservices1006 ns address

https://gerrit.wikimedia.org/r/956417

Change 956419 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] wmcs: refresh DNS addresses

https://gerrit.wikimedia.org/r/956419

Change 956415 merged by Arturo Borrero Gonzalez:

[operations/dns@master] wmcs: drop cloudservices1004 addresses

https://gerrit.wikimedia.org/r/956415

Change 956423 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] wmcs: eqiad1: drop ns1-next and use ns1

https://gerrit.wikimedia.org/r/956423

Change 956423 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] wmcs: eqiad1: drop ns1-next and use ns1

https://gerrit.wikimedia.org/r/956423