Page MenuHomePhabricator

Upgrade Prometheus VMs in PoPs to Bullseye
Closed, ResolvedPublic

Description

ATM we're supporting Buster and Bullseye for Prometheus (ops instance, deployed to every site). However for e.g. blackbox-exporter configuration options it'd be nice to have all hosts on Bullseye.

Procedure to move data from old hosts to new hosts (to be tested, and adjusted, and moved to wikitech)

Preflight checks

  1. hosts are running prometheus role and show up in all prometheus hosts lists (e.g. prometheus_all_nodes)
  2. ACLs on network devices have been updated
  3. mysqld grants are updated (to be verified)

Migration

  1. [new host] stop puppet / prometheus / thanos-sidecar@ops
  2. [new host] remove accumulated data so far: rm -rf /srv/prometheus/ops/metrics
  3. Initial rsync of data old -> new
  4. [old host] stop puppet / thanos-sidecar@ops / prometheus@ops . Note that once thanos-sidecar@ops is stopped here then Thanos won't be able to query data for the PoP
  5. Final rsync of data old -> new
  6. [new host] chown -R prometheus:prometheus /srv/prometheus/ops
  7. [new host] set replica label in puppet to match the old hosts', merge the change
  8. [new host] re-enable puppet and run puppet, this will restart prometheus and thanos-sidecar@ops, thus Thanos will be able to query data from the new host
  9. (applicable on PoPs only) Flip DNS for prometheus.svc record to point to the new host
  10. [old host] make sure puppet stays disabled, and thanos-sidecar@ops does not run. Ideally decom the host ASAP.

Followups

  • Move the final migration procedure to wikitech
  • Make sure a prolonged down of Prometheus pages
  • Make sure there is a sensible default for replica_label in Puppet/Prometheus

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

After yesterday's partial bullseye in-place upgrade of prometheus3001 I noticed puppet failed because git pull failed on /srv/alerts.git:

root@prometheus3001:/srv/alerts.git# git fetch 
fatal: unable to access 'https://gerrit.wikimedia.org/r/operations/alerts/': Failed sending HTTP request

After a bit of digging I fixed it with:

root@prometheus3001:/srv/alerts.git# apt install libcurl3-gnutls=7.64.0-4+deb10u5

Change 905705 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] prometheus: Apply prometheus::pop role to prometheus3002

https://gerrit.wikimedia.org/r/905705

Change 907984 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] prometheus: Apply prometheus::pop role to prometheus4002

https://gerrit.wikimedia.org/r/907984

Change 907985 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] prometheus: Apply prometheus::pop role to prometheus5002

https://gerrit.wikimedia.org/r/907985

Change 907987 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] prometheus: Apply prometheus::pop role to prometheus6002

https://gerrit.wikimedia.org/r/907987

Change 905705 merged by Andrea Denisse:

[operations/puppet@production] prometheus: Apply prometheus::pop role to prometheus3002

https://gerrit.wikimedia.org/r/905705

Change 907984 merged by Andrea Denisse:

[operations/puppet@production] prometheus: Apply prometheus::pop role to prometheus4002

https://gerrit.wikimedia.org/r/907984

Change 907987 merged by Andrea Denisse:

[operations/puppet@production] prometheus: Apply prometheus::pop role to prometheus6002

https://gerrit.wikimedia.org/r/907987

Change 907985 merged by Andrea Denisse:

[operations/puppet@production] prometheus: Apply prometheus::pop role to prometheus5002

https://gerrit.wikimedia.org/r/907985

Change 909738 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] prometheus: Add script to sync prometheus instamces in esams

https://gerrit.wikimedia.org/r/909738

We need to push new ACLs to network devices to allow mgmt network access for the new prometheus hosts (cfr https://phabricator.wikimedia.org/T334785#8806688)

I'm looking at https://wikitech.wikimedia.org/wiki/Homer#Capirca_%28ACL_generation%29 thus:

cumin1001:~$ homer  mr1-eqsin* diff
INFO:homer.devices:Initialized 56 devices
INFO:homer:Generating diff for query mr1-eqsin*
INFO:homer:Gathering global Netbox data
INFO:homer.devices:Matched 1 device(s) for query 'mr1-eqsin*'
INFO:homer:Generating configuration for mr1-eqsin.wikimedia.org
INFO:homer.transports.junos:Running commit check on mr1-eqsin.wikimedia.org
Changes for 1 devices: ['mr1-eqsin.wikimedia.org']
             
[edit security address-book global]
     address bast_group_4 { ... }
-    address bast_group_5 208.80.155.110/32;
+    address bast_group_5 208.80.153.110/32;
-    address bast_group_6 2001:df2:e500:1:103:102:166:11/128;
+    address bast_group_6 208.80.155.110/32;
-    address bast_group_7 2620:0:860:2:208:80:153:54/128;
+    address bast_group_7 2001:df2:e500:1:103:102:166:11/128;
-    address bast_group_8 2620:0:861:4:208:80:155:110/128;
+    address bast_group_8 2620:0:860:2:208:80:153:54/128;
-    address bast_group_9 2620:0:862:1:91:198:174:9/128;
+    address bast_group_9 2620:0:860:4:208:80:153:110/128;
-    address bast_group_10 2620:0:863:1:198:35:26:11/128;
+    address bast_group_10 2620:0:861:4:208:80:155:110/128;
-    address bast_group_11 2a02:ec80:600:2:185:15:58:42/128;
+    address bast_group_11 2620:0:862:1:91:198:174:9/128;
+    address bast_group_12 2620:0:863:1:198:35:26:11/128;
+    address bast_group_13 2a02:ec80:600:2:185:15:58:42/128;
     address cumin_group_0 { ... }
[edit security address-book global]
     address install_group_0 { ... }
-    address install_group_1 91.198.174.63/32;
+    address install_group_1 103.102.166.12/32;
-    address install_group_2 185.15.58.7/32;
+    address install_group_2 185.15.58.12/32;
-    address install_group_3 185.15.58.12/32;
+    address install_group_3 198.35.26.13/32;
-    address install_group_4 208.80.153.51/32;
+    address install_group_4 208.80.153.105/32;
-    address install_group_5 208.80.153.105/32;
+    address install_group_5 208.80.154.74/32;
-    address install_group_6 208.80.154.32/32;
+    address install_group_6 2001:df2:e500:1:103:102:166:12/128;
-    address install_group_7 208.80.154.74/32;
+    address install_group_7 2620:0:860:4:208:80:153:105/128;
-    address install_group_8 2620:0:860:2:208:80:153:51/128;
+    address install_group_8 2620:0:861:3:208:80:154:74/128;
-    address install_group_9 2620:0:860:4:208:80:153:105/128;
+    address install_group_9 2620:0:862:1:91:198:174:10/128;
-    address install_group_10 2620:0:861:1:208:80:154:32/128;
+    address install_group_10 2620:0:863:1:198:35:26:13/128;
-    address install_group_11 2620:0:861:3:208:80:154:74/128;
+    address install_group_11 2a02:ec80:600:1:185:15:58:12/128;
     address netmon_group_0 { ... }
[edit security address-book global]
     address network-infra_2 { ... }
-    address prometheus_group_0 10.20.0.104/32;
+    address prometheus_group_0 10.20.0.8/32;
-    address prometheus_group_1 10.64.0.82/32;
+    address prometheus_group_1 10.20.0.104/32;
-    address prometheus_group_2 10.64.16.62/32;
+    address prometheus_group_2 10.64.0.82/32;
-    address prometheus_group_3 10.128.0.34/32;
+    address prometheus_group_3 10.64.16.62/32;
-    address prometheus_group_4 10.132.0.33/32;
+    address prometheus_group_4 10.128.0.16/32;
-    address prometheus_group_5 10.136.1.18/32;
+    address prometheus_group_5 10.128.0.34/32;
-    address prometheus_group_6 10.192.16.75/32;
+    address prometheus_group_6 10.132.0.12/32;
-    address prometheus_group_7 10.192.32.67/32;
+    address prometheus_group_7 10.132.0.33/32;
-    address prometheus_group_8 2001:df2:e500:101:10:132:0:33/128;
+    address prometheus_group_8 10.136.1.18/32;
-    address prometheus_group_9 2620:0:860:102:10:192:16:75/128;
+    address prometheus_group_9 10.136.1.24/32;
-    address prometheus_group_10 2620:0:860:103:10:192:32:67/128;
+    address prometheus_group_10 10.192.16.75/32;
-    address prometheus_group_11 2620:0:861:101:10:64:0:82/128;
+    address prometheus_group_11 10.192.32.67/32;
-    address prometheus_group_12 2620:0:861:102:10:64:16:62/128;
+    address prometheus_group_12 2001:df2:e500:101:10:132:0:12/128;
-    address prometheus_group_13 2620:0:862:102:10:20:0:104/128;
+    address prometheus_group_13 2001:df2:e500:101:10:132:0:33/128;
-    address prometheus_group_14 2620:0:863:101:10:128:0:34/128;
+    address prometheus_group_14 2620:0:860:102:10:192:16:75/128;
-    address prometheus_group_15 2a02:ec80:600:102:10:136:1:18/128;
+    address prometheus_group_15 2620:0:860:103:10:192:32:67/128;
+    address prometheus_group_16 2620:0:861:101:10:64:0:82/128;
+    address prometheus_group_17 2620:0:861:102:10:64:16:62/128;
+    address prometheus_group_18 2620:0:862:102:10:20:0:8/128;
+    address prometheus_group_19 2620:0:862:102:10:20:0:104/128;
+    address prometheus_group_20 2620:0:863:101:10:128:0:16/128;
+    address prometheus_group_21 2620:0:863:101:10:128:0:34/128;
+    address prometheus_group_22 2a02:ec80:600:102:10:136:1:18/128;
+    address prometheus_group_23 2a02:ec80:600:102:10:136:1:24/128;
-    address install4001_0 198.35.26.12/31;
-    address install4001_1 2620:0:863:1:198:35:26:12/127;
-    address install5002_0 103.102.166.12/31;
-    address install5002_1 2001:df2:e500:1:103:102:166:12/127;
-    address install_group_12 2620:0:862:1:91:198:174:10/128;
-    address install_group_13 2620:0:862:1:91:198:174:63/128;
-    address install_group_14 2a02:ec80:600:1:185:15:58:7/128;
-    address install_group_15 2a02:ec80:600:1:185:15:58:12/128;
[edit security address-book global address-set bast_group]
      address bast_group_11 { ... }
+     address bast_group_12;
+     address bast_group_13;
[edit security address-book global address-set install_group]
-     address install_group_12;
-     address install_group_13;
-     address install_group_14;
-     address install_group_15;
[edit security address-book global address-set prometheus_group]
      address prometheus_group_15 { ... }
+     address prometheus_group_16;
+     address prometheus_group_17;
+     address prometheus_group_18;
+     address prometheus_group_19;
+     address prometheus_group_20;
+     address prometheus_group_21;
+     address prometheus_group_22;
+     address prometheus_group_23;
[edit security address-book global]
-    address-set install4001 {
-        address install4001_0;
-        address install4001_1;
-    }       
-    address-set install5002 {
-        address install5002_0;
-        address install5002_1;
-    }       
[edit security policies from-zone production to-zone mgmt policy dhcp match]
-      source-address [ install4001 install5002 install_group ];
+      source-address install_group;
             
---------------
INFO:homer:Homer run completed successfully on 1 devices: ['mr1-eqsin.wikimedia.org']
  • check with netops the above is correct/expected, since there are unrelated changes too
  • commit to all mr devices: homer mr* commit "Add Prometheus hosts - T309979"

This is done, prometheus hosts have access to mgmt network now:

INFO:homer:Homer run completed successfully on 6 devices: ['mr1-codfw.wikimedia.org', 'mr1-drmrs.wikimedia.org', 'mr1-eqiad.wikimedia.org', 'mr1-eqsin.wikimedia.org', 'mr1-esams.wikimedia.org', 'mr1-ulsfo.wikimed
ia.org']

Change 909738 merged by Andrea Denisse:

[operations/puppet@production] prometheus: Add support for syncing data between Prometheus hosts

https://gerrit.wikimedia.org/r/909738

Change 912937 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] prometheus: Show transfer progress when migrating data

https://gerrit.wikimedia.org/r/912937

Change 912937 merged by Andrea Denisse:

[operations/puppet@production] prometheus: Show transfer progress when migrating data

https://gerrit.wikimedia.org/r/912937

Hello Filippo, here are the steps I think are required to accomplish the migrations:

1. Migrate data

On the new host:

bash
sudo disable-puppet "Disabling Puppet, Prometheus, and Thanos sidecar on the Bullseye host to migrate Prometheus hosts to Bullseye - T309979"
sudo systemctl stop prometheus@ops.service
sudo systemctl stop thanos-sidecar@ops
sudo systemctl status prometheus@ops.service
sudo systemctl status thanos-sidecar@ops
sudo /usr/local/sbin/sync-prometheus-migration-esams

On the old host:

bash
sudo disable-puppet "Disabling Puppet, Prometheus, and Thanos sidecar on the Buster host to migrate Prometheus hosts to Bullseye - T309979"
sudo systemctl stop thanos-sidecar@ops
sudo systemctl status thanos-sidecar@ops

On the new host:

bash
sudo /usr/local/sbin/sync-prometheus-migration-esams

3. Re-enable and run Puppet in the new host

bash
sudo run-puppet-agent -e "Re-enabling Puppet, Prometheus, and Thanos sidecar on the Bullseye host to migrate Prometheus hosts to Bullseye - T309979"

4. Failover

Applying changes

SSH into dns1001.wikimedia.org

bash
sudo -i authdns-update

Query all three DNS servers to ensure that the change has been correctly deployed

bash
for i in 0 1 2; do
   ns=ns${i}.wikimedia.org
   echo $ns
   dig +short @${ns} -t srv _etcd-server-ssl._tcp.dse-k8s-etcd.eqiad.wmnet
done

5. Decomission tasks:

Change 913192 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/dns@master] prometheus: Failover DNS from prometheus3001 to prometheus3002 in esams

https://gerrit.wikimedia.org/r/913192

Change 913194 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/dns@master] prometheus: Failover DNS from prometheus4001 to prometheus4002 in ulsfo

https://gerrit.wikimedia.org/r/913194

Change 913196 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/dns@master] prometheus: Failover DNS from prometheus5001 to prometheus5002 in eqsin

https://gerrit.wikimedia.org/r/913196

Change 913198 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/dns@master] prometheus: Failover DNS from prometheus6001 to prometheus6002 in drmrs

https://gerrit.wikimedia.org/r/913198

andrea.denisse changed the task status from Open to In Progress.Apr 28 2023, 3:07 PM

Change 913192 merged by Andrea Denisse:

[operations/dns@master] prometheus: Failover DNS from prometheus3001 to prometheus3002 in esams

https://gerrit.wikimedia.org/r/913192

Change 914359 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] prometheus: Add UID/GID mappings support for promethus data sync

https://gerrit.wikimedia.org/r/914359

Change 914359 merged by Andrea Denisse:

[operations/puppet@production] prometheus: Add UID/GID mappings support for promethus data sync

https://gerrit.wikimedia.org/r/914359

Change 913194 merged by Andrea Denisse:

[operations/dns@master] prometheus: Failover DNS from prometheus4001 to prometheus4002 in ulsfo

https://gerrit.wikimedia.org/r/913194

Change 914400 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] prometheus: Synchronize only the /srv/prometheus folder instead of the entire /srv directory

https://gerrit.wikimedia.org/r/914400

Change 914400 merged by Andrea Denisse:

[operations/puppet@production] prometheus: Synchronize only the /srv/prometheus directory when migrating data

https://gerrit.wikimedia.org/r/914400

Change 913196 merged by Andrea Denisse:

[operations/dns@master] prometheus: Failover DNS from prometheus5001 to prometheus5002 in eqsin

https://gerrit.wikimedia.org/r/913196

Change 913198 merged by Andrea Denisse:

[operations/dns@master] prometheus: Failover DNS from prometheus6001 to prometheus6002 in drmrs

https://gerrit.wikimedia.org/r/913198

Mentioned in SAL (#wikimedia-operations) [2023-05-05T08:15:19Z] <godog> delete wal and chunks_head from prometheus5002 and prometheus4002 to let prometheus start back up and not crashloop - T309979

Mentioned in SAL (#wikimedia-operations) [2023-05-05T08:15:19Z] <godog> delete wal and chunks_head from prometheus5002 and prometheus4002 to let prometheus start back up and not crashloop - T309979

These two Prometheus instances were down since ~midnight UTC (crashlooping). I've added followups in the task description including making sure a Prometheus unavailable for prolonged periods of time is actually a page

@andrea.denisse FYI I noticed prometheus[346]001 were still in puppetdb, I've manually run puppet node clean and puppet node deactivate for those hosts. I noticed because:

root@prometheus3002:/srv/prometheus/ops# grep -ir prometheus3001 .
./targets/thanos_sidecar_esams.yaml:  - prometheus3001:19900
./targets/rsyslog_esams.yaml:  - prometheus3001:9105
./targets/node_site_esams.yaml:  - prometheus3001:9100
./targets/envoy_esams.yaml:  - prometheus3001:9631
  • I am assuming that between the puppet node clean/deactivate run by the cookbook and the manual gnt-instance remove there was a puppet run on the host, this effectively undoes the puppet node deactivate.

I thought though that puppet node clean would ban further puppet runs, maybe that's not the case? At any rate it seems that the decommission cookbook failed for three VMs out of four (always gnt-instance remove as far as I can tell). ping @Volans in case this is unexpected or not seen before (namely gnt-instance remove failing semi-consistently)

@andrea.denisse, @fgiunchedi are you sure it happened for prometheus6001 too? I don't see any failure in T335588#8846592

@andrea.denisse, @fgiunchedi are you sure it happened for prometheus6001 too? I don't see any failure in T335588#8846592

My mistake: it indeed happened in ulsfo and esams only

@andrea.denisse @fgiunchedi

For prometheus3001 and promeheus4001 the cookbook was called with the wrong FQDN (wment vs wmnet):

Executing cookbook sre.hosts.decommission with args: ['-t', 'T335585', 'prometheus4001.ulsfo.wment']
...
spicerack.remote.RemoteError: No hosts provided

at that point the cookbook offer the user to proceed anyway (because it might be already not in puppetdb for whatever reason):

ask_confirmation(
    'ATTENTION: the query does not match any host in PuppetDB or failed\n'
    'Hostname expansion matches {n} hosts: {hosts}\n'
    'Do you want to proceed anyway?'
    .format(n=len(decom_hosts), hosts=decom_hosts))

and the logs have:

User input is: "go"

and so it continued to decommission the "wrong" hostname.
I can check if the cookbook really needs the FQDN and move it to use hostname only, but would not prevent issues with other kind of typos.

@andrea.denisse @fgiunchedi

For prometheus3001 and promeheus4001 the cookbook was called with the wrong FQDN (wment vs wmnet):

doh! thank you, I completely missed that, which explains the symptoms

Executing cookbook sre.hosts.decommission with args: ['-t', 'T335585', 'prometheus4001.ulsfo.wment']
...
spicerack.remote.RemoteError: No hosts provided

at that point the cookbook offer the user to proceed anyway (because it might be already not in puppetdb for whatever reason):

ask_confirmation(
    'ATTENTION: the query does not match any host in PuppetDB or failed\n'
    'Hostname expansion matches {n} hosts: {hosts}\n'
    'Do you want to proceed anyway?'
    .format(n=len(decom_hosts), hosts=decom_hosts))

and the logs have:

User input is: "go"

and so it continued to decommission the "wrong" hostname.
I can check if the cookbook really needs the FQDN and move it to use hostname only, but would not prevent issues with other kind of typos.

Yeah that's fair, probably the ultimate check would be if the hostname/fqdn are in netbox, which they should always be, but probably overkill (?)

Actually is already like that, it takes a cumin query as parameter, so prometheus3001* or A:prometheus and A:esams would have worked the same.
The fact that it allows to specify an FQDN and go ahead if not in puppetdb is when either the host is already not in there because broken for more than 2 weeks or because the first decommissioning might have failed some step that you want to retry, so to keep it idempotent it needs to allow to run also based only on a FQDN provided manually by the user either with prometheus3001.esams.wmnet and proceeding after the confirmation.

lmata triaged this task as Medium priority.Jul 10 2023, 4:10 AM

That's correct yes, all done