Page MenuHomePhabricator

Set up puppet configuration for new WCQS cluster
Closed, ResolvedPublic5 Estimated Story Points

Description

As a WCQS maintainer I want new instances of WCQS to be configurable via Puppet so I can efficiently maintain a WCQS production cluster.

Most of the configuration is similar to WDQS, but there's the additional matter of setting up the authentication (not sure how much work is required on this).

AC:

  • All WCQS instances are configured in Puppet and the configuration is applied on at least one of the WCQS server
  • code duplication is not increasing

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 715570 merged by Ryan Kemper:

[labs/private@master] wcqs: add wcqs.discovery.wmnet dummy key

https://gerrit.wikimedia.org/r/715570

Change 715569 merged by Ryan Kemper:

[operations/puppet@production] wcqs: create tls cert

https://gerrit.wikimedia.org/r/715569

With respect to https://gerrit.wikimedia.org/r/c/operations/puppet/+/713946, we'll need to know if sdcquery01 is still in use. If it's not and we can "decommission" it (or at least not worry about breaking it), that will let us require oauth_settings to be defined here, instead of being an Optional: https://github.com/wikimedia/puppet/blob/ae7ad4bc10f65ce260c623a1f4c06835d26f1c52/modules/profile/manifests/query_service/wcqs.pp#L21

(This is because for WCQS we always want oauth_settings to be set, so we want to disallow the use of null/undefined)

Update: sdcquery01 has been deleted, so we can proceed with removing the Optional type

Change 713958 merged by Ryan Kemper:

[operations/puppet@production] blazegraph: Setup tls termination for wcqs

https://gerrit.wikimedia.org/r/713958

Change 713929 merged by Ryan Kemper:

[operations/dns@master] Add wcqs.svc.{codfw,eqiad}.wmnet

https://gerrit.wikimedia.org/r/713929

Mentioned in SAL (#wikimedia-operations) [2021-09-21T18:45:31Z] <ryankemper> T280001 ryankemper@authdns1001:~$ sudo authdns-update

Mentioned in SAL (#wikimedia-operations) [2021-09-21T18:48:08Z] <ryankemper> T280001 OK - authdns-update successful on all nodes!

[DNS changes -> authdns update]
ryankemper@authdns1001:~$ sudo authdns-update
Updating authdns1001.wikimedia.org (self)...
Pulling the current revision from https://gerrit.wikimedia.org/r/operations/dns.git
Reviewing 7c5c3049786e6102ca1101319de714290be9f4f4...

 templates/10.in-addr.arpa | 2 ++
 templates/wmnet           | 2 ++
 2 files changed, 4 insertions(+)

diff --git templates/10.in-addr.arpa templates/10.in-addr.arpa
index 5fc8abb3..d90c4062 100644
--- templates/10.in-addr.arpa
+++ templates/10.in-addr.arpa
@@ -65,6 +65,7 @@ $ORIGIN 1.2.@Z
 58  1H  IN PTR  miscweb.svc.codfw.wmnet.
 61  1H  IN PTR  shellbox-constraints.svc.codfw.wmnet.
 62  1H  IN PTR  toolhub.svc.codfw.wmnet.
+63  1h  IN PTR  wcqs.svc.codfw.wmnet.

 ; 10.2.2.0/24 - eqiad LVS low-traffic (internal) services

@@ -124,6 +125,7 @@ $ORIGIN 2.2.@Z
 61  1H  IN PTR  shellbox-constraints.svc.eqiad.wmnet.
 62  1H  IN PTR  toolhub.svc.eqiad.wmnet.
 63  1H  IN PTR  inference.svc.eqiad.wmnet.
+64  1H  IN PTR  wcqs.svc.eqiad.wmnet.

 ; 10.2.3.0/24 - esams LVS low-traffic (internal) services

diff --git templates/wmnet templates/wmnet
index 7c21d8f8..aafa7ea7 100644
--- templates/wmnet
+++ templates/wmnet
@@ -442,6 +442,7 @@ tegola-vector-tiles         1H  IN A    10.2.2.60
 shellbox-constraints        1H  IN A    10.2.2.61
 toolhub         1H  IN A    10.2.2.62
 inference       1H  IN A    10.2.2.63
+wcqs            1H  IN A    10.2.2.67
 ganeti01    1H  IN A        10.64.32.173
 nfs-tools-project    1H  IN A        10.64.37.18

@@ -543,6 +544,7 @@ mwdebug         1H  IN A    10.2.1.59
 tegola-vector-tiles         1H  IN A    10.2.1.60
 shellbox-constraints        1H  IN A    10.2.1.61
 toolhub         1H  IN A    10.2.1.62
+wcqs            1H  IN A    10.2.1.67

 ; K8S CODFW STAGING SERVICES


Merge these changes? (yes/no)? yes
Updating 8307641d..7c5c3049
Fast-forward
 templates/10.in-addr.arpa | 2 ++
 templates/wmnet           | 2 ++
 2 files changed, 4 insertions(+)
Deploying via utils/deploy-check.py...
Assembling and testing data in /tmp/dns-check.rapvt1c5
 -- Generating zonefiles from zone templates
 -- Processed 211 zones into directory /tmp/dns-check.rapvt1c5/zones
OK: No tabs
Summary of violations:
    W001|MISSING_IP_FOR_NAME_AND_PTR: 37
    W002|MISSING_PTR_FOR_NAME_AND_IP: 26
    W103|MISSING_MGMT_FOR_NAME: 64
    W105|TOO_MANY_PUBLIC_NAMES: 3
RESULT: 0 Errors, 130 Warnings, 0 Ignored violations, 0 Ignored lines
 -- Copying automatically generated zone files under target tree
 -- Copying repo-driven real config files and admin_state
 -- Copying puppetized config and GeoIP from /etc/gdnsd
 -- Checking for illegal tabs in zonefiles
 -- Running zone_validator to check WMF rules
 -- Running /usr/sbin/gdnsd checkconf on /tmp/dns-check.rapvt1c5
 -- Preflight checkconf is OK
Deploying from /tmp/dns-check.rapvt1c5 to system dirs
 -- Zone changed: wmnet
 -- Descending to subdirectory: netbox
 -- Done with subdir: netbox
 -- Zone changed: 10.in-addr.arpa
Reloading gdnsd zonefiles
info: Zone data reloaded
OK
---------------1
authdns2001.wikimedia.org,dns[1001-1002,2001-2002,3001-3002,4001-4002,5001-5002].wikimedia.org (11)
---------------
OK - authdns updated successfully

OK - authdns-update successful on all nodes!
[DNS change -> eqiad validation]
ryankemper@authdns1001:~$ for i in 0 1 2 ; do dig @ns${i}.wikimedia.org -t any wcqs.svc.eqiad.wmnet ; done

; <<>> DiG 9.11.5-P4-5.1+deb10u5-Debian <<>> @ns0.wikimedia.org -t any wcqs.svc.eqiad.wmnet
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 48639
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1024
; COOKIE: 2488f3a7fd0c5bbe01d8534adfbac196 (good)
; OPT=11: 01 72 (".r")
;; QUESTION SECTION:
;wcqs.svc.eqiad.wmnet.          IN      ANY

;; ANSWER SECTION:
wcqs.svc.eqiad.wmnet.   3600    IN      HINFO   "RFC8482" ""

;; Query time: 0 msec
;; SERVER: 208.80.154.238#53(208.80.154.238)
;; WHEN: Tue Sep 21 18:49:13 UTC 2021
;; MSG SIZE  rcvd: 96


; <<>> DiG 9.11.5-P4-5.1+deb10u5-Debian <<>> @ns1.wikimedia.org -t any wcqs.svc.eqiad.wmnet
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 43542
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1024
; COOKIE: 4ea9c8d14a719114988252f4b7a24024 (good)
; OPT=11: 01 72 (".r")
;; QUESTION SECTION:
;wcqs.svc.eqiad.wmnet.          IN      ANY

;; ANSWER SECTION:
wcqs.svc.eqiad.wmnet.   3600    IN      HINFO   "RFC8482" ""

;; Query time: 0 msec
;; SERVER: 208.80.153.231#53(208.80.153.231)
;; WHEN: Tue Sep 21 18:49:13 UTC 2021
;; MSG SIZE  rcvd: 96


; <<>> DiG 9.11.5-P4-5.1+deb10u5-Debian <<>> @ns2.wikimedia.org -t any wcqs.svc.eqiad.wmnet
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 1138
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1024
; COOKIE: 0e7f90b6a29651b34d8115b3a817a4cd (good)
; OPT=11: 01 72 (".r")
;; QUESTION SECTION:
;wcqs.svc.eqiad.wmnet.          IN      ANY

;; ANSWER SECTION:
wcqs.svc.eqiad.wmnet.   3600    IN      HINFO   "RFC8482" ""

;; Query time: 0 msec
;; SERVER: 91.198.174.239#53(91.198.174.239)
;; WHEN: Tue Sep 21 18:49:13 UTC 2021
;; MSG SIZE  rcvd: 96
[DNS change -> codfw validation]
ryankemper@authdns1001:~$ for i in 0 1 2 ; do dig @ns${i}.wikimedia.org -t any wcqs.svc.codfw.wmnet ; done

; <<>> DiG 9.11.5-P4-5.1+deb10u5-Debian <<>> @ns0.wikimedia.org -t any wcqs.svc.codfw.wmnet
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 36183
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1024
; COOKIE: 8dad192df87e3fd3a2874c513054386d (good)
; OPT=11: 01 72 (".r")
;; QUESTION SECTION:
;wcqs.svc.codfw.wmnet.          IN      ANY

;; ANSWER SECTION:
wcqs.svc.codfw.wmnet.   3600    IN      HINFO   "RFC8482" ""

;; Query time: 0 msec
;; SERVER: 208.80.154.238#53(208.80.154.238)
;; WHEN: Tue Sep 21 18:49:39 UTC 2021
;; MSG SIZE  rcvd: 96


; <<>> DiG 9.11.5-P4-5.1+deb10u5-Debian <<>> @ns1.wikimedia.org -t any wcqs.svc.codfw.wmnet
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 10558
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1024
; COOKIE: 8b49e3f1d8cc32489ef6822def8f7e99 (good)
; OPT=11: 01 72 (".r")
;; QUESTION SECTION:
;wcqs.svc.codfw.wmnet.          IN      ANY

;; ANSWER SECTION:
wcqs.svc.codfw.wmnet.   3600    IN      HINFO   "RFC8482" ""

;; Query time: 0 msec
;; SERVER: 208.80.153.231#53(208.80.153.231)
;; WHEN: Tue Sep 21 18:49:39 UTC 2021
;; MSG SIZE  rcvd: 96


; <<>> DiG 9.11.5-P4-5.1+deb10u5-Debian <<>> @ns2.wikimedia.org -t any wcqs.svc.codfw.wmnet
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 19570
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1024
; COOKIE: 1092f615eef30577ebf68567b27e1666 (good)
; OPT=11: 01 72 (".r")
;; QUESTION SECTION:
;wcqs.svc.codfw.wmnet.          IN      ANY

;; ANSWER SECTION:
wcqs.svc.codfw.wmnet.   3600    IN      HINFO   "RFC8482" ""

;; Query time: 0 msec
;; SERVER: 91.198.174.239#53(91.198.174.239)
;; WHEN: Tue Sep 21 18:49:39 UTC 2021
;; MSG SIZE  rcvd: 96

Mentioned in SAL (#wikimedia-operations) [2021-09-21T18:53:18Z] <ryankemper> T280001 for i in 0 1 2 ; do dig @ns${i}.wikimedia.org -t any wcqs.svc.[eqiad,codfw].wmnet ; done looks as expected

Mentioned in SAL (#wikimedia-operations) [2021-09-21T18:56:55Z] <ryankemper> T280001 Running sudo -i cookbook sre.dns.netbox -t T280001 'Added wcqs.svc.[eqiad,codfw].wmnet' per final step of https://wikitech.wikimedia.org/wiki/LVS#DNS_changes_(svc_zone_only)...

[DNS change -> sre.dns.netbox cookbook]
ryankemper@cumin1001:~$ sudo -i cookbook sre.dns.netbox -t T280001 'Added wcqs.svc.[eqiad,codfw].wmnet'
START - Cookbook sre.dns.netbox
Generating the DNS records from Netbox data. It will take a couple of minutes.
----- OUTPUT of 'cd /tmp && runus...ad,codfw].wmnet"' -----
2021-09-21 18:57:04,823 [INFO] Gathering devices, interfaces, addresses and prefixes from Netbox
2021-09-21 19:00:22,214 [WARNING] Device frqueue1002 of IP 10.64.40.204/26 not in devices, skipping.
2021-09-21 19:00:22,219 [WARNING] Device phab1003 of IP 10.65.1.16/16 not in devices, skipping.
2021-09-21 19:00:22,262 [INFO] Gathered 2307 devices from Netbox
2021-09-21 19:00:22,262 [INFO] Generating DNS records
2021-09-21 19:00:32,939 [INFO] Generated 13348 direct and reverse records (6674 each) in 28 direct zones and 171 reverse zones
2021-09-21 19:00:32,940 [INFO] Cloning /srv/netbox-exports/dns.git/ to /tmp/dns-c25pcHBldHM-5qrdksyv ...
2021-09-21 19:00:33,215 [INFO] Generating zonefile snippets to directory /tmp/dns-c25pcHBldHM-5qrdksyv
2021-09-21 19:00:34,249 [INFO] Committed changes: 4a37a231dbab6bfb2f424feffef4d4f411b8dd49
2021-09-21 19:00:34,285 [INFO] Validating generated data
2021-09-21 19:00:34,286 [INFO] Commit details: {'insertions': 4, 'deletions': 0, 'lines': 4, 'files': 4}
commit 4a37a231dbab6bfb2f424feffef4d4f411b8dd49
Author: generate-dns-snippets <noc@wikimedia.org>
Date:   Tue Sep 21 19:00:34 2021 +0000

    ryankemper@cumin1001: Added wcqs.svc.[eqiad,codfw].wmnet

diff --git a/1.2.10.in-addr.arpa b/1.2.10.in-addr.arpa
index 16f68f5..0d9297a 100644
--- a/1.2.10.in-addr.arpa
+++ b/1.2.10.in-addr.arpa
@@ -53,3 +53,4 @@
 64  1H IN PTR shellbox-media.svc.codfw.wmnet.
 65  1H IN PTR shellbox-syntaxhighlight.svc.codfw.wmnet.
 66  1H IN PTR shellbox-timeline.svc.codfw.wmnet.
+67  1H IN PTR wcqs.svc.codfw.wmnet.
diff --git a/2.2.10.in-addr.arpa b/2.2.10.in-addr.arpa
index adeb0c3..6904f40 100644
--- a/2.2.10.in-addr.arpa
+++ b/2.2.10.in-addr.arpa
@@ -58,3 +58,4 @@
 64  1H IN PTR shellbox-media.svc.eqiad.wmnet.
 65  1H IN PTR shellbox-syntaxhighlight.svc.eqiad.wmnet.
 66  1H IN PTR shellbox-timeline.svc.eqiad.wmnet.
+67  1H IN PTR wcqs.svc.eqiad.wmnet.
diff --git a/svc.codfw.wmnet b/svc.codfw.wmnet
index 23001c2..965466b 100644
--- a/svc.codfw.wmnet
+++ b/svc.codfw.wmnet
@@ -50,6 +50,7 @@ thanos-swift                             1H IN A 10.2.1.54
 thumbor                                  1H IN A 10.2.1.24
 toolhub                                  1H IN A 10.2.1.62
 videoscaler                              1H IN A 10.2.1.5
+wcqs                                     1H IN A 10.2.1.67
 wdqs                                     1H IN A 10.2.1.32
 wdqs-internal                            1H IN A 10.2.1.41
 wikifeeds                                1H IN A 10.2.1.47
diff --git a/svc.eqiad.wmnet b/svc.eqiad.wmnet
index 3dd0aa1..185c684 100644
--- a/svc.eqiad.wmnet
+++ b/svc.eqiad.wmnet
@@ -57,6 +57,7 @@ thanos-swift                             1H IN A 10.2.2.54
 thumbor                                  1H IN A 10.2.2.24
 toolhub                                  1H IN A 10.2.2.62
 videoscaler                              1H IN A 10.2.2.5
+wcqs                                     1H IN A 10.2.2.67
 wdqs                                     1H IN A 10.2.2.32
 wdqs-internal                            1H IN A 10.2.2.41
 wikifeeds                                1H IN A 10.2.2.47
METADATA: {"path": "/tmp/dns-c25pcHBldHM-5qrdksyv", "sha1": "4a37a231dbab6bfb2f424feffef4d4f411b8dd49", "insertions": 4, "deletions": 0, "lines": 4, "files": 4}
================
PASS |█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [03:30<00:00, 210.69s/hosts]
FAIL |                                                                                                                                                                                                                                                                                                                          |   0% (0/1) [03:30<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'cd /tmp && runus...ad,codfw].wmnet"'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
>>> Have you checked that the diff is OK?
Type "go" to proceed or "abort" to interrupt the execution
> go
----- OUTPUT of 'cd /tmp && runus...ef4d4f411b8dd49"' -----
2021-09-21 19:03:14,256 [INFO] Pushed with bitflags 256: f2989eb..4a37a23
2021-09-21 19:03:14,382 [INFO] Temporary directory /tmp/dns-c25pcHBldHM-5qrdksyv removed.
================
PASS |██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  1.00hosts/s]
FAIL |                                                                                                                                                                                                                                                                                                                          |   0% (0/1) [00:00<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'cd /tmp && runus...ef4d4f411b8dd49"'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Updating the Netbox passive copies of the repository on netbox2001.wikimedia.org
----- OUTPUT of 'runuser -u netbo...rg master:master' -----
From https://netbox1001.wikimedia.org/dns
   f2989eb..4a37a23  master     -> master
   f2989eb..4a37a23  master     -> netbox1001.wikimedia.org/master
================
PASS |██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:01<00:00,  1.44s/hosts]
FAIL |                                                                                                                                                                                                                                                                                                                          |   0% (0/1) [00:01<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'runuser -u netbo...rg master:master'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Updating the authdns copies of the repository on authdns[1001,2001].wikimedia.org,dns[1001-1002,2001-2002,3001-3002,4001-4002,5001-5002].wikimedia.org
===== NODE GROUP =====
(12) authdns[1001,2001].wikimedia.org,dns[1001-1002,2001-2002,3001-3002,4001-4002,5001-5002].wikimedia.org
----- OUTPUT of 'runuser -u netbo...fef4d4f411b8dd49' -----
From https://netbox-exports.wikimedia.org/dns
   f2989eb..4a37a23  master     -> origin/master
Updating f2989eb..4a37a23
Fast-forward
 1.2.10.in-addr.arpa | 1 +
 2.2.10.in-addr.arpa | 1 +
 svc.codfw.wmnet     | 1 +
 svc.eqiad.wmnet     | 1 +
 4 files changed, 4 insertions(+)
================
PASS |████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (12/12) [00:07<00:00,  1.13hosts/s]
FAIL |                                                                                                                                                                                                                                                                                                                         |   0% (0/12) [00:07<?, ?hosts/s]
100.0% (12/12) success ratio (>= 100.0% threshold) for command: 'runuser -u netbo...fef4d4f411b8dd49'.
100.0% (12/12) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Deploying the updated zonefiles on authdns[1001,2001].wikimedia.org,dns[1001-1002,2001-2002,3001-3002,4001-4002,5001-5002].wikimedia.org
===== NODE GROUP =====
(1) dns5002.wikimedia.org
----- OUTPUT of 'cd /srv/authdns/...nippets --deploy' -----
Assembling and testing data in /tmp/dns-check.et0f4m6b
 -- Generating zonefiles from zone templates
 -- Processed 211 zones into directory /tmp/dns-check.et0f4m6b/zones
OK: No tabs
Summary of violations:
    W001|MISSING_IP_FOR_NAME_AND_PTR: 37
    W002|MISSING_PTR_FOR_NAME_AND_IP: 26
    W103|MISSING_MGMT_FOR_NAME: 64
    W105|TOO_MANY_PUBLIC_NAMES: 3
RESULT: 0 Errors, 130 Warnings, 0 Ignored violations, 0 Ignored lines
 -- Copying automatically generated zone files under target tree
 -- Copying repo-driven real config files and admin_state
 -- Copying puppetized config and GeoIP from /etc/gdnsd
 -- Checking for illegal tabs in zonefiles
 -- Running zone_validator to check WMF rules
 -- Running /usr/sbin/gdnsd checkconf on /tmp/dns-check.et0f4m6b
 -- Preflight checkconf is OK
Deploying from /tmp/dns-check.et0f4m6b to system dirs
 -- Descending to subdirectory: netbox
 -- Zone changed: 1.2.10.in-addr.arpa
 -- Zone changed: 2.2.10.in-addr.arpa
 -- Zone changed: svc.codfw.wmnet
 -- Zone changed: svc.eqiad.wmnet
 -- Done with subdir: netbox
Reloading gdnsd zonefiles
info: Zone data reloaded
OK
===== NODE GROUP =====
(1) dns5001.wikimedia.org
----- OUTPUT of 'cd /srv/authdns/...nippets --deploy' -----
Assembling and testing data in /tmp/dns-check.yzuv53gu
 -- Generating zonefiles from zone templates
 -- Processed 211 zones into directory /tmp/dns-check.yzuv53gu/zones
OK: No tabs
Summary of violations:
    W001|MISSING_IP_FOR_NAME_AND_PTR: 37
    W002|MISSING_PTR_FOR_NAME_AND_IP: 26
    W103|MISSING_MGMT_FOR_NAME: 64
    W105|TOO_MANY_PUBLIC_NAMES: 3
RESULT: 0 Errors, 130 Warnings, 0 Ignored violations, 0 Ignored lines
 -- Copying automatically generated zone files under target tree
 -- Copying repo-driven real config files and admin_state
 -- Copying puppetized config and GeoIP from /etc/gdnsd
 -- Checking for illegal tabs in zonefiles
 -- Running zone_validator to check WMF rules
 -- Running /usr/sbin/gdnsd checkconf on /tmp/dns-check.yzuv53gu
 -- Preflight checkconf is OK
Deploying from /tmp/dns-check.yzuv53gu to system dirs
 -- Descending to subdirectory: netbox
 -- Zone changed: svc.codfw.wmnet
 -- Zone changed: 1.2.10.in-addr.arpa
 -- Zone changed: 2.2.10.in-addr.arpa
 -- Zone changed: svc.eqiad.wmnet
 -- Done with subdir: netbox
Reloading gdnsd zonefiles
info: Zone data reloaded
OK
===== NODE GROUP =====
(1) dns3002.wikimedia.org
----- OUTPUT of 'cd /srv/authdns/...nippets --deploy' -----
Assembling and testing data in /tmp/dns-check.01rsyr3x
 -- Generating zonefiles from zone templates
 -- Processed 211 zones into directory /tmp/dns-check.01rsyr3x/zones
OK: No tabs
Summary of violations:
    W001|MISSING_IP_FOR_NAME_AND_PTR: 37
    W002|MISSING_PTR_FOR_NAME_AND_IP: 26
    W103|MISSING_MGMT_FOR_NAME: 64
    W105|TOO_MANY_PUBLIC_NAMES: 3
RESULT: 0 Errors, 130 Warnings, 0 Ignored violations, 0 Ignored lines
 -- Copying automatically generated zone files under target tree
 -- Copying repo-driven real config files and admin_state
 -- Copying puppetized config and GeoIP from /etc/gdnsd
 -- Checking for illegal tabs in zonefiles
 -- Running zone_validator to check WMF rules
 -- Running /usr/sbin/gdnsd checkconf on /tmp/dns-check.01rsyr3x
 -- Preflight checkconf is OK
Deploying from /tmp/dns-check.01rsyr3x to system dirs
 -- Descending to subdirectory: netbox
 -- Zone changed: 1.2.10.in-addr.arpa
 -- Zone changed: 2.2.10.in-addr.arpa
 -- Zone changed: svc.codfw.wmnet
 -- Zone changed: svc.eqiad.wmnet
 -- Done with subdir: netbox
Reloading gdnsd zonefiles
info: Zone data reloaded
OK
===== NODE GROUP =====
(1) dns3001.wikimedia.org
----- OUTPUT of 'cd /srv/authdns/...nippets --deploy' -----
Assembling and testing data in /tmp/dns-check._npmsdv0
 -- Generating zonefiles from zone templates
 -- Processed 211 zones into directory /tmp/dns-check._npmsdv0/zones
OK: No tabs
Summary of violations:
    W001|MISSING_IP_FOR_NAME_AND_PTR: 37
    W002|MISSING_PTR_FOR_NAME_AND_IP: 26
    W103|MISSING_MGMT_FOR_NAME: 64
    W105|TOO_MANY_PUBLIC_NAMES: 3
RESULT: 0 Errors, 130 Warnings, 0 Ignored violations, 0 Ignored lines
 -- Copying automatically generated zone files under target tree
 -- Copying repo-driven real config files and admin_state
 -- Copying puppetized config and GeoIP from /etc/gdnsd
 -- Checking for illegal tabs in zonefiles
 -- Running zone_validator to check WMF rules
 -- Running /usr/sbin/gdnsd checkconf on /tmp/dns-check._npmsdv0
 -- Preflight checkconf is OK
Deploying from /tmp/dns-check._npmsdv0 to system dirs
 -- Descending to subdirectory: netbox
 -- Zone changed: 1.2.10.in-addr.arpa
 -- Zone changed: 2.2.10.in-addr.arpa
 -- Zone changed: svc.eqiad.wmnet
 -- Zone changed: svc.codfw.wmnet
 -- Done with subdir: netbox
Reloading gdnsd zonefiles
info: Zone data reloaded
OK
===== NODE GROUP =====
(1) dns4002.wikimedia.org
----- OUTPUT of 'cd /srv/authdns/...nippets --deploy' -----
Assembling and testing data in /tmp/dns-check.ed44fb1l
 -- Generating zonefiles from zone templates
 -- Processed 211 zones into directory /tmp/dns-check.ed44fb1l/zones
OK: No tabs
Summary of violations:
    W001|MISSING_IP_FOR_NAME_AND_PTR: 37
    W002|MISSING_PTR_FOR_NAME_AND_IP: 26
    W103|MISSING_MGMT_FOR_NAME: 64
    W105|TOO_MANY_PUBLIC_NAMES: 3
RESULT: 0 Errors, 130 Warnings, 0 Ignored violations, 0 Ignored lines
 -- Copying automatically generated zone files under target tree
 -- Copying repo-driven real config files and admin_state
 -- Copying puppetized config and GeoIP from /etc/gdnsd
 -- Checking for illegal tabs in zonefiles
 -- Running zone_validator to check WMF rules
 -- Running /usr/sbin/gdnsd checkconf on /tmp/dns-check.ed44fb1l
 -- Preflight checkconf is OK
Deploying from /tmp/dns-check.ed44fb1l to system dirs
 -- Descending to subdirectory: netbox
 -- Zone changed: svc.codfw.wmnet
 -- Zone changed: 1.2.10.in-addr.arpa
 -- Zone changed: svc.eqiad.wmnet
 -- Zone changed: 2.2.10.in-addr.arpa
 -- Done with subdir: netbox
Reloading gdnsd zonefiles
info: Zone data reloaded
OK
===== NODE GROUP =====
(1) dns4001.wikimedia.org
----- OUTPUT of 'cd /srv/authdns/...nippets --deploy' -----
Assembling and testing data in /tmp/dns-check.wj4__tz9
 -- Generating zonefiles from zone templates
 -- Processed 211 zones into directory /tmp/dns-check.wj4__tz9/zones
OK: No tabs
Summary of violations:
    W001|MISSING_IP_FOR_NAME_AND_PTR: 37
    W002|MISSING_PTR_FOR_NAME_AND_IP: 26
    W103|MISSING_MGMT_FOR_NAME: 64
    W105|TOO_MANY_PUBLIC_NAMES: 3
RESULT: 0 Errors, 130 Warnings, 0 Ignored violations, 0 Ignored lines
 -- Copying automatically generated zone files under target tree
 -- Copying repo-driven real config files and admin_state
 -- Copying puppetized config and GeoIP from /etc/gdnsd
 -- Checking for illegal tabs in zonefiles
 -- Running zone_validator to check WMF rules
 -- Running /usr/sbin/gdnsd checkconf on /tmp/dns-check.wj4__tz9
 -- Preflight checkconf is OK
Deploying from /tmp/dns-check.wj4__tz9 to system dirs
 -- Descending to subdirectory: netbox
 -- Zone changed: 1.2.10.in-addr.arpa
 -- Zone changed: svc.codfw.wmnet
 -- Zone changed: svc.eqiad.wmnet
 -- Zone changed: 2.2.10.in-addr.arpa
 -- Done with subdir: netbox
Reloading gdnsd zonefiles
info: Zone data reloaded
OK
===== NODE GROUP =====
(1) dns2001.wikimedia.org
----- OUTPUT of 'cd /srv/authdns/...nippets --deploy' -----
Assembling and testing data in /tmp/dns-check.w_a9p6gs
 -- Generating zonefiles from zone templates
 -- Processed 211 zones into directory /tmp/dns-check.w_a9p6gs/zones
OK: No tabs
Summary of violations:
    W001|MISSING_IP_FOR_NAME_AND_PTR: 37
    W002|MISSING_PTR_FOR_NAME_AND_IP: 26
    W103|MISSING_MGMT_FOR_NAME: 64
    W105|TOO_MANY_PUBLIC_NAMES: 3
RESULT: 0 Errors, 130 Warnings, 0 Ignored violations, 0 Ignored lines
 -- Copying automatically generated zone files under target tree
 -- Copying repo-driven real config files and admin_state
 -- Copying puppetized config and GeoIP from /etc/gdnsd
 -- Checking for illegal tabs in zonefiles
 -- Running zone_validator to check WMF rules
 -- Running /usr/sbin/gdnsd checkconf on /tmp/dns-check.w_a9p6gs
 -- Preflight checkconf is OK
Deploying from /tmp/dns-check.w_a9p6gs to system dirs
 -- Descending to subdirectory: netbox
 -- Zone changed: svc.codfw.wmnet
 -- Zone changed: 2.2.10.in-addr.arpa
 -- Zone changed: 1.2.10.in-addr.arpa
 -- Zone changed: svc.eqiad.wmnet
 -- Done with subdir: netbox
Reloading gdnsd zonefiles
info: Zone data reloaded
OK
===== NODE GROUP =====
(1) dns2002.wikimedia.org
----- OUTPUT of 'cd /srv/authdns/...nippets --deploy' -----
Assembling and testing data in /tmp/dns-check.k7qigeg8
 -- Generating zonefiles from zone templates
 -- Processed 211 zones into directory /tmp/dns-check.k7qigeg8/zones
OK: No tabs
Summary of violations:
    W001|MISSING_IP_FOR_NAME_AND_PTR: 37
    W002|MISSING_PTR_FOR_NAME_AND_IP: 26
    W103|MISSING_MGMT_FOR_NAME: 64
    W105|TOO_MANY_PUBLIC_NAMES: 3
RESULT: 0 Errors, 130 Warnings, 0 Ignored violations, 0 Ignored lines
 -- Copying automatically generated zone files under target tree
 -- Copying repo-driven real config files and admin_state
 -- Copying puppetized config and GeoIP from /etc/gdnsd
 -- Checking for illegal tabs in zonefiles
 -- Running zone_validator to check WMF rules
 -- Running /usr/sbin/gdnsd checkconf on /tmp/dns-check.k7qigeg8
 -- Preflight checkconf is OK
Deploying from /tmp/dns-check.k7qigeg8 to system dirs
 -- Descending to subdirectory: netbox
 -- Zone changed: 2.2.10.in-addr.arpa
 -- Zone changed: svc.eqiad.wmnet
 -- Zone changed: svc.codfw.wmnet
 -- Zone changed: 1.2.10.in-addr.arpa
 -- Done with subdir: netbox
Reloading gdnsd zonefiles
info: Zone data reloaded
OK
===== NODE GROUP =====
(1) authdns2001.wikimedia.org
----- OUTPUT of 'cd /srv/authdns/...nippets --deploy' -----
Assembling and testing data in /tmp/dns-check.eroj7ylw
 -- Generating zonefiles from zone templates
 -- Processed 211 zones into directory /tmp/dns-check.eroj7ylw/zones
OK: No tabs
Summary of violations:
    W001|MISSING_IP_FOR_NAME_AND_PTR: 37
    W002|MISSING_PTR_FOR_NAME_AND_IP: 26
    W103|MISSING_MGMT_FOR_NAME: 64
    W105|TOO_MANY_PUBLIC_NAMES: 3
RESULT: 0 Errors, 130 Warnings, 0 Ignored violations, 0 Ignored lines
 -- Copying automatically generated zone files under target tree
 -- Copying repo-driven real config files and admin_state
 -- Copying puppetized config and GeoIP from /etc/gdnsd
 -- Checking for illegal tabs in zonefiles
 -- Running zone_validator to check WMF rules
 -- Running /usr/sbin/gdnsd checkconf on /tmp/dns-check.eroj7ylw
 -- Preflight checkconf is OK
Deploying from /tmp/dns-check.eroj7ylw to system dirs
 -- Descending to subdirectory: netbox
 -- Zone changed: svc.codfw.wmnet
 -- Zone changed: 1.2.10.in-addr.arpa
 -- Zone changed: svc.eqiad.wmnet
 -- Zone changed: 2.2.10.in-addr.arpa
 -- Done with subdir: netbox
Reloading gdnsd zonefiles
info: Zone data reloaded
OK
===== NODE GROUP =====
(1) authdns1001.wikimedia.org
----- OUTPUT of 'cd /srv/authdns/...nippets --deploy' -----
Assembling and testing data in /tmp/dns-check.j6b2_yq6
 -- Generating zonefiles from zone templates
 -- Processed 211 zones into directory /tmp/dns-check.j6b2_yq6/zones
OK: No tabs
Summary of violations:
    W001|MISSING_IP_FOR_NAME_AND_PTR: 37
    W002|MISSING_PTR_FOR_NAME_AND_IP: 26
    W103|MISSING_MGMT_FOR_NAME: 64
    W105|TOO_MANY_PUBLIC_NAMES: 3
RESULT: 0 Errors, 130 Warnings, 0 Ignored violations, 0 Ignored lines
 -- Copying automatically generated zone files under target tree
 -- Copying repo-driven real config files and admin_state
 -- Copying puppetized config and GeoIP from /etc/gdnsd
 -- Checking for illegal tabs in zonefiles
 -- Running zone_validator to check WMF rules
 -- Running /usr/sbin/gdnsd checkconf on /tmp/dns-check.j6b2_yq6
 -- Preflight checkconf is OK
Deploying from /tmp/dns-check.j6b2_yq6 to system dirs
 -- Descending to subdirectory: netbox
 -- Zone changed: svc.codfw.wmnet
 -- Zone changed: 2.2.10.in-addr.arpa
 -- Zone changed: 1.2.10.in-addr.arpa
 -- Zone changed: svc.eqiad.wmnet
 -- Done with subdir: netbox
Reloading gdnsd zonefiles
info: Zone data reloaded
OK
===== NODE GROUP =====
(1) dns1001.wikimedia.org
----- OUTPUT of 'cd /srv/authdns/...nippets --deploy' -----
Assembling and testing data in /tmp/dns-check._wrnetwx
 -- Generating zonefiles from zone templates
 -- Processed 211 zones into directory /tmp/dns-check._wrnetwx/zones
OK: No tabs
Summary of violations:
    W001|MISSING_IP_FOR_NAME_AND_PTR: 37
    W002|MISSING_PTR_FOR_NAME_AND_IP: 26
    W103|MISSING_MGMT_FOR_NAME: 64
    W105|TOO_MANY_PUBLIC_NAMES: 3
RESULT: 0 Errors, 130 Warnings, 0 Ignored violations, 0 Ignored lines
 -- Copying automatically generated zone files under target tree
 -- Copying repo-driven real config files and admin_state
 -- Copying puppetized config and GeoIP from /etc/gdnsd
 -- Checking for illegal tabs in zonefiles
 -- Running zone_validator to check WMF rules
 -- Running /usr/sbin/gdnsd checkconf on /tmp/dns-check._wrnetwx
 -- Preflight checkconf is OK
Deploying from /tmp/dns-check._wrnetwx to system dirs
 -- Descending to subdirectory: netbox
 -- Zone changed: svc.codfw.wmnet
 -- Zone changed: svc.eqiad.wmnet
 -- Zone changed: 2.2.10.in-addr.arpa
 -- Zone changed: 1.2.10.in-addr.arpa
 -- Done with subdir: netbox
Reloading gdnsd zonefiles
info: Zone data reloaded
OK
===== NODE GROUP =====
(1) dns1002.wikimedia.org
----- OUTPUT of 'cd /srv/authdns/...nippets --deploy' -----
Assembling and testing data in /tmp/dns-check.z46qq795
 -- Generating zonefiles from zone templates
 -- Processed 211 zones into directory /tmp/dns-check.z46qq795/zones
OK: No tabs
Summary of violations:
    W001|MISSING_IP_FOR_NAME_AND_PTR: 37
    W002|MISSING_PTR_FOR_NAME_AND_IP: 26
    W103|MISSING_MGMT_FOR_NAME: 64
    W105|TOO_MANY_PUBLIC_NAMES: 3
RESULT: 0 Errors, 130 Warnings, 0 Ignored violations, 0 Ignored lines
 -- Copying automatically generated zone files under target tree
 -- Copying repo-driven real config files and admin_state
 -- Copying puppetized config and GeoIP from /etc/gdnsd
 -- Checking for illegal tabs in zonefiles
 -- Running zone_validator to check WMF rules
 -- Running /usr/sbin/gdnsd checkconf on /tmp/dns-check.z46qq795
 -- Preflight checkconf is OK
Deploying from /tmp/dns-check.z46qq795 to system dirs
 -- Descending to subdirectory: netbox
 -- Zone changed: 1.2.10.in-addr.arpa
 -- Zone changed: 2.2.10.in-addr.arpa
 -- Zone changed: svc.eqiad.wmnet
 -- Zone changed: svc.codfw.wmnet
 -- Done with subdir: netbox
Reloading gdnsd zonefiles
info: Zone data reloaded
OK
================
PASS |████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (12/12) [00:35<00:00,  1.45s/hosts]
FAIL |                                                                                                                                                                                                                                                                                                                         |   0% (0/12) [00:35<?, ?hosts/s]
100.0% (12/12) success ratio (>= 100.0% threshold) for command: 'cd /srv/authdns/...nippets --deploy'.
100.0% (12/12) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
END (PASS) - Cookbook sre.dns.netbox (exit_code=0)

Mentioned in SAL (#wikimedia-operations) [2021-09-21T19:10:43Z] <ryankemper> T280001 sre.dns.netbox completed successfully

Change 713959 had a related patch set uploaded (by Ryan Kemper; author: Ebernhardson):

[operations/puppet@production] blazegraph: LVS for WCQS step 1

https://gerrit.wikimedia.org/r/713959

Change 713959 merged by Ryan Kemper:

[operations/puppet@production] query_service: LVS for WCQS step 1

https://gerrit.wikimedia.org/r/713959

[conftool changes following merge of https://gerrit.wikimedia.org/r/c/operations/puppet/+/713959]
Now running conftool-merge to sync any changes to conftool data
Running conftool-sync on /etc/conftool/data
2021-09-23 16:38:10 [INFO] conftool::load_files: Loading data for entity node from /etc/conftool/data
2021-09-23 16:38:10 [INFO] conftool::load_files: Parsing file /etc/conftool/data/node/codfw.yaml
2021-09-23 16:38:10 [INFO] conftool::load_files: Parsing file /etc/conftool/data/node/eqsin.yaml
2021-09-23 16:38:10 [INFO] conftool::load_files: Parsing file /etc/conftool/data/node/eqiad.yaml
2021-09-23 16:38:10 [INFO] conftool::load_files: Parsing file /etc/conftool/data/node/esams.yaml
2021-09-23 16:38:10 [INFO] conftool::load_files: Parsing file /etc/conftool/data/node/ulsfo.yaml
2021-09-23 16:38:10 [INFO] conftool::load_files: Loading data for entity discovery from /etc/conftool/data
2021-09-23 16:38:10 [INFO] conftool::load_files: Parsing file /etc/conftool/data/discovery/services.yaml
2021-09-23 16:38:10 [INFO] conftool::load_files: Parsing file /etc/conftool/data/discovery/mediawiki.yaml
2021-09-23 16:38:10 [INFO] conftool::load_files: Loading data for entity mwconfig from /etc/conftool/data
2021-09-23 16:38:10 [INFO] conftool::load_files: Parsing file /etc/conftool/data/mwconfig/data.yaml
2021-09-23 16:38:10 [INFO] conftool::load_files: Loading data for entity dbconfig-instance from /etc/conftool/data
2021-09-23 16:38:10 [INFO] conftool::load_files: Parsing file /etc/conftool/data/dbconfig-instance/instances.yaml
2021-09-23 16:38:10 [INFO] conftool::load_files: Loading data for entity dbconfig-section from /etc/conftool/data
2021-09-23 16:38:10 [INFO] conftool::load_files: Parsing file /etc/conftool/data/dbconfig-section/sections.yaml
2021-09-23 16:38:10 [INFO] conftool::load: Adding objects for node
2021-09-23 16:38:10 [INFO] conftool::load: Creating node with tags codfw/wcqs/wcqs/wcqs2003.codfw.wmnet
2021-09-23 16:38:10 [INFO] conftool::load: Creating node with tags codfw/wcqs/wcqs/wcqs2002.codfw.wmnet
2021-09-23 16:38:10 [INFO] conftool::load: Creating node with tags eqiad/wcqs/wcqs/wcqs1001.eqiad.wmnet
2021-09-23 16:38:10 [INFO] conftool::load: Creating node with tags eqiad/wcqs/wcqs/wcqs1003.eqiad.wmnet
2021-09-23 16:38:10 [INFO] conftool::load: Creating node with tags eqiad/wcqs/wcqs/wcqs1002.eqiad.wmnet
2021-09-23 16:38:10 [INFO] conftool::load: Creating node with tags codfw/wcqs/wcqs/wcqs2001.codfw.wmnet
2021-09-23 16:38:10 [INFO] conftool::load: Adding objects for discovery
2021-09-23 16:38:10 [INFO] conftool::load: Creating discovery with tags wcqs/eqiad
2021-09-23 16:38:10 [INFO] conftool::load: Creating discovery with tags wcqs/codfw
2021-09-23 16:38:10 [INFO] conftool::load: Adding objects for mwconfig
2021-09-23 16:38:10 [INFO] conftool::load: Adding objects for dbconfig-instance
2021-09-23 16:38:10 [INFO] conftool::load: Adding objects for dbconfig-section
2021-09-23 16:38:10 [INFO] conftool::load: Removing stale objects for dbconfig-section
2021-09-23 16:38:10 [INFO] conftool::load: Removing stale objects for dbconfig-instance
2021-09-23 16:38:10 [INFO] conftool::load: Removing stale objects for mwconfig
2021-09-23 16:38:10 [INFO] conftool::load: Removing stale objects for discovery
2021-09-23 16:38:10 [INFO] conftool::load: Removing stale objects for node

And here's how the dns discovery side of conftool looks after the deploy of https://gerrit.wikimedia.org/r/c/operations/puppet/+/713959]

ryankemper@puppetmaster1001:~$ sudo -i confctl --quiet --object-type discovery select 'dnsdisc=wcqs' get
{"eqiad": {"pooled": false, "references": [], "ttl": 300}, "tags": "dnsdisc=wcqs"}
{"codfw": {"pooled": false, "references": [], "ttl": 300}, "tags": "dnsdisc=wcqs"}

Change 723254 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wcqs: go from service_setup to lvs_setup

https://gerrit.wikimedia.org/r/723254

Change 721089 had a related patch set uploaded (by Ryan Kemper; author: Ebernhardson):

[operations/puppet@production] Declare wikimedia_cluster for wcqs

https://gerrit.wikimedia.org/r/721089

Change 721089 merged by Ryan Kemper:

[operations/puppet@production] Declare wikimedia_cluster for wcqs

https://gerrit.wikimedia.org/r/721089

[Steps for deploy of https://gerrit.wikimedia.org/r/723254]
merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/723254
(on cumin) => sudo cumin 'O:lvs::balancer' 'sudo run-puppet-agent'
ack alerts

!log T280001 Restarting pybal on backup low-traffic hosts `lvs2010` and `lvs1016`
sudo cumin 'P{lvs2010*,lvs1016*}' 'sudo systemctl restart pybal'


Sanity check of `sudo ipvsadm -L -n` on low-traffic backups `lvs2010` and `lvs1016` => contains wcqs and has sane list of backends (`wcqs*`)

!log T280001 Sanity check of `sudo ipvsadm -L -n` on backup  `lvs2010` and `lvs1016` looks good, proceeding

Wait 120s while checking https://icinga.wikimedia.org/alerts
!log T280001 Waited 120s and checked https://icinga.wikimedia.org/alerts, proceeding to primary low-traffic hosts `lvs2009` and `lvs1015`

!log T280001 Restarting pybal on low-traffic primaries `lvs2009` and `lvs1015`
sudo cumin 'P{lvs2009*,lvs1015*}' 'sudo systemctl restart pybal'

Run a test like `curl -v -k http://wcqs.svc.eqiad.wmnet`
!log T280001 Sanity check of `wcqs.svc.{eqiad,codfw}.wmnet` looks good; declaring this a success

Mentioned in SAL (#wikimedia-operations) [2021-09-23T20:47:23Z] <ryankemper> T280001 Merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/723254 to proceed with lvs_setup state change; will be restarting low-traffic lvs hosts shortly

Change 723254 merged by Ryan Kemper:

[operations/puppet@production] wcqs: go from service_setup to lvs_setup

https://gerrit.wikimedia.org/r/723254

Mentioned in SAL (#wikimedia-operations) [2021-09-23T20:53:10Z] <ryankemper> T280001 Ran puppet on all lvs hosts => ryankemper@cumin1001:~$ sudo cumin 'O:lvs::balancer' 'sudo run-puppet-agent'

Mentioned in SAL (#wikimedia-operations) [2021-09-23T20:53:55Z] <ryankemper> T280001 Restarting pybal on backup low-traffic hosts lvs2010 and lvs1016...

Mentioned in SAL (#wikimedia-operations) [2021-09-23T20:54:37Z] <ryankemper> T280001 Restarted pybal on backup low-traffic hosts: ryankemper@cumin1001:~$ sudo cumin 'P{lvs2010*,lvs1016*}' 'sudo systemctl restart pybal'

Mentioned in SAL (#wikimedia-operations) [2021-09-23T21:00:05Z] <ryankemper> T280001 TCP 10.2.1.67:443 wrr shows up on ryankemper@lvs1016:~$ sudo ipvsadm -L -n and TCP 10.2.2.67:443 wrr shows up on ryankemper@lvs2010:~$ sudo ipvsadm -L -n as expected

Mentioned in SAL (#wikimedia-operations) [2021-09-23T21:00:14Z] <ryankemper> T280001 Sanity check of sudo ipvsadm -L -n on backup lvs2010 and lvs1016 looks good, proceeding

Mentioned in SAL (#wikimedia-operations) [2021-09-23T21:00:55Z] <ryankemper> T280001 Sanity check of sudo ipvsadm -L -n on low-traffic backups lvs2010 and lvs1016 looks good, proceeding

Mentioned in SAL (#wikimedia-operations) [2021-09-23T21:04:12Z] <ryankemper> T280001 Waited 120s and checked https://icinga.wikimedia.org/alerts, proceeding to primary low-traffic hosts lvs2009 and lvs1015

Mentioned in SAL (#wikimedia-operations) [2021-09-23T21:04:40Z] <ryankemper> T280001 Restarting pybal on low-traffic primaries lvs2009 and lvs1015...

Mentioned in SAL (#wikimedia-operations) [2021-09-23T21:05:44Z] <ryankemper> T280001 Restarted pybal on low-traffic primaries: ryankemper@cumin1001:~$ sudo cumin 'P{lvs2009*,lvs1015*}' 'sudo systemctl restart pybal'

Mentioned in SAL (#wikimedia-operations) [2021-09-23T21:23:28Z] <ryankemper> T280001 Swapped IPs of https://netbox.wikimedia.org/ipam/ip-addresses/9062/ and https://netbox.wikimedia.org/ipam/ip-addresses/9063; this should fix the issue where eqiad and codfw were swapped in netbox (my error)...still need to run netbox cookbook and possibly a manual sudo authdns-update

Mentioned in SAL (#wikimedia-operations) [2021-09-23T21:26:59Z] <ryankemper> T280001 ryankemper@cumin1001:~$ sudo -i cookbook sre.dns.netbox -t T280001 'Fix swapped wcqs.svc.[eqiad,codfw].wmnet' in progress (note: no sudo authdns-update will be necessary because that's just for operations/dns repo changes; we only need to run the netbox cookbook)

Mentioned in SAL (#wikimedia-operations) [2021-09-23T21:36:04Z] <ryankemper> T280001 sre.dns.netbox run complete, netbox IP mixup *should* be resolved

Change 723315 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wcqs: fix swapped codfw / eqiad ip defaults

https://gerrit.wikimedia.org/r/723315

Change 723315 merged by Ryan Kemper:

[operations/puppet@production] wcqs: fix swapped codfw / eqiad ip defaults

https://gerrit.wikimedia.org/r/723315

Mentioned in SAL (#wikimedia-operations) [2021-09-23T21:59:08Z] <ryankemper> T280001 Swapped the netbox IPAM addresses back, after erroneously swapping them earlier. sre.dns.netbox cookbook run complete as well

Mentioned in SAL (#wikimedia-operations) [2021-09-23T21:59:49Z] <ryankemper> T280001 Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/723315, ran puppet agent on wcqs* to fix local lo:LVS destination IPs

Mentioned in SAL (#wikimedia-operations) [2021-09-23T22:00:52Z] <ryankemper> T280001 Running puppet on all lvs hosts: ryankemper@cumin1001:~$ sudo cumin 'O:lvs::balancer' 'sudo run-puppet-agent'...

Mentioned in SAL (#wikimedia-operations) [2021-09-23T22:03:18Z] <ryankemper> T280001 Ran puppet on all lvs hosts: ryankemper@cumin1001:~$ sudo cumin 'O:lvs::balancer' 'sudo run-puppet-agent'

Mentioned in SAL (#wikimedia-operations) [2021-09-23T22:03:51Z] <ryankemper> T280001 Restarting pybal on low-traffic backups lvs2010 and lvs1016...

Mentioned in SAL (#wikimedia-operations) [2021-09-23T22:04:09Z] <ryankemper> T280001 Restarted pybal on low-traffic backups: ryankemper@cumin1001:~$ sudo cumin 'P{lvs2010*,lvs1016*}' 'sudo systemctl restart pybal'

Mentioned in SAL (#wikimedia-operations) [2021-09-23T22:05:32Z] <ryankemper> T280001 [Sanity check] TCP 10.2.2.67:443 wrr shows up on ryankemper@lvs1016:~$ sudo ipvsadm -L -n and TCP 10.2.1.67:443 wrr shows up on ryankemper@lvs2010:~$ sudo ipvsadm -L -n as expected

Mentioned in SAL (#wikimedia-operations) [2021-09-23T22:05:50Z] <ryankemper> T280001 [Cleanup required] TCP 10.2.1.67:443 wrr shows up on ryankemper@lvs1016:~$ sudo ipvsadm -L -n and TCP 10.2.2.67:443 wrr shows up on ryankemper@lvs2010:~$ sudo ipvsadm -L -n (erroneous)

Mentioned in SAL (#wikimedia-operations) [2021-09-23T22:06:33Z] <ryankemper> T280001 Waited 120s and checked https://icinga.wikimedia.org/alerts, proceeding to primary low-traffic hosts lvs2009 and lvs1015

Mentioned in SAL (#wikimedia-operations) [2021-09-23T22:06:54Z] <ryankemper> T280001 Restarted pybal on low-traffic primaries: ryankemper@cumin1001:~$ sudo cumin 'P{lvs2009*,lvs1015*}' 'sudo systemctl restart pybal'

Mentioned in SAL (#wikimedia-operations) [2021-09-23T22:13:10Z] <ryankemper> T280001 [eqiad] root@lvs1016:/home/ryankemper# ipvsadm -Dt 10.2.1.67:443 and root@lvs1015:/home/ryankemper# ipvsadm -Dt 10.2.1.67:443

Mentioned in SAL (#wikimedia-operations) [2021-09-23T22:13:17Z] <ryankemper> T280001 [codfw] root@lvs2010:/home/ryankemper# ipvsadm -Dt 10.2.2.67:443 and root@lvs2009:/home/ryankemper# ipvsadm -Dt 10.2.2.67:443

Mentioned in SAL (#wikimedia-operations) [2021-09-23T22:18:21Z] <ryankemper> T280001 ryankemper@puppetmaster1001:/srv$ sudo confctl select 'name=wcqs.*' set/pooled=yes:weight=10

Mentioned in SAL (#wikimedia-operations) [2021-09-23T22:27:01Z] <ryankemper> T280001 The pooling of the wcqs* hosts has gotten /srv/config-master/pybal/${DC}/wcqs to render, but we need to clear away the stale error files to get rid of the associated warnings Stale template error files present for '/srv/config-master/pybal/${DC}/wcqs' => sudo rm -fv /var/run/confd-template/.wcqs*

Mentioned in SAL (#wikimedia-operations) [2021-09-23T22:27:50Z] <ryankemper> T280001 ryankemper@cumin1001:~$ sudo cumin 'P{puppetmaster*}' 'sudo rm -fv /var/run/confd-template/.wcqs*' complete, forcing recheck

Change 723314 had a related patch set uploaded (by Ryan Kemper; author: Ebernhardson):

[operations/puppet@production] query_service: Add monitoring::groups for wcqs

https://gerrit.wikimedia.org/r/723314

Change 723314 merged by Ryan Kemper:

[operations/puppet@production] query_service: Add monitoring::groups for wcqs

https://gerrit.wikimedia.org/r/723314

Change 721600 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] Add dsh targets for the new wcqs cluster

https://gerrit.wikimedia.org/r/721600

Change 721600 merged by Ryan Kemper:

[operations/puppet@production] Add dsh targets for the new wcqs cluster

https://gerrit.wikimedia.org/r/721600

Mentioned in SAL (#wikimedia-operations) [2021-09-28T18:54:19Z] <ryankemper> T280001 Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/721600 (add wcqs scap dsh groups), running puppet on scap::dsh hosts: ryankemper@cumin1001:~$ sudo cumin 'P:scap::dsh' 'sudo run-puppet-agent'

Change 724533 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wcqs: state change: lvs_setup -> monitoring_setup

https://gerrit.wikimedia.org/r/724533

Change 724533 merged by Ryan Kemper:

[operations/puppet@production] wcqs: state change: lvs_setup -> monitoring_setup

https://gerrit.wikimedia.org/r/724533

Mentioned in SAL (#wikimedia-operations) [2021-09-28T23:45:15Z] <ryankemper> T280001 Changing wcqs state from lvs_setup to monitoring_setup: ryankemper@cumin1001:~$ sudo cumin 'A:icinga' 'run-puppet-agent'

Mentioned in SAL (#wikimedia-operations) [2021-09-28T23:49:52Z] <ryankemper> T280001 New icinga alerts showing up as expected following wcqs state change to monitoring_setup: LVS wcqs codfw port 443/tcp - Wikimedia Commons Query Service IPv4 and LVS wcqs eqiad port 443/tcp - Wikimedia Commons Query Service IPv4

Mentioned in SAL (#wikimedia-operations) [2021-09-28T23:53:05Z] <ryankemper> T280001 New icinga checks are green, will proceed to next step of moving wcqs state from monitoring_setup -> production_setup

Mentioned in SAL (#wikimedia-operations) [2021-09-28T23:53:42Z] <ryankemper> T280001 New icinga checks are green, will proceed to next step of moving wcqs state from monitoring_setup -> production

Change 724536 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wcqs: state change: monitoring_setup -> production

https://gerrit.wikimedia.org/r/724536

Change 724536 merged by Ryan Kemper:

[operations/puppet@production] wcqs: state change: monitoring_setup -> production

https://gerrit.wikimedia.org/r/724536

Mentioned in SAL (#wikimedia-operations) [2021-09-29T00:15:05Z] <ryankemper> T280001 ryankemper@cumin1001:~$ sudo cumin 'A:icinga or A:dns-auth' run-puppet-agent per https://wikitech.wikimedia.org/wiki/LVS#Make_the_service_page,_add_discovery_resources

Mentioned in SAL (#wikimedia-operations) [2021-09-29T00:21:18Z] <ryankemper> T280001 ryankemper@authdns1001:~$ sudo -i authdns-update following merge of https://gerrit.wikimedia.org/r/c/operations/dns/+/724538

Change 724545 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wcqs: add disc desired state

https://gerrit.wikimedia.org/r/724545

Change 724545 merged by Ryan Kemper:

[operations/puppet@production] wcqs: add disc desired state

https://gerrit.wikimedia.org/r/724545

Change 742841 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wcqs: move back to lvs_setup

https://gerrit.wikimedia.org/r/742841

Mentioned in SAL (#wikimedia-operations) [2021-12-02T01:00:40Z] <ryankemper> T280001 About to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/742841 to bring wcqs into state lvs_setup, after which I'll perform a rolling restart of pybal

Change 742841 merged by Ryan Kemper:

[operations/puppet@production] wcqs: move back to lvs_setup

https://gerrit.wikimedia.org/r/742841

Mentioned in SAL (#wikimedia-operations) [2021-12-02T01:02:53Z] <ryankemper> T280001 ryankemper@cumin1001:~$ sudo cumin 'O:lvs::balancer' 'sudo run-puppet-agent'

Mentioned in SAL (#wikimedia-operations) [2021-12-02T01:07:45Z] <ryankemper> T280001 Restarting pybal on low-traffic backups: ryankemper@cumin1001:~$ sudo cumin 'P{lvs2010*,lvs1016*}' 'sudo systemctl restart pybal'

Mentioned in SAL (#wikimedia-operations) [2021-12-02T01:08:32Z] <ryankemper> T280001 Sanity check of sudo ipvsadm -L -n on backup lvs2010 and lvs1016 looks good (for ex lvs1016 has TCP 10.2.2.67:443 wrr)

Mentioned in SAL (#wikimedia-operations) [2021-12-02T01:11:30Z] <ryankemper> T280001 Waited 120s and checked https://icinga.wikimedia.org/alerts, proceeding to primary low-traffic hosts lvs2009 and lvs1015

Mentioned in SAL (#wikimedia-operations) [2021-12-02T01:12:07Z] <ryankemper> T280001 Restarting pybal on low-traffic primaries lvs2009 and lvs1015: ryankemper@cumin1001:~$ sudo cumin 'P{lvs2009*,lvs1015*}' 'sudo systemctl restart pybal'

Mentioned in SAL (#wikimedia-operations) [2021-12-02T01:12:17Z] <ryankemper> T280001 Restarting pybal on low-traffic primaries lvs2009 and lvs1015: ryankemper@cumin1001:~$ sudo cumin 'P{lvs2009*,lvs1015*}' 'sudo systemctl restart pybal'

Mentioned in SAL (#wikimedia-operations) [2021-12-02T01:16:45Z] <ryankemper> T280001 Pooled wcqs200[1-3] (had been left unpooled from when we last removed wcqs from production)

Mentioned in SAL (#wikimedia-operations) [2021-12-02T01:21:06Z] <ryankemper> T280001 Rolling restart of low-traffic pybal hosts complete. All of wcqs is pooled and the pybal / ipvs related alerts have cleared

Change 755806 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/dns@master] wcqs: add discovery record

https://gerrit.wikimedia.org/r/755806

Change 755810 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] wcqs: Move back from lvs_setup to monitoring_setup

https://gerrit.wikimedia.org/r/755810

Change 755810 merged by Bking:

[operations/puppet@production] wcqs: Move back from lvs_setup to monitoring_setup

https://gerrit.wikimedia.org/r/755810

Change 755806 merged by Bking:

[operations/dns@master] wcqs: add discovery record

https://gerrit.wikimedia.org/r/755806

Change 756713 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] wcqs: move service into production status

https://gerrit.wikimedia.org/r/756713

Change 756713 merged by Bking:

[operations/puppet@production] wcqs: move service into production status

https://gerrit.wikimedia.org/r/756713

Mentioned in SAL (#wikimedia-operations) [2022-01-24T22:48:19Z] <ryankemper> T280001 Moved wcqs service state into production by merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/756713; running puppet on authdns/alert hosts

Mentioned in SAL (#wikimedia-operations) [2022-01-24T22:54:13Z] <ryankemper> T280001 Removed downtime on wcqs*