We need to setup DNS records to enable ingress traffic to the spark-history server pods via the dse k8s cluster ingress gateway.
Description
Details
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | brouberol | T330176 [Data Platform] Deploy Spark History Service | |||
| Resolved | brouberol | T352639 Configure ingress to the spark history servers |
Event Timeline
Change 979891 had a related patch set uploaded (by Brouberol; author: Brouberol):
[operations/dns@master] Define a DNS A record for the dse k8s ingress gateway
Change 979892 had a related patch set uploaded (by Brouberol; author: Brouberol):
[operations/dns@master] Enable ingress for the spark-history server services via the dse ingress gw
We have configured the following two IP addresses in netbox, for the ingress gateway service on dse-k8s
- 10.2.2.91/32 k8s-ingress-dse.svc.eqiad.wmnet (active)
- 10.2.1.91/32 k8s-ingress-dse.svc.codfw.wmnet (reserved)
We will now run the sre.dns.netbox cookbook to generate the records and then follow up with a change to the DNS zone as per these instructions and T270071
Change 979910 had a related patch set uploaded (by Brouberol; author: Brouberol):
[operations/puppet@production] Add an entry related to the dse k8s cluster ingress gateway to conftool
Change 979911 had a related patch set uploaded (by Brouberol; author: Brouberol):
[operations/puppet@production] Add the k8s-ingress-dse LVS service to the service list
I think that you will also need to add k8s-ingress-dse: {} to the profile::lvs::realserver::pools hash in hieradata/role/common/dse_k8s/worker.yaml
That is for the step mentioned here.
Add this IP to the loopback interface on all the servers where the service is present
Change 979910 merged by Brouberol:
[operations/puppet@production] Add an entry related to the dse k8s cluster ingress gateway to conftool
Change 979891 merged by Brouberol:
[operations/dns@master] Define a DNS A record for the dse k8s ingress gateway
Mentioned in SAL (#wikimedia-operations) [2023-12-05T09:37:31Z] <brouberol> running authdns-update on dns1004.wikimedia.org - T352639
DNS records and reverse DNS are in place:
brouberol@dns1004:~$ for i in 0 1 2; do ns=ns${i}.wikimedia.org; echo $ns; dig +short @${ns} k8s-ingress-dse.svc.eqiad.wmnet; done
ns0.wikimedia.org
10.2.2.91
ns1.wikimedia.org
10.2.2.91
ns2.wikimedia.org
10.2.2.91
brouberol@dns1004:~$ for i in 0 1 2; do ns=ns${i}.wikimedia.org; echo $ns; dig +short @${ns} -x 10.2.2.91; done
ns0.wikimedia.org
k8s-ingress-dse.svc.eqiad.wmnet.
ns1.wikimedia.org
k8s-ingress-dse.svc.eqiad.wmnet.
ns2.wikimedia.org
k8s-ingress-dse.svc.eqiad.wmnet.Change 979911 merged by Brouberol:
[operations/puppet@production] Add the k8s-ingress-dse LVS service to the service list
Change 980347 had a related patch set uploaded (by Brouberol; author: Brouberol):
[operations/puppet@production] Include the profile::lvs::realserver profile on the dse-k8s-roles
Change 980353 had a related patch set uploaded (by Brouberol; author: Brouberol):
[operations/puppet@production] Remove the inference realserver pool from the dse cluster
Change 980353 merged by Brouberol:
[operations/puppet@production] Remove the inference realserver pool from the dse cluster
Change 980347 merged by Brouberol:
[operations/puppet@production] Include the profile::lvs::realserver profile on the dse-k8s-roles
Change 980368 had a related patch set uploaded (by Brouberol; author: Brouberol):
[operations/puppet@production] Switch the k8s-ingress-dse LVS service in lvs_setup state
Change 980404 had a related patch set uploaded (by Brouberol; author: Brouberol):
[operations/dns@master] Add discovery records for the k8s-ingress-dse LVS service
Change 980368 merged by Brouberol:
[operations/puppet@production] Switch the k8s-ingress-dse LVS service in lvs_setup state
Mentioned in SAL (#wikimedia-operations) [2023-12-05T14:54:45Z] <brouberol> adding k8s-ingress-dse backend to LVS - T352639
Mentioned in SAL (#wikimedia-operations) [2023-12-05T15:01:21Z] <cgoubert@cumin1001> START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs[1018,1020].eqiad.wmnet} and A:lvs (T352639)
Mentioned in SAL (#wikimedia-operations) [2023-12-05T15:11:57Z] <cgoubert@cumin1001> END (FAIL) - Cookbook sre.loadbalancer.restart-pybal (exit_code=1) rolling-restart of pybal on P{lvs[1018,1020].eqiad.wmnet} and A:lvs (T352639)
Mentioned in SAL (#wikimedia-operations) [2023-12-05T15:16:05Z] <claime> Manually restarting pybal on lvs1020 - T352639
Mentioned in SAL (#wikimedia-operations) [2023-12-05T15:22:14Z] <claime> Manually restarting pybal on lvs1019 - T352639
Change 980417 had a related patch set uploaded (by Brouberol; author: Brouberol):
[operations/puppet@production] Fix cluster conftool selector for the k8s-ingress-dse LVS service
Change 980417 merged by Brouberol:
[operations/puppet@production] Fix cluster conftool selector for the k8s-ingress-dse LVS service
Mentioned in SAL (#wikimedia-operations) [2023-12-05T15:42:31Z] <claime> Manually restarting pybal on lvs1020 - T352639
Thanks to @Clement_Goubert, the k8s-ingress-dse LVS service is now deployed. All backends appear down however
We're currently setting them as pooled: inactive to quiet the alert. We first need to make sure the port 30443 is open on each host before we re-enable this pool.
Mentioned in SAL (#wikimedia-operations) [2023-12-05T15:49:44Z] <claime> sudo confctl select "service=kubesvc,cluster=dse-k8s" set/pooled=inactive - T352639
Change 980428 had a related patch set uploaded (by Brouberol; author: Brouberol):
[operations/puppet@production] Rollback state of LVS k8s-ingress-dse to service_setup
Change 980428 merged by Brouberol:
[operations/puppet@production] Rollback state of LVS k8s-ingress-dse to service_setup
Mentioned in SAL (#wikimedia-operations) [2023-12-05T16:18:24Z] <claime> Rolling back k8s-ingress-dse - restarting pybal on lvs1020 - T352639
Mentioned in SAL (#wikimedia-operations) [2023-12-05T16:24:02Z] <claime> Rolling back k8s-ingress-dse - restarting pybal on lvs1019 - T352639
We ended up rolling back because alerts were persisting even when pooling as inactive.
The service was put back in service_setup status, puppet ran on the lvs servers, and pybal restarted.
Icinga is green, I'm leaving the hosts with @ssingh to check if additional action needs to be taken.
Change 980404 merged by Brouberol:
[operations/dns@master] Add discovery records for the k8s-ingress-dse LVS service
The next attempt to enable the LVS service for the dse k8s ingress gateway should work, as ports are now open:
brouberol@lvs1019:~$ for i in $(seq 1 8); do echo dse-k8s-worker100${i}.eqiad.wmnet && nc -z -v -w5 $(dig +short dse-k8s-worker100${i}.eqiad.wmnet) 30443; done
dse-k8s-worker1001.eqiad.wmnet
Connection to 10.64.0.38 30443 port [tcp/*] succeeded!
dse-k8s-worker1002.eqiad.wmnet
Connection to 10.64.16.47 30443 port [tcp/*] succeeded!
dse-k8s-worker1003.eqiad.wmnet
Connection to 10.64.32.178 30443 port [tcp/*] succeeded!
dse-k8s-worker1004.eqiad.wmnet
Connection to 10.64.48.52 30443 port [tcp/*] succeeded!
dse-k8s-worker1005.eqiad.wmnet
Connection to 10.64.130.6 30443 port [tcp/*] succeeded!
dse-k8s-worker1006.eqiad.wmnet
Connection to 10.64.132.8 30443 port [tcp/*] succeeded!
dse-k8s-worker1007.eqiad.wmnet
Connection to 10.64.134.5 30443 port [tcp/*] succeeded!
dse-k8s-worker1008.eqiad.wmnet
Connection to 10.64.136.7 30443 port [tcp/*] succeeded!FYI, we have deployed a dummy echosersver service behind egress in dse-k8s-eqiad, to make sure that the spark-history server isn't responsible for the up/down status of the LVS service for the dse-eqiad-k8s ingress. See T353004 for details.
Change 981944 had a related patch set uploaded (by Brouberol; author: Brouberol):
[operations/puppet@production] Switch the k8s-ingress-dse LVS service in lvs_setup state (#2)
Mentioned in SAL (#wikimedia-operations) [2023-12-11T10:37:23Z] <claime> Repooling dse-k8s-worker nodes - sudo confctl select "service=kubesvc,cluster=dse-k8s" set/pooled=yes - T352639
Change 981944 merged by Brouberol:
[operations/puppet@production] Switch the k8s-ingress-dse LVS service in lvs_setup state (#2)
Mentioned in SAL (#wikimedia-operations) [2023-12-11T10:45:52Z] <claime> Disabling puppet on O:lvs::balancer - T352639
Mentioned in SAL (#wikimedia-operations) [2023-12-11T10:46:16Z] <claime> Running puppet on O:lvs::balancer - T352639
Mentioned in SAL (#wikimedia-operations) [2023-12-11T10:50:16Z] <cgoubert@cumin1001> START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs[1019-1020].eqiad.wmnet} and A:lvs (T352639)
Mentioned in SAL (#wikimedia-operations) [2023-12-11T10:54:48Z] <cgoubert@cumin1001> END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs[1019-1020].eqiad.wmnet} and A:lvs (T352639)
Change 981733 had a related patch set uploaded (by Brouberol; author: Brouberol):
[operations/dns@master] Revert "Revert "Add discovery records for the k8s-ingress-dse LVS service""
Change 982045 had a related patch set uploaded (by Brouberol; author: Brouberol):
[operations/puppet@production] Switch state of k8s-ingress-dse LVS service to production
Change 982045 merged by Brouberol:
[operations/puppet@production] Switch state of k8s-ingress-dse LVS service to production
Change 981733 merged by Brouberol:
[operations/dns@master] Revert "Revert "Add discovery records for the k8s-ingress-dse LVS service""
Mentioned in SAL (#wikimedia-operations) [2023-12-11T11:12:51Z] <brouberol> Add discovery records for the k8s-ingress-dse LVS service - T352639
Mentioned in SAL (#wikimedia-operations) [2023-12-11T11:18:38Z] <claime> sudo confctl --object-type discovery select 'name=eqiad,dnsdisc=k8s-ingress-dse' set/pooled=true - T352639
brouberol@cumin1001:~$ host k8s-ingress-dse.svc.eqiad.wmnet k8s-ingress-dse.svc.eqiad.wmnet has address 10.2.2.91 brouberol@cumin1001:~$ host k8s-ingress-dse.discovery.wmnet k8s-ingress-dse.discovery.wmnet has address 10.2.2.91
Change 979892 merged by Brouberol:
[operations/dns@master] Enable ingress for the spark-history server services via the dse ingress gw
Mentioned in SAL (#wikimedia-operations) [2023-12-11T12:11:25Z] <brouberol> Adding spark-history(-test).svc.eqiad.wmnet CNAMEs pointing to k8s-ingress-dse.svc.eqiad.wmnet. - T352639
brouberol@dns1004:~$ host spark-history.svc.eqiad.wmnet spark-history.svc.eqiad.wmnet is an alias for k8s-ingress-dse.svc.eqiad.wmnet. k8s-ingress-dse.svc.eqiad.wmnet has address 10.2.2.91 brouberol@dns1004:~$ host spark-history-test.svc.eqiad.wmnet spark-history-test.svc.eqiad.wmnet is an alias for k8s-ingress-dse.svc.eqiad.wmnet. k8s-ingress-dse.svc.eqiad.wmnet has address 10.2.2.91
Change 982106 had a related patch set uploaded (by Brouberol; author: Brouberol):
[operations/deployment-charts@master] Fix: make sure to generate a TLS certificate for the namespace
Change 982106 merged by Brouberol:
[operations/deployment-charts@master] admin_ng: fix gateway TLS setting for dse-k8s-eqiad
Mentioned in SAL (#wikimedia-operations) [2023-12-11T15:25:55Z] <brouberol> provisioning TLS certificates for the spark-history and spark-history-test namespaces in dse-k8s-eqiad - T352639
