Update Kubernetes cluster staging-eqiad to kubernetes 1.16
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	JMeybohm
	Mar 3 2021, 9:47 AM

Description

Update the staging Kuberetes cluster in eqiad to kubernetes 1.16

Create new masters: https://phabricator.wikimedia.org/T276204

Switch to using kubestagemaster in codfw via changing modules/profile/manifests/kubernetes/deployment_server.pp neon reference to kubestagemaster.svc.codfw.wmnet https://gerrit.wikimedia.org/r/q/topic:%22switch_k8s_staging%22

Next steps are described a in https://etherpad.wikimedia.org/p/k8s-reinitialize

Clean up etcd in eqiad
Decommission the old master
Restart etcd servers
Image the new master
Reimage the nodes
Start master, controller, scheduler
helmfile sync admin_ng
deploy all services

Details

Subject	Repo	Branch	Lines +/-
kubernetes staging-eqiad: Enable notifications	operations/puppet	production	+0 -1
Add eventstreams clusters to monitor_services.pp	operations/puppet	production	+12 -0
Bump eventstreams and -internal to 2021-03-11-200606-production	operations/deployment-charts	master	+2 -2
Set x-monitor: false for all /v2/stream routes	mediawiki/services/eventstreams	master	+3 -11
Move k8s controllermanager_token to common staging	labs/private	master	+0 -0
sre.hosts.decommission: temporary fix for Netbox	operations/cookbooks	master	+2 -0
Switch active kubernetes staging cluster to eqiad	operations/puppet	production	+4 -3
Switch staging.svc.eqiad.wmnet back to eqiad	operations/dns	master	+2 -4
Remove old kubernetes staging master neon.eqiad.wmnet	operations/puppet	production	+1 -12
admin_ng: Enable staging-eqiad	operations/deployment-charts	master	+5 -5
kubernetes staging: Move k8s, docker and calico version to common	operations/puppet	production	+7 -12
Switch from neon.eqiad.wmnet to kubestagemaster.svc.eqiad.wmnet	operations/puppet	production	+4 -6
Change kubestagemaster.svc.equiad.wmnet to point to new master	operations/dns	master	+2 -2
files/ssl: Add certificate for kubestagemaster.svc.eqiad.wmnet.crt	operations/puppet	production	+25 -0
staging-eqiad: Apply role/hiera to new master	operations/puppet	production	+12 -30
ci/releases: Switch to using the codfw staging cluster	operations/puppet	production	+2 -2
deployment_server: Switch k8s staging to codfw	operations/puppet	production	+1 -1
Switch staging.svc.eqiad.wmnet to point to codfw k8s	operations/dns	master	+4 -3
Set role on new kubernetes staging master in eqiad	operations/puppet	production	+1 -1

Related Objects
Search...

Status	Assigned	Task
Resolved	akosiaris	T244335 Upgrade kubernetes clusters to v1.16
Resolved	JMeybohm	T276305 Update Kubernetes cluster staging-eqiad to kubernetes 1.16
Resolved	akosiaris	T276204 EQIAD and CODFW : 5of VMs requested for kubernetes master

Event Timeline

JMeybohm created this task.Mar 3 2021, 9:47 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 3 2021, 9:47 AM

JMeybohm triaged this task as Medium priority.Mar 3 2021, 9:47 AM

Change 667996 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/puppet@production] deployment_server: Switch k8s staging to codfw

https://gerrit.wikimedia.org/r/667996

Change 668011 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/puppet@production] Remove old kubernetes saging master neon.eqiad.wmnet

https://gerrit.wikimedia.org/r/668011

JMeybohm added a subtask: T276204: EQIAD and CODFW : 5of VMs requested for kubernetes master.Mar 3 2021, 9:49 AM

Change 668012 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/puppet@production] Set role on new kubernetes staging master in eqiad

https://gerrit.wikimedia.org/r/668012

Change 668015 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/puppet@production] Switch from neon.eqiad.wmnet to kubestagemaster.svc.eqiad.wmnet

https://gerrit.wikimedia.org/r/668015

Change 667982 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/dns@master] Switch staging.svc.ediad.wmnet to point to codfw k8s

https://gerrit.wikimedia.org/r/667982

Change 667983 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/dns@master] Change kubestagemaster.svc.equiad.wmnet to point to new master

https://gerrit.wikimedia.org/r/667983

Change 668012 abandoned by JMeybohm:
[operations/puppet@production] Set role on new kubernetes staging master in eqiad

Reason:
In favor of I2b3fa247dbe2cc9b6e9453d2e0b3ad3aa8cdadd6

https://gerrit.wikimedia.org/r/668012

Change 667867 had a related patch set uploaded (by JMeybohm; owner: Alexandros Kosiaris):
[operations/puppet@production] staging-eqiad: Apply role/hiera to new master

https://gerrit.wikimedia.org/r/667867

Change 668081 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/puppet@production] kubernetes staging: Move k8s, docker and calico version to common

https://gerrit.wikimedia.org/r/668081

Change 667982 merged by JMeybohm:
[operations/dns@master] Switch staging.svc.eqiad.wmnet to point to codfw k8s

https://gerrit.wikimedia.org/r/667982

Change 667996 merged by JMeybohm:
[operations/puppet@production] deployment_server: Switch k8s staging to codfw

https://gerrit.wikimedia.org/r/667996

Change 668114 had a related patch set uploaded (by JMeybohm; owner: Alexandros Kosiaris):
[operations/puppet@production] ci/releases: Switch to using the codfw staging cluster

https://gerrit.wikimedia.org/r/668114

Change 668114 merged by Alexandros Kosiaris:
[operations/puppet@production] ci/releases: Switch to using the codfw staging cluster

https://gerrit.wikimedia.org/r/668114

Change 667867 merged by JMeybohm:
[operations/puppet@production] staging-eqiad: Apply role/hiera to new master

https://gerrit.wikimedia.org/r/667867

Change 668122 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/puppet@production] files/ssl: Add cetificate for kubestagemaster.svc.eqiad.wmnet.crt

https://gerrit.wikimedia.org/r/668122

Change 668122 merged by JMeybohm:
[operations/puppet@production] files/ssl: Add certificate for kubestagemaster.svc.eqiad.wmnet.crt

https://gerrit.wikimedia.org/r/668122

Change 667983 merged by JMeybohm:
[operations/dns@master] Change kubestagemaster.svc.equiad.wmnet to point to new master

https://gerrit.wikimedia.org/r/667983

Change 668015 merged by JMeybohm:
[operations/puppet@production] Switch from neon.eqiad.wmnet to kubestagemaster.svc.eqiad.wmnet

https://gerrit.wikimedia.org/r/668015

Change 668081 merged by JMeybohm:
[operations/puppet@production] kubernetes staging: Move k8s, docker and calico version to common

https://gerrit.wikimedia.org/r/668081

Script wmf-auto-reimage was launched by jayme on cumin1001.eqiad.wmnet for hosts:

['kubestage1001.eqiad.wmnet', 'kubestage1002.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103031708_jayme_31144.log.

Change 668143 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] admin_ng: Enable staging-eqiad

https://gerrit.wikimedia.org/r/668143

Change 668143 merged by jenkins-bot:
[operations/deployment-charts@master] admin_ng: Enable staging-eqiad

https://gerrit.wikimedia.org/r/668143

Completed auto-reimage of hosts:

['kubestage1002.eqiad.wmnet', 'kubestage1001.eqiad.wmnet']

and were ALL successful.

The actual steps where:

disable puppet on master and nodes
stop apiserver, controller manager, scheduler on master
Clean up etcd in eqiad
Restart etcd servers
Image the new master
Reimage the nodes
Start master, controller, scheduler
helmfile sync admin_ng
deploy all services

todo is:

Decommission the old master

Change 668011 merged by JMeybohm:
[operations/puppet@production] Remove old kubernetes staging master neon.eqiad.wmnet

https://gerrit.wikimedia.org/r/668011

cookbooks.sre.hosts.decommission executed by jayme@cumin1001 for hosts: neon.eqiad.wmnet

neon.eqiad.wmnet (PASS)
- Downtimed host on Icinga
- Found Ganeti VM
- VM shutdown
- Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB
- VM removed
- Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox

COMMON_STEPS (FAIL)
- Failed to run the sre.dns.netbox cookbook: Cumin execution failed (exit_code=2)

ERROR: some step on some host failed, check the bolded items above

Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
Generating the DNS records from Netbox data. It will take a couple of minutes.
----- OUTPUT of 'cd /tmp && runus...e asset tag one"' -----                                                                                                                      
2021-03-04 14:19:38,906 [INFO] Gathering devices, interfaces, addresses and prefixes from Netbox                                                                                 
2021-03-04 14:23:16,886 [ERROR] Failed to run                                                                                                                                    
Traceback (most recent call last):                                                                                                                                               
  File "/srv/deployment/netbox-extras/dns/generate_dns_snippets.py", line 686, in main                                                                                           
    batch_status, ret_code = run_commit(args, config, tmpdir)                                                                                                                    
  File "/srv/deployment/netbox-extras/dns/generate_dns_snippets.py", line 590, in run_commit                                                                                     
    netbox.collect()                                                                                                                                                             
  File "/srv/deployment/netbox-extras/dns/generate_dns_snippets.py", line 156, in collect                                                                                        
    self._collect_device(device, True)                                                                                                                                           
  File "/srv/deployment/netbox-extras/dns/generate_dns_snippets.py", line 197, in _collect_device                                                                                
    if self.addresses[primary.id].dns_name:                                                                                                                                      
KeyError: 5979

Ran sre.dns.netbox again and it completed just fine. Mission accomplished then.

Missing points are docs on how to switch the staging cluster back and forth and actually switching back from codfw to eqiad.

Change 668505 had a related patch set uploaded (by Volans; owner: Volans):
[operations/cookbooks@master] sre.hosts.decommission: temporary fix for Netbox

https://gerrit.wikimedia.org/r/668505

Change 668646 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/puppet@production] Switch active kubernetes staging cluster to eqiad

https://gerrit.wikimedia.org/r/668646

Change 668647 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/dns@master] Switch staging.svc.eqiad.wmnet back to eqiad

https://gerrit.wikimedia.org/r/668647

Change 668647 merged by JMeybohm:
[operations/dns@master] Switch staging.svc.eqiad.wmnet back to eqiad

https://gerrit.wikimedia.org/r/668647

Change 668646 merged by JMeybohm:
[operations/puppet@production] Switch active kubernetes staging cluster to eqiad

https://gerrit.wikimedia.org/r/668646

Switched back to staging-eqiad with no surprises.
Docs at https://wikitech.wikimedia.org/wiki/Kubernetes#Switch_the_active_staging_cluster_%28eqiad%3C-%3Ecodfw%29

List of services in staging and happiness by service checker

Through some automation (see P14643) and some manual followup I got the following table:

service	service-checker-status	notes
apertium	-	OK, apertium doesn't have swagger/openapi specs
api-gateway	-	no spec, service-checker badly fails
blubberoid	✓	-
changeprop	✓	No TLS
changeprop-jobqueue	✓	No TLS
citoid	✓	-
cxserver	✓	-
echostore	✓	spec under /openapi
eventgate-analytics	✓	-
eventgate-analytics-external	✓	-
eventgate-logging-external	✓	-
eventgate-main	✓	-
eventstreams	x	multiple failures
eventstreams-internal	x	multiple failures
linkrecommendation	x	timeout
mathoid	✓	-
mobileapps	✓	-
proton	x	timeout
push-notifications	✓	-
recommendation-api	✓	-
sessionstore	✓	spec under /openapi
similar-users	-	No spec? should be added?
termbox	✓	-
wikifeeds	✓	-
zotero	-	OK, no swagger/openapi specs

Change 668505 merged by jenkins-bot:
[operations/cookbooks@master] sre.hosts.decommission: temporary fix for Netbox

https://gerrit.wikimedia.org/r/668505

Change 670167 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[labs/private@master] Move k8s controllermanager_token to common staging

https://gerrit.wikimedia.org/r/670167

Change 670167 merged by JMeybohm:
[labs/private@master] Move k8s controllermanager_token to common staging

https://gerrit.wikimedia.org/r/670167

JMeybohm mentioned this in rLPRIb133ecca6f58: Move k8s controllermanager_token to common staging.Mar 9 2021, 2:02 PM

linkrecommendation

Works with custom SPEC_URL: service-checker-swagger staging.svc.codfw.wmnet https://staging.svc.codfw.wmnet:4005 -t 2 -s /apispec_1.json
The cronjob object is missing the spec.jobTemplate.spec.template.metadata.labels added with I087099a6abd9f817014c15f47e80b77374e1dbb8. Although the object shown by helm-diff on helmfile sync looks okay and tiller stores the right object in it's configmap.
- This leads to the job failing as it does not get egress rules applied.

proton

works with "-t 15" (as we do in prod)
The instance in staging-eqiad seems kind of stuck, though: https://grafana.wikimedia.org/goto/u4Q9X08Gk

eventstreams and eventstreams-internal

Broken in production as well (in terms of service-checker), so no blocker I would say. Maybe @Ottomata knows more?

In T276305#6900578, @JMeybohm wrote:

linkrecommendation

The cronjob object is missing the spec.jobTemplate.spec.template.metadata.labels added with I087099a6abd9f817014c15f47e80b77374e1dbb8. Although the object shown by helm-diff on helmfile sync looks okay and tiller stores the right object in it's configmap.

This leads to the job failing as it does not get egress rules applied.

This was due to an spec error in the cronjob yaml (https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/670791) that went undetected because of https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/670806 - both fixed now.

In T276305#6901095, @JMeybohm wrote:

eventstreams and eventstreams-internal

Broken in production as well (in terms of service-checker), so no blocker I would say. Maybe @Ottomata knows more?

fwiw here is the output:

$ service-checker-swagger -t 15 eventstreams.discovery.wmnet https://eventstreams.discovery.wmnet:4892 
/v2/stream/{streams} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200); /v2/stream/eventgate-main.test.event (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200); /v2/stream/mediawiki.page-create (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200); /v2/stream/mediawiki.page-delete (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200); /v2/stream/mediawiki.page-links-change (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200); /v2/stream/mediawiki.page-move (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200); /v2/stream/mediawiki.page-properties-change (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200); /v2/stream/mediawiki.page-undelete (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200); /v2/stream/mediawiki.recentchange (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200); /v2/stream/mediawiki.revision-create (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200); /v2/stream/mediawiki.revision-score (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200); /v2/stream/mediawiki.revision-visibility-change (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200); /v2/stream/page-create (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200); /v2/stream/page-delete (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200); /v2/stream/page-links-change (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200); /v2/stream/page-move (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200); /v2/stream/page-properties-change (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200); /v2/stream/page-undelete (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200); /v2/stream/recentchange (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200); /v2/stream/revision-create (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200); /v2/stream/revision-score (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200); /v2/stream/test (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200)

evenstreams returns that even in production currently, there is no way to know if an endpoint is misbehaving or not.

However the service seems to be working quite fine per https://grafana.wikimedia.org/d/znIuUcsWz/eventstreams?orgId=1&refresh=1m

Huh interesting. /v2/stream/{streams} is a parameterized route, why would servicechecker try to hit that one? 404 there makes sense, it is not providing a parameter.

For all the others, I'm not sure why a 400 would happen, but they are unending streams, so server never finishes the http response. EventStreams code actually disables the api spec based CI tests for the /v2/stream/* routes for that reason.

Can servicechecker be configured to do the same? Skip checking all /v2/stream routes?

JMeybohm mentioned this in T277191: Update Kubernetes cluster codfw to kubernetes 1.16.Mar 11 2021, 4:22 PM

In T276305#6905240, @Ottomata wrote:

Can servicechecker be configured to do the same? Skip checking all /v2/stream routes?

Hmm. service-achecker seems to call all GET routes with a "default_example" request unless you explicitly set x-monitor: false to skip them: https://github.com/wikimedia/operations-software-service-checker/blob/master/servicechecker/swagger.py#L129-L138

ah! ok we can add x-monitor: false. I wonder if the service-template-node tests would also respect that. It looks like it does.

In T276305#6905531, @Ottomata wrote:

ah! ok we can add x-monitor: false. I wonder if the service-template-node tests would also respect that. It looks like it does.

Cool! When that's done, deployed and working you could also add both eventstream incarnations to modules/lvs/manifests/monitor_services.pp in pupptet to have them monitored in prod.

Change 670924 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/services/eventstreams@master] Set x-monitor: false for all /v2/stream routes

https://gerrit.wikimedia.org/r/670924

Change 670924 merged by Ottomata:
[mediawiki/services/eventstreams@master] Set x-monitor: false for all /v2/stream routes

https://gerrit.wikimedia.org/r/670924

Change 670936 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] Bump eventstreams and -internal to 2021-03-11-200606-production

https://gerrit.wikimedia.org/r/670936

Change 670936 merged by Ottomata:
[operations/deployment-charts@master] Bump eventstreams and -internal to 2021-03-11-200606-production

https://gerrit.wikimedia.org/r/670936

Change 670960 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Add eventstreams clusters to monitor_services.pp

https://gerrit.wikimedia.org/r/670960

Change 671175 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/puppet@production] kubernetes staging-eqiad: Enable notifications

https://gerrit.wikimedia.org/r/671175

Change 670960 merged by Ottomata:
[operations/puppet@production] Add eventstreams clusters to monitor_services.pp

https://gerrit.wikimedia.org/r/670960

Change 671175 merged by Alexandros Kosiaris:
[operations/puppet@production] kubernetes staging-eqiad: Enable notifications

https://gerrit.wikimedia.org/r/671175

Volans mentioned this in rCCKB27fa31b7eccb: sre.hosts.decommission: temporary fix for Netbox.Dec 14 2022, 3:27 PM

Update Kubernetes cluster staging-eqiad to kubernetes 1.16Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Update Kubernetes cluster staging-eqiad to kubernetes 1.16
Closed, ResolvedPublic
Actions

Related Objects
Search...