Page MenuHomePhabricator

Update Kubernetes cluster staging-eqiad to kubernetes 1.16
Closed, ResolvedPublic

Description

Update the staging Kuberetes cluster in eqiad to kubernetes 1.16

Next steps are described a in https://etherpad.wikimedia.org/p/k8s-reinitialize

  • Clean up etcd in eqiad
  • Decommission the old master
  • Restart etcd servers
  • Image the new master
  • Reimage the nodes
  • Start master, controller, scheduler
  • helmfile sync admin_ng
  • deploy all services

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+0 -1
operations/puppetproduction+12 -0
operations/deployment-chartsmaster+2 -2
mediawiki/services/eventstreamsmaster+3 -11
labs/privatemaster+0 -0
operations/cookbooksmaster+2 -0
operations/puppetproduction+4 -3
operations/dnsmaster+2 -4
operations/puppetproduction+1 -12
operations/deployment-chartsmaster+5 -5
operations/puppetproduction+7 -12
operations/puppetproduction+4 -6
operations/dnsmaster+2 -2
operations/puppetproduction+25 -0
operations/puppetproduction+12 -30
operations/puppetproduction+2 -2
operations/puppetproduction+1 -1
operations/dnsmaster+4 -3
operations/puppetproduction+1 -1
Show related patches Customize query in gerrit

Event Timeline

JMeybohm triaged this task as Medium priority.Mar 3 2021, 9:47 AM

Change 667996 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/puppet@production] deployment_server: Switch k8s staging to codfw

https://gerrit.wikimedia.org/r/667996

Change 668011 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/puppet@production] Remove old kubernetes saging master neon.eqiad.wmnet

https://gerrit.wikimedia.org/r/668011

Change 668012 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/puppet@production] Set role on new kubernetes staging master in eqiad

https://gerrit.wikimedia.org/r/668012

Change 668015 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/puppet@production] Switch from neon.eqiad.wmnet to kubestagemaster.svc.eqiad.wmnet

https://gerrit.wikimedia.org/r/668015

Change 667982 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/dns@master] Switch staging.svc.ediad.wmnet to point to codfw k8s

https://gerrit.wikimedia.org/r/667982

Change 667983 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/dns@master] Change kubestagemaster.svc.equiad.wmnet to point to new master

https://gerrit.wikimedia.org/r/667983

Change 668012 abandoned by JMeybohm:
[operations/puppet@production] Set role on new kubernetes staging master in eqiad

Reason:
In favor of I2b3fa247dbe2cc9b6e9453d2e0b3ad3aa8cdadd6

https://gerrit.wikimedia.org/r/668012

Change 667867 had a related patch set uploaded (by JMeybohm; owner: Alexandros Kosiaris):
[operations/puppet@production] staging-eqiad: Apply role/hiera to new master

https://gerrit.wikimedia.org/r/667867

Change 668081 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/puppet@production] kubernetes staging: Move k8s, docker and calico version to common

https://gerrit.wikimedia.org/r/668081

Change 667982 merged by JMeybohm:
[operations/dns@master] Switch staging.svc.eqiad.wmnet to point to codfw k8s

https://gerrit.wikimedia.org/r/667982

Change 667996 merged by JMeybohm:
[operations/puppet@production] deployment_server: Switch k8s staging to codfw

https://gerrit.wikimedia.org/r/667996

Change 668114 had a related patch set uploaded (by JMeybohm; owner: Alexandros Kosiaris):
[operations/puppet@production] ci/releases: Switch to using the codfw staging cluster

https://gerrit.wikimedia.org/r/668114

Change 668114 merged by Alexandros Kosiaris:
[operations/puppet@production] ci/releases: Switch to using the codfw staging cluster

https://gerrit.wikimedia.org/r/668114

Change 667867 merged by JMeybohm:
[operations/puppet@production] staging-eqiad: Apply role/hiera to new master

https://gerrit.wikimedia.org/r/667867

Change 668122 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/puppet@production] files/ssl: Add cetificate for kubestagemaster.svc.eqiad.wmnet.crt

https://gerrit.wikimedia.org/r/668122

Change 668122 merged by JMeybohm:
[operations/puppet@production] files/ssl: Add certificate for kubestagemaster.svc.eqiad.wmnet.crt

https://gerrit.wikimedia.org/r/668122

Change 667983 merged by JMeybohm:
[operations/dns@master] Change kubestagemaster.svc.equiad.wmnet to point to new master

https://gerrit.wikimedia.org/r/667983

Change 668015 merged by JMeybohm:
[operations/puppet@production] Switch from neon.eqiad.wmnet to kubestagemaster.svc.eqiad.wmnet

https://gerrit.wikimedia.org/r/668015

Change 668081 merged by JMeybohm:
[operations/puppet@production] kubernetes staging: Move k8s, docker and calico version to common

https://gerrit.wikimedia.org/r/668081

Script wmf-auto-reimage was launched by jayme on cumin1001.eqiad.wmnet for hosts:

['kubestage1001.eqiad.wmnet', 'kubestage1002.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103031708_jayme_31144.log.

Change 668143 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] admin_ng: Enable staging-eqiad

https://gerrit.wikimedia.org/r/668143

Change 668143 merged by jenkins-bot:
[operations/deployment-charts@master] admin_ng: Enable staging-eqiad

https://gerrit.wikimedia.org/r/668143

Completed auto-reimage of hosts:

['kubestage1002.eqiad.wmnet', 'kubestage1001.eqiad.wmnet']

and were ALL successful.

The actual steps where:

  • disable puppet on master and nodes
  • stop apiserver, controller manager, scheduler on master
  • Clean up etcd in eqiad
  • Restart etcd servers
  • Image the new master
  • Reimage the nodes
  • Start master, controller, scheduler
  • helmfile sync admin_ng
  • deploy all services

todo is:

  • Decommission the old master

Change 668011 merged by JMeybohm:
[operations/puppet@production] Remove old kubernetes staging master neon.eqiad.wmnet

https://gerrit.wikimedia.org/r/668011

cookbooks.sre.hosts.decommission executed by jayme@cumin1001 for hosts: neon.eqiad.wmnet

  • neon.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
  • COMMON_STEPS (FAIL)
    • Failed to run the sre.dns.netbox cookbook: Cumin execution failed (exit_code=2)

ERROR: some step on some host failed, check the bolded items above

Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
Generating the DNS records from Netbox data. It will take a couple of minutes.
----- OUTPUT of 'cd /tmp && runus...e asset tag one"' -----                                                                                                                      
2021-03-04 14:19:38,906 [INFO] Gathering devices, interfaces, addresses and prefixes from Netbox                                                                                 
2021-03-04 14:23:16,886 [ERROR] Failed to run                                                                                                                                    
Traceback (most recent call last):                                                                                                                                               
  File "/srv/deployment/netbox-extras/dns/generate_dns_snippets.py", line 686, in main                                                                                           
    batch_status, ret_code = run_commit(args, config, tmpdir)                                                                                                                    
  File "/srv/deployment/netbox-extras/dns/generate_dns_snippets.py", line 590, in run_commit                                                                                     
    netbox.collect()                                                                                                                                                             
  File "/srv/deployment/netbox-extras/dns/generate_dns_snippets.py", line 156, in collect                                                                                        
    self._collect_device(device, True)                                                                                                                                           
  File "/srv/deployment/netbox-extras/dns/generate_dns_snippets.py", line 197, in _collect_device                                                                                
    if self.addresses[primary.id].dns_name:                                                                                                                                      
KeyError: 5979

Ran sre.dns.netbox again and it completed just fine. Mission accomplished then.

Missing points are docs on how to switch the staging cluster back and forth and actually switching back from codfw to eqiad.

Change 668505 had a related patch set uploaded (by Volans; owner: Volans):
[operations/cookbooks@master] sre.hosts.decommission: temporary fix for Netbox

https://gerrit.wikimedia.org/r/668505

Change 668646 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/puppet@production] Switch active kubernetes staging cluster to eqiad

https://gerrit.wikimedia.org/r/668646

Change 668647 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/dns@master] Switch staging.svc.eqiad.wmnet back to eqiad

https://gerrit.wikimedia.org/r/668647

Change 668647 merged by JMeybohm:
[operations/dns@master] Switch staging.svc.eqiad.wmnet back to eqiad

https://gerrit.wikimedia.org/r/668647

Change 668646 merged by JMeybohm:
[operations/puppet@production] Switch active kubernetes staging cluster to eqiad

https://gerrit.wikimedia.org/r/668646

List of services in staging and happiness by service checker

Through some automation (see P14643) and some manual followup I got the following table:

serviceservice-checker-statusnotes
apertium-OK, apertium doesn't have swagger/openapi specs
api-gateway-no spec, service-checker badly fails
blubberoid-
changepropNo TLS
changeprop-jobqueueNo TLS
citoid-
cxserver-
echostorespec under /openapi
eventgate-analytics-
eventgate-analytics-external-
eventgate-logging-external-
eventgate-main-
eventstreamsxmultiple failures
eventstreams-internalxmultiple failures
linkrecommendationxtimeout
mathoid-
mobileapps-
protonxtimeout
push-notifications-
recommendation-api-
sessionstorespec under /openapi
similar-users-No spec? should be added?
termbox-
wikifeeds-
zotero-OK, no swagger/openapi specs

Change 668505 merged by jenkins-bot:
[operations/cookbooks@master] sre.hosts.decommission: temporary fix for Netbox

https://gerrit.wikimedia.org/r/668505

Change 670167 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[labs/private@master] Move k8s controllermanager_token to common staging

https://gerrit.wikimedia.org/r/670167

Change 670167 merged by JMeybohm:
[labs/private@master] Move k8s controllermanager_token to common staging

https://gerrit.wikimedia.org/r/670167

linkrecommendation

  • Works with custom SPEC_URL: service-checker-swagger staging.svc.codfw.wmnet https://staging.svc.codfw.wmnet:4005 -t 2 -s /apispec_1.json
  • The cronjob object is missing the spec.jobTemplate.spec.template.metadata.labels added with I087099a6abd9f817014c15f47e80b77374e1dbb8. Although the object shown by helm-diff on helmfile sync looks okay and tiller stores the right object in it's configmap.
    • This leads to the job failing as it does not get egress rules applied.

proton

eventstreams and eventstreams-internal

  • Broken in production as well (in terms of service-checker), so no blocker I would say. Maybe @Ottomata knows more?

linkrecommendation

  • The cronjob object is missing the spec.jobTemplate.spec.template.metadata.labels added with I087099a6abd9f817014c15f47e80b77374e1dbb8. Although the object shown by helm-diff on helmfile sync looks okay and tiller stores the right object in it's configmap.
    • This leads to the job failing as it does not get egress rules applied.

This was due to an spec error in the cronjob yaml (https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/670791) that went undetected because of https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/670806 - both fixed now.

eventstreams and eventstreams-internal

  • Broken in production as well (in terms of service-checker), so no blocker I would say. Maybe @Ottomata knows more?

fwiw here is the output:

$ service-checker-swagger -t 15 eventstreams.discovery.wmnet https://eventstreams.discovery.wmnet:4892 
/v2/stream/{streams} (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200); /v2/stream/eventgate-main.test.event (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200); /v2/stream/mediawiki.page-create (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200); /v2/stream/mediawiki.page-delete (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200); /v2/stream/mediawiki.page-links-change (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200); /v2/stream/mediawiki.page-move (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200); /v2/stream/mediawiki.page-properties-change (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200); /v2/stream/mediawiki.page-undelete (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200); /v2/stream/mediawiki.recentchange (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200); /v2/stream/mediawiki.revision-create (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200); /v2/stream/mediawiki.revision-score (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200); /v2/stream/mediawiki.revision-visibility-change (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200); /v2/stream/page-create (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200); /v2/stream/page-delete (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200); /v2/stream/page-links-change (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200); /v2/stream/page-move (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200); /v2/stream/page-properties-change (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200); /v2/stream/page-undelete (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200); /v2/stream/recentchange (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200); /v2/stream/revision-create (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200); /v2/stream/revision-score (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200); /v2/stream/test (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 400 (expecting: 200)

evenstreams returns that even in production currently, there is no way to know if an endpoint is misbehaving or not.

However the service seems to be working quite fine per https://grafana.wikimedia.org/d/znIuUcsWz/eventstreams?orgId=1&refresh=1m

Huh interesting. /v2/stream/{streams} is a parameterized route, why would servicechecker try to hit that one? 404 there makes sense, it is not providing a parameter.

For all the others, I'm not sure why a 400 would happen, but they are unending streams, so server never finishes the http response. EventStreams code actually disables the api spec based CI tests for the /v2/stream/* routes for that reason.

Can servicechecker be configured to do the same? Skip checking all /v2/stream routes?

Can servicechecker be configured to do the same? Skip checking all /v2/stream routes?

Hmm. service-achecker seems to call all GET routes with a "default_example" request unless you explicitly set x-monitor: false to skip them: https://github.com/wikimedia/operations-software-service-checker/blob/master/servicechecker/swagger.py#L129-L138

ah! ok we can add x-monitor: false. I wonder if the service-template-node tests would also respect that. It looks like it does.

ah! ok we can add x-monitor: false. I wonder if the service-template-node tests would also respect that. It looks like it does.

Cool! When that's done, deployed and working you could also add both eventstream incarnations to modules/lvs/manifests/monitor_services.pp in pupptet to have them monitored in prod.

Change 670924 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/services/eventstreams@master] Set x-monitor: false for all /v2/stream routes

https://gerrit.wikimedia.org/r/670924

Change 670924 merged by Ottomata:
[mediawiki/services/eventstreams@master] Set x-monitor: false for all /v2/stream routes

https://gerrit.wikimedia.org/r/670924

Change 670936 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] Bump eventstreams and -internal to 2021-03-11-200606-production

https://gerrit.wikimedia.org/r/670936

Change 670936 merged by Ottomata:
[operations/deployment-charts@master] Bump eventstreams and -internal to 2021-03-11-200606-production

https://gerrit.wikimedia.org/r/670936

Change 670960 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Add eventstreams clusters to monitor_services.pp

https://gerrit.wikimedia.org/r/670960

Change 671175 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/puppet@production] kubernetes staging-eqiad: Enable notifications

https://gerrit.wikimedia.org/r/671175

Change 670960 merged by Ottomata:
[operations/puppet@production] Add eventstreams clusters to monitor_services.pp

https://gerrit.wikimedia.org/r/670960

Change 671175 merged by Alexandros Kosiaris:
[operations/puppet@production] kubernetes staging-eqiad: Enable notifications

https://gerrit.wikimedia.org/r/671175