Page MenuHomePhabricator

Deploy and migrate tools to a Kubernetes v1.15 or newer cluster
Open, HighPublic

Description

Replace the aging Kubernetes cluster used in Toolforge with a modern deployment of Kubernetes using v1.15 or newer. This will involve:

  • Examining the method we use to deploy Kubernetes
  • Replacing custom code changes made in our legacy Kubernetes cluster with modern equivalents
  • Implementing RBAC to replace ABAC
  • Examine how we handle Ingress traffic to tools
  • ...

Details

Related Gerrit Patches:

Related Objects

StatusSubtypeAssignedTask
Resolvedbd808
OpenNone
OpenNone
OpenNone
OpenNone
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
OpenJprorama
OpenNone
OpenNone
Resolvedaborrero
ResolvedBstorm
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
ResolvedBstorm
OpenJdforrester-WMF
Resolvedaborrero
ResolvedBstorm
ResolvedBstorm
ResolvedBstorm
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
ResolvedBstorm
OpenNone
Resolvedaborrero
OpenNone
Resolvedaborrero
OpenNone
Resolvedaborrero
StalledBstorm
ResolvedBstorm
Resolved yuvipanda
DuplicateNone
ResolvedBstorm
ResolvedBstorm
OpenBstorm
DuplicateNone
ResolvedBstorm
Resolvedaborrero
DuplicateNone
ResolvedBstorm
ResolvedBstorm
ResolvedBstorm
ResolvedBstorm
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
OpenBstorm
ResolvedBstorm
ResolvedBstorm
DuplicateNone
Resolvedaborrero
ResolvedBstorm
Resolvedbd808
Invalidaborrero
Resolvedbd808
Resolvedbd808
ResolvedSecurityBstorm
Resolvedaborrero
Resolvedbd808
DuplicateNone
ResolvedBstorm
Resolvedbd808
Resolvedbd808
OpenNone

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
GTirloni removed a subscriber: GTirloni.Mar 21 2019, 9:06 PM
Xinbenlv added a subscriber: Xinbenlv.

Waiting for the K8s to be upgraded to modern version.

Hugely important work~ Finger-crossed!

Because the version of Kubernetes in Toolforge was related to some lousy error messages during an outage, and this is now one of the actionables from that incident, adding the Incident tag.

bd808 moved this task from On-going to Follow-up on the Wikimedia-Incident board.Sep 16 2019, 4:35 PM
Base added a subscriber: Base.Oct 19 2019, 5:43 PM

Change 547504 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: new k8s: rename hiera keys for consistency

https://gerrit.wikimedia.org/r/547504

Change 547504 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: new k8s: rename hiera keys for consistency

https://gerrit.wikimedia.org/r/547504

Change 547509 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: new k8s: rename node to worker

https://gerrit.wikimedia.org/r/547509

Change 547509 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: new k8s: rename node to worker

https://gerrit.wikimedia.org/r/547509

Change 549613 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/software/tools-webservice@master] new k8s: Fix ingress object and enable toolsbeta ingress creation

https://gerrit.wikimedia.org/r/549613

Change 549661 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] kubectl: upgrade /usr/bin/kubectl to 1.15.5

https://gerrit.wikimedia.org/r/549661

Change 549613 merged by Bstorm:
[operations/software/tools-webservice@master] new k8s: Fix ingress object and enable toolsbeta ingress creation

https://gerrit.wikimedia.org/r/549613

Change 549661 merged by Bstorm:
[operations/puppet@production] kubectl: upgrade /usr/bin/kubectl to 1.15.5

https://gerrit.wikimedia.org/r/549661

aborrero removed a subscriber: chasemp.Nov 19 2019, 10:20 AM

Started the first run to create the backbone of user RBAC, etc. in tools.

So far, I've run into a problem:

root@tools-k8s-control-3:~# kubectl -n maintain-kubeusers exec -it maintain-kubeusers-ops -- /bin/ash
/app # source venv/bin/activate
(venv) /app # python maintain_kubeusers.py --gentle-mode --once
starting a run
Homedir already exists for /data/project/mirador
Wrote config in /data/project/mirador/.kube/config
Provisioned creds for tool mirador
Homedir already exists for /data/project/misc2svg
Traceback (most recent call last):
  File "maintain_kubeusers.py", line 1056, in <module>
    main()
  File "maintain_kubeusers.py", line 1030, in main
    api_server, ca_data, args.gentle_mode
  File "maintain_kubeusers.py", line 866, in write_kubeconfig
    self.write_config_file(config)
  File "maintain_kubeusers.py", line 691, in write_config_file
    f = os.open(path, os.O_CREAT | os.O_WRONLY | os.O_NOFOLLOW)
PermissionError: [Errno 1] Operation not permitted: '/data/project/misc2svg/.kube/config'

I didn't encounter that previously in toolsbeta, and notably, this worked for the tool before it. I wonder if there aren't some configs with an immutable bit set.

That is exactly what I just saw:

[bstorm@labstore1004]:~ $ sudo lsattr /srv/tools/shared/tools/project/misc2svg/.kube/config
----i--------e-- /srv/tools/shared/tools/project/misc2svg/.kube/config

😭

Mentioned in SAL (#wikimedia-cloud) [2019-12-17T01:05:34Z] <bstorm_> beginning the first run of the new maintain-kubeusers in gentle-mode -- but it was just killed by some files setting the immutable bit T214513

1704 are immutable--just about half the tools.

Mentioned in SAL (#wikimedia-cloud) [2019-12-17T01:25:13Z] <bstorm_> unset the immutable bit from 1704 tool kubeconfigs T214513

Mentioned in SAL (#wikimedia-cloud) [2019-12-17T01:26:01Z] <bstorm_> running the first run of maintain-kubeusers 2.0 for the new cluster T214513 (more successfully this time)

Failed again after over a thousand tools worked on:

Wrote config in /data/project/commons_describer/.kube/config
Could not create podsecuritypolicy for <__main__.User object at 0x7f01c0221210>
Traceback (most recent call last):
  File "maintain_kubeusers.py", line 1056, in <module>
    main()
  File "maintain_kubeusers.py", line 1032, in main
    k8s_api.add_user_access(tools[tool_name])
  File "maintain_kubeusers.py", line 644, in add_user_access
    self.generate_psp(user)
  File "maintain_kubeusers.py", line 497, in generate_psp
    _ = self.extensions.create_pod_security_policy(policy)
  File "/app/venv/lib/python3.7/site-packages/kubernetes/client/apis/extensions_v1beta1_api.py", line 756, in create_pod_security_policy
    (data) = self.create_pod_security_policy_with_http_info(body, **kwargs)
  File "/app/venv/lib/python3.7/site-packages/kubernetes/client/apis/extensions_v1beta1_api.py", line 841, in create_pod_security_policy_with_http_info
    collection_formats=collection_formats)
  File "/app/venv/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 334, in call_api
    _return_http_data_only, collection_formats, _preload_content, _request_timeout)
  File "/app/venv/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 168, in __call_api
    _request_timeout=_request_timeout)
  File "/app/venv/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 377, in request
    body=body)
  File "/app/venv/lib/python3.7/site-packages/kubernetes/client/rest.py", line 266, in POST
    body=body)
  File "/app/venv/lib/python3.7/site-packages/kubernetes/client/rest.py", line 222, in request
    raise ApiException(http_resp=r)
kubernetes.client.rest.ApiException: (422)
Reason: Unprocessable Entity
HTTP response headers: HTTPHeaderDict({'Content-Type': 'application/json', 'Date': 'Tue, 17 Dec 2019 02:21:59 GMT', 'Content-Length': '980'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"PodSecurityPolicy.extensions \"tool-commons_describer-psp\" is invalid: metadata.name: Invalid value: \"tool-commons_describer-psp\": a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')","reason":"Invalid","details":{"name":"tool-commons_describer-psp","group":"extensions","kind":"PodSecurityPolicy","causes":[{"reason":"FieldValueInvalid","message":"Invalid value: \"tool-commons_describer-psp\": a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')","field":"metadata.name"}]},"code":422}

So we apparently have at least one tool with a '_' in the name. That will break this every time. This also would be invalid as a namespace name, so I have no clue how it would work in the old cluster to begin with. It must have been manually hacked in?

Mentioned in SAL (#wikimedia-cloud) [2019-12-17T04:48:40Z] <bstorm_> completed first run of maintain-kubeusers 2 in the new cluster T214513

Change 558523 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: bastion: raise default value for nproc

https://gerrit.wikimedia.org/r/558523

Change 558523 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: bastion: raise default value for nproc

https://gerrit.wikimedia.org/r/558523

Mentioned in SAL (#wikimedia-cloud) [2019-12-17T16:53:33Z] <bstorm_> maintain-kubeusers app deployed fully in tools for new kubernetes cluster T214513 T228499

So we apparently have at least one tool with a '_' in the name. That will break this every time. This also would be invalid as a namespace name, so I have no clue how it would work in the old cluster to begin with. It must have been manually hacked in?

lol? How did that happen?

lol? How did that happen?

Apparently, I can never do anything large in Toolforge without discovering a fascinating past mistake of my forebears (and/or myself). I hacked around that and am thankful that it is only four tools.

zhuyifei1999 added a subscriber: zhuyifei1999.
bd808 added a subtask: Restricted Task.Dec 29 2019, 5:04 AM
Bstorm closed subtask Restricted Task as Resolved.Jan 6 2020, 7:05 PM
Legoktm added a subscriber: Legoktm.Jan 7 2020, 9:54 PM

For the opt-in manual migration period, it would be nice if there could be some kind of helper script like the following (untested):

1#!/usr/bin/env python3
2# (C) 2020 Kunal Mehta <legoktm@member.fsf.org> under Apache-2.0
3
4import argparse
5import os
6import subprocess
7import sys
8import yaml
9
10
11MANIFEST = os.path.expanduser('~/service.manifest')
12
13def current_webservice():
14 if not os.path.exists(MANIFEST):
15 print('This tool does not currently have a webservice running')
16 sys.exit(1)
17 with open(MANIFEST) as f:
18 data = yaml.safe_load(f)
19 if data['backend'] != 'kubenetes':
20 print('This tool is not running its webservice on kubernetes')
21 sys.exit(1)
22 if 'web' not in data:
23 print('This tool does not currently have a webservice running')
24 sys.exit(1)
25
26 return data['web']
27
28
29def main():
30 parser = argparse.ArgumentParser(description='Migrate your webservice to the new k8s cluster')
31 parser.add_argument('--revert', action='store_bool', help='Revert back to the old k8s cluster')
32 args = parser.parse_args()
33 # TODO: check that it already hasn't migrated over to the new cluster...how?
34 current = current_webservice()
35 subprocess.check_call(['webservice', 'stop'])
36 context = 'default' if args.revert else 'toolforge'
37 subprocess.check_call(['/usr/bin/kubectl', 'config', 'use-context', context])
38 subprocess.check_call(['webservice', current, '--backend=kubernetes', 'start'])
39 subprocess.check_call(['dologmsg', 'Switched over to use new k8s cluster'])
40
41
42if __name__ == '__main__':
43 main()

bd808 renamed this task from Upgrade Toolforge Kubernetes to Deploy and migrate tools to a Kubernetes v1.15 or newer cluster.Jan 8 2020, 9:54 PM
bd808 updated the task description. (Show Details)