Page MenuHomePhabricator

Deploy and migrate tools to a Kubernetes v1.15 or newer cluster
Closed, ResolvedPublic

Assigned To
Authored By
Bstorm
Jan 23 2019, 7:46 PM
Referenced Files
None
Tokens
"Barnstar" token, awarded by mmodell."Love" token, awarded by dduvall."Love" token, awarded by zhuyifei1999."Manufacturing Defect?" token, awarded by Xinbenlv."Love" token, awarded by Chicocvenancio.

Description

Replace the aging Kubernetes cluster used in Toolforge with a modern deployment of Kubernetes using v1.15 or newer. This will involve:

  • Examining the method we use to deploy Kubernetes
  • Replacing custom code changes made in our legacy Kubernetes cluster with modern equivalents
  • Implementing RBAC to replace ABAC
  • Examine how we handle Ingress traffic to tools
  • ...

Related Objects

StatusSubtypeAssignedTask
Resolved Bstorm
Resolvedbd808
Resolved Bstorm
Resolvedaborrero
Resolved Bstorm
Resolved Bstorm
Resolved Bstorm
DeclinedNone
Resolved Bstorm
Resolved Bstorm
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
ResolvedJprorama
Resolvedaborrero
Resolved Bstorm
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolved Bstorm
Resolveddduvall
Resolved Bstorm
Resolvedaborrero
Resolved Bstorm
Resolved Bstorm
Resolved Bstorm
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolved Bstorm
DeclinedNone
Resolvedaborrero
OpenNone
Resolvedaborrero
StalledNone
Resolvedaborrero
Resolved Bstorm
Resolved Bstorm
Resolvedyuvipanda
DuplicateNone
Resolved Bstorm
Resolved Bstorm
Resolved Bstorm
DuplicateNone
Resolved Bstorm
ResolvedSecurityaborrero
Resolvedaborrero
DuplicateNone
Resolved Bstorm
Resolved Bstorm
Resolved Bstorm
Resolved Bstorm
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolved Bstorm
Resolved Bstorm
Resolved Bstorm
DuplicateNone
Resolvedaborrero
OpenNone
Resolved Bstorm
Resolvedbd808
Invalidaborrero
Resolvedbd808
Resolvedbd808
ResolvedSecurity Bstorm
Resolvedaborrero
Resolvedbd808
DuplicateNone
Resolved Bstorm
Resolvedbd808
Resolvedbd808

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Started the first run to create the backbone of user RBAC, etc. in tools.

So far, I've run into a problem:

root@tools-k8s-control-3:~# kubectl -n maintain-kubeusers exec -it maintain-kubeusers-ops -- /bin/ash
/app # source venv/bin/activate
(venv) /app # python maintain_kubeusers.py --gentle-mode --once
starting a run
Homedir already exists for /data/project/mirador
Wrote config in /data/project/mirador/.kube/config
Provisioned creds for tool mirador
Homedir already exists for /data/project/misc2svg
Traceback (most recent call last):
  File "maintain_kubeusers.py", line 1056, in <module>
    main()
  File "maintain_kubeusers.py", line 1030, in main
    api_server, ca_data, args.gentle_mode
  File "maintain_kubeusers.py", line 866, in write_kubeconfig
    self.write_config_file(config)
  File "maintain_kubeusers.py", line 691, in write_config_file
    f = os.open(path, os.O_CREAT | os.O_WRONLY | os.O_NOFOLLOW)
PermissionError: [Errno 1] Operation not permitted: '/data/project/misc2svg/.kube/config'

I didn't encounter that previously in toolsbeta, and notably, this worked for the tool before it. I wonder if there aren't some configs with an immutable bit set.

That is exactly what I just saw:

[bstorm@labstore1004]:~ $ sudo lsattr /srv/tools/shared/tools/project/misc2svg/.kube/config
----i--------e-- /srv/tools/shared/tools/project/misc2svg/.kube/config

😭

Mentioned in SAL (#wikimedia-cloud) [2019-12-17T01:05:34Z] <bstorm_> beginning the first run of the new maintain-kubeusers in gentle-mode -- but it was just killed by some files setting the immutable bit T214513

1704 are immutable--just about half the tools.

Mentioned in SAL (#wikimedia-cloud) [2019-12-17T01:25:13Z] <bstorm_> unset the immutable bit from 1704 tool kubeconfigs T214513

Mentioned in SAL (#wikimedia-cloud) [2019-12-17T01:26:01Z] <bstorm_> running the first run of maintain-kubeusers 2.0 for the new cluster T214513 (more successfully this time)

Failed again after over a thousand tools worked on:

Wrote config in /data/project/commons_describer/.kube/config
Could not create podsecuritypolicy for <__main__.User object at 0x7f01c0221210>
Traceback (most recent call last):
  File "maintain_kubeusers.py", line 1056, in <module>
    main()
  File "maintain_kubeusers.py", line 1032, in main
    k8s_api.add_user_access(tools[tool_name])
  File "maintain_kubeusers.py", line 644, in add_user_access
    self.generate_psp(user)
  File "maintain_kubeusers.py", line 497, in generate_psp
    _ = self.extensions.create_pod_security_policy(policy)
  File "/app/venv/lib/python3.7/site-packages/kubernetes/client/apis/extensions_v1beta1_api.py", line 756, in create_pod_security_policy
    (data) = self.create_pod_security_policy_with_http_info(body, **kwargs)
  File "/app/venv/lib/python3.7/site-packages/kubernetes/client/apis/extensions_v1beta1_api.py", line 841, in create_pod_security_policy_with_http_info
    collection_formats=collection_formats)
  File "/app/venv/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 334, in call_api
    _return_http_data_only, collection_formats, _preload_content, _request_timeout)
  File "/app/venv/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 168, in __call_api
    _request_timeout=_request_timeout)
  File "/app/venv/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 377, in request
    body=body)
  File "/app/venv/lib/python3.7/site-packages/kubernetes/client/rest.py", line 266, in POST
    body=body)
  File "/app/venv/lib/python3.7/site-packages/kubernetes/client/rest.py", line 222, in request
    raise ApiException(http_resp=r)
kubernetes.client.rest.ApiException: (422)
Reason: Unprocessable Entity
HTTP response headers: HTTPHeaderDict({'Content-Type': 'application/json', 'Date': 'Tue, 17 Dec 2019 02:21:59 GMT', 'Content-Length': '980'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"PodSecurityPolicy.extensions \"tool-commons_describer-psp\" is invalid: metadata.name: Invalid value: \"tool-commons_describer-psp\": a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')","reason":"Invalid","details":{"name":"tool-commons_describer-psp","group":"extensions","kind":"PodSecurityPolicy","causes":[{"reason":"FieldValueInvalid","message":"Invalid value: \"tool-commons_describer-psp\": a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')","field":"metadata.name"}]},"code":422}

So we apparently have at least one tool with a '_' in the name. That will break this every time. This also would be invalid as a namespace name, so I have no clue how it would work in the old cluster to begin with. It must have been manually hacked in?

Mentioned in SAL (#wikimedia-cloud) [2019-12-17T04:48:40Z] <bstorm_> completed first run of maintain-kubeusers 2 in the new cluster T214513

Change 558523 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: bastion: raise default value for nproc

https://gerrit.wikimedia.org/r/558523

Change 558523 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: bastion: raise default value for nproc

https://gerrit.wikimedia.org/r/558523

Mentioned in SAL (#wikimedia-cloud) [2019-12-17T16:53:33Z] <bstorm_> maintain-kubeusers app deployed fully in tools for new kubernetes cluster T214513 T228499

So we apparently have at least one tool with a '_' in the name. That will break this every time. This also would be invalid as a namespace name, so I have no clue how it would work in the old cluster to begin with. It must have been manually hacked in?

lol? How did that happen?

lol? How did that happen?

Apparently, I can never do anything large in Toolforge without discovering a fascinating past mistake of my forebears (and/or myself). I hacked around that and am thankful that it is only four tools.

Bstorm closed subtask Restricted Task as Resolved.Jan 6 2020, 7:05 PM

For the opt-in manual migration period, it would be nice if there could be some kind of helper script like the following (untested):

1#!/usr/bin/env python3
2# (C) 2020 Kunal Mehta <legoktm@member.fsf.org> under Apache-2.0
3
4import argparse
5import os
6import subprocess
7import sys
8import yaml
9
10
11MANIFEST = os.path.expanduser('~/service.manifest')
12
13def current_webservice():
14 if not os.path.exists(MANIFEST):
15 print('This tool does not currently have a webservice running')
16 sys.exit(1)
17 with open(MANIFEST) as f:
18 data = yaml.safe_load(f)
19 if data['backend'] != 'kubenetes':
20 print('This tool is not running its webservice on kubernetes')
21 sys.exit(1)
22 if 'web' not in data:
23 print('This tool does not currently have a webservice running')
24 sys.exit(1)
25
26 return data['web']
27
28
29def main():
30 parser = argparse.ArgumentParser(description='Migrate your webservice to the new k8s cluster')
31 parser.add_argument('--revert', action='store_bool', help='Revert back to the old k8s cluster')
32 args = parser.parse_args()
33 # TODO: check that it already hasn't migrated over to the new cluster...how?
34 current = current_webservice()
35 subprocess.check_call(['webservice', 'stop'])
36 context = 'default' if args.revert else 'toolforge'
37 subprocess.check_call(['/usr/bin/kubectl', 'config', 'use-context', context])
38 subprocess.check_call(['webservice', current, '--backend=kubernetes', 'start'])
39 subprocess.check_call(['dologmsg', 'Switched over to use new k8s cluster'])
40
41
42if __name__ == '__main__':
43 main()

bd808 renamed this task from Upgrade Toolforge Kubernetes to Deploy and migrate tools to a Kubernetes v1.15 or newer cluster.Jan 8 2020, 9:54 PM
bd808 updated the task description. (Show Details)

Change 574568 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[labs/tools/maintain-kubeusers@master] release: remove "gentle mode" from deployment

https://gerrit.wikimedia.org/r/574568

Change 574625 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[labs/tools/maintain-kubeusers@master] refactor: split up the huge file

https://gerrit.wikimedia.org/r/574625

I just noticed when testing maintain-kubeusers against a 1.17 cluster that we have a deprecated API version for podsecuritypolicies baked into it. That needs to be fixed. It simply will not run on 1.17 this way.

bd808 added a parent task: Restricted Task.Feb 25 2020, 6:01 PM

Change 575322 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolforge-kubernetes: shut down the old maintain-kubeusers

https://gerrit.wikimedia.org/r/575322

Change 575325 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolforge: remove the ancient version of kubectl

https://gerrit.wikimedia.org/r/575325

Change 575322 merged by Bstorm:
[operations/puppet@production] toolforge-kubernetes: shut down the old maintain-kubeusers

https://gerrit.wikimedia.org/r/575322

Change 574568 merged by Bstorm:
[labs/tools/maintain-kubeusers@master] release: remove "gentle mode" from deployment

https://gerrit.wikimedia.org/r/574568

Change 575325 merged by Bstorm:
[operations/puppet@production] toolforge: remove the ancient version of kubectl

https://gerrit.wikimedia.org/r/575325

Change 576469 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolforge: remove special configuration for kubernetes on proxy servers

https://gerrit.wikimedia.org/r/576469

Change 576469 merged by Bstorm:
[operations/puppet@production] toolforge: remove special configuration for kubernetes on proxy servers

https://gerrit.wikimedia.org/r/576469

Change 574625 merged by Bstorm:
[labs/tools/maintain-kubeusers@master] refactor: split up the huge file

https://gerrit.wikimedia.org/r/574625

Change 577364 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[labs/tools/maintain-kubeusers@master] refactor: split up the huge file

https://gerrit.wikimedia.org/r/577364

Change 577364 merged by Bstorm:
[labs/tools/maintain-kubeusers@master] refactor: split up the huge file

https://gerrit.wikimedia.org/r/577364

Bstorm removed a parent task: Restricted Task.Mar 10 2020, 4:37 PM
bd808 assigned this task to Bstorm.

There are some small bits of this which are currently stalled, but the core work of the project is completed. Lots of folks helped, but I think @Bstorm deserves the credit not only for her technical work but also for the project management role she played.

nskaggs closed subtask Restricted Task as Resolved.Jan 25 2022, 4:12 PM