toolforge: some additional testing before final migration
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	aborrero
	Nov 19 2019, 10:19 AM

Description

It could be interesting to have some "realworld" testing on the new kubernetes cluster in toolsbeta before starting the final migration in the tools project.

The toolsbeta cluster is currently scaled to 3 control nodes, 2 worker nodes, 3 etcd servers, 1 haproxy, 1 front proxy (dynamicproxy).

Some ideas and questions I would like to see answered:

how many request can handle the north-south proxy setup? i.e, front proxy (dynamicproxy) + haproxy + ingress. We have grafana dashboard to measure this: https://grafana-labs.wikimedia.org/d/R7BPaEbWk/toolforge-ingress?refresh=1m&orgId=1 but we don't have test tool running there yet.
how many pods can we run in just a couple worker nodes? how oversubscribing mem and CPU works in this new cluster? We have a grafana dashboard to measure this: https://grafana-labs.wikimedia.org/d/toolforge-kubernetes/toolforge-kubernetes?refresh=1m&orgId=1 but we don't have any test tool running there yet do do actual tests.
how is nginx-ingress behaving when hundred of ingress objects are being created/removed? This was answered in T239405: toolforge: new k8s: evaluate ingress controller reload behaviour
what happens when we scale up/down the cluster? Is service interrupted in any way? Specifically when adding / removing control and worker nodes.
the haproxy setup is not HA. We have cold standby server. How long (how bad) is the service interrupted in case of failover?
the frontproxy (dynamicproxy) setup is not HA. We have a cold standby server. How long (how bad) is the service interrupted in case of failover?
estimate/test the service impact of relocating a VM to a different cloudvirt (etcd, worker, control, haproxy, frontproxy)

Details

	Subject	Repo	Branch	Lines +/-
	dynamicproxy: add backend information to access log entries	operations/puppet	production	+1 -1
	toolforge: Update toolviews.py nginx log parser	operations/puppet	production	+3 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
		Restricted Task
Resolved	• Bstorm	T246122 Upgrade the Toolforge Kubernetes cluster to v1.16
		Restricted Task
Resolved	bd808	T232536 Toolforge Kubernetes internal API down, causing `webservice` and other tooling to fail
Resolved	• Bstorm	T236565 "tools" Cloud VPS project jessie deprecation
Resolved	aborrero	T101651 Set up toolsbeta more fully to help make testing easier
Resolved	• Bstorm	T166949 Homedir/UID info breaks after a while in Tools Kubernetes (can't read replica.my.cnf)
Resolved	• Bstorm	T246059 Add admin account creation to maintain-kubeusers
Resolved	• Bstorm	T154504 Make webservice backend default to kubernetes
Declined	None	T245230 Investigate cpu/ram requests and limits for DaemonSets pods
Resolved	• Bstorm	T214513 Deploy and migrate tools to a Kubernetes v1.15 or newer cluster
Resolved	aborrero	T215531 Deploy upgraded Kubernetes to toolsbeta
Declined	None	T238641 toolforge: some additional testing before final migration
Resolved	aborrero	T239403 toolforge: new k8s: scale up a bit the cluster before final tests and initial migrations
Open	None	T239404 toolforge: new k8s: evaluate DNS (coredns) autoscale options
Resolved	aborrero	T239405 toolforge: new k8s: evaluate ingress controller reload behaviour
Stalled	None	T239406 toolforge: new k8s: evalute and test firewalling via calico

Event Timeline

aborrero created this task.Nov 19 2019, 10:19 AM

aborrero triaged this task as Medium priority.Nov 22 2019, 10:14 AM

aborrero closed subtask T239403: toolforge: new k8s: scale up a bit the cluster before final tests and initial migrations as Resolved.Nov 29 2019, 12:18 PM

Change 554041 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] dynamicproxy: add backend information to access log entries

https://gerrit.wikimedia.org/r/554041

gerritbot added a project: Patch-For-Review.Dec 2 2019, 11:23 AM

aborrero updated the task description. (Show Details)Dec 5 2019, 12:59 PM

Change 558676 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] toolforge: Update toolviews.py nginx log parser

https://gerrit.wikimedia.org/r/558676

Change 558676 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: Update toolviews.py nginx log parser

https://gerrit.wikimedia.org/r/558676

Change 554041 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] dynamicproxy: add backend information to access log entries

https://gerrit.wikimedia.org/r/554041

Maintenance_bot removed a project: Patch-For-Review.Dec 20 2019, 12:10 PM

aborrero closed subtask T239405: toolforge: new k8s: evaluate ingress controller reload behaviour as Resolved.Jan 2 2020, 1:39 PM

The new k8s cluster is fully working now.

We had no time to tests/answer many of the items stated in the task description, but things look good so far anyway!

aborrero changed the status of subtask T239406: toolforge: new k8s: evalute and test firewalling via calico from Open to Stalled.Jul 6 2020, 11:16 AM

toolforge: some additional testing before final migrationClosed, DeclinedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

toolforge: some additional testing before final migration
Closed, DeclinedPublic
Actions

Related Objects
Search...