Page MenuHomePhabricator

[toolsbeta] Rebuild servers to learn how to take down the services without downtime (and use affinities)
Closed, ResolvedPublic

Description

Working guide

https://etherpad.wikimedia.org/p/toolsbeta_refresh_notes

Possibly useful references

etcd

Example of cli command:

root@toolsbeta-test-k8s-etcd-1:~# etcdctl  --endpoints "https://toolsbeta-test-k8s-etcd-1.toolsbeta.eqiad.wmflabs:2379,https://toolsbeta-test-k8s-etcd-3.toolsbeta.eqiad.wmflabs:2379,https://toolsbeta-test-k8s-etcd-2.toolsbeta.eqiad.wmflabs:2379" --ca-file /etc/etcd/ssl/ca.pem --key-file /etc/etcd/ssl/toolsbeta-test-k8s-etcd-1.toolsbeta.eqiad.wmflabs.priv --cert-file /etc/etcd/ssl/toolsbeta-test-k8s-etcd-1.toolsbeta.eqiad.wmflabs.pem member list
67a7255628c1f89f: name=toolsbeta-test-k8s-etcd-1.toolsbeta.eqiad.wmflabs peerURLs=https://toolsbeta-test-k8s-etcd-1.toolsbeta.eqiad.wmflabs:2380 clientURLs=https://toolsbeta-test-k8s-etcd-1.toolsbeta.eqiad.wmflabs:2379 isLeader=false
822c4bd670e96cb1: name=toolsbeta-test-k8s-etcd-3.toolsbeta.eqiad.wmflabs peerURLs=https://toolsbeta-test-k8s-etcd-3.toolsbeta.eqiad.wmflabs:2380 clientURLs=https://toolsbeta-test-k8s-etcd-3.toolsbeta.eqiad.wmflabs:2379 isLeader=true
cacc7abd354d7bbf: name=toolsbeta-test-k8s-etcd-2.toolsbeta.eqiad.wmflabs peerURLs=https://toolsbeta-test-k8s-etcd-2.toolsbeta.eqiad.wmflabs:2380 clientURLs=https://toolsbeta-test-k8s-etcd-2.toolsbeta.eqiad.wmflabs:2379 isLeader=false

Event Timeline

Mentioned in SAL (#wikimedia-cloud) [2020-11-04T15:42:06Z] <dcaro> re-creating the toolsbeta-proxy-03, used wrong image on the first try (T267140)

Mentioned in SAL (#wikimedia-cloud) [2020-11-05T15:53:58Z] <dcaro> Adding toolsbeta-proxy-3 to the list of slave proxies in hiera (T267140)

Mentioned in SAL (#wikimedia-cloud) [2020-11-05T16:40:51Z] <dcaro> Moving active proxy from proxy-1 to proxy-3 (T267140)

Mentioned in SAL (#wikimedia-cloud) [2020-11-06T13:18:48Z] <dcaro> bringin up a new proxy-4 instance as slave (T267140)

Mentioned in SAL (#wikimedia-cloud) [2020-11-06T15:53:57Z] <dcaro> Removing proxy-1 and proxy-3 from hiera, proxy-3 stays as active and proxy-4 as backup (T267140)

Mentioned in SAL (#wikimedia-cloud) [2020-11-06T15:56:50Z] <dcaro> Deleting instances proxy-1 and proxy-2, that will finish the proxy rebuild (T267140)

@dcaro: I updated the requirements on T267082 slightly, in case that changes your strategies here.

Added some notes in the etherpad that I hope are helpful.

Mentioned in SAL (#wikimedia-cloud) [2020-11-10T14:44:48Z] <dcaro> taking down one of the test-k8s etcd nodes to reimage (T267140)

Mentioned in SAL (#wikimedia-cloud) [2020-11-10T17:15:18Z] <dcaro> removing unused toolsbeta-k8s-etcd prefix (we use toolsbeta-test-k8s-etcd) (T267140)

For the toolsbeta-test-k8s-haproxy nodes: see the notes in the Toolforge task around keepalived, btw https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Keepalived

I don't think we put that on toolsbeta yet? It would be cool if we did :)

Mentioned in SAL (#wikimedia-cloud) [2020-11-10T17:18:27Z] <dcaro> launching instance toolsbeta-test-k8s-etcd-4 (T267140)

@dcaro In case you haven't found them yet and they are useful to you, we did write some docs from the last time we tried adjusting a live etcd cluster https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes/Etcd#Adding_a_Member

Based on what worked at the time anyway :)

That doc is clearly from before we added client auth. The client auth would be required to make any of that work on this cluster.

Mentioned in SAL (#wikimedia-cloud) [2020-11-16T11:27:46Z] <dcaro> Creating instance toolsbeta-test-k8s-etcd5 and adding to the etcd cluster (T267140)

Mentioned in SAL (#wikimedia-cloud) [2020-11-16T11:44:54Z] <dcaro> etcd5 member added, creating instance toolsbeta-test-k8s-etcd6 and adding to the etcd cluster (T267140)

Mentioned in SAL (#wikimedia-cloud) [2020-11-17T08:54:03Z] <dcaro> etcd-4,5 and 6 are up and running, removing 1,2 and 3 (T267140)

Mentioned in SAL (#wikimedia-cloud) [2020-11-17T10:40:09Z] <arturo> hand-edited /etc/kubernetes/manifests/kube-apiserver.yaml in all 3 k8s control nodes to account for new etcd servers (T267140)

Mentioned in SAL (#wikimedia-cloud) [2020-11-17T12:09:08Z] <Lucas_WMDE> <dcaro> 11:59:36 UTC – toolbeta up and running again, documented on the live doc for now, apsrever had the wrong config (T267140)

Mentioned in SAL (#wikimedia-cloud) [2020-11-17T15:32:01Z] <dcaro> Creating new toolsbeta-test-k8s-control-4 node and adding it to the cluster (T267140)

aborrero renamed this task from [toolsbeta] Rebuild servers to learn how to take down the services without downtime to [toolsbeta] Rebuild servers to learn how to take down the services without downtime (and use affinities).Nov 17 2020, 3:44 PM

Mentioned in SAL (#wikimedia-cloud) [2020-11-18T10:50:31Z] <dcaro_> Adding new control-4 node to the control cluster (T267140)

Can u tell me what this means bc I'm dumb well I guess u think I am

Mentioned in SAL (#wikimedia-cloud) [2020-11-18T11:46:25Z] <dcaro_> Modifying the security groupts to mirror tools (T267140)

@ByrdJessByrd42: Please ask general questions in support forums instead, not here. Thanks.

I keep getting email from u telling me this stuff why my name's is
ByrdJessByrf42 and has been for 13 years so what's all this about

Wrong place, as explained before.

Mentioned in SAL (#wikimedia-cloud) [2020-11-23T09:58:27Z] <dcaro> Remove control-1 node from the pool (was replaced by control-4) (T267140)

Mentioned in SAL (#wikimedia-cloud) [2020-11-23T10:32:09Z] <dcaro> Creating new control-5 node (will replace control-2) (T267140)

Mentioned in SAL (#wikimedia-cloud) [2020-11-23T10:45:53Z] <dcaro> Taking out control-2 node, replaced by control-5 (I saw one 503 reply on the proxy when creating control-5, fyi) (T267140)

Mentioned in SAL (#wikimedia-cloud) [2020-11-23T11:12:01Z] <dcaro> Launching control-6, to replace control-3 (T267140)

Mentioned in SAL (#wikimedia-cloud) [2020-11-23T14:08:15Z] <dcaro> Taking control-3 node out as control-6 is up and running (T267140)

Mentioned in SAL (#wikimedia-cloud) [2020-11-23T14:17:39Z] <dcaro> All control nodes re-imaged (T267140)

dcaro triaged this task as Medium priority.Dec 1 2020, 10:22 AM
dcaro moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.
dcaro moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.
dcaro moved this task from Doing to Clinic Duty on the cloud-services-team (Kanban) board.
dcaro moved this task from Clinic Duty to Needs discussion on the cloud-services-team (Kanban) board.
dcaro moved this task from Needs discussion to Blocked on the cloud-services-team (Kanban) board.
dcaro moved this task from Blocked to Watching on the cloud-services-team (Kanban) board.
dcaro moved this task from Watching to Graveyard on the cloud-services-team (Kanban) board.

Mentioned in SAL (#wikimedia-cloud) [2021-06-29T08:57:09Z] <wm-bot> Depooling and removing worker , will pick the oldest. (T267140) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud) [2021-06-29T08:57:16Z] <wm-bot> Draining node toolsbeta-test-k8s-worker-1... (T267140) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud) [2021-06-29T08:58:47Z] <wm-bot> Depooling and removing worker , will pick the oldest. (T267140) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud) [2021-06-29T08:58:53Z] <wm-bot> Draining node toolsbeta-test-k8s-worker-1... (T267140) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud) [2021-06-29T08:59:38Z] <wm-bot> Drained node toolsbeta-test-k8s-worker-1. (T267140) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud) [2021-06-29T08:59:56Z] <wm-bot> Depooled and removed worker toolsbeta-test-k8s-worker-1.toolsbeta.eqiad1.wikimedia.cloud. (T267140) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud) [2021-06-29T09:12:16Z] <wm-bot> Depooling and removing worker , will pick the oldest. (T267140) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud) [2021-06-29T09:12:19Z] <wm-bot> Draining node toolsbeta-test-k8s-worker-2... (T267140) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud) [2021-06-29T09:13:12Z] <wm-bot> Drained node toolsbeta-test-k8s-worker-2. (T267140) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud) [2021-06-29T09:13:29Z] <wm-bot> Depooled and removed worker toolsbeta-test-k8s-worker-2.toolsbeta.eqiad1.wikimedia.cloud. (T267140) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud) [2021-06-29T09:27:38Z] <wm-bot> Depooling and removing worker , will pick the oldest. (T267140) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud) [2021-06-29T09:27:41Z] <wm-bot> Draining node toolsbeta-test-k8s-worker-3... (T267140) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud) [2021-06-29T09:28:26Z] <wm-bot> Drained node toolsbeta-test-k8s-worker-3. (T267140) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud) [2021-06-29T09:28:44Z] <wm-bot> Depooled and removed worker toolsbeta-test-k8s-worker-3.toolsbeta.eqiad1.wikimedia.cloud. (T267140) - cookbook ran by dcaro@vulcanus

dcaro removed dcaro as the assignee of this task.Aug 10 2021, 5:06 PM
dcaro raised the priority of this task from Medium to Needs Triage.
taavi subscribed.

Anything left to do here? I think most toolforge things have cookbooks now.

dcaro claimed this task.

I think we can close it, and if/when we need to do anything new specifically we can open a new task for it, no point on having this super-wide task around.