Page MenuHomePhabricator

[toolsbeta] Rebuild servers to learn how to take down the services without downtime (and use affinities)
Open, MediumPublic

Description

Working guide

https://etherpad.wikimedia.org/p/toolsbeta_refresh_notes

Possibly useful references

etcd

Example of cli command:

root@toolsbeta-test-k8s-etcd-1:~# etcdctl  --endpoints "https://toolsbeta-test-k8s-etcd-1.toolsbeta.eqiad.wmflabs:2379,https://toolsbeta-test-k8s-etcd-3.toolsbeta.eqiad.wmflabs:2379,https://toolsbeta-test-k8s-etcd-2.toolsbeta.eqiad.wmflabs:2379" --ca-file /etc/etcd/ssl/ca.pem --key-file /etc/etcd/ssl/toolsbeta-test-k8s-etcd-1.toolsbeta.eqiad.wmflabs.priv --cert-file /etc/etcd/ssl/toolsbeta-test-k8s-etcd-1.toolsbeta.eqiad.wmflabs.pem member list
67a7255628c1f89f: name=toolsbeta-test-k8s-etcd-1.toolsbeta.eqiad.wmflabs peerURLs=https://toolsbeta-test-k8s-etcd-1.toolsbeta.eqiad.wmflabs:2380 clientURLs=https://toolsbeta-test-k8s-etcd-1.toolsbeta.eqiad.wmflabs:2379 isLeader=false
822c4bd670e96cb1: name=toolsbeta-test-k8s-etcd-3.toolsbeta.eqiad.wmflabs peerURLs=https://toolsbeta-test-k8s-etcd-3.toolsbeta.eqiad.wmflabs:2380 clientURLs=https://toolsbeta-test-k8s-etcd-3.toolsbeta.eqiad.wmflabs:2379 isLeader=true
cacc7abd354d7bbf: name=toolsbeta-test-k8s-etcd-2.toolsbeta.eqiad.wmflabs peerURLs=https://toolsbeta-test-k8s-etcd-2.toolsbeta.eqiad.wmflabs:2380 clientURLs=https://toolsbeta-test-k8s-etcd-2.toolsbeta.eqiad.wmflabs:2379 isLeader=false

Event Timeline

dcaro created this task.Nov 3 2020, 4:50 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 3 2020, 4:50 PM
aborrero updated the task description. (Show Details)Nov 4 2020, 10:04 AM
dcaro updated the task description. (Show Details)Nov 4 2020, 12:12 PM

Mentioned in SAL (#wikimedia-cloud) [2020-11-04T15:42:06Z] <dcaro> re-creating the toolsbeta-proxy-03, used wrong image on the first try (T267140)

Mentioned in SAL (#wikimedia-cloud) [2020-11-05T15:53:58Z] <dcaro> Adding toolsbeta-proxy-3 to the list of slave proxies in hiera (T267140)

Mentioned in SAL (#wikimedia-cloud) [2020-11-05T16:40:51Z] <dcaro> Moving active proxy from proxy-1 to proxy-3 (T267140)

Mentioned in SAL (#wikimedia-cloud) [2020-11-06T13:18:48Z] <dcaro> bringin up a new proxy-4 instance as slave (T267140)

Mentioned in SAL (#wikimedia-cloud) [2020-11-06T15:53:57Z] <dcaro> Removing proxy-1 and proxy-3 from hiera, proxy-3 stays as active and proxy-4 as backup (T267140)

Mentioned in SAL (#wikimedia-cloud) [2020-11-06T15:56:50Z] <dcaro> Deleting instances proxy-1 and proxy-2, that will finish the proxy rebuild (T267140)

Bstorm added a comment.Nov 6 2020, 5:17 PM

@dcaro: I updated the requirements on T267082 slightly, in case that changes your strategies here.

Bstorm added a comment.Nov 6 2020, 7:24 PM

Added some notes in the etherpad that I hope are helpful.

Mentioned in SAL (#wikimedia-cloud) [2020-11-10T14:44:48Z] <dcaro> taking down one of the test-k8s etcd nodes to reimage (T267140)

Mentioned in SAL (#wikimedia-cloud) [2020-11-10T17:15:18Z] <dcaro> removing unused toolsbeta-k8s-etcd prefix (we use toolsbeta-test-k8s-etcd) (T267140)

For the toolsbeta-test-k8s-haproxy nodes: see the notes in the Toolforge task around keepalived, btw https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Keepalived

I don't think we put that on toolsbeta yet? It would be cool if we did :)

Mentioned in SAL (#wikimedia-cloud) [2020-11-10T17:18:27Z] <dcaro> launching instance toolsbeta-test-k8s-etcd-4 (T267140)

@dcaro In case you haven't found them yet and they are useful to you, we did write some docs from the last time we tried adjusting a live etcd cluster https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes/Etcd#Adding_a_Member

Based on what worked at the time anyway :)

That doc is clearly from before we added client auth. The client auth would be required to make any of that work on this cluster.

Mentioned in SAL (#wikimedia-cloud) [2020-11-16T11:27:46Z] <dcaro> Creating instance toolsbeta-test-k8s-etcd5 and adding to the etcd cluster (T267140)

Mentioned in SAL (#wikimedia-cloud) [2020-11-16T11:44:54Z] <dcaro> etcd5 member added, creating instance toolsbeta-test-k8s-etcd6 and adding to the etcd cluster (T267140)

Mentioned in SAL (#wikimedia-cloud) [2020-11-17T08:54:03Z] <dcaro> etcd-4,5 and 6 are up and running, removing 1,2 and 3 (T267140)

Mentioned in SAL (#wikimedia-cloud) [2020-11-17T08:58:53Z] <dcaro> etcd hosts reimaged (T267140)

Mentioned in SAL (#wikimedia-cloud) [2020-11-17T10:40:09Z] <arturo> hand-edited /etc/kubernetes/manifests/kube-apiserver.yaml in all 3 k8s control nodes to account for new etcd servers (T267140)

Mentioned in SAL (#wikimedia-cloud) [2020-11-17T12:09:08Z] <Lucas_WMDE> <dcaro> 11:59:36 UTC – toolbeta up and running again, documented on the live doc for now, apsrever had the wrong config (T267140)

Mentioned in SAL (#wikimedia-cloud) [2020-11-17T15:32:01Z] <dcaro> Creating new toolsbeta-test-k8s-control-4 node and adding it to the cluster (T267140)

aborrero renamed this task from [toolsbeta] Rebuild servers to learn how to take down the services without downtime to [toolsbeta] Rebuild servers to learn how to take down the services without downtime (and use affinities).Nov 17 2020, 3:44 PM

Mentioned in SAL (#wikimedia-cloud) [2020-11-18T10:50:31Z] <dcaro_> Adding new control-4 node to the control cluster (T267140)

Can u tell me what this means bc I'm dumb well I guess u think I am

Mentioned in SAL (#wikimedia-cloud) [2020-11-18T11:46:25Z] <dcaro_> Modifying the security groupts to mirror tools (T267140)

@ByrdJessByrd42: Please ask general questions in support forums instead, not here. Thanks.

So can u show me how to

I keep getting email from u telling me this stuff why my name's is
ByrdJessByrf42 and has been for 13 years so what's all this about

Wrong place, as explained before.

Mentioned in SAL (#wikimedia-cloud) [2020-11-23T09:58:27Z] <dcaro> Remove control-1 node from the pool (was replaced by control-4) (T267140)

Mentioned in SAL (#wikimedia-cloud) [2020-11-23T10:32:09Z] <dcaro> Creating new control-5 node (will replace control-2) (T267140)

Mentioned in SAL (#wikimedia-cloud) [2020-11-23T10:45:53Z] <dcaro> Taking out control-2 node, replaced by control-5 (I saw one 503 reply on the proxy when creating control-5, fyi) (T267140)

Mentioned in SAL (#wikimedia-cloud) [2020-11-23T11:12:01Z] <dcaro> Launching control-6, to replace control-3 (T267140)

Mentioned in SAL (#wikimedia-cloud) [2020-11-23T14:08:15Z] <dcaro> Taking control-3 node out as control-6 is up and running (T267140)

Mentioned in SAL (#wikimedia-cloud) [2020-11-23T14:17:39Z] <dcaro> All control nodes re-imaged (T267140)

dcaro triaged this task as Medium priority.Dec 1 2020, 10:22 AM
dcaro moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.
dcaro moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.
dcaro moved this task from Doing to Clinic Duty on the cloud-services-team (Kanban) board.
dcaro moved this task from Clinic Duty to Needs discussion on the cloud-services-team (Kanban) board.
dcaro moved this task from Needs discussion to Blocked on the cloud-services-team (Kanban) board.
dcaro moved this task from Blocked to Watching on the cloud-services-team (Kanban) board.
dcaro moved this task from Watching to Graveyard on the cloud-services-team (Kanban) board.