Page MenuHomePhabricator

Bstorm (Brooke)
Ops Witch -- Wikimedia Cloud Services Team

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Sunday

  • Clear sailing ahead.

User Details

User Since
Jan 22 2018, 10:09 PM (77 w, 3 d)
Availability
Available
IRC Nick
bstorm_
LDAP User
Bstorm
MediaWiki User
BStorm (WMF) [ Global Accounts ]

On the wikis, I'm BStorm (WMF), bstorm_ on IRC and Bstorm on gerrit and WikiTech.

I work for or provide services to the Wikimedia Foundation, but this is my only Phabricator account. Edits, statements, or other contributions made from this account are my own, and may not reflect the views of the Foundation.

Recent Activity

Yesterday

Bstorm added a comment to T227290: Design and document how to integrate the new Toolforge k8s cluster with PodSecurityPolicy.

Ok, the toolsbeta-test-k8s cluster now has PSP enabled and was built from scratch that way. Updated the build procedure here: T215531

Thu, Jul 18, 8:28 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm closed T228450: Mount Dumps NFS share on instances in the wcdo Cloud VPS project as Resolved.

Now mounted at /public/dumps/ on wcdo.wcdo.eqiad.wmflabs

Thu, Jul 18, 8:24 PM · cloud-services-team (Kanban), VPS-Projects, Data-Services
Bstorm moved T228450: Mount Dumps NFS share on instances in the wcdo Cloud VPS project from Backlog to Shared Storage on the Data-Services board.
Thu, Jul 18, 8:03 PM · cloud-services-team (Kanban), VPS-Projects, Data-Services
Bstorm added a comment to T228192: Change database password for tool toolforge: gyan:.

I reset your password to a new one @Jnanaranjan_sahu. Please confirm everything works as expected.

Thu, Jul 18, 8:00 PM · cloud-services-team (Kanban), Data-Services, Toolforge
Bstorm closed T227377: Request creation of Linkwatcher and COIBot VPS project, a subtask of T224154: Reduce size of linkwatcher db on toolsdb if at all possible, as Resolved.
Thu, Jul 18, 7:53 PM · Data-Services
Bstorm closed T227377: Request creation of Linkwatcher and COIBot VPS project as Resolved.
Thu, Jul 18, 7:53 PM · cloud-services-team (Kanban), Cloud-VPS (Project-requests)
Bstorm added a comment to T227377: Request creation of Linkwatcher and COIBot VPS project.

The project is now available in https://horizon.wikimedia.org to spin up virtual machines and such (see https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS).
You should be able to make a couple of VMs and start setting up the systems on there. You won't have anywhere near the disk space right now to run the database in the project, so you'll want to continue to connect to the Toolsdb like you do now until we are able to work out another solution. If you need more RAM or CPU, etc. please request more quota and we can take a look!

Thu, Jul 18, 7:53 PM · cloud-services-team (Kanban), Cloud-VPS (Project-requests)
Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

To explain this patch and the one where I changed the docker service class:
The docker service class being left out of master since it was easy to forget. I made it an include at the module level (to make the module functional and internally consistent) instead of declaring it in class context in the profile. Separating it out like that is how we manage roles to keep them flexible (which I get), but doing it at the module level makes modules require unusual quirks and insider knowledge just to make them work. Modules are developed elsewhere with a primary init.pp gateway that accepts all options, with most else configured by that interface. I'm fine not using the init pattern in modules, but I'd rather not make it more confusing as well by splitting it out too much.

Thu, Jul 18, 7:35 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

Since this works perfectly now (for whatever reason--I have theories that don't ultimately matter much now), the final form of the build process now looks like this:

Thu, Jul 18, 7:10 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

Ok, the cluster is now using PSP on init, and it works fine. I have no idea what caused our problem before, but a clean rebuild works great.

Thu, Jul 18, 6:40 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T227290: Design and document how to integrate the new Toolforge k8s cluster with PodSecurityPolicy.

When putting this out there, it was broken until I concatenated the whole file to the kubeadm-init.yaml....however some parts are not applied that way:

Thu, Jul 18, 3:31 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban), Kubernetes

Wed, Jul 17

Bstorm added a comment to T215678: Replace each of the custom controllers with something in a new Toolforge Kubernetes setup.

The uidenforcer admission controller appears to be a combination of "don't run as root" and RBAC when RBAC doesn't exist.

Wed, Jul 17, 9:57 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T227290: Design and document how to integrate the new Toolforge k8s cluster with PodSecurityPolicy.

Ok, so that's the basic cluster build and enablement portion. Now we need to sort out any additional RBAC and PSPs for the rest of the pods.

Wed, Jul 17, 9:35 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T227377: Request creation of Linkwatcher and COIBot VPS project.

Nope, basically, I just have to create the project for now. I'll note here when it's ready to go.

Wed, Jul 17, 9:30 PM · cloud-services-team (Kanban), Cloud-VPS (Project-requests)
Bstorm added a comment to T153068: Consider mounting labs NFS labstore1003.eqiad.wmnet:/scratch for server-side uploads.

That's in contrary with what @Bstorm said, but fine.

Wed, Jul 17, 8:44 PM · cloud-services-team (Kanban), Data-Services, Operations, video2commons
Bstorm added a comment to T153068: Consider mounting labs NFS labstore1003.eqiad.wmnet:/scratch for server-side uploads.

I really must decline this request if that's the reason. My thinking on this is:

Wed, Jul 17, 8:28 PM · cloud-services-team (Kanban), Data-Services, Operations, video2commons
Bstorm added a subtask for T215678: Replace each of the custom controllers with something in a new Toolforge Kubernetes setup: T227290: Design and document how to integrate the new Toolforge k8s cluster with PodSecurityPolicy.
Wed, Jul 17, 7:45 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a parent task for T227290: Design and document how to integrate the new Toolforge k8s cluster with PodSecurityPolicy: T215678: Replace each of the custom controllers with something in a new Toolforge Kubernetes setup.
Wed, Jul 17, 7:45 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T227290: Design and document how to integrate the new Toolforge k8s cluster with PodSecurityPolicy.

Alright, experimenting a bit, I'm able to use kubeadm deploy with pod security policy. I'll tweak the policies where appropriate and see about adding them to puppet.

Wed, Jul 17, 7:45 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T153068: Consider mounting labs NFS labstore1003.eqiad.wmnet:/scratch for server-side uploads.

We cross mount dumps NFS I believe to stats hosts (which might be production-ish), but those are read-only mounts.

Wed, Jul 17, 7:25 PM · cloud-services-team (Kanban), Data-Services, Operations, video2commons
Bstorm added a comment to T153068: Consider mounting labs NFS labstore1003.eqiad.wmnet:/scratch for server-side uploads.

This seems like a bad idea. Scratch is writable by all of cloud. I do not want that mounted in production, if that's what we are ultimately talking about.

Wed, Jul 17, 7:24 PM · cloud-services-team (Kanban), Data-Services, Operations, video2commons
Bstorm added a comment to T227290: Design and document how to integrate the new Toolforge k8s cluster with PodSecurityPolicy.

This has more depth (noting for my reference--I don't expect blog type posts to work 100%, but the context is good). https://octetz.com/posts/setting-up-psps
I'll see if I can cook up a patch that will allow us to start playing with them after trying some things on minikube first.

Wed, Jul 17, 1:45 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T227290: Design and document how to integrate the new Toolforge k8s cluster with PodSecurityPolicy.

Good reference as well. I may start by trying this on my local minikube https://github.com/kubernetes/kubeadm/issues/791

Wed, Jul 17, 1:35 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T227290: Design and document how to integrate the new Toolforge k8s cluster with PodSecurityPolicy.

Helpful person who also couldn't figure this out documented how they made it work on their blog: https://pmcgrath.net/using-pod-security-policies-with-kubeadm
Basically, we have to create RBACs and PSPs for the control plane pods so they will be created during init. I'll keep looking around and we can try different approaches to this. I tend to imagine that the api-server pods can be restarted with new settings by changing the right config maps. However, we've done a good job at making clusters fully reproducible/rebuildable so far. I tend to think that the right rbac policy files might be just the thing.

Wed, Jul 17, 1:34 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm created T228238: Remove nfsiostat collector for diamond if possible, which may be broken on tools workers.
Wed, Jul 17, 3:07 AM · cloud-services-team (Kanban)

Tue, Jul 16

Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

I say that partly because we have a lot of work to do to get this "toolforge ready" now that we've got a handle on a process for kubeadm itself.

Tue, Jul 16, 8:15 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

I went ahead and tried this:

Tue, Jul 16, 8:15 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm updated the task description for T224188: rack/setup/install (3) new osd ceph nodes.
Tue, Jul 16, 7:48 PM · ops-eqiad, Operations, cloud-services-team (Kanban), Cloud-Services
Bstorm added a comment to T224188: rack/setup/install (3) new osd ceph nodes.

Per what was decided by WMCS in T228102, the hostname proposal is now cloudcephosd100* for the three. Updating the description with that much at least.

Tue, Jul 16, 7:47 PM · ops-eqiad, Operations, cloud-services-team (Kanban), Cloud-Services
Bstorm moved T228102: rack/setup/install cloudcephmon100[123] from Needs discussion to Inbox on the cloud-services-team (Kanban) board.
Tue, Jul 16, 7:46 PM · cloud-services-team (Kanban), Operations, Cloud-Services, ops-eqiad
Bstorm added a comment to T228102: rack/setup/install cloudcephmon100[123].

After talking in the weekly meeting, it's now cloudcephmon100*, updating the description.

Tue, Jul 16, 7:46 PM · cloud-services-team (Kanban), Operations, Cloud-Services, ops-eqiad
Bstorm renamed T228102: rack/setup/install cloudcephmon100[123] from rack/setup/install cloudmon100[123] to rack/setup/install cloudcephmon100[123].
Tue, Jul 16, 7:45 PM · cloud-services-team (Kanban), Operations, Cloud-Services, ops-eqiad
Bstorm triaged T227377: Request creation of Linkwatcher and COIBot VPS project as Normal priority.
Tue, Jul 16, 7:27 PM · cloud-services-team (Kanban), Cloud-VPS (Project-requests)
Bstorm added a comment to T227377: Request creation of Linkwatcher and COIBot VPS project.

This was discussed at our weekly meeting. We decided to approve creation of the project to allow build out of the app, but we also agreed to keep the data on toolsdb for now. More consideration is needed to properly manage the data set in the future, but that will be out of scope for this, I think. That will help with toolforge performance concerns, but there is a need to work out more tooling and so forth on our end to provide a reasonable, reliable service for that much data.

Tue, Jul 16, 7:27 PM · cloud-services-team (Kanban), Cloud-VPS (Project-requests)
Bstorm triaged T228174: Function Call, wrong number of arguments (4 for 5) when a puppet master is connected to labs puppetmaster as Normal priority.
Tue, Jul 16, 3:21 PM · cloud-services-team (Kanban), Cloud-Services
Bstorm added a project to T228174: Function Call, wrong number of arguments (4 for 5) when a puppet master is connected to labs puppetmaster: cloud-services-team (Kanban).
Tue, Jul 16, 3:20 PM · cloud-services-team (Kanban), Cloud-Services
Bstorm added a comment to T228174: Function Call, wrong number of arguments (4 for 5) when a puppet master is connected to labs puppetmaster.

Very odd.

Tue, Jul 16, 3:20 PM · cloud-services-team (Kanban), Cloud-Services
Bstorm added a comment to T228174: Function Call, wrong number of arguments (4 for 5) when a puppet master is connected to labs puppetmaster.

So I read that as something wrong in $dnsconfig = hiera_hash('labsdnsconfig', {})

Tue, Jul 16, 3:18 PM · cloud-services-team (Kanban), Cloud-Services
Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

No, that seems more like etcd needs cleanup to me.

Tue, Jul 16, 12:11 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T228102: rack/setup/install cloudcephmon100[123].

Great point @aborrero! I almost half wanted to name all of these "cloudstore" and figure it out from there, but that's not great. cloudstoremon perhaps just to keep the brand out of the name. The OSDs are literally slated to be cloudosd.

Tue, Jul 16, 11:33 AM · cloud-services-team (Kanban), Operations, Cloud-Services, ops-eqiad

Mon, Jul 15

Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.
NOTE: puppet is disabled on master-1 where I was livehacking--for when you try things in your morning. Feel free to re-enable and mess with things, of course. I didn't un-hack anything on the puppetmaster itself
Mon, Jul 15, 10:47 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

The alternative is, obviously, to use a client cert that is held in common by all the nodes (with each of their names on it) and turn on client cert checking. That cert can be made using the puppet cert generate command with the --dns_alt_names option including the names of all three master nodes. I tested that process in another project. It's kind of weird (puts the resulting files in /var/lib/puppet/ssl/server/ where it keeps the original master certs made during bootstrap, but it didn't seem to break anything where I tested it). I can't say I like it, but it might be good. I mean, with access to the puppetmaster, one can also just use the openssl CLI to make and sign a cert for this that will be trusted by etcd.

Mon, Jul 15, 10:45 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.
# kubectl get nodes
NAME                          STATUS   ROLES    AGE     VERSION
toolsbeta-test-k8s-master-1   Ready    master   7m24s   v1.15.0
toolsbeta-test-k8s-master-2   Ready    master   3m46s   v1.15.0
toolsbeta-test-k8s-master-3   Ready    master   2m55s   v1.15.0
Mon, Jul 15, 10:02 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

Victory! Now I'll try to join another control plane node.

Mon, Jul 15, 9:56 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

Reset etcd with ETCDCTL_API=3 etcdctl --endpoints https://toolsbeta-test-k8s-etcd-1.toolsbeta.eqiad.wmflabs:2379 del "" --from-key=true as normal user.

Mon, Jul 15, 9:52 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

Without it I see this problem with calico:

Mon, Jul 15, 9:50 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

I added a command to quickly get the ca-cert-hash, btw, in the wiki page of notes.

Mon, Jul 15, 9:36 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

It does *not* work as merged.

Mon, Jul 15, 9:36 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

Going to try to bootstrap the other two cluster nodes using the --upload-certs thing since I haven't tried it myself yet :)

Mon, Jul 15, 8:51 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

Changed my mind on that last bit because you can specify certs with etcdctl :) No need to skip ssl whether we enable the verification or not.

Mon, Jul 15, 7:25 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

Finally figured out how to query this version of etcd: ETCDCTL_API=3 etcdctl --endpoints https://toolsbeta-test-k8s-etcd-1.toolsbeta.eqiad.wmflabs:2379 get / --prefix --keys-only

Mon, Jul 15, 7:07 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

Wait, no. That'll work or the peer cert file as long as that ca.pem is the puppet one. Never mind :-p

Mon, Jul 15, 6:44 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

I went ahead to try and answer my own questions and noticed a problem. We have etcd cert authentication enabled, and that cert is the puppet cert of the etcd server we've spun up.
from /etc/default/etcd

Mon, Jul 15, 6:42 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

Wait, does it work without the client cert matching on every node (reading it more)? I was expecting it to add that client cert to a configuration map in kubernetes, and I was worried it wouldn't work if they were different. Maybe it doesn't matter because it just stores the location and all of them are valid individually :) That'd be awesome!

Mon, Jul 15, 5:05 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

Just a comment on the certs issue:
With an external etcd cluster, the external cluster is in control of the server certs. If it requires the certs to be in our config, then this version of etcd does require authentication (or this version of k8s does), which is honestly the right thing to do. I was figuring we could continue to use it without auth and use the puppet certs like we used to.

Mon, Jul 15, 4:59 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T166949: Homedir/UID info breaks after a while in Tools Kubernetes (can't read replica.my.cnf).

A describe on a running pod that displays that would tell us the pod spec.

Mon, Jul 15, 3:38 PM · Patch-For-Review, Tool-Global-user-contributions, cloud-services-team (Kanban), Kubernetes, Toolforge, Cloud-VPS
Bstorm added a comment to T227395: tools-worker-1022 k8s duplicate node.
NOTE: I've been lazy and just deleted the bad node entry in the past.
Mon, Jul 15, 3:07 PM · cloud-services-team
Bstorm added a project to T208690: create revision_commentindex: cloud-services-team (Kanban).
Mon, Jul 15, 2:13 PM · cloud-services-team (Kanban), Data-Services
Bstorm updated subscribers of T208690: create revision_commentindex.
Mon, Jul 15, 2:12 PM · cloud-services-team (Kanban), Data-Services
Bstorm added a comment to T208690: create revision_commentindex.

I'd rather not make a slippery slope argument, but I must admit it's there in my mind. Revision is the table where I'd want this kind of thing vs. probably anywhere else.

Mon, Jul 15, 2:12 PM · cloud-services-team (Kanban), Data-Services
Bstorm added a comment to T212231: Remove Diamond from production.

Thank you!!

Mon, Jul 15, 1:10 PM · observability, Operations
Bstorm added a comment to T212231: Remove Diamond from production.

I was on vacation last week, so I wasn't following the code reviews.

Mon, Jul 15, 1:04 PM · observability, Operations
Bstorm added a comment to T212231: Remove Diamond from production.

@MoritzMuehlenhoff How were the Cloud NFS servers handled? They won't remove the diamond software unless told to I imagine. I don't see that here, though?

Mon, Jul 15, 1:04 PM · observability, Operations

Fri, Jul 5

Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

Yeah, it doesn't seem possible to join a master via the config file (on your earlier comment). In at least one bug (https://github.com/kubernetes/kubeadm/issues/1485), the developers stated that this is "by design" that --control-plane is only available for the CLI. The only use in having the join config on a control plane node seems to be if we want to spin up later without the ca verification option.

Fri, Jul 5, 4:10 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes

Thu, Jul 4

Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

My thinking is that maybe it all works except when we do the upload-certs later to try to rebuild? This would suggest we still need a manual cert copy for a rebuild, which isn't the end of the world. Just more docs and/or scripts (or even possibly adding the certs to labs/private later on for that, which would work fine as well--and the way I did it in puppet, isn't used without a hiera trigger). I don't know for sure without some logs or testing, though, obviously. Just a guess.

Thu, Jul 4, 9:30 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

Is that happening in the one deployed with puppet distributed certs or using CLI only? If using CLI only, it would be a bug in kubeadm, I think. I wasn't seeing that with the puppet distributed (which is basically manual distribution)--at least as long as I watched it for, it may have snuck up later.

Thu, Jul 4, 9:12 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

Lemme know what you think! I kind of struggled with the whole idea myself, and that's where I ended up. The logic could possibly be turned in the opposite direction and one could say we need a script for rebuilding a node (that copies the certs for us) instead, figuring *that* won't happen often either?

Thu, Jul 4, 2:32 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

My reason for avoiding the upload-certs option is that they are self-destructing. The secret that contains them lasts 24 hours by default, and then you cannot rebuild a control plane node. If we keep our certs in a secret (which may be kind of clunky at best, but it is our supported mechanism for maintaining secrets) then we can rebuild a control plane node whenever needed. We won't be wiping the nodes regularly. I think a rebuild of one node seems more likely to me than a full cluster wipe after deployment; even a healthy thing to do now and and again. To do with via the native mechanism, the certs need to be manually copied.
Random note: --experimental-control-plane is deprecated in this version. It's just --control-plane now--but they do the same thing.

Thu, Jul 4, 2:28 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes

Wed, Jul 3

Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

As another note: we should make sure that the final form of this has a floating IP for the LB to provide a certain sort of redundancy where we can always move it to a new server. All stuff to get on a wiki page. Stacked control plane works great though, now! :)

Wed, Jul 3, 9:23 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.
root@toolsbeta-test-k8s-master-3:~# kubectl get nodes
NAME                          STATUS   ROLES    AGE    VERSION
toolsbeta-test-k8s-master-1   Ready    master   124m   v1.15.0
toolsbeta-test-k8s-master-2   Ready    master   28m    v1.15.0
toolsbeta-test-k8s-master-3   Ready    master   35s    v1.15.0
Wed, Jul 3, 9:16 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

Another good reason to do it on the instance: on the initial puppet run against labspuppetmaster, it'd fail because the secrets would be on the standalone :)

Wed, Jul 3, 9:01 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

I am adding profile::toolforge::k8s::existing_certs: true on the instance puppet config in horizon so that the certs are distributed only to the new masters. It probably wouldn't break anything to put it in the prefix, but honestly, I think it is reasonable to just document for rebuilding controlplane nodes.

Wed, Jul 3, 8:59 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

With the second node up, I am able to confirm that the LB was able to reload and all that right from the hiera change :)

Wed, Jul 3, 8:56 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

So the join config for control plane nodes is basically useless.
https://github.com/kubernetes/kubeadm/issues/1485 makes it clear that we have to use the command line. Luckily, not much is required for that if you figure the certs are distributed by puppet. The CLI is:

Wed, Jul 3, 8:46 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm moved T227224: update Windows docs from Inbox to Doing on the cloud-services-team (Kanban) board.
Wed, Jul 3, 8:27 PM · cloud-services-team (Kanban)
Bstorm triaged T227224: update Windows docs as Normal priority.
Wed, Jul 3, 8:27 PM · cloud-services-team (Kanban)
Bstorm created T227224: update Windows docs.
Wed, Jul 3, 8:26 PM · cloud-services-team (Kanban)
Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

The first master is now using the puppet config and is running fine.

Wed, Jul 3, 7:07 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

Saved the current working config aside (overwriting the .working file) and enabling puppet on the master.

Wed, Jul 3, 7:01 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm created T227215: Document WMCS puppet practices specific to what we do.
Wed, Jul 3, 5:49 PM · cloud-services-team (Kanban)
Bstorm created T227214: Add a section about our workboard to the onboarding -- document on wiki.
Wed, Jul 3, 5:47 PM · cloud-services-team (Kanban)
Bstorm created T227213: Replace Whatsapp with Telegram.
Wed, Jul 3, 5:46 PM · cloud-services-team (Kanban)
Bstorm created T227212: Schedule regular failover and restore testing -- design this at least roughly and add to quarterly goals.
Wed, Jul 3, 5:44 PM · cloud-services-team (Kanban)
Bstorm added a subtask for T227211: Action items and work for retro 20190703: T220051: Puppet cleanup around OpenStack.
Wed, Jul 3, 5:42 PM · cloud-services-team (Kanban)
Bstorm added a parent task for T220051: Puppet cleanup around OpenStack: T227211: Action items and work for retro 20190703.
Wed, Jul 3, 5:42 PM · cloud-services-team (Kanban)
Bstorm triaged T227211: Action items and work for retro 20190703 as Normal priority.
Wed, Jul 3, 5:41 PM · cloud-services-team (Kanban)
Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

And that results in this config after merging in kubeadm:

Wed, Jul 3, 3:00 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

The logs of the scheduler showed it was affected by https://github.com/kubernetes/kubeadm/issues/1285. Trying setting those as well.

Wed, Jul 3, 2:24 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

Yeah, I'm not quite sure coredns and the scheduler are happy with everything yet

Wed, Jul 3, 2:10 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

Watching the pods with my latest version....it seems stable. Things have been running for 3m with pods still live. I won't declare victory yet...

Wed, Jul 3, 2:09 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

Ah, I noticed a cidr mask was added in the local config for the controller manager, which did not match other values either. Removed that. Overall, I think the serviceSubnet was a big problem.

Wed, Jul 3, 2:06 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

noticed this:

E0703 13:49:37.284017       1 driver-call.go:267] Failed to unmarshal output for command: init, output: "2019/07/03 13:49:37 Unix syslog delivery error\n", error: invalid character '/' after top-level value
W0703 13:49:37.284186       1 driver-call.go:150] FlexVolume: driver call failed: executable: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds/uds, args: [init], error: exit status 1, output: "2019/07/03 13:49:37 Unix syslog delivery error\n"
E0703 13:49:37.284283       1 plugins.go:746] Error dynamically probing plugins: Error creating Flexvolume plugin from directory nodeagent~uds, skipping. Error: invalid character '/' after top-level value
Wed, Jul 3, 1:51 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

It's recommended that the service ips not even come close to overlap and even stay at the default 10.96.whatever. Helpful hints here as well: https://docs.projectcalico.org/v3.8/getting-started/kubernetes/installation/config-options

Wed, Jul 3, 1:43 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

It seems to end up set to the same value as our podSubnet: - --cluster-cidr=192.168.0.0/24

Wed, Jul 3, 1:39 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

What is dogging us (besides the podsecuritypolicy fun) is the pod IP settings. They have to all within the -cluster-cidr. I found where that damned setting is. It's on the kube-controller-manager:

Wed, Jul 3, 1:38 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes

Tue, Jul 2

Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

Current status: the node never becomes ready because CNI isn't up. CNI isn't up because the daemonset for calico doesn't actually start running for some reason. daemonset.apps/calico-node is the daemonset.

Tue, Jul 2, 10:47 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

That seems to have it up consistently. I re-applied the calico config.

Tue, Jul 2, 9:59 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

Nah, that's consistent. running kubeadm reset and then kubeadm init --config again.

Tue, Jul 2, 9:56 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

I think I see the problem. Kubeadm designated the node by it's short name kubelet.go:2248] node "toolsbeta-test-k8s-master-1" not found.
Perhaps that's not in the cert for the kubelet? Checking.

Tue, Jul 2, 9:39 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T215531: Deploy upgraded Kubernetes to toolsbeta.

Something is flapping, I think.

Tue, Jul 2, 9:27 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes