I work for or provide services to the Wikimedia Foundation, but this is my only Phabricator account. Edits, statements, or other contributions made from this account are my own, and may not reflect the views of the Foundation.
Fri, Sep 20
Oddly, since most serviceaccounts are namespaced, this may be easier to do with a simple "user" object with an x509 and a custom role that gives just the perms you need. We'll think more about it...
Thu, Sep 19
Started design doc...needs pictures: https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Toolforge_Kubernetes_RBAC_and_PSP
Ok, I've run maintain-meta_p on all 4 replicas. Docs don't need an update because what I changed is hardcoded anyway.
The problem is apparently a lot of settings being moved from InitializeSettings.php to VariantSettings.php. I'm going to make sure the function will correctly parse the new file and, if so, document a command line that will use that instead.
Looks like https://gerrit.wikimedia.org/r/c/operations/puppet/+/538030 fixes it, but I don't know if we need that timestamp index.
I'm not seeing anything changing in the meta_p script. Wonder if it talks to a server that's moved/down/etc.
To provide context, we did a lot to prevent breakage of the tables during that refactor, but I didn't even think to check the indexes. The meta_p thing looks like an error that wasn't tested since it was merged? That'll take more digging.
That meta_p and maintain-indexes thing will make the wiki non-functional for some purposes to users. The meta_p bit will link in tooling and the maintain-indexes breakage will make some queries horribly slow (it's for the joins).
Wed, Sep 18
I've adapted my current test environment to using the PSPs in that patch and the proposed role above. So far, it behaves exactly as intended. A user with these credentials is nicely blind of the goings-on in any other namespace, but enjoys relative freedom to act within their own.
If this project ends up integrating with WMCS-managed stuff at all (Beta cluster -- does that mean deployment-prep?), I'd at least be interested in being a fly on the wall. I'm kind of curious what people come up with in general for our use or understanding, but if we are doing any peering or VPN with things in Cloud, I definitely would like to know to see how it impacts things.
I think we may need to add a PodPreset injection for the automounter. However, I'm more concerned about restricting mounts than forcing them. I'll test that.
To correctly mimic the behavior of the UID enforcer controller, T215678: Replace each of the custom controllers with something in a new Toolforge Kubernetes setup, maintain-kubeusers must apply a UID restriction to each user and namespaced default service account. Going to test that notion.
The workaround seems to be working so far. I haven't seen evidence of a single hang, the logs show it creating new accounts and running successfully. The only place it might have an issue is if it ever needs to create enough accounts to make it take longer than a minute.
This might need a procurement so, I'm going to rebuild the task.
Tue, Sep 17
Mon, Sep 16
Fri, Sep 13
Shifting gears on this a bit. Since we don't need/want tendril, we should make sure that we are collecting appropriate metrics for grafana/prometheus on the cloud side of the house. A bit of a review.
Thu, Sep 12
So on investigation, it doesn't look like it actually removed them all, but at the same time, there's some odd behavior I see. It adds every service to the redis backend on every loop which is wrong.
The related cert for the outage was on the server itself in this place https://phabricator.wikimedia.org/T148929#2817428
Because the version of Kubernetes in Toolforge was related to some lousy error messages during an outage, and this is now one of the actionables from that incident, adding the Incident tag.
Tue, Sep 10
Mon, Sep 9
The docker-registry.tools.wmflabs.org/toollabs-python35-sssd-web:testing image worked today in testing on tools-worker-1029.tools.eqiad.wmflabs (which is cordoned and runs sssd).
As of k8s 1.8, I think there's a prometheus metric for cert expiry https://github.com/kubernetes/kubernetes/pull/51031
Sat, Sep 7
Fri, Sep 6
@sbassett Can I get a +1 from security on this column? Looking through backlogged tickets, I noticed this one.
That's why it suddenly stopped and started working. I was wondering (and commented on the merged task).
I just saw wikiwho get created:
Thu, Sep 5
So it appears the safest way to really test sssd is still to do something like https://gerrit.wikimedia.org/r/c/operations/docker-images/toollabs-images/+/527258
Then build the image and tag it with testing. I can test things locally, but I suspect it is easier to do this and actually test it on tools-worker-1029 (which is still in place as a jessie sssd test node). To do it locally, have to make minikube work with sssd which sounds like a lot of fussing.
So one thing learned from testing this with NFS: volume claims are namespaced. The only way to share an NFS mount across namespaces at this time is as a hostPath.
Fri, Aug 30
I have managed a live test of this in a Kubernetes cluster with LDAP. New permissions were needed for the clusterrole. Additionally, the serviceaccount running this needs all permissions that it grants to other users because that's the rules in Kubernetes. Since it currently grants the clusterrole "edit" in a namespace, I had to give the sa that permission as a clusterrolebinding (because it must be able to do all those things in the target namespace). I kind of hate that, but it is necessary.
Thu, Aug 29
That sounds like a great idea!
Wed, Aug 28
No, I think this is only resolved if "new kubernetes worker nodes" can export metrics. They'll fail if we spin up another one. I'm perfectly fine with just documenting that the package needs an upgrade (since there's packages that need downgrades as well), but a puppet pin of the package would resolve it as well. The reason I'm ok with just updating the docs is because this is re: Jessie nodes. We are going to deprecate Jessie. Otherwise, we'd surely insist on fixing this in puppet so the build is reproduceable.
Tue, Aug 27
We discussed the matter and felt as a team that these are not the right way to be monitoring the customer experience tools we have for Toolforge. We decided to remove the icinga monitors and create a subtask to implement a more sensible monitor for this.
Mon, Aug 26
Scripts finished. Validated the the views are reachable in Toolforge.
Created the database and the grant on the replicas, running scripts now to get it all set.
Aug 21 2019
Looks like there's a number of fixes on this update of the controller firmware, but I don't see any very specific to our issue (lots of INTERNAL_DEVICE_RESET, etc). Can we try that before putting it back in service? I reimage it if that is required to update the firmware (I'm sure we'll need to at this point anyway).
copied from T230442#5413070
Versions ================ Product Name : PERC H730P Adapter Serial No : 87U048Y FW Package Build: 25.5.3.0005
Reboot sent it into a re-image (stalled at confirmation about writing partitioning scheme to disk). It's not healthy. :) Feel free to muck around in the console.
It wasn't showing the right number of disks when I was running things. It was missing four, I believe? Two have failed and logged tickets, but it would have to have lost two more to go read-only (and I seem to recall this was a 10 disk machine)--would need to check to be sure.
Yup, I can do that. I'm not sure which either, per T230442#5429068
It dropped the failures from the list, and I'm not even entirely convinced the disks are bad with how it behaved. It's not accepting ssh connections anymore, so I'll have to do with via mgmt.
Did Dell only send replacement SSD? This has lost 4 disks in a very short time (all are failed now and most missing in the list of disks). I highly suspect there is another issue that isn't the disks themselves (controller firmware, etc. maybe?). This is also not the first time this server did this (fail out multiple disks until the filesystem failed), see:
T216218: Cloud VPS outage on cloudvirt1024 and cloudvirt1018 due to storage failure
I mean, it might be fine, and coincidences do happen, but I'm curious.
Aug 15 2019
Looks like the exact same thing as T229156: Degraded RAID on cloudvirt1018. Same disk, same error and even same hot spare rebuilding.
Looks like a bad disk here:
Another proposal is enabling automatic rotation for kubelet certs so we don't have to manually re-issue them if we don't upgrade during the course of a year. Since upgrading via kubeadm does rotate the certs for all nodes, as long as there is at least one upgrade during a year, we'll be ok, but why chance it? https://kubernetes.io/docs/tasks/tls/certificate-rotation/#enabling-client-certificate-rotation
Closing since I was able to test it with @MaxSem's tool account/venv.
Removing toolforge access is an effective removal as well if the API server is not publicly accessible of course. However, ensuring that the RBAC can be removed as well helps. Rebuilding the CA sounds like a poor option.
Ok for Toolforge users, I currently have maintain-kubeusers generating individual role-bindings to the default "edit" clusterrole for new users in their namespace (T228499: Toolforge: changes to maintain-kubeusers). Since "edit" is a blank check of read/write access while preventing changes to RBAC/PSP, I thought it called for some modification. The biggest things I think we should remove from "edit" is:
I was just thinking about the fact that we are applying RBAC to the user and not to the group (which seems more efficient). At this time, here is why (and this needs to be documented in the script): I do not support using the group annotation for overall cluster access when using x.509 certificates because this issue is not resolved yet. An issued cert exists until it expires, so RBAC is our primary means of immediately shutting down user access. This means we can tie some things to the group, but other things (write access to resources, at least) must be tied to the user until there is a reasonable mechanism for invalidating client certs.
Currently this is using the "edit" default clusterrole for new toolforge users. That is absolutely not what I'd like it to use for now. So I've dumped out the permissions that grants and have commented out the pieces I'd like to further restrict for toolforge. I'll include this in the PSP/RBAC ticket as well with a bit more explaination (T227290: Design and document how to integrate the new Toolforge k8s cluster with PodSecurityPolicy)
This seems like it is fixed now. I don't need a bash wrapper in my test case.
Aug 14 2019
I did a little research to make sure I'm not being unhelpful on this ticket by commenting (and yes, some of my comments were probably useless).
At this point, I'm going to leave this task open basically for just long enough for us to build new control plane nodes in toolsbeta. I don't think it requires us tearing down the test cluster for now.
At this point, we have ended up puppetizing the copying of puppet certs to act as etcd client certs as well as server certs in T215531: Deploy upgraded Kubernetes to toolsbeta with an "unstacked" control plane (separate etcd servers) because we found the process of dealing with node failure with a stacked control plane to be kind of awful.
Adding to discussion in order to discuss the proposal for admin users since that is a change from the behavior of the original system as well as to open the design proposal for comment/questions/rejection/redo in general.
The admin users mentioned in that diagram are still theoretical, but there is no reason to require root to interact with a k8s API. It should be straightforward to add a service or a manually run script that maps the <project>.admin group to admin user accounts and places them in the appropriate locations. That will allow Toolforge admins to interact with k8s as easily as they can Grid Engine (and nobody else--they need to use tool accounts). This should simplify playbooks and procedures for dealing with jobs and services that are misbehaving, etc.
This describes essentially what we are now doing. Etcd client and server certs are simply the puppet certs (which should keep etcd flexible in case we need to set up routing into calico somewhere), while certs for users are x.509s generated using the certificates API of k8s. Node certs are generated by k8s as well using kubeadm (which interacts with the certs API using tokens). The certs to manage the CA and PKI are copied between k8s control plane nodes at build time. A new cluster will have a new CA, which honestly prevents leakage nicely.
In the meantime, I did confirm separately that what I documented about using a wrapper explicitly with an activate does work with jsub. I very much like the idea of having it fixed so that isn't necessary, though :)