I work for or provide services to the Wikimedia Foundation, but this is my only Phabricator account. Edits, statements, or other contributions made from this account are my own, and may not reflect the views of the Foundation.
Fri, Sep 13
Shifting gears on this a bit. Since we don't need/want tendril, we should make sure that we are collecting appropriate metrics for grafana/prometheus on the cloud side of the house. A bit of a review.
Thu, Sep 12
So on investigation, it doesn't look like it actually removed them all, but at the same time, there's some odd behavior I see. It adds every service to the redis backend on every loop which is wrong.
The related cert for the outage was on the server itself in this place https://phabricator.wikimedia.org/T148929#2817428
Because the version of Kubernetes in Toolforge was related to some lousy error messages during an outage, and this is now one of the actionables from that incident, adding the Incident tag.
Tue, Sep 10
Mon, Sep 9
The docker-registry.tools.wmflabs.org/toollabs-python35-sssd-web:testing image worked today in testing on tools-worker-1029.tools.eqiad.wmflabs (which is cordoned and runs sssd).
As of k8s 1.8, I think there's a prometheus metric for cert expiry https://github.com/kubernetes/kubernetes/pull/51031
Sat, Sep 7
Fri, Sep 6
@sbassett Can I get a +1 from security on this column? Looking through backlogged tickets, I noticed this one.
That's why it suddenly stopped and started working. I was wondering (and commented on the merged task).
I just saw wikiwho get created:
Thu, Sep 5
So it appears the safest way to really test sssd is still to do something like https://gerrit.wikimedia.org/r/c/operations/docker-images/toollabs-images/+/527258
Then build the image and tag it with testing. I can test things locally, but I suspect it is easier to do this and actually test it on tools-worker-1029 (which is still in place as a jessie sssd test node). To do it locally, have to make minikube work with sssd which sounds like a lot of fussing.
So one thing learned from testing this with NFS: volume claims are namespaced. The only way to share an NFS mount across namespaces at this time is as a hostPath.
Fri, Aug 30
I have managed a live test of this in a Kubernetes cluster with LDAP. New permissions were needed for the clusterrole. Additionally, the serviceaccount running this needs all permissions that it grants to other users because that's the rules in Kubernetes. Since it currently grants the clusterrole "edit" in a namespace, I had to give the sa that permission as a clusterrolebinding (because it must be able to do all those things in the target namespace). I kind of hate that, but it is necessary.
Thu, Aug 29
That sounds like a great idea!
Wed, Aug 28
No, I think this is only resolved if "new kubernetes worker nodes" can export metrics. They'll fail if we spin up another one. I'm perfectly fine with just documenting that the package needs an upgrade (since there's packages that need downgrades as well), but a puppet pin of the package would resolve it as well. The reason I'm ok with just updating the docs is because this is re: Jessie nodes. We are going to deprecate Jessie. Otherwise, we'd surely insist on fixing this in puppet so the build is reproduceable.
Tue, Aug 27
We discussed the matter and felt as a team that these are not the right way to be monitoring the customer experience tools we have for Toolforge. We decided to remove the icinga monitors and create a subtask to implement a more sensible monitor for this.
Mon, Aug 26
Scripts finished. Validated the the views are reachable in Toolforge.
Created the database and the grant on the replicas, running scripts now to get it all set.
Wed, Aug 21
Looks like there's a number of fixes on this update of the controller firmware, but I don't see any very specific to our issue (lots of INTERNAL_DEVICE_RESET, etc). Can we try that before putting it back in service? I reimage it if that is required to update the firmware (I'm sure we'll need to at this point anyway).
copied from T230442#5413070
Versions ================ Product Name : PERC H730P Adapter Serial No : 87U048Y FW Package Build: 25.5.3.0005
Reboot sent it into a re-image (stalled at confirmation about writing partitioning scheme to disk). It's not healthy. :) Feel free to muck around in the console.
It wasn't showing the right number of disks when I was running things. It was missing four, I believe? Two have failed and logged tickets, but it would have to have lost two more to go read-only (and I seem to recall this was a 10 disk machine)--would need to check to be sure.
Yup, I can do that. I'm not sure which either, per T230442#5429068
It dropped the failures from the list, and I'm not even entirely convinced the disks are bad with how it behaved. It's not accepting ssh connections anymore, so I'll have to do with via mgmt.
Did Dell only send replacement SSD? This has lost 4 disks in a very short time (all are failed now and most missing in the list of disks). I highly suspect there is another issue that isn't the disks themselves (controller firmware, etc. maybe?). This is also not the first time this server did this (fail out multiple disks until the filesystem failed), see:
T216218: Cloud VPS outage on cloudvirt1024 and cloudvirt1018 due to storage failure
I mean, it might be fine, and coincidences do happen, but I'm curious.
Aug 15 2019
Looks like the exact same thing as T229156: Degraded RAID on cloudvirt1018. Same disk, same error and even same hot spare rebuilding.
Looks like a bad disk here:
Another proposal is enabling automatic rotation for kubelet certs so we don't have to manually re-issue them if we don't upgrade during the course of a year. Since upgrading via kubeadm does rotate the certs for all nodes, as long as there is at least one upgrade during a year, we'll be ok, but why chance it? https://kubernetes.io/docs/tasks/tls/certificate-rotation/#enabling-client-certificate-rotation
Closing since I was able to test it with @MaxSem's tool account/venv.
Removing toolforge access is an effective removal as well if the API server is not publicly accessible of course. However, ensuring that the RBAC can be removed as well helps. Rebuilding the CA sounds like a poor option.
Ok for Toolforge users, I currently have maintain-kubeusers generating individual role-bindings to the default "edit" clusterrole for new users in their namespace (T228499: Toolforge: changes to maintain-kubeusers). Since "edit" is a blank check of read/write access while preventing changes to RBAC/PSP, I thought it called for some modification. The biggest things I think we should remove from "edit" is:
I was just thinking about the fact that we are applying RBAC to the user and not to the group (which seems more efficient). At this time, here is why (and this needs to be documented in the script): I do not support using the group annotation for overall cluster access when using x.509 certificates because this issue is not resolved yet. An issued cert exists until it expires, so RBAC is our primary means of immediately shutting down user access. This means we can tie some things to the group, but other things (write access to resources, at least) must be tied to the user until there is a reasonable mechanism for invalidating client certs.
Currently this is using the "edit" default clusterrole for new toolforge users. That is absolutely not what I'd like it to use for now. So I've dumped out the permissions that grants and have commented out the pieces I'd like to further restrict for toolforge. I'll include this in the PSP/RBAC ticket as well with a bit more explaination (T227290: Design and document how to integrate the new Toolforge k8s cluster with PodSecurityPolicy)
This seems like it is fixed now. I don't need a bash wrapper in my test case.
Aug 14 2019
I did a little research to make sure I'm not being unhelpful on this ticket by commenting (and yes, some of my comments were probably useless).
At this point, I'm going to leave this task open basically for just long enough for us to build new control plane nodes in toolsbeta. I don't think it requires us tearing down the test cluster for now.
At this point, we have ended up puppetizing the copying of puppet certs to act as etcd client certs as well as server certs in T215531: Deploy upgraded Kubernetes to toolsbeta with an "unstacked" control plane (separate etcd servers) because we found the process of dealing with node failure with a stacked control plane to be kind of awful.
Adding to discussion in order to discuss the proposal for admin users since that is a change from the behavior of the original system as well as to open the design proposal for comment/questions/rejection/redo in general.
The admin users mentioned in that diagram are still theoretical, but there is no reason to require root to interact with a k8s API. It should be straightforward to add a service or a manually run script that maps the <project>.admin group to admin user accounts and places them in the appropriate locations. That will allow Toolforge admins to interact with k8s as easily as they can Grid Engine (and nobody else--they need to use tool accounts). This should simplify playbooks and procedures for dealing with jobs and services that are misbehaving, etc.
This describes essentially what we are now doing. Etcd client and server certs are simply the puppet certs (which should keep etcd flexible in case we need to set up routing into calico somewhere), while certs for users are x.509s generated using the certificates API of k8s. Node certs are generated by k8s as well using kubeadm (which interacts with the certs API using tokens). The certs to manage the CA and PKI are copied between k8s control plane nodes at build time. A new cluster will have a new CA, which honestly prevents leakage nicely.
In the meantime, I did confirm separately that what I documented about using a wrapper explicitly with an activate does work with jsub. I very much like the idea of having it fixed so that isn't necessary, though :)
@zhuyifei1999 was kind enough to put things back so I could prove myself good and solidly wrong about the character set interfering. It is definitely the resolving of symlinks...and that's why a bash wrapper is a good idea here. Thanks @zhuyifei1999 :)
Ah ok. But that doesn't make sense. My venv works fine. It also is a symlink.
The character set changes on the grid seem to affect the resolution of the python search path.
Root cause is the character set @zhuyifei1999
This tells me that you should definitely use a wrapper script
Just to re-emphasize: this system does not have any loads on it at this time, so it's a wonderful time for it to blow up. It can be repaired and rebooted as needed.
Per T230442, this appears to be something strange going on, possibly a controller freaking out. It lost 4 disks in a very short time and is now a read-only volume. Feel free to reboot or whatever @Cmjohnson . I included some troubleshooting info on the other ticket.
Nothing in the eventlog when I tried to retrieve it.
Some controller info:
Since the filesystem has gone read-only, I was only able to get part of the firmware terminal logs.
It seems that this is showing a loss of 4 disks. We may want to check a controller in this case.
Aug 13 2019
I will say that I can import that when I run this on an exec node directly, so this isn't a difference in the nodes. It could be a difference in the environment, though, which is what a wrapper might fix.
The virtualenv is clearly well-formed, but the environment of the grid can be a bit weird, so I know I have to use a shell wrapper to run python to set a few things similar to what is mentioned here: https://wikitech.wikimedia.org/wiki/Help:Toolforge/Grid#An_error_with_%22ascii%22_codepage,_%22file_not_found%22,_or_UnicodeEncodeError
That said, this one seems like it might actually be fixable with a pin. :)
Just for information, there's more than one quirk in building new Jessie K8s nodes. It may be worth it to just document the problem because pinning doesn't always prevent chicken/egg issues https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes#Building_new_nodes
We declined this project in the team meeting because personal testing projects are not really supported. Though we sympathize with the difficulty of testing patches to ops/puppet we do puppet testing for our projects in the projects where other work is being done, generally.
It seems @tstarling might know how to go about this sort of thing?
Fair enough. I'm concerned we may need to change to the community-supported one at some point (which doesn't need to be now since there are bound to be similarities). Once this is working, we can try stuff and will know more. If the community supported one supports dynamic changes of endpoints (as is suggested by that chart), it may be a better fit for many reasons.