I work for or provide services to the Wikimedia Foundation, but this is my only Phabricator account. Edits, statements, or other contributions made from this account are my own, and may not reflect the views of the Foundation.
Thu, Jan 6
It's back. No space left on device. I didn't get any time to look myself today. TF must have grown a bit. Maybe the disk needs to be bigger?
Wed, Jan 5
Thanks @dcaro! My inbox is restored to proper function. I guess the rest of the mystery is the email bouncing.
Tue, Jan 4
Nov 4 2021
Oct 20 2021
Just an FYI, the testing was historically a matter of checking that all custom controllers continue working (by exercising them, like running a basic set of webservice commands or something) and ensuring that maintain-kubeusers is functioning. If there are more concerning changes in the upgrade, I'd also use the utility https://github.com/toolforge/toolsctl to create a toolsbeta tool to make sure everything got created in our automation toolchain without anything breaking. No way would we consider upgrading so often as the y'all have been because that's too much manual work for the time given. We'd been aiming in the past at a 6 month cycle, but we actually got behind because of the other work going on.
Oct 15 2021
The good disk looks like this:
The bad one is not responding, naturally :)
The sort of vhost routing this does is the common use of haproxy and where they've done the most work on the software itself. It has an advanced and extensive ACL interface so that you can share one port with many FQDNs that allows extensive and fine grained access control if you want it as well since people use haproxy to handle edge traffic. It will probably be how LBaaS works in Openstack and is how I've used haproxy elsewhere to keep entire businesses behind a cluster of them (behind a caching layer). I might suggest the ultimate goal probably ought to be allowing access to the APIs at least inside the cloud instead of firewalling quite so much. That would allow for automation and more standard capabilities.
Oct 12 2021
Overall, this is how Jupyterhub normally works: jupyter--45-56-413-2e0-5f-28bot-29 1/1 Running 0 133m. Capitals get converted to hex. So why is that such a problem on our local env...
So after figuring out a bunch of things, it seems that even with using an oauth grant from actual metawiki, we (naturally) get capital letters from oauth and the most recent version of jupyterhub (and possibly k8s) is choking on the capitals. They get converted into the hex representations of the letters. That *could* still somehow be affected by running in Minikube, but it seems unlikely.
Oh yeah, please don't remove docker-ce from the repos unless you account for the harbor use of it, also. It's running in docker compose and currently using our kubeadm components to do it.
Looks like https://kubernetes.io/docs/reference/config-api/kubelet-config.v1beta1/#kubelet-config-k8s-io-v1beta1-KubeletConfiguration has options (not the snazziest when it comes to puppetizing, but you can).
For this, we currently rely on docker settings to manage log length in containerd, much like prod does. We will want to find an equivalent later because some tools are otherwise very good at filling worker nodes (an old problem around here T148487). logrotate can handle it, but docker was quite good at it with fewer failures waiting for a logrotate run (yes people crashed k8s nodes between logrotate runs regularly, typically using java).
Oct 9 2021
Oct 8 2021
Current status: host is up, mariadb is not yet
Oct 7 2021
I think networking.k8s.io/v1beta1 might be ok...just definitely not extensions/v1beta1.
Warning, ingress nginx 1.0 will refuse to work with extensions/v1beta1 ingresses regardless of cluster version. @mdipietro and I figured this out experimenting with T291589: Upgrade paws jupyterhub. That should be no issue for most tools per se, as long as it reads existing ingress objects (worth checking if any still exist...they probably do), but jupyterhub 0.9.0 still uses that ingress version.
@Marostegui I think this host is ready to get moving again. Would you like to check it and try getting replication up again? I'm hanging back in case you'd rather I don't mess with the state for those purposes.
Oct 6 2021
I've got a patch set that I'm going to test in Minikube in a few.
Very similar to what people experienced here https://github.com/evanphx/json-patch/issues/138. The json representation of the structure of the pod object is not fixed in k8s, so if there is no volume at all on the pod, this mutator fails because it is using json-patch as it's strategy. Since the default service account mounts its own creds in most pods, this isn't a problem unless you create your own object that disables that feature.
In our team meeting today, we figured that a straight refresh of cloudmetrics1001/2 as the systems were provisioned previously might be best for now. Taking over 10G space for 1G hosts doesn't seem sensible.
For now, I am willing to bet, you can just remove that line, and your tool will work via the label again. We also should probably upgrade the web hook so that it can function without a pre-existing volumes list as well.
I see your problem with the new setup:
On both pods for the controller I see on of these on different dates: 2021/09/30 17:22:52 http: TLS handshake error from 192.168.48.128:39842: EOF, but it seems to be functioning for the most part, so I'm not sure what that's about. That's not currently the pod's IP address, and I don't even see that IP in the current environment, so I presume that's just some old stuff.
Oct 5 2021
@nskaggs and @aborrero Just checking on this, do we want to do a straight refresh of exactly as it is? The cloud-support vlan was going away last I checked. If we do a refresh exactly as the originals are deployed, it would be in racks C and A rather than on the cloud-dedicated areas. That would probably be the cloud-hosts vlan (which is not where we've got cloudmetrics1001/2 today).
Oct 4 2021
Here, this works (except for hosts that are hard down like toolsbeta-sgewebgrid-generic-0901):
sudo cumin --force --timeout 500 "A:all" "dmesg | grep -q -m 1 'since last fsck'", with that "success" means the filesystem had errors.
bstorm@cloud-cumin-01:~$ grep 'error count' corruption-search.txt | wc -l 9
plus 1 for toolsbeta-sgewebgrid-generic-0901 says we are definitely at 10. I don't know if that is a growth or I just captured a couple more in my list. Since json output is possible, maybe I can try to come up with a command that actually can be rerun to show a delta. Otherwise, there's going to be a lot of guesswork in this ticket.
Marking toolforge containers done since there is no hope for the Jessie containers.
Ok! Running that again with less distractions on board: sudo cumin --force -x --timeout 500 "A:all" "dmesg | grep -m 1 'since last fsck'" > corruption-search.txt
Ah...because that's actually in the instance's hiera in horizon. Changing that.
We actually tried to exclude trove from the all alias. It didn't work: https://gerrit.wikimedia.org/r/c/operations/puppet/+/715245 should have done it, but the config file clearly doesn't include that change.
Yeah, 13 hosts. commonsarchive-mwtest is just exploded (kernel panics). However, we can take two off the list because it was an ssh issue ((2) gerrit-prod-1001.devtools.eqiad1.wikimedia.cloud,mwv-builder-03.mediawiki-vagrant.eqiad1.wikimedia.cloud). I also removed wcdo because that thing just likes to have python OOM panics it looks like. I didn't see the actual issue in dmesg.
Running sudo cumin --force --timeout 500 "A:all" "dmesg | grep 'since last fsck'" quit quite early on me. I ran it again a little different. This needs to exclude trove from the "all" set somehow. It's annoying. I ran: sudo cumin --force -x --timeout 500 "A:all" "dmesg | grep -m 1 'since last fsck'" to make sure it didn't quit and minimized output from grep.
Likely related: T292264: Loss of access to parsing-qa-01.eqiad.wmflabs
toolsbeta-sgewebgrid-0901 is similar:
Ok, so from what you just said, that sounds to me like the OSMDB needs to be rebuilt to make sure we don't have gaps after dumping the appropriate databases. Since it is on VMs. That also suggests it is a good time to consider building the service inside the maps project instead of in the special "admin only" space of clouddb-services. I don't know the implications of syncing up the design of this sync with the production one, but that might be worth considering as well.
Oct 3 2021
"may be affected" I should have said on buster.
New buster images should be up now if you need to use that.
Tagged this on my rebuilds in case the openssl library needed updating (bullseye fixes the issue either way, but in case you need to roll back to buster). The images are being pushed still, so don't test that rollback just yet if 3.9 doesn't work.
Oct 1 2021
Wait, are you seeing the toolforge.org domain "expired"?
Sep 30 2021
+1, This will be a much better setup for the next time such a problem happens anyway!
@mdipietro We should put up a patch to remove this at the same time as T291806. I wouldn't worry about announcing this if the table is already dropped. The data would not be changing anyway at best, and it's already gone and throwing errors at worst.
Listeria is using the image docker-registry.tools.wmflabs.org/toolforge-php73-sssd-web. That's a buster-based image, so the usual upgrade advice doesn't necessarily seem to apply there unless the buster images had an old SSL stack at some point. If so, restarting should fix it.
Adding more documentation about the deployment of what has been done to https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Toolforge_Buildpack_Implementation#Deployment
The overall issue is that existing certificates/v1 signers don't include a pod serving signer. You cannot make it use the kubelet serving signer (which is the closest you can come). This should not be an issue for maintain-kubeusers since there's a signer for that use case. The certs that signer makes cannot be serving certs, though. They can only be used for client auth.
All better now! Thanks @Majavah
Using trove for Postgres in the most recent iteration is terrible. You cannot control it much, and it doesn't actually allow you access to the Postgres account to create a database. This means you can have exactly one database and user. I doubt the replication still works as well. Maybe it will be improved as they settle in to their more containerized setup.
A login may succeed with sssd when ldap is down due to caching behavior. A simple connection doesn't always actually suggest ldap is healthy. A getent that is expressly told to dodge the cache and go straight to ldap is not a bad notion for catching the whole chain quickly (which is sort of what is done on the cloudstore servers with the useldap script), but a script that does an ldap search for a should-be-stable group like tools.admin might be even better and more clear as far as what it is testing...if it connects only to what the VMs connect to. maintain-dbusers is usually also killed by an LDAP outage because it does an LDAP list of users, but it uses a route VMs don't use.
Sweet! Instantly better ingresses.
As is whatever changed is going to interfere with the upgrade to 1.20 for T280402
We need it in stretch for bastions...and it clearly used to be there. :)
Sep 29 2021
@aborrero I haven't looked much at what's up with that so far. I can say that deleting the sources file and letting puppet put it back didn't work. This just a quick report of the problem.
This sort of script is pretty simple if you have admin (kubectl sudo if you are using your local account, etc). I did this to fix a problem in the presets.
#!/bin/bash # Run this script with your root/cluster admin account as appropriate. # This will fix the dumps mounts for all existing tools.
Sep 28 2021
That's a big nope from the server on restarting via console. It has a processor reporting bad voltage and other fun. System Event Log is attached.
This does not seem related to T289159 as it is a different rack, but you never know.