Page MenuHomePhabricator

Bstorm (Brooke)
Operations

Projects (10)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Thursday

  • Clear sailing ahead.

User Details

User Since
Jan 22 2018, 10:09 PM (254 w, 1 d)
Availability
Available
IRC Nick
bstorm
LDAP User
Bstorm
MediaWiki User
BStorm (WMF) [ Global Accounts ]

On the wikis, I'm BStorm (WMF), bstorm on IRC and Bstorm on gerrit and WikiTech.

I work for or provide services to the Wikimedia Foundation, but this is my only Phabricator account. Edits, statements, or other contributions made from this account are my own, and may not reflect the views of the Foundation.

Recent Activity

Jan 6 2022

Bstorm added a comment to T298501: Email flood due to a some email issue and a full disk on tools prometheus.

It's back. No space left on device. I didn't get any time to look myself today. TF must have grown a bit. Maybe the disk needs to be bigger?

Jan 6 2022, 6:15 AM · User-dcaro, cloud-services-team (Kanban), Toolforge

Jan 5 2022

Bstorm added a comment to T298501: Email flood due to a some email issue and a full disk on tools prometheus.

Thanks @dcaro! My inbox is restored to proper function. I guess the rest of the mystery is the email bouncing.

Jan 5 2022, 5:09 AM · User-dcaro, cloud-services-team (Kanban), Toolforge

Jan 4 2022

Bstorm renamed T298501: Email flood due to a some email issue and a full disk on tools prometheus from Email flood due to a naming issue and a full disk on tools prometheus to Email flood due to a some email issue and a full disk on tools prometheus.
Jan 4 2022, 4:58 AM · User-dcaro, cloud-services-team (Kanban), Toolforge
Bstorm added a project to T298501: Email flood due to a some email issue and a full disk on tools prometheus: cloud-services-team (Kanban).
Jan 4 2022, 4:58 AM · User-dcaro, cloud-services-team (Kanban), Toolforge
Bstorm edited projects for T298501: Email flood due to a some email issue and a full disk on tools prometheus, added: Toolforge; removed Cloud-Services.
Jan 4 2022, 4:57 AM · User-dcaro, cloud-services-team (Kanban), Toolforge
Bstorm created T298501: Email flood due to a some email issue and a full disk on tools prometheus.
Jan 4 2022, 4:56 AM · User-dcaro, cloud-services-team (Kanban), Toolforge

Nov 4 2021

Bstorm added a comment to T294888: `webservice restart` isn't actually restarting the pods.

I think it is unfortunate that webservice restart has different semantics than webservice stop && webservice start, maybe that should be its own task.

Nov 4 2021, 6:03 PM · Documentation, cloud-services-team (Kanban), Toolforge

Oct 20 2021

Bstorm added a comment to T293675: Proposal to move kubernetes upgrades to blue green deploy.

Just an FYI, the testing was historically a matter of checking that all custom controllers continue working (by exercising them, like running a basic set of webservice commands or something) and ensuring that maintain-kubeusers is functioning. If there are more concerning changes in the upgrade, I'd also use the utility https://github.com/toolforge/toolsctl to create a toolsbeta tool to make sure everything got created in our automation toolchain without anything breaking. No way would we consider upgrading so often as the y'all have been because that's too much manual work for the time given. We'd been aiming in the past at a 6 month cycle, but we actually got behind because of the other work going on.

Oct 20 2021, 12:00 AM · Cloud Services Proposals, User-dcaro, cloud-services-team (FY2021/2022-Q3), Toolforge, Kubernetes

Oct 15 2021

Bstorm added a comment to T293428: Degraded RAID on labweb1002.

The good disk looks like this:

Oct 15 2021, 11:19 PM · cloud-services-team (Kanban), SRE, ops-eqiad
Bstorm added a comment to T293428: Degraded RAID on labweb1002.

The bad one is not responding, naturally :)

Oct 15 2021, 11:19 PM · cloud-services-team (Kanban), SRE, ops-eqiad
Bstorm removed a watcher for cloud-services-team (Kanban): Bstorm.
Oct 15 2021, 9:12 PM
Bstorm changed IRC Nick from bstorm_ to bstorm on Bstorm.
Oct 15 2021, 9:11 PM
Bstorm updated Bstorm.
Oct 15 2021, 9:07 PM
Bstorm updated Bstorm.
Oct 15 2021, 9:07 PM
Bstorm added a comment to T267194: CloudVPS: enable TLS in openstack API endpoints.

The sort of vhost routing this does is the common use of haproxy and where they've done the most work on the software itself. It has an advanced and extensive ACL interface so that you can share one port with many FQDNs that allows extensive and fine grained access control if you want it as well since people use haproxy to handle edge traffic. It will probably be how LBaaS works in Openstack and is how I've used haproxy elsewhere to keep entire businesses behind a cluster of them (behind a caching layer). I might suggest the ultimate goal probably ought to be allowing access to the APIs at least inside the cloud instead of firewalling quite so much. That would allow for automation and more standard capabilities.

Oct 15 2021, 3:07 PM · cloud-services-team (Kanban), Patch-For-Review, Cloud-VPS

Oct 12 2021

Bstorm added a comment to T291589: Upgrade paws jupyterhub.

Overall, this is how Jupyterhub normally works: jupyter--45-56-413-2e0-5f-28bot-29 1/1 Running 0 133m. Capitals get converted to hex. So why is that such a problem on our local env...

Oct 12 2021, 9:42 PM · cloud-services-team (FY2021/2022-Q3), PAWS
Bstorm added a comment to T291589: Upgrade paws jupyterhub.

So after figuring out a bunch of things, it seems that even with using an oauth grant from actual metawiki, we (naturally) get capital letters from oauth and the most recent version of jupyterhub (and possibly k8s) is choking on the capitals. They get converted into the hex representations of the letters. That *could* still somehow be affected by running in Minikube, but it seems unlikely.

Oct 12 2021, 9:41 PM · cloud-services-team (FY2021/2022-Q3), PAWS
Bstorm committed rPAWSe378ec148681: remove all local mediawiki (authored by Bstorm).
remove all local mediawiki
Oct 12 2021, 8:35 PM
Bstorm added a comment to T284656: Toolforge k8s: Migrate from Docker to Containerd.

Oh yeah, please don't remove docker-ce from the repos unless you account for the harbor use of it, also. It's running in docker compose and currently using our kubeadm components to do it.

Oct 12 2021, 3:47 PM · Kubernetes, Toolforge, cloud-services-team (Kanban)
Bstorm added a comment to T284656: Toolforge k8s: Migrate from Docker to Containerd.

Looks like https://kubernetes.io/docs/reference/config-api/kubelet-config.v1beta1/#kubelet-config-k8s-io-v1beta1-KubeletConfiguration has options (not the snazziest when it comes to puppetizing, but you can).

Oct 12 2021, 3:40 PM · Kubernetes, Toolforge, cloud-services-team (Kanban)
Bstorm added a comment to T284656: Toolforge k8s: Migrate from Docker to Containerd.

For this, we currently rely on docker settings to manage log length in containerd, much like prod does. We will want to find an equivalent later because some tools are otherwise very good at filling worker nodes (an old problem around here T148487). logrotate can handle it, but docker was quite good at it with fewer failures waiting for a logrotate run (yes people crashed k8s nodes between logrotate runs regularly, typically using java).

Oct 12 2021, 3:35 PM · Kubernetes, Toolforge, cloud-services-team (Kanban)

Oct 9 2021

Bstorm committed rPAWS56d77977901b: update NOTES.txt for the new way of adding the ingress name in mediawiki chart (authored by Bstorm).
update NOTES.txt for the new way of adding the ingress name in mediawiki chart
Oct 9 2021, 12:47 AM

Oct 8 2021

Bstorm committed rPAWS28a550e49b1c: fixing the ingress (hopefully) and resetting defaults in git to avoid confusion (authored by Bstorm).
fixing the ingress (hopefully) and resetting defaults in git to avoid confusion
Oct 8 2021, 11:57 PM
Bstorm committed rPAWSe3e5c2576721: install oauth from mediawiki 1.36 (authored by Bstorm).
install oauth from mediawiki 1.36
Oct 8 2021, 11:57 PM
Bstorm updated subscribers of T292850: Re-enable clouddb1020 wikireplica (analytics s5 and s8).
Oct 8 2021, 3:35 PM · cloud-services-team (Kanban), Data-Services
Bstorm added a comment to T292850: Re-enable clouddb1020 wikireplica (analytics s5 and s8).

Current status: host is up, mariadb is not yet

Oct 8 2021, 3:13 PM · cloud-services-team (Kanban), Data-Services
Bstorm moved T292850: Re-enable clouddb1020 wikireplica (analytics s5 and s8) from Backlog to Wiki replicas on the Data-Services board.
Oct 8 2021, 3:11 PM · cloud-services-team (Kanban), Data-Services
Bstorm triaged T292850: Re-enable clouddb1020 wikireplica (analytics s5 and s8) as Medium priority.
Oct 8 2021, 3:11 PM · cloud-services-team (Kanban), Data-Services
Bstorm added a comment to T291963: hw troubleshooting: crash (with thermal event) for clouddb1020.eqiad.wmnet.

@Bstorm Should this task be reopened or is there another task for follow up?

Oct 8 2021, 3:06 PM · SRE, ops-eqiad, Data-Services, cloud-services-team (Hardware), DC-Ops

Oct 7 2021

Bstorm added a comment to T292771: upgrade to ingress-nginx 1.0.

I think networking.k8s.io/v1beta1 might be ok...just definitely not extensions/v1beta1.

Oct 7 2021, 11:56 PM · cloud-services-team (Kanban), Toolforge
Bstorm updated subscribers of T292771: upgrade to ingress-nginx 1.0.

Warning, ingress nginx 1.0 will refuse to work with extensions/v1beta1 ingresses regardless of cluster version. @mdipietro and I figured this out experimenting with T291589: Upgrade paws jupyterhub. That should be no issue for most tools per se, as long as it reads existing ingress objects (worth checking if any still exist...they probably do), but jupyterhub 0.9.0 still uses that ingress version.

Oct 7 2021, 11:55 PM · cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T291963: hw troubleshooting: crash (with thermal event) for clouddb1020.eqiad.wmnet.

@Marostegui I think this host is ready to get moving again. Would you like to check it and try getting replication up again? I'm hanging back in case you'd rather I don't mess with the state for those purposes.

Oct 7 2021, 7:13 PM · SRE, ops-eqiad, Data-Services, cloud-services-team (Hardware), DC-Ops
Bstorm changed the status of T292672: Issue creating pods after migration away from PodPresets from Open to In Progress.
Oct 7 2021, 12:47 AM · User-Majavah, Toolforge, cloud-services-team (Kanban)
Bstorm changed the status of T292672: Issue creating pods after migration away from PodPresets, a subtask of T279106: Establish replacement for PodPresets in Toolforge Kubernetes, from Open to In Progress.
Oct 7 2021, 12:47 AM · User-Majavah, Toolforge, cloud-services-team (Kanban)
Bstorm claimed T292672: Issue creating pods after migration away from PodPresets.
Oct 7 2021, 12:47 AM · User-Majavah, Toolforge, cloud-services-team (Kanban)

Oct 6 2021

Bstorm added a comment to T292672: Issue creating pods after migration away from PodPresets.

I've got a patch set that I'm going to test in Minikube in a few.

Oct 6 2021, 11:46 PM · User-Majavah, Toolforge, cloud-services-team (Kanban)
Bstorm added a comment to T292672: Issue creating pods after migration away from PodPresets.

Very similar to what people experienced here https://github.com/evanphx/json-patch/issues/138. The json representation of the structure of the pod object is not fixed in k8s, so if there is no volume at all on the pod, this mutator fails because it is using json-patch as it's strategy. Since the default service account mounts its own creds in most pods, this isn't a problem unless you create your own object that disables that feature.

Oct 6 2021, 11:00 PM · User-Majavah, Toolforge, cloud-services-team (Kanban)
Bstorm added a comment to T289888: Q1:(Need By: TBD) rack/setup/install cloudmetrics100[34].eqiad.wmnet.

In our team meeting today, we figured that a straight refresh of cloudmetrics1001/2 as the systems were provisioned previously might be best for now. Taking over 10G space for 1G hosts doesn't seem sensible.

Oct 6 2021, 10:57 PM · Patch-For-Review, SRE, cloud-services-team (Hardware), ops-eqiad, DC-Ops
Bstorm added a comment to T292672: Issue creating pods after migration away from PodPresets.

To be really specific, this line: https://github.com/lucaswerkmeister/notwikilambda-k8s/blob/cef7948b15922240cf113dc1e7621ee74715d95d/function-orchestrator/deployment.yaml#L21

Oct 6 2021, 9:44 PM · User-Majavah, Toolforge, cloud-services-team (Kanban)
Bstorm added a comment to T292672: Issue creating pods after migration away from PodPresets.

For now, I am willing to bet, you can just remove that line, and your tool will work via the label again. We also should probably upgrade the web hook so that it can function without a pre-existing volumes list as well.

Oct 6 2021, 9:42 PM · User-Majavah, Toolforge, cloud-services-team (Kanban)
Bstorm added a comment to T292672: Issue creating pods after migration away from PodPresets.

I see your problem with the new setup:

Oct 6 2021, 9:41 PM · User-Majavah, Toolforge, cloud-services-team (Kanban)
Bstorm added a comment to T292672: Issue creating pods after migration away from PodPresets.

On both pods for the controller I see on of these on different dates: 2021/09/30 17:22:52 http: TLS handshake error from 192.168.48.128:39842: EOF, but it seems to be functioning for the most part, so I'm not sure what that's about. That's not currently the pod's IP address, and I don't even see that IP in the current environment, so I presume that's just some old stuff.

Oct 6 2021, 9:36 PM · User-Majavah, Toolforge, cloud-services-team (Kanban)

Oct 5 2021

Bstorm updated subscribers of T289888: Q1:(Need By: TBD) rack/setup/install cloudmetrics100[34].eqiad.wmnet.

@nskaggs and @aborrero Just checking on this, do we want to do a straight refresh of exactly as it is? The cloud-support vlan was going away last I checked. If we do a refresh exactly as the originals are deployed, it would be in racks C and A rather than on the cloud-dedicated areas. That would probably be the cloud-hosts vlan (which is not where we've got cloudmetrics1001/2 today).

Oct 5 2021, 9:55 PM · Patch-For-Review, SRE, cloud-services-team (Hardware), ops-eqiad, DC-Ops
Bstorm committed rPAWS12715e2e8603: remove the commented out and deprecated part (authored by Bstorm).
remove the commented out and deprecated part
Oct 5 2021, 5:27 PM
Bstorm committed rPAWS7ad51aeb0882: update the readme to install ingress addon (authored by Bstorm).
update the readme to install ingress addon
Oct 5 2021, 5:27 PM
Bstorm committed rPAWSb2927502e608: dev environ: Fix for more diverse minikube drivers (authored by Bstorm).
dev environ: Fix for more diverse minikube drivers
Oct 5 2021, 5:27 PM
Bstorm committed rPAWS199144d1a3d8: Change to api version 2 (authored by Bstorm).
Change to api version 2
Oct 5 2021, 5:27 PM

Oct 4 2021

Bstorm added a project to T290970: File System corruption on cloud-vps instances: Cloud-VPS.
Oct 4 2021, 6:15 PM · Cloud-Services-Worktype-Unplanned, Cloud-Services-Origin-Alert, User-dcaro, cloud-services-team (FY2021/2022-Q3), Cloud-VPS
Bstorm updated the task description for T290970: File System corruption on cloud-vps instances.
Oct 4 2021, 6:14 PM · Cloud-Services-Worktype-Unplanned, Cloud-Services-Origin-Alert, User-dcaro, cloud-services-team (FY2021/2022-Q3), Cloud-VPS
Bstorm added a comment to T290970: File System corruption on cloud-vps instances.

Here, this works (except for hosts that are hard down like toolsbeta-sgewebgrid-generic-0901):
sudo cumin --force --timeout 500 "A:all" "dmesg | grep -q -m 1 'since last fsck'", with that "success" means the filesystem had errors.

Oct 4 2021, 6:12 PM · Cloud-Services-Worktype-Unplanned, Cloud-Services-Origin-Alert, User-dcaro, cloud-services-team (FY2021/2022-Q3), Cloud-VPS
Bstorm added a comment to T290970: File System corruption on cloud-vps instances.
bstorm@cloud-cumin-01:~$ grep 'error count' corruption-search.txt | wc -l
9

plus 1 for toolsbeta-sgewebgrid-generic-0901 says we are definitely at 10. I don't know if that is a growth or I just captured a couple more in my list. Since json output is possible, maybe I can try to come up with a command that actually can be rerun to show a delta. Otherwise, there's going to be a lot of guesswork in this ticket.

Oct 4 2021, 6:08 PM · Cloud-Services-Worktype-Unplanned, Cloud-Services-Origin-Alert, User-dcaro, cloud-services-team (FY2021/2022-Q3), Cloud-VPS
Bstorm added a comment to T291387: Ensure Cloud Services platforms will accept new LE issuance chain.

Marking toolforge containers done since there is no hope for the Jessie containers.

Oct 4 2021, 6:03 PM · PAWS, Cloud-VPS, Toolforge, cloud-services-team (Kanban)
Bstorm updated the task description for T291387: Ensure Cloud Services platforms will accept new LE issuance chain.
Oct 4 2021, 6:02 PM · PAWS, Cloud-VPS, Toolforge, cloud-services-team (Kanban)
Bstorm added a comment to T290970: File System corruption on cloud-vps instances.

Ok! Running that again with less distractions on board: sudo cumin --force -x --timeout 500 "A:all" "dmesg | grep -m 1 'since last fsck'" > corruption-search.txt

Oct 4 2021, 6:01 PM · Cloud-Services-Worktype-Unplanned, Cloud-Services-Origin-Alert, User-dcaro, cloud-services-team (FY2021/2022-Q3), Cloud-VPS
Bstorm added a comment to T290970: File System corruption on cloud-vps instances.

Ah...because that's actually in the instance's hiera in horizon. Changing that.

Oct 4 2021, 5:58 PM · Cloud-Services-Worktype-Unplanned, Cloud-Services-Origin-Alert, User-dcaro, cloud-services-team (FY2021/2022-Q3), Cloud-VPS
Bstorm added a comment to T290970: File System corruption on cloud-vps instances.

We actually tried to exclude trove from the all alias. It didn't work: https://gerrit.wikimedia.org/r/c/operations/puppet/+/715245 should have done it, but the config file clearly doesn't include that change.

Oct 4 2021, 5:57 PM · Cloud-Services-Worktype-Unplanned, Cloud-Services-Origin-Alert, User-dcaro, cloud-services-team (FY2021/2022-Q3), Cloud-VPS
Bstorm added a comment to T290970: File System corruption on cloud-vps instances.

Yeah, 13 hosts. commonsarchive-mwtest is just exploded (kernel panics). However, we can take two off the list because it was an ssh issue ((2) gerrit-prod-1001.devtools.eqiad1.wikimedia.cloud,mwv-builder-03.mediawiki-vagrant.eqiad1.wikimedia.cloud). I also removed wcdo because that thing just likes to have python OOM panics it looks like. I didn't see the actual issue in dmesg.

Oct 4 2021, 5:53 PM · Cloud-Services-Worktype-Unplanned, Cloud-Services-Origin-Alert, User-dcaro, cloud-services-team (FY2021/2022-Q3), Cloud-VPS
Bstorm added a comment to T290970: File System corruption on cloud-vps instances.

Running sudo cumin --force --timeout 500 "A:all" "dmesg | grep 'since last fsck'" quit quite early on me. I ran it again a little different. This needs to exclude trove from the "all" set somehow. It's annoying. I ran: sudo cumin --force -x --timeout 500 "A:all" "dmesg | grep -m 1 'since last fsck'" to make sure it didn't quit and minimized output from grep.

Oct 4 2021, 5:45 PM · Cloud-Services-Worktype-Unplanned, Cloud-Services-Origin-Alert, User-dcaro, cloud-services-team (FY2021/2022-Q3), Cloud-VPS
Bstorm added a comment to T290970: File System corruption on cloud-vps instances.

Likely related: T292264: Loss of access to parsing-qa-01.eqiad.wmflabs

Oct 4 2021, 5:26 PM · Cloud-Services-Worktype-Unplanned, Cloud-Services-Origin-Alert, User-dcaro, cloud-services-team (FY2021/2022-Q3), Cloud-VPS
Bstorm created T292465: Automate rebuild and rebuild toolsbeta-sgewebgrid-generic-0901.
Oct 4 2021, 5:23 PM · cloud-services-team (Kanban), Patch-For-Review
Bstorm added a comment to T290970: File System corruption on cloud-vps instances.

toolsbeta-sgewebgrid-0901 is similar:

Oct 4 2021, 5:20 PM · Cloud-Services-Worktype-Unplanned, Cloud-Services-Origin-Alert, User-dcaro, cloud-services-team (FY2021/2022-Q3), Cloud-VPS
Bstorm added a comment to T285668: Labs OSMdb is outdated/outofsync.

Ok, so from what you just said, that sounds to me like the OSMDB needs to be rebuilt to make sure we don't have gaps after dumping the appropriate databases. Since it is on VMs. That also suggests it is a good time to consider building the service inside the maps project instead of in the special "admin only" space of clouddb-services. I don't know the implications of syncing up the design of this sync with the production one, but that might be worth considering as well.

Oct 4 2021, 4:23 PM · Patch-For-Review, Data-Services, cloud-services-team (Kanban), Maps

Oct 3 2021

Bstorm added a reverting change for rODIT47592b707f09: Add yarn to node images: rODIT807b52ab64ea: Partially Revert "Add yarn to node images".
Oct 3 2021, 9:52 PM
Bstorm committed rODIT807b52ab64ea: Partially Revert "Add yarn to node images" (authored by Bstorm).
Partially Revert "Add yarn to node images"
Oct 3 2021, 9:52 PM
Bstorm committed rODIT5380b4449e86: openssl: update stretch container TLS libraries before using LE certs (authored by Bstorm).
openssl: update stretch container TLS libraries before using LE certs
Oct 3 2021, 9:52 PM
Bstorm added a comment to T291387: Ensure Cloud Services platforms will accept new LE issuance chain.

"may be affected" I should have said on buster.

Oct 3 2021, 9:46 PM · PAWS, Cloud-VPS, Toolforge, cloud-services-team (Kanban)
Bstorm added a comment to T292355: video2commons login is broken by the LE cert expiry (py2.7).

New buster images should be up now if you need to use that.

Oct 3 2021, 9:35 PM · video2commons
Bstorm added a comment to T292355: video2commons login is broken by the LE cert expiry (py2.7).

Tagged this on my rebuilds in case the openssl library needed updating (bullseye fixes the issue either way, but in case you need to roll back to buster). The images are being pushed still, so don't test that rollback just yet if 3.9 doesn't work.

Oct 3 2021, 9:33 PM · video2commons

Oct 1 2021

Bstorm added a comment to T292217: User browser complains about SSL certificate expired for several toolforge webservices.

Wait, are you seeing the toolforge.org domain "expired"?

Oct 1 2021, 3:27 PM · cloud-services-team (Kanban), Security, Security-Team

Sep 30 2021

Bstorm added a comment to T292265: Request increased quota for wikitextexp Cloud VPS project.

+1, This will be a much better setup for the next time such a problem happens anyway!

Sep 30 2021, 10:28 PM · User-bd808, cloud-services-team (Kanban), Cloud-VPS (Quota-requests)
Bstorm added a comment to T292043: Remove views from flaggedimages.

@mdipietro We should put up a patch to remove this at the same time as T291806. I wouldn't worry about announcing this if the table is already dropped. The data would not be changing anyway at best, and it's already gone and throwing errors at worst.

Sep 30 2021, 10:22 PM · cloud-services-team (Kanban), Data-Services
Bstorm added a comment to T292217: User browser complains about SSL certificate expired for several toolforge webservices.

Listeria is using the image docker-registry.tools.wmflabs.org/toolforge-php73-sssd-web. That's a buster-based image, so the usual upgrade advice doesn't necessarily seem to apply there unless the buster images had an old SSL stack at some point. If so, restarting should fix it.

Sep 30 2021, 10:16 PM · cloud-services-team (Kanban), Security, Security-Team
Bstorm added a comment to T291544: Transition Toolforge Build Service.

Adding more documentation about the deployment of what has been done to https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Toolforge_Buildpack_Implementation#Deployment

Sep 30 2021, 7:22 PM · cloud-services-team (zz-archived1), Toolforge
Bstorm updated the task description for T286856: Upgrade Toolforge Kubernetes to latest 1.22.
Sep 30 2021, 7:05 PM · cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T292238: Figure out certificate generation for admission webhooks before we lose the certificates/v1beta1.

The overall issue is that existing certificates/v1 signers don't include a pod serving signer. You cannot make it use the kubelet serving signer (which is the closest you can come). This should not be an issue for maintain-kubeusers since there's a signer for that use case. The certs that signer makes cannot be serving certs, though. They can only be used for client auth.

Sep 30 2021, 7:03 PM · Patch-For-Review, cloud-services-team (Kanban), Toolforge
Bstorm created T292238: Figure out certificate generation for admission webhooks before we lose the certificates/v1beta1.
Sep 30 2021, 6:57 PM · Patch-For-Review, cloud-services-team (Kanban), Toolforge
Bstorm closed T292131: Something is up with the kubeadm component on stretch VMs as Resolved.

All better now! Thanks @Majavah

Sep 30 2021, 5:44 PM · cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T267616: [tbs.harbor] Puppetize the toolsbeta installation.

Using trove for Postgres in the most recent iteration is terrible. You cannot control it much, and it doesn't actually allow you access to the Postgres account to create a database. This means you can have exactly one database and user. I doubt the replication still works as well. Maybe it will be improved as they settle in to their more containerized setup.

Sep 30 2021, 5:29 PM · Toolforge Build Service (Iteration 05), Cloud-Services-Worktype-Project, Cloud-Services-Origin-Team, User-dcaro, Patch-For-Review, cloud-services-team (Kanban)
Bstorm added a comment to T292205: monitor ldap functionality from within cloud-vps.

A login may succeed with sssd when ldap is down due to caching behavior. A simple connection doesn't always actually suggest ldap is healthy. A getent that is expressly told to dodge the cache and go straight to ldap is not a bad notion for catching the whole chain quickly (which is sort of what is done on the cloudstore servers with the useldap script), but a script that does an ldap search for a should-be-stable group like tools.admin might be even better and more clear as far as what it is testing...if it connects only to what the VMs connect to. maintain-dbusers is usually also killed by an LDAP outage because it does an LDAP list of users, but it uses a route VMs don't use.

Sep 30 2021, 4:58 PM · cloud-services-team (Kanban)
Bstorm added a comment to T292105: Remove deprecated ingress objects from existing web services.

Sweet! Instantly better ingresses.

Sep 30 2021, 4:51 PM · User-Majavah, cloud-services-team (Kanban), Toolforge
Bstorm awarded T291976: Upgrade toolsbeta to k8s 1.20 a Party Time token.
Sep 30 2021, 4:21 PM · cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T292131: Something is up with the kubeadm component on stretch VMs.

As is whatever changed is going to interfere with the upgrade to 1.20 for T280402

Sep 30 2021, 4:06 PM · cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T292131: Something is up with the kubeadm component on stretch VMs.

We need it in stretch for bastions...and it clearly used to be there. :)

Sep 30 2021, 3:38 PM · cloud-services-team (Kanban), Toolforge

Sep 29 2021

Bstorm updated subscribers of T292131: Something is up with the kubeadm component on stretch VMs.

@aborrero I haven't looked much at what's up with that so far. I can say that deleting the sources file and letting puppet put it back didn't work. This just a quick report of the problem.

Sep 29 2021, 10:45 PM · cloud-services-team (Kanban), Toolforge
Bstorm created T292131: Something is up with the kubeadm component on stretch VMs.
Sep 29 2021, 10:44 PM · cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T292105: Remove deprecated ingress objects from existing web services.

This sort of script is pretty simple if you have admin (kubectl sudo if you are using your local account, etc). I did this to fix a problem in the presets.

#!/bin/bash
# Run this script with your root/cluster admin account as appropriate.
# This will fix the dumps mounts for all existing tools.
Sep 29 2021, 6:54 PM · User-Majavah, cloud-services-team (Kanban), Toolforge
Bstorm created T292105: Remove deprecated ingress objects from existing web services.
Sep 29 2021, 6:52 PM · User-Majavah, cloud-services-team (Kanban), Toolforge
Bstorm moved T292043: Remove views from flaggedimages from Backlog to Wiki replicas on the Data-Services board.
Sep 29 2021, 3:26 PM · cloud-services-team (Kanban), Data-Services

Sep 28 2021

Bstorm updated the task description for T291963: hw troubleshooting: crash (with thermal event) for clouddb1020.eqiad.wmnet.
Sep 28 2021, 4:26 PM · SRE, ops-eqiad, Data-Services, cloud-services-team (Hardware), DC-Ops
Bstorm updated the task description for T291963: hw troubleshooting: crash (with thermal event) for clouddb1020.eqiad.wmnet.
Sep 28 2021, 4:26 PM · SRE, ops-eqiad, Data-Services, cloud-services-team (Hardware), DC-Ops
Bstorm added a comment to T291963: hw troubleshooting: crash (with thermal event) for clouddb1020.eqiad.wmnet.

That's a big nope from the server on restarting via console. It has a processor reporting bad voltage and other fun. System Event Log is attached.

Sep 28 2021, 4:25 PM · SRE, ops-eqiad, Data-Services, cloud-services-team (Hardware), DC-Ops
Bstorm added a comment to T291963: hw troubleshooting: crash (with thermal event) for clouddb1020.eqiad.wmnet.

This does not seem related to T289159 as it is a different rack, but you never know.

Sep 28 2021, 4:17 PM · SRE, ops-eqiad, Data-Services, cloud-services-team (Hardware), DC-Ops
Bstorm moved T291963: hw troubleshooting: crash (with thermal event) for clouddb1020.eqiad.wmnet from Backlog to Wiki replicas on the Data-Services board.
Sep 28 2021, 4:14 PM · SRE, ops-eqiad, Data-Services, cloud-services-team (Hardware), DC-Ops
Bstorm merged T291961: clouddb1020 crash into T291963: hw troubleshooting: crash (with thermal event) for clouddb1020.eqiad.wmnet.
Sep 28 2021, 4:14 PM · SRE, ops-eqiad, Data-Services, cloud-services-team (Hardware), DC-Ops
Bstorm merged task T291961: clouddb1020 crash into T291963: hw troubleshooting: crash (with thermal event) for clouddb1020.eqiad.wmnet.
Sep 28 2021, 4:13 PM · Data-Services, cloud-services-team (Hardware)
Bstorm updated the task description for T291963: hw troubleshooting: crash (with thermal event) for clouddb1020.eqiad.wmnet.
Sep 28 2021, 4:13 PM · SRE, ops-eqiad, Data-Services, cloud-services-team (Hardware), DC-Ops
Bstorm added a subtask for T291961: clouddb1020 crash: T291963: hw troubleshooting: crash (with thermal event) for clouddb1020.eqiad.wmnet.
Sep 28 2021, 4:08 PM · Data-Services, cloud-services-team (Hardware)
Bstorm added a parent task for T291963: hw troubleshooting: crash (with thermal event) for clouddb1020.eqiad.wmnet: T291961: clouddb1020 crash.
Sep 28 2021, 4:08 PM · SRE, ops-eqiad, Data-Services, cloud-services-team (Hardware), DC-Ops
Bstorm created T291963: hw troubleshooting: crash (with thermal event) for clouddb1020.eqiad.wmnet.
Sep 28 2021, 4:07 PM · SRE, ops-eqiad, Data-Services, cloud-services-team (Hardware), DC-Ops
Bstorm updated the task description for T291961: clouddb1020 crash.
Sep 28 2021, 4:01 PM · Data-Services, cloud-services-team (Hardware)