Page MenuHomePhabricator

Bstorm (Brooke)
Ops Witch -- Wikimedia Cloud Services Team

Projects (7)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Sunday

  • Clear sailing ahead.

User Details

User Since
Jan 22 2018, 10:09 PM (123 w, 3 d)
Availability
Available
IRC Nick
bstorm_
LDAP User
Bstorm
MediaWiki User
BStorm (WMF) [ Global Accounts ]

On the wikis, I'm BStorm (WMF), bstorm_ on IRC and Bstorm on gerrit and WikiTech.

I work for or provide services to the Wikimedia Foundation, but this is my only Phabricator account. Edits, statements, or other contributions made from this account are my own, and may not reflect the views of the Foundation.

Recent Activity

Tue, Jun 2

Bstorm added a comment to T253738: ToolsDB: master crashed, replica having consistency issues.

So far so good. It's still replicating. It hasn't fully caught up yet, though.

Tue, Jun 2, 11:03 PM · Patch-For-Review, Data-Services, cloud-services-team (Kanban)
bd808 awarded T246122: Upgrade the Toolforge Kubernetes cluster to v1.16 a Party Time token.
Tue, Jun 2, 10:50 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm closed T246122: Upgrade the Toolforge Kubernetes cluster to v1.16 as Resolved.

I think we are done with this one!

Tue, Jun 2, 10:34 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm updated the task description for T246122: Upgrade the Toolforge Kubernetes cluster to v1.16.
Tue, Jun 2, 10:33 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm closed T250874: Refresh external certs for the toolforge k8s cluster after the upgrade, a subtask of T246122: Upgrade the Toolforge Kubernetes cluster to v1.16, as Resolved.
Tue, Jun 2, 10:32 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm closed T250874: Refresh external certs for the toolforge k8s cluster after the upgrade as Resolved.

I think we are done then!

Tue, Jun 2, 10:32 PM · Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm created T254293: Change to admission controller readme.md failed to pass gate-and-submit jobs.
Tue, Jun 2, 10:06 PM · Continuous-Integration-Config, Release-Engineering-Team (CI & Testing services), Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm closed T253698: Attach 'skierpage' svn era LDAP account to wikitech to allow password recovery as Resolved.

Please reopen if this isn't fixed! Thanks

Tue, Jun 2, 4:50 PM · cloud-services-team (Kanban), wikitech.wikimedia.org
Bstorm assigned T222411: Decommission former cloud IPs and vlans to aborrero.
Tue, Jun 2, 4:47 PM · cloud-services-team (Kanban)
Bstorm moved T226052: Google OAuth verification for tools require domain verification from Inbox to Watching on the cloud-services-team (Kanban) board.
Tue, Jun 2, 4:45 PM · Toolforge, cloud-services-team (Kanban), Google-Drive-to-Commons
Bstorm changed the status of T224510: Document the new NFS setup on cloudstore1008/9, a subtask of T209527: Set up scratch and maps NFS services on cloudstore1008/9, from Open to Stalled.
Tue, Jun 2, 4:42 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm changed the status of T224510: Document the new NFS setup on cloudstore1008/9 from Open to Stalled.

This needs an overhaul before more docs are done.

Tue, Jun 2, 4:42 PM · cloud-services-team (Kanban)
Bstorm moved T175668: Investigate mirrorbits for possible use for dumps mirrors traffic from Inbox to Watching on the cloud-services-team (Kanban) board.
Tue, Jun 2, 4:41 PM · cloud-services-team (Kanban), User-ArielGlenn, Datasets-General-or-Unknown
Bstorm moved T249022: Track and list the services that Cloud Services that connect to internal network endpoints from Inbox to Doing on the cloud-services-team (Kanban) board.
Tue, Jun 2, 4:39 PM · cloud-services-team (Kanban)
Bstorm moved T249787: Create Docker image for Toolforge that is purpose built to run pywikibot scripts from Inbox to Doing on the cloud-services-team (Kanban) board.
Tue, Jun 2, 4:37 PM · Pywikibot, cloud-services-team (Kanban), Toolforge
Bstorm moved T251719: Quarry or the Analytics wikireplicas role creates lots of InnoDB Purge Lag from Inbox to Watching on the cloud-services-team (Kanban) board.
Tue, Jun 2, 4:33 PM · Quarry, Data-Services, cloud-services-team (Kanban)
Bstorm removed a project from T253134: Find an alternative solution for the mysql-proxy in PAWS: cloud-services-team (Kanban).
Tue, Jun 2, 4:30 PM · PAWS
Bstorm moved T250706: Prepare and check storage layer for gomwiktionary from Inbox to Clinic Duty on the cloud-services-team (Kanban) board.
Tue, Jun 2, 4:28 PM · cloud-services-team (Kanban), Data-Services, DBA
Bstorm moved T251410: Prepare and check storage layer for awawiki from Inbox to Clinic Duty on the cloud-services-team (Kanban) board.
Tue, Jun 2, 4:27 PM · cloud-services-team (Kanban), Data-Services, DBA
Bstorm edited projects for T169286: labstore1005 A PCIe link training failure error on boot, added: cloud-services-team (Hardware); removed cloud-services-team (Kanban).
Tue, Jun 2, 4:26 PM · cloud-services-team (Hardware), DC-Ops, Operations
Bstorm moved T253412: webservice 0.69+ fills /tmp with k8s ca cert files from Inbox to Soon! on the cloud-services-team (Kanban) board.
Tue, Jun 2, 4:25 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban)
Bstorm moved T253424: Determine if templatetiger is abandoned from Inbox to Doing on the cloud-services-team (Kanban) board.
Tue, Jun 2, 4:24 PM · Data-Services, cloud-services-team (Kanban), Tools
Bstorm moved T253738: ToolsDB: master crashed, replica having consistency issues from Inbox to Doing on the cloud-services-team (Kanban) board.
Tue, Jun 2, 4:24 PM · Patch-For-Review, Data-Services, cloud-services-team (Kanban)
Bstorm moved T243730: DBQueryError on Wikitech Static Search from Inbox to Doing on the cloud-services-team (Kanban) board.
Tue, Jun 2, 4:23 PM · cloud-services-team (Kanban), CirrusSearch, wikitech.wikimedia.org, Discovery-Search
Bstorm assigned T243730: DBQueryError on Wikitech Static Search to Andrew.
Tue, Jun 2, 4:23 PM · cloud-services-team (Kanban), CirrusSearch, wikitech.wikimedia.org, Discovery-Search
Bstorm triaged T251030: Enable innodb_rollback_on_timeout for wmflabs databases as Medium priority.
Tue, Jun 2, 4:21 PM · cloud-services-team (Kanban), Data-Services
Bstorm triaged T252734: Consider moving tools away from acme-chief as Low priority.
Tue, Jun 2, 4:19 PM · cloud-services-team (Kanban), Tools
Bstorm triaged T253267: Configure the soft anti-affinity (and presumably the soft affinity) server policy as Low priority.
Tue, Jun 2, 4:18 PM · Cloud-VPS, cloud-services-team (Kanban)
Bstorm triaged T250206: Deploy a proof of concept prometheus server in cloudvps to replace shinken as High priority.
Tue, Jun 2, 4:16 PM · cloud-services-team (Kanban), Cloud-VPS (Debian Jessie Deprecation)

Mon, Jun 1

Bstorm added a comment to T250874: Refresh external certs for the toolforge k8s cluster after the upgrade.

That leaves what else? Prometheus? @aborrero
Or did we have something else where we made certs by hand?

Mon, Jun 1, 11:52 PM · Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T250874: Refresh external certs for the toolforge k8s cluster after the upgrade.

Using an operations-pod for maintain-kubeusers (so that I could install the openssl package in the pod):

# echo | openssl s_client -showcerts -servername registry-admission.registry-admission.svc -connect registry-admission.registry-admission.svc:443 2>/dev/null | openssl x509 -n
oout -dates
notBefore=Jun  1 23:23:00 2020 GMT
notAfter=Jun  1 23:23:00 2021 GMT

That tells me it worked!
I'll do it in tools. Also I'll document the process in the README files of the controllers. I did restart the pods just in case.

Mon, Jun 1, 11:47 PM · Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T250874: Refresh external certs for the toolforge k8s cluster after the upgrade.

Cool thing, I can just re-run the scripts I've got for the controllers. It works great on minikube. I'll run it in toolsbeta and delete the pods to restart as well if it needs it.

Mon, Jun 1, 11:23 PM · Toolforge, cloud-services-team (Kanban), Kubernetes
zhuyifei1999 awarded T251786: Changes to fontcdn would be helpful for the new toolforge domains a Like token.
Mon, Jun 1, 12:17 PM · cloud-services-team (Kanban), Toolforge, Tools

Fri, May 29

Bstorm closed T154504: Make webservice backend default to kubernetes as Resolved.
Fri, May 29, 11:49 PM · Patch-For-Review, Kubernetes, Tracking-Neverending, Tools-Kubernetes, Toolforge
Bstorm added a comment to T250874: Refresh external certs for the toolforge k8s cluster after the upgrade.

I was about to just use my cert scripts, but they won't do. I need to mess with them a bit to get the admission controller scripts renewed...ideally with a simple argument or something to say "just renew the scripts". A second "create" with the same name will fail. It should update the existing secrets with new scripts with all the appropriate alt-names for doing SSL termination.

Fri, May 29, 11:45 PM · Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T253738: ToolsDB: master crashed, replica having consistency issues.

Seconds_Behind_Master: 213890 sounds a lot better. Let's see if it actually catches up.

Fri, May 29, 11:43 PM · Patch-For-Review, Data-Services, cloud-services-team (Kanban)
Bstorm added a comment to T253738: ToolsDB: master crashed, replica having consistency issues.

So after merging that, I realized that the escapes don't seem to be correct for a table vs a wild_table, so I stopped puppet and made it a wildcard entry. That seems to have got replication moving again.

Fri, May 29, 11:41 PM · Patch-For-Review, Data-Services, cloud-services-team (Kanban)
Bstorm added a comment to T236974: Establish a process for increasing a toolforge tool's connections to the wiki replicas.

Finally circled back and added that information! What do folks think now?

Fri, May 29, 10:45 PM · cloud-services-team (Kanban), Data-Services
Bstorm edited Description on Data-Services (Quota-requests).
Fri, May 29, 10:45 PM
Bstorm added a comment to T252219: Drop MCR-obsoleted fields from the wiki replicas.

Surprisingly, only labsdb1009 and db1141 now remain.

Fri, May 29, 10:38 PM · Data-Services, cloud-services-team (Kanban)
Bstorm updated the task description for T195217: Sane ingress methods for PAWS.
Fri, May 29, 10:17 PM · cloud-services-team (Kanban), PAWS (zero-to-jupyterhub-k8s 0.8.0)
Bstorm closed T251598: Clean up wb_terms related views as Resolved.
Fri, May 29, 9:12 PM · cloud-services-team (Kanban), Data-Services
Bstorm closed T251598: Clean up wb_terms related views, a subtask of T248086: Drop wb_terms in production from s4 (commonswiki, testcommonswiki), s3 (testwikidatawiki), s8 (wikidatawiki), as Resolved.
Fri, May 29, 9:12 PM · DBA
Bstorm added a comment to T251598: Clean up wb_terms related views.

Actually, just succeeded in running on labsdb1010 by chance.

Fri, May 29, 9:11 PM · cloud-services-team (Kanban), Data-Services
Bstorm added a comment to T251598: Clean up wb_terms related views.

This just needs a run on labsdb1010. When I do the run for T252219 against 1009 and 1010, that will finish this off.

Fri, May 29, 9:06 PM · cloud-services-team (Kanban), Data-Services
Bstorm added a comment to T252219: Drop MCR-obsoleted fields from the wiki replicas.

I can! I will do that.

Fri, May 29, 8:51 PM · Data-Services, cloud-services-team (Kanban)
Bstorm added a comment to T252217: Public-Link of Notebooks of users with non-ASCII characters return 500 error.

If we end up not merging all of my coming pull request into the paws repo, I'll make a separate PR with just the new image.

Fri, May 29, 7:41 PM · PAWS
Bstorm closed T252217: Public-Link of Notebooks of users with non-ASCII characters return 500 error as Resolved.

Apparently my version works! That makes this stable and I'll close the ticket.

Fri, May 29, 7:40 PM · PAWS
Bstorm added a comment to T252217: Public-Link of Notebooks of users with non-ASCII characters return 500 error.

Ok so they work again for now, and I don't plan on restarting the pod for a while. I may try building my newer version of the image (not Debian Jessie-based) and testing it with a tag that will allow quick rollback if it doesn't work.

Fri, May 29, 6:14 PM · PAWS
Bstorm updated subscribers of T252219: Drop MCR-obsoleted fields from the wiki replicas.

@Marostegui I won't be able to squeeze this in on 1009-11 (or db1141) without depooling, I think after a first attempt on 1009, which is the most likely one to succeed without. You've been doing a lot of work on 1011 and friends, so I thought I'd check in before I start that rotation on Monday. Ok to proceed?

Fri, May 29, 5:55 PM · Data-Services, cloud-services-team (Kanban)
Bstorm updated the task description for T252219: Drop MCR-obsoleted fields from the wiki replicas.
Fri, May 29, 5:25 PM · Data-Services, cloud-services-team (Kanban)
Bstorm added a comment to T253837: Batches in QuickStatements not working.

It likely would have lost database connection during T253738, and it would have been restarted when we upgraded it's k8s nodes yesterday in T246122. If it's working again, I'd be willing to bet that's why and think this can be closed.

Fri, May 29, 5:05 PM · Tools
Bstorm added a comment to T253738: ToolsDB: master crashed, replica having consistency issues.

So stupid question, @Marostegui, if I do that patch and restart on mysqld on the slave without restarting the master, will that work or do I need to CHANGE REPLICATION FILTER with the whole mess?

Fri, May 29, 5:03 PM · Patch-For-Review, Data-Services, cloud-services-team (Kanban)

Thu, May 28

Bstorm updated the task description for T246122: Upgrade the Toolforge Kubernetes cluster to v1.16.
Thu, May 28, 11:44 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T246122: Upgrade the Toolforge Kubernetes cluster to v1.16.
NOTE: I copied the admin.conf to .kube/config for the root account on each control plane node because I realized our upgrade renewed that cert :)
Thu, May 28, 8:54 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T246122: Upgrade the Toolforge Kubernetes cluster to v1.16.

Never mind! The blasted config is the default on this version: RotateKubeletClientCertificate=true|false (BETA - default=true) from https://v1-16.docs.kubernetes.io/docs/reference/command-line-tools-reference/kubelet/

Thu, May 28, 6:18 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T246122: Upgrade the Toolforge Kubernetes cluster to v1.16.

I did confirm our control plane certs look right.

Thu, May 28, 5:57 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T246122: Upgrade the Toolforge Kubernetes cluster to v1.16.

Looking deeper into things, I think kubeadm is confusingly documented (we knew that). In order to upgrade the client cert for kubelet, we can simply set the kubelets to do it for us with a feature gate. The settings are here https://kubernetes.io/docs/tasks/tls/certificate-rotation/
This is distinct from *serving certificate rotation*, which we deliberately avoided. I'll make another task and a patch to add the args to our kubelets.

Thu, May 28, 5:57 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T246122: Upgrade the Toolforge Kubernetes cluster to v1.16.

We discovered that there is a bug in kubeadm < 1.17 that sets renew-certs to false on node upgrades. The control plane certs rotated fine, but the kubelet certs of worker nodes did not. https://github.com/kubernetes/kubeadm/issues/1818 This is also referenced in the docs here https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-certs/

Thu, May 28, 4:38 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T252217: Public-Link of Notebooks of users with non-ASCII characters return 500 error.

The backports repo is gone. It needs to change to the debian archive. I'll get that fixed shortly.

Thu, May 28, 4:11 PM · PAWS
Bstorm added a comment to T252217: Public-Link of Notebooks of users with non-ASCII characters return 500 error.

Bad news. The image cannot build.

Thu, May 28, 4:05 PM · PAWS
Bstorm added a comment to T246122: Upgrade the Toolforge Kubernetes cluster to v1.16.

It's looking good after a short problem:

Thu, May 28, 3:42 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T252217: Public-Link of Notebooks of users with non-ASCII characters return 500 error.

Turns out that we are planning on restarting that pod today. I'll rebuild the image.

Thu, May 28, 2:51 PM · PAWS
Bstorm added a comment to T253738: ToolsDB: master crashed, replica having consistency issues.

Ahh, yeah, oops. I see that Seconds_Behind_Master: NULL and the threads are dead all over the logs. So @Marostegui were the earlier ones that I think are timestamped before T253738#6169013 fixed already? If so, then I'm only aware of s51245__totoazero.maj_articles_recents.

Thu, May 28, 2:17 PM · Patch-For-Review, Data-Services, cloud-services-team (Kanban)
Bstorm updated the task description for T246122: Upgrade the Toolforge Kubernetes cluster to v1.16.
Thu, May 28, 1:55 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T246122: Upgrade the Toolforge Kubernetes cluster to v1.16.

I'd forgot to check deprecated objects by the end of the day yesterday, but I checked this morning in Toolsbeta...and there may not be any there. I replaced all the PSPs already in tools and TB as I recall and the deployments there are replaced.

Thu, May 28, 1:53 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes

Wed, May 27

Bstorm updated subscribers of T253738: ToolsDB: master crashed, replica having consistency issues.

Circling back around to this. It looks like I'm seeing: s51245__totoazero.maj_articles_recents is the major source of trouble. Timing-wise, I don't think I see any other issues.

Wed, May 27, 10:04 PM · Patch-For-Review, Data-Services, cloud-services-team (Kanban)
Bstorm updated the task description for T195217: Sane ingress methods for PAWS.
Wed, May 27, 9:55 PM · cloud-services-team (Kanban), PAWS (zero-to-jupyterhub-k8s 0.8.0)
Bstorm closed T253702: Figure out storing images for PAWS, a subtask of T211096: PAWS: Rebuild and upgrade Kubernetes, as Resolved.
Wed, May 27, 9:55 PM · Patch-For-Review, Toolforge, Epic, Goal, cloud-services-team (Kanban), PAWS
Bstorm closed T253702: Figure out storing images for PAWS, a subtask of T195217: Sane ingress methods for PAWS, as Resolved.
Wed, May 27, 9:55 PM · cloud-services-team (Kanban), PAWS (zero-to-jupyterhub-k8s 0.8.0)
Bstorm closed T253702: Figure out storing images for PAWS as Resolved.

Got it! I'll close the task since I can then add others if needed. Thank you very much!

Wed, May 27, 9:55 PM · cloud-services-team (Kanban), PAWS
Bstorm added a comment to T253702: Figure out storing images for PAWS.

@aborrero I presume you'll want access as well?

Wed, May 27, 7:13 PM · cloud-services-team (Kanban), PAWS
Bstorm added a comment to T253702: Figure out storing images for PAWS.

Thank you for doing more than my web drive-by, @bd808! @Chicocvenancio please give me access!
Then I can push up my custom tags for testing in the new cluster. 💃

Wed, May 27, 7:12 PM · cloud-services-team (Kanban), PAWS
Bstorm added a comment to T253702: Figure out storing images for PAWS.

I made a quay.io user https://quay.io/user/brookestorm, but I still definitely am interested in the answers to those questions.

Wed, May 27, 7:03 PM · cloud-services-team (Kanban), PAWS
Bstorm moved T253702: Figure out storing images for PAWS from Inbox to Doing on the cloud-services-team (Kanban) board.
Wed, May 27, 7:00 PM · cloud-services-team (Kanban), PAWS
Bstorm triaged T253702: Figure out storing images for PAWS as High priority.
Wed, May 27, 6:59 PM · cloud-services-team (Kanban), PAWS
Bstorm added a comment to T253702: Figure out storing images for PAWS.

@Chicocvenancio What are the terms on that registry? Is it free? It looks like somebody is paying for it.
Overall, it looks like a good place to keep things, and I'd like access. I just don't want to inadvertently cost someone money if that's charging someone other than the Foundation. I also don't want someone to stop paying for it suddenly if we continue depending on it :)

Wed, May 27, 6:59 PM · cloud-services-team (Kanban), PAWS
Bstorm added a comment to T252219: Drop MCR-obsoleted fields from the wiki replicas.

@daniel I'm all set to get this closed up, I'm just waiting on a final review of the patch. I think it's good to go. I'll test it locally first because it's really hard to fix sometimes if it goes badly.

Wed, May 27, 6:53 PM · Data-Services, cloud-services-team (Kanban)
Bstorm added a comment to T253738: ToolsDB: master crashed, replica having consistency issues.

Is there a method you recommend for recloning the tables, @Marostegui . I cannot say I have done a single table clone on mysql to a replica this decade 🙂.

Wed, May 27, 2:16 PM · Patch-For-Review, Data-Services, cloud-services-team (Kanban)

Tue, May 26

Bstorm added a comment to T218380: paws-deploy-hook is unreachable.

@Chicocvenancio Good to hear, overall, I haven't enabled the repo for travis yet (nothing stops it from working as is with the current file) because too many balls are in the air to automate it from there. I don't want to automatically break the existing cluster. The travis config points back at the old repo location right now...and that's just fine until a couple more pieces are done! Thanks!

Tue, May 26, 11:48 PM · cloud-services-team (Kanban), PAWS
Bstorm added a parent task for T253702: Figure out storing images for PAWS: T195217: Sane ingress methods for PAWS.
Tue, May 26, 11:10 PM · cloud-services-team (Kanban), PAWS
Bstorm added a subtask for T195217: Sane ingress methods for PAWS: T253702: Figure out storing images for PAWS.
Tue, May 26, 11:10 PM · cloud-services-team (Kanban), PAWS (zero-to-jupyterhub-k8s 0.8.0)
Bstorm created T253702: Figure out storing images for PAWS.
Tue, May 26, 11:10 PM · cloud-services-team (Kanban), PAWS
Bstorm updated the task description for T195217: Sane ingress methods for PAWS.
Tue, May 26, 11:01 PM · cloud-services-team (Kanban), PAWS (zero-to-jupyterhub-k8s 0.8.0)
Bstorm added a comment to T195217: Sane ingress methods for PAWS.

As this is now being pursued as a quarter goal in T211096 with the effort to reuse much of the design and testing work done for Toolforge k8s, there will be significant updates to this soon. I suspect that we will likely be able to use the haproxy (in Gio's model above it was nginx)->ingress model of Toolforge, using the existing front proxy to temporarily smooth the transition. So far, the cluster is all up and ready. We are close to sorting out the last steps of actually deploying a paws in parallel there. I'm also hoping that "paws beta" can become simply the beta namespace inside this cluster.

Tue, May 26, 10:50 PM · cloud-services-team (Kanban), PAWS (zero-to-jupyterhub-k8s 0.8.0)
Bstorm added a comment to T188912: Puppetize PAWS k8s cluster .

At this point, I'm just keeping this open until we've moved over the cluster. As long as we actually use what we've built so far, this is effectively done.

Tue, May 26, 10:38 PM · Patch-For-Review, cloud-services-team (Kanban), PAWS (zero-to-jupyterhub-k8s 0.8.0)
Bstorm added a comment to T246122: Upgrade the Toolforge Kubernetes cluster to v1.16.

I think this should be unblocked and the upgrade might work on the next try. Probably should depool control plane nodes before upgrading then repooling them per https://v1-16.docs.kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/ since that is the newer procedure (in case that fixes anything that my fix didn't--the thing I fixed would have stopped the upgrade no matter what). I don't think we should worry a bit about fussing with haproxy during the upgrade because the tooling should all be compatible between the two versions. The big thing we must check before the tools upgrade is to make sure that all the objects created with old definitions are still working on the upgraded cluster...presuming we get the upgrade rolling.

Tue, May 26, 10:14 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm updated the task description for T211096: PAWS: Rebuild and upgrade Kubernetes.
Tue, May 26, 10:07 PM · Patch-For-Review, Toolforge, Epic, Goal, cloud-services-team (Kanban), PAWS
Bstorm added a comment to T211096: PAWS: Rebuild and upgrade Kubernetes.

Ok, while that waits for things, now PAWS just needs the ingress setup finished, testing if the roles and all that work for the pods it needs, ideally using all the upgraded images, and all of this deployed in the new cluster.

Tue, May 26, 9:57 PM · Patch-For-Review, Toolforge, Epic, Goal, cloud-services-team (Kanban), PAWS
Bstorm committed rLTMK99bd246e759f: admins: intialize variables to zero for reporting (authored by Bstorm).
admins: intialize variables to zero for reporting
Tue, May 26, 9:48 PM
Bstorm added a comment to T211096: PAWS: Rebuild and upgrade Kubernetes.
root@paws-k8s-control-3:~# kubectl get nodes
NAME                 STATUS   ROLES    AGE     VERSION
paws-k8s-control-1   Ready    master   28m     v1.16.10
paws-k8s-control-2   Ready    master   21m     v1.16.10
paws-k8s-control-3   Ready    master   12m     v1.16.10
paws-k8s-worker-1    Ready    <none>   3m22s   v1.16.10
paws-k8s-worker-2    Ready    <none>   2m26s   v1.16.10
paws-k8s-worker-3    Ready    <none>   105s    v1.16.10
paws-k8s-worker-4    Ready    <none>   31s     v1.16.10
Tue, May 26, 6:36 PM · Patch-For-Review, Toolforge, Epic, Goal, cloud-services-team (Kanban), PAWS
Bstorm added a comment to T211096: PAWS: Rebuild and upgrade Kubernetes.

It works now!

Tue, May 26, 6:06 PM · Patch-For-Review, Toolforge, Epic, Goal, cloud-services-team (Kanban), PAWS
Bstorm added a comment to T211096: PAWS: Rebuild and upgrade Kubernetes.

Ah, it was a copy/paste error in the hiera! Fixing.

Tue, May 26, 5:17 PM · Patch-For-Review, Toolforge, Epic, Goal, cloud-services-team (Kanban), PAWS
Bstorm added a comment to T211096: PAWS: Rebuild and upgrade Kubernetes.

It doesn't look like it is base64 in the config. I wonder why.

Tue, May 26, 5:03 PM · Patch-For-Review, Toolforge, Epic, Goal, cloud-services-team (Kanban), PAWS
Bstorm added a comment to T211096: PAWS: Rebuild and upgrade Kubernetes.

So we have progress! There's a new error that is particular to paws:
running docker ps -a to get the container ID and docker logs <hash>

Tue, May 26, 5:02 PM · Patch-For-Review, Toolforge, Epic, Goal, cloud-services-team (Kanban), PAWS
Bstorm added a comment to T246122: Upgrade the Toolforge Kubernetes cluster to v1.16.

To be clear, this would prevent the api-server pod from starting after upgrade. I suspect that's exactly what caused the error you saw (partly because it is very similar to my kubeadm init error and because the pod cannot start with that value for a volume name).

Tue, May 26, 4:11 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T246122: Upgrade the Toolforge Kubernetes cluster to v1.16.

@aborrero I think I know what is wrong in Toolsbeta. It is the same thing that I saw just now on paws. There is an error in the kubeadm config (which becomes the kubeadm configmap). The name of the extra volume needed for encryption and some other important config for the apiserver is wrong. I must have done this by mistake somewhere during that very long security eval. I made changes in place instead of rebuilding clusters, so I never saw the discrepancy.

Tue, May 26, 4:10 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T211096: PAWS: Rebuild and upgrade Kubernetes.

So the problem was:

May 26 15:46:53 paws-k8s-control-1 kubelet[28450]: E0526 15:46:53.546760   28450 file.go:187] Can't process manifest file "/etc/kubernetes/manifests/kube-apiserver.yaml": invalid p
od: [spec.volumes[0].name: Invalid value: "/etc/kubernetes/admission": a DNS-1123 label must consist of lower case alphanumeric characters or '-', and must start and end with an al
phanumeric character (e.g. 'my-name',  or '123-abc', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?') spec.containers[0].volumeMounts[0].name: Not found: "/etc/kuberne
tes/admission"]

So it's the puppetization somewhere. I'll dig that up.

Tue, May 26, 3:57 PM · Patch-For-Review, Toolforge, Epic, Goal, cloud-services-team (Kanban), PAWS
Bstorm added a comment to T246122: Upgrade the Toolforge Kubernetes cluster to v1.16.

I see there were psp changes around 1.16 https://github.com/kubernetes/kubernetes/pull/77792
That isn't likely to be our issue, but something to be aware of.

Tue, May 26, 3:44 PM · Patch-For-Review, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T211096: PAWS: Rebuild and upgrade Kubernetes.

We got an error! Whatever it is may help with the upgrades in toolsbeta.

Tue, May 26, 3:37 PM · Patch-For-Review, Toolforge, Epic, Goal, cloud-services-team (Kanban), PAWS