Page MenuHomePhabricator

Bstorm (Brooke)
Ops Witch -- Wikimedia Cloud Services Team

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
Jan 22 2018, 10:09 PM (86 w, 5 d)
Availability
Available
IRC Nick
bstorm_
LDAP User
Bstorm
MediaWiki User
BStorm (WMF) [ Global Accounts ]

On the wikis, I'm BStorm (WMF), bstorm_ on IRC and Bstorm on gerrit and WikiTech.

I work for or provide services to the Wikimedia Foundation, but this is my only Phabricator account. Edits, statements, or other contributions made from this account are my own, and may not reflect the views of the Foundation.

Recent Activity

Fri, Sep 20

Bstorm added a comment to T233372: Create a "novaobserver" equivalent for Toolforge Kubernetes cluster inspection.

Oddly, since most serviceaccounts are namespaced, this may be easier to do with a simple "user" object with an x509 and a custom role that gives just the perms you need. We'll think more about it...

Fri, Sep 20, 12:01 AM · Kubernetes, Toolforge

Thu, Sep 19

Bstorm added a comment to T227290: Design and document how to integrate the new Toolforge k8s cluster with PodSecurityPolicy.

Started design doc...needs pictures: https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Toolforge_Kubernetes_RBAC_and_PSP

Thu, Sep 19, 10:47 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm closed T219374: Prepare and check storage layer for hi.wikisource, a subtask of T218155: Create Wikisource Hindi, as Resolved.
Thu, Sep 19, 6:05 PM · Operations, User-Ladsgroup, Patch-For-Review, Wiki-Setup (Create), User-Urbanecm
Bstorm closed T219374: Prepare and check storage layer for hi.wikisource as Resolved.

Ok, I've run maintain-meta_p on all 4 replicas. Docs don't need an update because what I changed is hardcoded anyway.

Thu, Sep 19, 6:05 PM · Core Platform Team Workboards (Clinic Duty Team), cloud-services-team, Analytics, Data-Services, DBA
Bstorm edited projects for T233335: Change maintain-meta_p script to use the sitematrix API, added: cloud-services-team (Kanban); removed cloud-services-team.
Thu, Sep 19, 5:31 PM · cloud-services-team (Kanban), Data-Services
Bstorm triaged T233335: Change maintain-meta_p script to use the sitematrix API as Low priority.
Thu, Sep 19, 5:30 PM · cloud-services-team (Kanban), Data-Services
Bstorm created T233335: Change maintain-meta_p script to use the sitematrix API.
Thu, Sep 19, 5:10 PM · cloud-services-team (Kanban), Data-Services
Bstorm added a comment to T219374: Prepare and check storage layer for hi.wikisource.

The problem is apparently a lot of settings being moved from InitializeSettings.php to VariantSettings.php. I'm going to make sure the function will correctly parse the new file and, if so, document a command line that will use that instead.

Thu, Sep 19, 5:08 PM · Core Platform Team Workboards (Clinic Duty Team), cloud-services-team, Analytics, Data-Services, DBA
Bstorm added a comment to T219374: Prepare and check storage layer for hi.wikisource.

Thanks @Marostegui

Thu, Sep 19, 4:45 PM · Core Platform Team Workboards (Clinic Duty Team), cloud-services-team, Analytics, Data-Services, DBA
Bstorm added a comment to T219374: Prepare and check storage layer for hi.wikisource.

Looks like https://gerrit.wikimedia.org/r/c/operations/puppet/+/538030 fixes it, but I don't know if we need that timestamp index.
I'm not seeing anything changing in the meta_p script. Wonder if it talks to a server that's moved/down/etc.

Thu, Sep 19, 4:44 PM · Core Platform Team Workboards (Clinic Duty Team), cloud-services-team, Analytics, Data-Services, DBA
Bstorm added a comment to T219374: Prepare and check storage layer for hi.wikisource.

To provide context, we did a lot to prevent breakage of the tables during that refactor, but I didn't even think to check the indexes. The meta_p thing looks like an error that wasn't tested since it was merged? That'll take more digging.

Thu, Sep 19, 4:40 PM · Core Platform Team Workboards (Clinic Duty Team), cloud-services-team, Analytics, Data-Services, DBA
Bstorm added a comment to T219374: Prepare and check storage layer for hi.wikisource.

That meta_p and maintain-indexes thing will make the wiki non-functional for some purposes to users. The meta_p bit will link in tooling and the maintain-indexes breakage will make some queries horribly slow (it's for the joins).

Thu, Sep 19, 4:37 PM · Core Platform Team Workboards (Clinic Duty Team), cloud-services-team, Analytics, Data-Services, DBA

Wed, Sep 18

Bstorm added a comment to T227290: Design and document how to integrate the new Toolforge k8s cluster with PodSecurityPolicy.

I've adapted my current test environment to using the PSPs in that patch and the proposed role above. So far, it behaves exactly as intended. A user with these credentials is nicely blind of the goings-on in any other namespace, but enjoys relative freedom to act within their own.

Wed, Sep 18, 9:00 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T222820: Experiment with hosted kubernetes solutions for Beta.

If this project ends up integrating with WMCS-managed stuff at all (Beta cluster -- does that mean deployment-prep?), I'd at least be interested in being a fly on the wall. I'm kind of curious what people come up with in general for our use or understanding, but if we are doing any peering or VPN with things in Cloud, I definitely would like to know to see how it impacts things.

Wed, Sep 18, 7:06 PM · Release-Engineering-Team-TODO, Beta-Cluster-Infrastructure, Release Pipeline
Bstorm added a comment to T215678: Replace each of the custom controllers with something in a new Toolforge Kubernetes setup.

I think we may need to add a PodPreset injection for the automounter. However, I'm more concerned about restricting mounts than forcing them. I'll test that.

Wed, Sep 18, 4:42 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T228499: Toolforge: changes to maintain-kubeusers.

To correctly mimic the behavior of the UID enforcer controller, T215678: Replace each of the custom controllers with something in a new Toolforge Kubernetes setup, maintain-kubeusers must apply a UID restriction to each user and namespaced default service account. Going to test that notion.

Wed, Sep 18, 4:41 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T194859: Toolforge maintain-kubeusers doesn't fail well when LDAP servers are unreachable.

The workaround seems to be working so far. I haven't seen evidence of a single hang, the logs show it creating new accounts and running successfully. The only place it might have an issue is if it ever needs to create enough accounts to make it take longer than a minute.

Wed, Sep 18, 3:59 PM · cloud-services-team (Kanban), Toolforge
Bstorm closed T233176: ssl renewal: *.wmflabs.org expires 2019-11-16 as Declined.

This might need a procurement so, I'm going to rebuild the task.

Wed, Sep 18, 12:13 AM · procurement, cloud-services-team (Kanban), Cloud-Services, Operations
Bstorm created T233176: ssl renewal: *.wmflabs.org expires 2019-11-16.
Wed, Sep 18, 12:11 AM · procurement, cloud-services-team (Kanban), Cloud-Services, Operations

Tue, Sep 17

Bstorm assigned T233120: 2019-09-17: tools share cleanup (high usage) to Phamhi.
Tue, Sep 17, 6:41 PM · Data-Services, cloud-services-team (Kanban)
Bstorm moved T233120: 2019-09-17: tools share cleanup (high usage) from Backlog to Shared Storage on the Data-Services board.
Tue, Sep 17, 4:13 PM · Data-Services, cloud-services-team (Kanban)
Bstorm moved T233120: 2019-09-17: tools share cleanup (high usage) from Inbox to Important on the cloud-services-team (Kanban) board.
Tue, Sep 17, 4:13 PM · Data-Services, cloud-services-team (Kanban)
Bstorm raised the priority of T233120: 2019-09-17: tools share cleanup (high usage) from Normal to High.
Tue, Sep 17, 4:13 PM · Data-Services, cloud-services-team (Kanban)
Bstorm triaged T233120: 2019-09-17: tools share cleanup (high usage) as Normal priority.
Tue, Sep 17, 4:12 PM · Data-Services, cloud-services-team (Kanban)

Mon, Sep 16

Bstorm committed rLPRI13e5629c1c73: password: rotate cloudwide root key for bstorm (authored by Bstorm).
password: rotate cloudwide root key for bstorm
Mon, Sep 16, 9:00 PM
Bstorm updated the task description for T227133: a8-eqiad pdu refresh (Date TBA).
Mon, Sep 16, 5:05 PM · DC-Ops, Operations, ops-eqiad
Bstorm updated the task description for T227539: b3-eqiad pdu refresh (Tuesday 9/17 @11am UTC).
Mon, Sep 16, 5:00 PM · DC-Ops, Operations, ops-eqiad

Fri, Sep 13

Bstorm added a comment to T220530: Ensure clouddb1001 is monitored appropriately from the tendril/prometheus side.

Shifting gears on this a bit. Since we don't need/want tendril, we should make sure that we are collecting appropriate metrics for grafana/prometheus on the cloud side of the house. A bit of a review.

Fri, Sep 13, 4:40 PM · Data-Services, cloud-services-team (Kanban)

Thu, Sep 12

Bstorm added a comment to T232770: Ensure kube2proxy handles apiserver failure gracefully.

So on investigation, it doesn't look like it actually removed them all, but at the same time, there's some odd behavior I see. It adds every service to the redis backend on every loop which is wrong.

Thu, Sep 12, 11:11 PM · cloud-services-team (Kanban), Wikimedia-Incident, Toolforge
Bstorm added a comment to T232772: Audit tools project puppet CA certs to ensure that they are all consistent.

The related cert for the outage was on the server itself in this place https://phabricator.wikimedia.org/T148929#2817428

Thu, Sep 12, 7:27 PM · Wikimedia-Incident, cloud-services-team (Kanban), Toolforge
Bstorm added a subtask for T232536: Toolforge Kubernetes internal API down, causing `webservice` and other tooling to fail: T232770: Ensure kube2proxy handles apiserver failure gracefully.
Thu, Sep 12, 7:11 PM · cloud-services-team (Kanban), Toolforge
Bstorm added a parent task for T232770: Ensure kube2proxy handles apiserver failure gracefully: T232536: Toolforge Kubernetes internal API down, causing `webservice` and other tooling to fail.
Thu, Sep 12, 7:11 PM · cloud-services-team (Kanban), Wikimedia-Incident, Toolforge
Bstorm added a parent task for T232769: Document some etcd cluster operations for Toolforge: T232536: Toolforge Kubernetes internal API down, causing `webservice` and other tooling to fail.
Thu, Sep 12, 7:10 PM · cloud-services-team (Kanban), Toolforge, Wikimedia-Incident
Bstorm added a subtask for T232536: Toolforge Kubernetes internal API down, causing `webservice` and other tooling to fail: T232769: Document some etcd cluster operations for Toolforge.
Thu, Sep 12, 7:10 PM · cloud-services-team (Kanban), Toolforge
Bstorm added a parent task for T214513: Upgrade Toolforge Kubernetes: T232536: Toolforge Kubernetes internal API down, causing `webservice` and other tooling to fail.
Thu, Sep 12, 7:10 PM · Wikimedia-Incident, Goal, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a subtask for T232536: Toolforge Kubernetes internal API down, causing `webservice` and other tooling to fail: T214513: Upgrade Toolforge Kubernetes.
Thu, Sep 12, 7:10 PM · cloud-services-team (Kanban), Toolforge
Bstorm triaged T232772: Audit tools project puppet CA certs to ensure that they are all consistent as Normal priority.
Thu, Sep 12, 7:08 PM · Wikimedia-Incident, cloud-services-team (Kanban), Toolforge
Bstorm created T232772: Audit tools project puppet CA certs to ensure that they are all consistent.
Thu, Sep 12, 7:07 PM · Wikimedia-Incident, cloud-services-team (Kanban), Toolforge
Bstorm added a project to T214513: Upgrade Toolforge Kubernetes: Wikimedia-Incident.

Because the version of Kubernetes in Toolforge was related to some lousy error messages during an outage, and this is now one of the actionables from that incident, adding the Incident tag.

Thu, Sep 12, 6:59 PM · Wikimedia-Incident, Goal, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm triaged T232770: Ensure kube2proxy handles apiserver failure gracefully as Normal priority.
Thu, Sep 12, 6:58 PM · cloud-services-team (Kanban), Wikimedia-Incident, Toolforge
Bstorm triaged T232769: Document some etcd cluster operations for Toolforge as Normal priority.
Thu, Sep 12, 6:50 PM · cloud-services-team (Kanban), Toolforge, Wikimedia-Incident

Tue, Sep 10

Bstorm updated subscribers of T232264: Change db password for tools.machtsinn.
Tue, Sep 10, 3:27 PM · Data-Services, cloud-services-team (Kanban)

Mon, Sep 9

Bstorm added a comment to T229058: Replace the nslcd mount in containers from the old Toolforge cluster with something that will work with sssd in the new one.

The docker-registry.tools.wmflabs.org/toollabs-python35-sssd-web:testing image worked today in testing on tools-worker-1029.tools.eqiad.wmflabs (which is cordoned and runs sssd).

Mon, Sep 9, 11:57 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm committed rODITb40cdcdf5621: sssd: Add some new images to test sssd in containers (authored by Bstorm).
sssd: Add some new images to test sssd in containers
Mon, Sep 9, 10:40 PM
Bstorm committed rODITe168a3aef8a8: tagging: Add the tag to the templates (authored by Bstorm).
tagging: Add the tag to the templates
Mon, Sep 9, 5:41 PM
Bstorm added a comment to T215553: Figure out cert management for Toolforge kubernetes and make it clear in documents, etc. for the upgrade.

As of k8s 1.8, I think there's a prometheus metric for cert expiry https://github.com/kubernetes/kubernetes/pull/51031

Mon, Sep 9, 4:23 PM · Toolforge, cloud-services-team (Kanban), Kubernetes

Sat, Sep 7

Bstorm triaged T232264: Change db password for tools.machtsinn as Normal priority.
Sat, Sep 7, 9:03 PM · Data-Services, cloud-services-team (Kanban)

Fri, Sep 6

Bstorm updated subscribers of T221272: Expose new ipblocks.ipb_sitewide column to the replicas.

@sbassett Can I get a +1 from security on this column? Looking through backlogged tickets, I noticed this one.

Fri, Sep 6, 7:06 PM · cloud-services-team (Kanban), Data-Services, Security-Team, Anti-Harassment
Bstorm added a comment to T194859: Toolforge maintain-kubeusers doesn't fail well when LDAP servers are unreachable.

That's why it suddenly stopped and started working. I was wondering (and commented on the merged task).

Fri, Sep 6, 3:17 PM · cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T228529: Become failed for newly created tool .

I just saw wikiwho get created:

Fri, Sep 6, 3:15 PM · cloud-services-team (Kanban), Toolforge

Thu, Sep 5

Bstorm added a comment to T229058: Replace the nslcd mount in containers from the old Toolforge cluster with something that will work with sssd in the new one.

So it appears the safest way to really test sssd is still to do something like https://gerrit.wikimedia.org/r/c/operations/docker-images/toollabs-images/+/527258
Then build the image and tag it with testing. I can test things locally, but I suspect it is easier to do this and actually test it on tools-worker-1029 (which is still in place as a jessie sssd test node). To do it locally, have to make minikube work with sssd which sounds like a lot of fussing.

Thu, Sep 5, 6:49 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T228499: Toolforge: changes to maintain-kubeusers.

So one thing learned from testing this with NFS: volume claims are namespaced. The only way to share an NFS mount across namespaces at this time is as a hostPath.

Thu, Sep 5, 6:39 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban), Kubernetes

Fri, Aug 30

Bstorm added a comment to T228499: Toolforge: changes to maintain-kubeusers.

I have managed a live test of this in a Kubernetes cluster with LDAP. New permissions were needed for the clusterrole. Additionally, the serviceaccount running this needs all permissions that it grants to other users because that's the rules in Kubernetes. Since it currently grants the clusterrole "edit" in a namespace, I had to give the sa that permission as a clusterrolebinding (because it must be able to do all those things in the target namespace). I kind of hate that, but it is necessary.

Fri, Aug 30, 12:46 AM · Patch-For-Review, Toolforge, cloud-services-team (Kanban), Kubernetes

Thu, Aug 29

Bstorm added a comment to T229448: showmount not working on labstore1004 & labstore1005.

That sounds like a great idea!

Thu, Aug 29, 2:14 PM · Data-Services, cloud-services-team (Kanban)

Wed, Aug 28

Bstorm moved T231500: Investigate and develop a rough maturity model for Tools and Cloud VPS from Inbox to Doing on the cloud-services-team (Kanban) board.
Wed, Aug 28, 9:28 PM · Cloud-VPS, Toolforge, cloud-services-team (Kanban)
Bstorm triaged T231500: Investigate and develop a rough maturity model for Tools and Cloud VPS as Normal priority.
Wed, Aug 28, 9:28 PM · Cloud-VPS, Toolforge, cloud-services-team (Kanban)
Bstorm edited projects for T231499: Create a feature that allows a tool to be surfaced as "co-maintainer wanted", added: Striker; removed Toolforge.
Wed, Aug 28, 9:24 PM · Striker
Bstorm updated subscribers of T231499: Create a feature that allows a tool to be surfaced as "co-maintainer wanted".
Wed, Aug 28, 9:23 PM · Striker
Bstorm created T231499: Create a feature that allows a tool to be surfaced as "co-maintainer wanted".
Wed, Aug 28, 9:23 PM · Striker
Bstorm triaged T231498: Wrap continuity documentation into new OKR work for the team as Normal priority.
Wed, Aug 28, 9:18 PM · cloud-services-team
Bstorm added a comment to T230147: Toolforge: collect prometheus node exporter metrics from new k8s worker nodes.

No, I think this is only resolved if "new kubernetes worker nodes" can export metrics. They'll fail if we spin up another one. I'm perfectly fine with just documenting that the package needs an upgrade (since there's packages that need downgrades as well), but a puppet pin of the package would resolve it as well. The reason I'm ok with just updating the docs is because this is re: Jessie nodes. We are going to deprecate Jessie. Otherwise, we'd surely insist on fixing this in puppet so the build is reproduceable.

Wed, Aug 28, 12:47 AM · cloud-services-team (Kanban)

Tue, Aug 27

Bstorm committed rODIT6ca7fedb953a: Apply black formatting (authored by Bstorm).
Apply black formatting
Tue, Aug 27, 10:34 PM
Bstorm committed rLTMKc81664a1248e: maintain_kubeusers: authenticate in k8s and out -- fix tests and docker (authored by Bstorm).
maintain_kubeusers: authenticate in k8s and out -- fix tests and docker
Tue, Aug 27, 9:08 PM
Bstorm committed rLTMKf738f772b72a: maintain_kubeusers: add basic expiration detection (authored by Bstorm).
maintain_kubeusers: add basic expiration detection
Tue, Aug 27, 9:08 PM
Bstorm moved T215553: Figure out cert management for Toolforge kubernetes and make it clear in documents, etc. for the upgrade from Needs discussion to Doing on the cloud-services-team (Kanban) board.
Tue, Aug 27, 5:05 PM · Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T221301: Toolschecker webservice checks get out of sync likely from timeouts.

We discussed the matter and felt as a team that these are not the right way to be monitoring the customer experience tools we have for Toolforge. We decided to remove the icinga monitors and create a subtask to implement a more sensible monitor for this.

Tue, Aug 27, 4:36 PM · Toolforge, cloud-services-team (Kanban)

Mon, Aug 26

Bstorm closed T210762: Prepare and check storage layer for nap.wikisource, a subtask of T210752: Create Wikisource Neapolitan, as Resolved.
Mon, Aug 26, 5:58 PM · MW-1.33-notes (1.33.0-wmf.8; 2018-12-11), Patch-For-Review, Google-Code-in-2018, User-Urbanecm, Wiki-Setup (Create)
Bstorm closed T210762: Prepare and check storage layer for nap.wikisource as Resolved.

Scripts finished. Validated the the views are reachable in Toolforge.

Mon, Aug 26, 5:58 PM · Data-Services, cloud-services-team, DBA
Bstorm added a comment to T210762: Prepare and check storage layer for nap.wikisource.

Created the database and the grant on the replicas, running scripts now to get it all set.

Mon, Aug 26, 5:50 PM · Data-Services, cloud-services-team, DBA
Bstorm claimed T210762: Prepare and check storage layer for nap.wikisource.
Mon, Aug 26, 5:35 PM · Data-Services, cloud-services-team, DBA

Aug 21 2019

Bstorm added a comment to T230289: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only.

https://www.dell.com/support/home/en/en/sebsdt1/drivers/driversdetails?driverid=f675y
Looks like there's a number of fixes on this update of the controller firmware, but I don't see any very specific to our issue (lots of INTERNAL_DEVICE_RESET, etc). Can we try that before putting it back in service? I reimage it if that is required to update the firmware (I'm sure we'll need to at this point anyway).

Aug 21 2019, 6:11 PM · cloud-services-team, ops-eqiad, Operations
Bstorm added a comment to T230289: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only.

copied from T230442#5413070

                    Versions
                ================
Product Name    : PERC H730P Adapter
Serial No       : 87U048Y
FW Package Build: 25.5.3.0005
Aug 21 2019, 5:56 PM · cloud-services-team, ops-eqiad, Operations
Bstorm added a comment to T230289: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only.

Reboot sent it into a re-image (stalled at confirmation about writing partitioning scheme to disk). It's not healthy. :) Feel free to muck around in the console.

Aug 21 2019, 5:24 PM · cloud-services-team, ops-eqiad, Operations
Bstorm added a comment to T230289: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only.

It wasn't showing the right number of disks when I was running things. It was missing four, I believe? Two have failed and logged tickets, but it would have to have lost two more to go read-only (and I seem to recall this was a 10 disk machine)--would need to check to be sure.

Aug 21 2019, 5:20 PM · cloud-services-team, ops-eqiad, Operations
Bstorm added a comment to T230289: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only.

Yup, I can do that. I'm not sure which either, per T230442#5429068
It dropped the failures from the list, and I'm not even entirely convinced the disks are bad with how it behaved. It's not accepting ssh connections anymore, so I'll have to do with via mgmt.

Aug 21 2019, 5:15 PM · cloud-services-team, ops-eqiad, Operations
Bstorm added a comment to T230442: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only.

Did Dell only send replacement SSD? This has lost 4 disks in a very short time (all are failed now and most missing in the list of disks). I highly suspect there is another issue that isn't the disks themselves (controller firmware, etc. maybe?). This is also not the first time this server did this (fail out multiple disks until the filesystem failed), see:
https://wikitech.wikimedia.org/wiki/Incident_documentation/20190213-cloudvps
T216218: Cloud VPS outage on cloudvirt1024 and cloudvirt1018 due to storage failure
I mean, it might be fine, and coincidences do happen, but I'm curious.

Aug 21 2019, 4:43 PM · ops-eqiad, Operations

Aug 15 2019

Bstorm added a comment to T230575: Degraded RAID on cloudvirt1018.

Looks like the exact same thing as T229156: Degraded RAID on cloudvirt1018. Same disk, same error and even same hot spare rebuilding.

Aug 15 2019, 8:37 PM · ops-eqiad, Operations
Bstorm added a comment to T230575: Degraded RAID on cloudvirt1018.

Looks like a bad disk here:

Aug 15 2019, 8:32 PM · ops-eqiad, Operations
Bstorm added a comment to T215553: Figure out cert management for Toolforge kubernetes and make it clear in documents, etc. for the upgrade.

Another proposal is enabling automatic rotation for kubelet certs so we don't have to manually re-issue them if we don't upgrade during the course of a year. Since upgrading via kubeadm does rotate the certs for all nodes, as long as there is at least one upgrade during a year, we'll be ok, but why chance it? https://kubernetes.io/docs/tasks/tls/certificate-rotation/#enabling-client-certificate-rotation

Aug 15 2019, 8:18 PM · Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T229551: Database-reports can't see packages in its virtualenv on the grid.

Closing since I was able to test it with @MaxSem's tool account/venv.

Aug 15 2019, 6:10 PM · Community-Tech (Kanban (Q1 2019-20)), Patch-For-Review, Tools, Toolforge
Bstorm closed T229551: Database-reports can't see packages in its virtualenv on the grid, a subtask of T219472: Port database reports to Python 3, as Resolved.
Aug 15 2019, 6:08 PM · Tools, Community-Tech (Kanban (Q1 2019-20))
Bstorm closed T229551: Database-reports can't see packages in its virtualenv on the grid as Resolved.
Aug 15 2019, 6:08 PM · Community-Tech (Kanban (Q1 2019-20)), Patch-For-Review, Tools, Toolforge
Bstorm added a comment to T228499: Toolforge: changes to maintain-kubeusers.

Removing toolforge access is an effective removal as well if the API server is not publicly accessible of course. However, ensuring that the RBAC can be removed as well helps. Rebuilding the CA sounds like a poor option.

Aug 15 2019, 5:48 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T227290: Design and document how to integrate the new Toolforge k8s cluster with PodSecurityPolicy.

Ok for Toolforge users, I currently have maintain-kubeusers generating individual role-bindings to the default "edit" clusterrole for new users in their namespace (T228499: Toolforge: changes to maintain-kubeusers). Since "edit" is a blank check of read/write access while preventing changes to RBAC/PSP, I thought it called for some modification. The biggest things I think we should remove from "edit" is:

Aug 15 2019, 5:40 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T228499: Toolforge: changes to maintain-kubeusers.

I was just thinking about the fact that we are applying RBAC to the user and not to the group (which seems more efficient). At this time, here is why (and this needs to be documented in the script): I do not support using the group annotation for overall cluster access when using x.509 certificates because this issue is not resolved yet. An issued cert exists until it expires, so RBAC is our primary means of immediately shutting down user access. This means we can tie some things to the group, but other things (write access to resources, at least) must be tied to the user until there is a reasonable mechanism for invalidating client certs.

Aug 15 2019, 5:27 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T228499: Toolforge: changes to maintain-kubeusers.

Currently this is using the "edit" default clusterrole for new toolforge users. That is absolutely not what I'd like it to use for now. So I've dumped out the permissions that grants and have commented out the pieces I'd like to further restrict for toolforge. I'll include this in the PSP/RBAC ticket as well with a bit more explaination (T227290: Design and document how to integrate the new Toolforge k8s cluster with PodSecurityPolicy)

1# RBAC minimum perms for toolforge users, based on the "edit" system clusterrole with some permissions commented out:
2# verbs for R/O
3# ["get", "list", "watch"]
4
5# verbs for R/W
6# ["get", "list", "watch", "create", "update", "patch", "delete"]
7
8# resources needed by Toolforge users for sure:
9# pods, deployments,
10
11rules:
12- apiGroups:
13 - ""
14 resources:
15 - pods/attach
16 - pods/exec
17 - pods/portforward
18 - pods/proxy
19 - secrets
20 - services/proxy
21 verbs:
22 - get
23 - list
24 - watch
25# - apiGroups:
26# - ""
27# resources:
28# - serviceaccounts
29# verbs:
30# - impersonate
31- apiGroups:
32 - ""
33 resources:
34 - pods
35 - pods/attach
36 - pods/exec
37 - pods/portforward
38 - pods/proxy
39 verbs:
40 - create
41 - delete
42 - deletecollection
43 - patch
44 - update
45- apiGroups:
46 - ""
47 resources:
48 - configmaps
49 - endpoints
50# - persistentvolumeclaims
51 - replicationcontrollers
52 - replicationcontrollers/scale
53 - secrets
54# - serviceaccounts
55 - services
56 - services/proxy
57 verbs:
58 - create
59 - delete
60 - deletecollection
61 - patch
62 - update
63- apiGroups:
64 - apps
65 resources:
66# - daemonsets
67 - deployments
68 - deployments/rollback
69 - deployments/scale
70 - replicasets
71 - replicasets/scale
72 - statefulsets
73 - statefulsets/scale
74 verbs:
75 - create
76 - delete
77 - deletecollection
78 - patch
79 - update
80# - apiGroups:
81# - autoscaling
82# resources:
83# - horizontalpodautoscalers
84# verbs:
85# - create
86# - delete
87# - deletecollection
88# - patch
89# - update
90- apiGroups:
91 - batch
92 resources:
93 - cronjobs
94 - jobs
95 verbs:
96 - create
97 - delete
98 - deletecollection
99 - patch
100 - update
101- apiGroups:
102 - extensions
103 resources:
104# - daemonsets
105 - deployments
106 - deployments/rollback
107 - deployments/scale
108 - ingresses
109 - networkpolicies
110 - replicasets
111 - replicasets/scale
112 - replicationcontrollers/scale
113 verbs:
114 - create
115 - delete
116 - deletecollection
117 - patch
118 - update
119# - apiGroups:
120# - policy
121# resources:
122# - poddisruptionbudgets
123# verbs:
124# - create
125# - delete
126# - deletecollection
127# - patch
128# - update
129- apiGroups:
130 - networking.k8s.io
131 resources:
132 - ingresses
133 - networkpolicies
134 verbs:
135 - create
136 - delete
137 - deletecollection
138 - patch
139 - update
140- apiGroups:
141 - ""
142 resources:
143 - configmaps
144 - endpoints
145 - persistentvolumeclaims
146 - pods
147 - replicationcontrollers
148 - replicationcontrollers/scale
149# - serviceaccounts
150 - services
151 verbs:
152 - get
153 - list
154 - watch
155- apiGroups:
156 - ""
157 resources:
158 - bindings
159 - events
160 - limitranges
161 - namespaces/status
162 - pods/log
163 - pods/status
164 - replicationcontrollers/status
165 - resourcequotas
166 - resourcequotas/status
167 verbs:
168 - get
169 - list
170 - watch
171- apiGroups:
172 - ""
173 resources:
174 - namespaces
175 verbs:
176 - get
177 - list
178 - watch
179- apiGroups:
180 - apps
181 resources:
182 - controllerrevisions
183 - daemonsets
184 - deployments
185 - deployments/scale
186 - replicasets
187 - replicasets/scale
188 - statefulsets
189 - statefulsets/scale
190 verbs:
191 - get
192 - list
193 - watch
194- apiGroups:
195 - autoscaling
196 resources:
197 - horizontalpodautoscalers
198 verbs:
199 - get
200 - list
201 - watch
202- apiGroups:
203 - batch
204 resources:
205 - cronjobs
206 - jobs
207 verbs:
208 - get
209 - list
210 - watch
211- apiGroups:
212 - extensions
213 resources:
214 - daemonsets
215 - deployments
216 - deployments/scale
217 - ingresses
218 - networkpolicies
219 - replicasets
220 - replicasets/scale
221 - replicationcontrollers/scale
222 verbs:
223 - get
224 - list
225 - watch
226- apiGroups:
227 - policy
228 resources:
229 - poddisruptionbudgets
230 verbs:
231 - get
232 - list
233 - watch
234- apiGroups:
235 - networking.k8s.io
236 resources:
237 - ingresses
238 - networkpolicies
239 verbs:
240 - get
241 - list
242 - watch

Aug 15 2019, 5:04 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm created P8918 RBAC for Toolforge suggestion.
Aug 15 2019, 5:01 PM
Bstorm added a comment to T229551: Database-reports can't see packages in its virtualenv on the grid.

This seems like it is fixed now. I don't need a bash wrapper in my test case.

Aug 15 2019, 3:52 PM · Community-Tech (Kanban (Q1 2019-20)), Patch-For-Review, Tools, Toolforge

Aug 14 2019

Bstorm added a comment to T228500: Toolforge: evaluate ingress mechanism.

I did a little research to make sure I'm not being unhelpful on this ticket by commenting (and yes, some of my comments were probably useless).

Aug 14 2019, 10:54 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm triaged T229009: Proposal: ditching the master name in kubernetes servers as Low priority.
Aug 14 2019, 5:42 PM · Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T229009: Proposal: ditching the master name in kubernetes servers.

At this point, I'm going to leave this task open basically for just long enough for us to build new control plane nodes in toolsbeta. I don't think it requires us tearing down the test cluster for now.

Aug 14 2019, 5:42 PM · Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm closed T169287: etcd config depends on puppet certs, but puppet doesn't know, a subtask of T169289: Tool Labs 2017-06-29 Labstore100[45] kernel upgrade issues, as Resolved.
Aug 14 2019, 5:39 PM · cloud-services-team (Kanban), Toolforge
Bstorm closed T169287: etcd config depends on puppet certs, but puppet doesn't know, a subtask of T215553: Figure out cert management for Toolforge kubernetes and make it clear in documents, etc. for the upgrade, as Resolved.
Aug 14 2019, 5:39 PM · Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm closed T169287: etcd config depends on puppet certs, but puppet doesn't know as Resolved.

At this point, we have ended up puppetizing the copying of puppet certs to act as etcd client certs as well as server certs in T215531: Deploy upgraded Kubernetes to toolsbeta with an "unstacked" control plane (separate etcd servers) because we found the process of dealing with node failure with a stacked control plane to be kind of awful.

Aug 14 2019, 5:39 PM · cloud-services-team (Kanban), Kubernetes, Cloud-Services
Bstorm moved T215553: Figure out cert management for Toolforge kubernetes and make it clear in documents, etc. for the upgrade from Doing to Needs discussion on the cloud-services-team (Kanban) board.

Adding to discussion in order to discuss the proposal for admin users since that is a change from the behavior of the original system as well as to open the design proposal for comment/questions/rejection/redo in general.

Aug 14 2019, 5:33 PM · Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm merged T144153: Move kubernetes authentication to using X.509 client certs into T228499: Toolforge: changes to maintain-kubeusers.
Aug 14 2019, 5:23 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm merged task T144153: Move kubernetes authentication to using X.509 client certs into T228499: Toolforge: changes to maintain-kubeusers.
Aug 14 2019, 5:23 PM · cloud-services-team (Kanban), Kubernetes, Toolforge, Tools-Kubernetes
Bstorm added a comment to T215553: Figure out cert management for Toolforge kubernetes and make it clear in documents, etc. for the upgrade.

The admin users mentioned in that diagram are still theoretical, but there is no reason to require root to interact with a k8s API. It should be straightforward to add a service or a manually run script that maps the <project>.admin group to admin user accounts and places them in the appropriate locations. That will allow Toolforge admins to interact with k8s as easily as they can Grid Engine (and nobody else--they need to use tool accounts). This should simplify playbooks and procedures for dealing with jobs and services that are misbehaving, etc.

Aug 14 2019, 5:08 PM · Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T215553: Figure out cert management for Toolforge kubernetes and make it clear in documents, etc. for the upgrade.

This describes essentially what we are now doing. Etcd client and server certs are simply the puppet certs (which should keep etcd flexible in case we need to set up routing into calico somewhere), while certs for users are x.509s generated using the certificates API of k8s. Node certs are generated by k8s as well using kubeadm (which interacts with the certs API using tokens). The certs to manage the CA and PKI are copied between k8s control plane nodes at build time. A new cluster will have a new CA, which honestly prevents leakage nicely.

Aug 14 2019, 5:05 PM · Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T229551: Database-reports can't see packages in its virtualenv on the grid.

In the meantime, I did confirm separately that what I documented about using a wrapper explicitly with an activate does work with jsub. I very much like the idea of having it fixed so that isn't necessary, though :)

Aug 14 2019, 2:31 AM · Community-Tech (Kanban (Q1 2019-20)), Patch-For-Review, Tools, Toolforge