Page MenuHomePhabricator

Bstorm (Brooke)
Ops Witch -- Wikimedia Cloud Services Team

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
Jan 22 2018, 10:09 PM (85 w, 5 d)
Availability
Available
IRC Nick
bstorm_
LDAP User
Bstorm
MediaWiki User
BStorm (WMF) [ Global Accounts ]

On the wikis, I'm BStorm (WMF), bstorm_ on IRC and Bstorm on gerrit and WikiTech.

I work for or provide services to the Wikimedia Foundation, but this is my only Phabricator account. Edits, statements, or other contributions made from this account are my own, and may not reflect the views of the Foundation.

Recent Activity

Fri, Sep 13

Bstorm added a comment to T220530: Ensure clouddb1001 is monitored appropriately from the tendril/prometheus side.

Shifting gears on this a bit. Since we don't need/want tendril, we should make sure that we are collecting appropriate metrics for grafana/prometheus on the cloud side of the house. A bit of a review.

Fri, Sep 13, 4:40 PM · Data-Services, cloud-services-team (Kanban)

Thu, Sep 12

Bstorm added a comment to T232770: Ensure kube2proxy handles apiserver failure gracefully.

So on investigation, it doesn't look like it actually removed them all, but at the same time, there's some odd behavior I see. It adds every service to the redis backend on every loop which is wrong.

Thu, Sep 12, 11:11 PM · cloud-services-team (Kanban), Wikimedia-Incident, Toolforge
Bstorm added a comment to T232772: Audit tools project puppet CA certs to ensure that they are all consistent.

The related cert for the outage was on the server itself in this place https://phabricator.wikimedia.org/T148929#2817428

Thu, Sep 12, 7:27 PM · Wikimedia-Incident, cloud-services-team (Kanban), Toolforge
Bstorm added a subtask for T232536: Toolforge Kubernetes internal API down, causing `webservice` and other tooling to fail: T232770: Ensure kube2proxy handles apiserver failure gracefully.
Thu, Sep 12, 7:11 PM · cloud-services-team (Kanban), Toolforge
Bstorm added a parent task for T232770: Ensure kube2proxy handles apiserver failure gracefully: T232536: Toolforge Kubernetes internal API down, causing `webservice` and other tooling to fail.
Thu, Sep 12, 7:11 PM · cloud-services-team (Kanban), Wikimedia-Incident, Toolforge
Bstorm added a parent task for T232769: Document some etcd cluster operations for Toolforge: T232536: Toolforge Kubernetes internal API down, causing `webservice` and other tooling to fail.
Thu, Sep 12, 7:10 PM · cloud-services-team (Kanban), Toolforge, Wikimedia-Incident
Bstorm added a subtask for T232536: Toolforge Kubernetes internal API down, causing `webservice` and other tooling to fail: T232769: Document some etcd cluster operations for Toolforge.
Thu, Sep 12, 7:10 PM · cloud-services-team (Kanban), Toolforge
Bstorm added a parent task for T214513: Upgrade Toolforge Kubernetes: T232536: Toolforge Kubernetes internal API down, causing `webservice` and other tooling to fail.
Thu, Sep 12, 7:10 PM · Wikimedia-Incident, Goal, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a subtask for T232536: Toolforge Kubernetes internal API down, causing `webservice` and other tooling to fail: T214513: Upgrade Toolforge Kubernetes.
Thu, Sep 12, 7:10 PM · cloud-services-team (Kanban), Toolforge
Bstorm triaged T232772: Audit tools project puppet CA certs to ensure that they are all consistent as Normal priority.
Thu, Sep 12, 7:08 PM · Wikimedia-Incident, cloud-services-team (Kanban), Toolforge
Bstorm created T232772: Audit tools project puppet CA certs to ensure that they are all consistent.
Thu, Sep 12, 7:07 PM · Wikimedia-Incident, cloud-services-team (Kanban), Toolforge
Bstorm added a project to T214513: Upgrade Toolforge Kubernetes: Wikimedia-Incident.

Because the version of Kubernetes in Toolforge was related to some lousy error messages during an outage, and this is now one of the actionables from that incident, adding the Incident tag.

Thu, Sep 12, 6:59 PM · Wikimedia-Incident, Goal, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm triaged T232770: Ensure kube2proxy handles apiserver failure gracefully as Normal priority.
Thu, Sep 12, 6:58 PM · cloud-services-team (Kanban), Wikimedia-Incident, Toolforge
Bstorm triaged T232769: Document some etcd cluster operations for Toolforge as Normal priority.
Thu, Sep 12, 6:50 PM · cloud-services-team (Kanban), Toolforge, Wikimedia-Incident

Tue, Sep 10

Bstorm updated subscribers of T232264: Change db password or tools.machtsinn.
Tue, Sep 10, 3:27 PM · Data-Services, cloud-services-team (Kanban)

Mon, Sep 9

Bstorm added a comment to T229058: Replace the nslcd mount in containers from the old Toolforge cluster with something that will work with sssd in the new one.

The docker-registry.tools.wmflabs.org/toollabs-python35-sssd-web:testing image worked today in testing on tools-worker-1029.tools.eqiad.wmflabs (which is cordoned and runs sssd).

Mon, Sep 9, 11:57 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm committed rODITb40cdcdf5621: sssd: Add some new images to test sssd in containers (authored by Bstorm).
sssd: Add some new images to test sssd in containers
Mon, Sep 9, 10:40 PM
Bstorm committed rODITe168a3aef8a8: tagging: Add the tag to the templates (authored by Bstorm).
tagging: Add the tag to the templates
Mon, Sep 9, 5:41 PM
Bstorm added a comment to T215553: Figure out cert management for Toolforge kubernetes and make it clear in documents, etc. for the upgrade.

As of k8s 1.8, I think there's a prometheus metric for cert expiry https://github.com/kubernetes/kubernetes/pull/51031

Mon, Sep 9, 4:23 PM · Toolforge, cloud-services-team (Kanban), Kubernetes

Sat, Sep 7

Bstorm triaged T232264: Change db password or tools.machtsinn as Normal priority.
Sat, Sep 7, 9:03 PM · Data-Services, cloud-services-team (Kanban)

Fri, Sep 6

Bstorm updated subscribers of T221272: Expose new ipblocks.ipb_sitewide column to the replicas.

@sbassett Can I get a +1 from security on this column? Looking through backlogged tickets, I noticed this one.

Fri, Sep 6, 7:06 PM · cloud-services-team (Kanban), Patch-For-Review, Data-Services, Security-Team, Anti-Harassment
Bstorm added a comment to T194859: Toolforge maintain-kubeusers doesn't fail well when LDAP servers are unreachable.

That's why it suddenly stopped and started working. I was wondering (and commented on the merged task).

Fri, Sep 6, 3:17 PM · cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T228529: Become failed for newly created tool .

I just saw wikiwho get created:

Fri, Sep 6, 3:15 PM · cloud-services-team (Kanban), Toolforge

Thu, Sep 5

Bstorm added a comment to T229058: Replace the nslcd mount in containers from the old Toolforge cluster with something that will work with sssd in the new one.

So it appears the safest way to really test sssd is still to do something like https://gerrit.wikimedia.org/r/c/operations/docker-images/toollabs-images/+/527258
Then build the image and tag it with testing. I can test things locally, but I suspect it is easier to do this and actually test it on tools-worker-1029 (which is still in place as a jessie sssd test node). To do it locally, have to make minikube work with sssd which sounds like a lot of fussing.

Thu, Sep 5, 6:49 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T228499: Toolforge: changes to maintain-kubeusers.

So one thing learned from testing this with NFS: volume claims are namespaced. The only way to share an NFS mount across namespaces at this time is as a hostPath.

Thu, Sep 5, 6:39 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban), Kubernetes

Fri, Aug 30

Bstorm added a comment to T228499: Toolforge: changes to maintain-kubeusers.

I have managed a live test of this in a Kubernetes cluster with LDAP. New permissions were needed for the clusterrole. Additionally, the serviceaccount running this needs all permissions that it grants to other users because that's the rules in Kubernetes. Since it currently grants the clusterrole "edit" in a namespace, I had to give the sa that permission as a clusterrolebinding (because it must be able to do all those things in the target namespace). I kind of hate that, but it is necessary.

Fri, Aug 30, 12:46 AM · Patch-For-Review, Toolforge, cloud-services-team (Kanban), Kubernetes

Thu, Aug 29

Bstorm added a comment to T229448: showmount not working on labstore1004 & labstore1005.

That sounds like a great idea!

Thu, Aug 29, 2:14 PM · Data-Services, cloud-services-team (Kanban)

Wed, Aug 28

Bstorm moved T231500: Investigate and develop a rough maturity model for Tools and Cloud VPS from Inbox to Doing on the cloud-services-team (Kanban) board.
Wed, Aug 28, 9:28 PM · Cloud-VPS, Toolforge, cloud-services-team (Kanban)
Bstorm triaged T231500: Investigate and develop a rough maturity model for Tools and Cloud VPS as Normal priority.
Wed, Aug 28, 9:28 PM · Cloud-VPS, Toolforge, cloud-services-team (Kanban)
Bstorm edited projects for T231499: Create a feature that allows a tool to be surfaced as "co-maintainer wanted", added: Striker; removed Toolforge.
Wed, Aug 28, 9:24 PM · Striker
Bstorm updated subscribers of T231499: Create a feature that allows a tool to be surfaced as "co-maintainer wanted".
Wed, Aug 28, 9:23 PM · Striker
Bstorm created T231499: Create a feature that allows a tool to be surfaced as "co-maintainer wanted".
Wed, Aug 28, 9:23 PM · Striker
Bstorm triaged T231498: Wrap continuity documentation into new OKR work for the team as Normal priority.
Wed, Aug 28, 9:18 PM · cloud-services-team
Bstorm added a comment to T230147: Toolforge: collect prometheus node exporter metrics from new k8s worker nodes.

No, I think this is only resolved if "new kubernetes worker nodes" can export metrics. They'll fail if we spin up another one. I'm perfectly fine with just documenting that the package needs an upgrade (since there's packages that need downgrades as well), but a puppet pin of the package would resolve it as well. The reason I'm ok with just updating the docs is because this is re: Jessie nodes. We are going to deprecate Jessie. Otherwise, we'd surely insist on fixing this in puppet so the build is reproduceable.

Wed, Aug 28, 12:47 AM · cloud-services-team (Kanban)

Tue, Aug 27

Bstorm committed rODIT6ca7fedb953a: Apply black formatting (authored by Bstorm).
Apply black formatting
Tue, Aug 27, 10:34 PM
Bstorm committed rLTMKc81664a1248e: maintain_kubeusers: authenticate in k8s and out -- fix tests and docker (authored by Bstorm).
maintain_kubeusers: authenticate in k8s and out -- fix tests and docker
Tue, Aug 27, 9:08 PM
Bstorm committed rLTMKf738f772b72a: maintain_kubeusers: add basic expiration detection (authored by Bstorm).
maintain_kubeusers: add basic expiration detection
Tue, Aug 27, 9:08 PM
Bstorm moved T215553: Figure out cert management for Toolforge kubernetes and make it clear in documents, etc. for the upgrade from Needs discussion to Doing on the cloud-services-team (Kanban) board.
Tue, Aug 27, 5:05 PM · Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T221301: Toolschecker webservice checks get out of sync likely from timeouts.

We discussed the matter and felt as a team that these are not the right way to be monitoring the customer experience tools we have for Toolforge. We decided to remove the icinga monitors and create a subtask to implement a more sensible monitor for this.

Tue, Aug 27, 4:36 PM · Toolforge, cloud-services-team (Kanban)

Mon, Aug 26

Bstorm closed T210762: Prepare and check storage layer for nap.wikisource, a subtask of T210752: Create Wikisource Neapolitan, as Resolved.
Mon, Aug 26, 5:58 PM · MW-1.33-notes (1.33.0-wmf.8; 2018-12-11), Patch-For-Review, Google-Code-in-2018, User-Urbanecm, Wiki-Setup (Create)
Bstorm closed T210762: Prepare and check storage layer for nap.wikisource as Resolved.

Scripts finished. Validated the the views are reachable in Toolforge.

Mon, Aug 26, 5:58 PM · Data-Services, cloud-services-team, DBA
Bstorm added a comment to T210762: Prepare and check storage layer for nap.wikisource.

Created the database and the grant on the replicas, running scripts now to get it all set.

Mon, Aug 26, 5:50 PM · Data-Services, cloud-services-team, DBA
Bstorm claimed T210762: Prepare and check storage layer for nap.wikisource.
Mon, Aug 26, 5:35 PM · Data-Services, cloud-services-team, DBA

Wed, Aug 21

Bstorm added a comment to T230289: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only.

https://www.dell.com/support/home/en/en/sebsdt1/drivers/driversdetails?driverid=f675y
Looks like there's a number of fixes on this update of the controller firmware, but I don't see any very specific to our issue (lots of INTERNAL_DEVICE_RESET, etc). Can we try that before putting it back in service? I reimage it if that is required to update the firmware (I'm sure we'll need to at this point anyway).

Wed, Aug 21, 6:11 PM · cloud-services-team, ops-eqiad, Operations
Bstorm added a comment to T230289: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only.

copied from T230442#5413070

                    Versions
                ================
Product Name    : PERC H730P Adapter
Serial No       : 87U048Y
FW Package Build: 25.5.3.0005
Wed, Aug 21, 5:56 PM · cloud-services-team, ops-eqiad, Operations
Bstorm added a comment to T230289: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only.

Reboot sent it into a re-image (stalled at confirmation about writing partitioning scheme to disk). It's not healthy. :) Feel free to muck around in the console.

Wed, Aug 21, 5:24 PM · cloud-services-team, ops-eqiad, Operations
Bstorm added a comment to T230289: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only.

It wasn't showing the right number of disks when I was running things. It was missing four, I believe? Two have failed and logged tickets, but it would have to have lost two more to go read-only (and I seem to recall this was a 10 disk machine)--would need to check to be sure.

Wed, Aug 21, 5:20 PM · cloud-services-team, ops-eqiad, Operations
Bstorm added a comment to T230289: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only.

Yup, I can do that. I'm not sure which either, per T230442#5429068
It dropped the failures from the list, and I'm not even entirely convinced the disks are bad with how it behaved. It's not accepting ssh connections anymore, so I'll have to do with via mgmt.

Wed, Aug 21, 5:15 PM · cloud-services-team, ops-eqiad, Operations
Bstorm added a comment to T230442: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only.

Did Dell only send replacement SSD? This has lost 4 disks in a very short time (all are failed now and most missing in the list of disks). I highly suspect there is another issue that isn't the disks themselves (controller firmware, etc. maybe?). This is also not the first time this server did this (fail out multiple disks until the filesystem failed), see:
https://wikitech.wikimedia.org/wiki/Incident_documentation/20190213-cloudvps
T216218: Cloud VPS outage on cloudvirt1024 and cloudvirt1018 due to storage failure
I mean, it might be fine, and coincidences do happen, but I'm curious.

Wed, Aug 21, 4:43 PM · ops-eqiad, Operations

Aug 15 2019

Bstorm added a comment to T230575: Degraded RAID on cloudvirt1018.

Looks like the exact same thing as T229156: Degraded RAID on cloudvirt1018. Same disk, same error and even same hot spare rebuilding.

Aug 15 2019, 8:37 PM · ops-eqiad, Operations
Bstorm added a comment to T230575: Degraded RAID on cloudvirt1018.

Looks like a bad disk here:

Aug 15 2019, 8:32 PM · ops-eqiad, Operations
Bstorm added a comment to T215553: Figure out cert management for Toolforge kubernetes and make it clear in documents, etc. for the upgrade.

Another proposal is enabling automatic rotation for kubelet certs so we don't have to manually re-issue them if we don't upgrade during the course of a year. Since upgrading via kubeadm does rotate the certs for all nodes, as long as there is at least one upgrade during a year, we'll be ok, but why chance it? https://kubernetes.io/docs/tasks/tls/certificate-rotation/#enabling-client-certificate-rotation

Aug 15 2019, 8:18 PM · Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T229551: Database-reports can't see packages in its virtualenv on the grid.

Closing since I was able to test it with @MaxSem's tool account/venv.

Aug 15 2019, 6:10 PM · Community-Tech (Kanban (Q1 2019-20)), Patch-For-Review, Tools, Toolforge
Bstorm closed T229551: Database-reports can't see packages in its virtualenv on the grid, a subtask of T219472: Port database reports to Python 3, as Resolved.
Aug 15 2019, 6:08 PM · Tools, Community-Tech (Kanban (Q1 2019-20))
Bstorm closed T229551: Database-reports can't see packages in its virtualenv on the grid as Resolved.
Aug 15 2019, 6:08 PM · Community-Tech (Kanban (Q1 2019-20)), Patch-For-Review, Tools, Toolforge
Bstorm added a comment to T228499: Toolforge: changes to maintain-kubeusers.

Removing toolforge access is an effective removal as well if the API server is not publicly accessible of course. However, ensuring that the RBAC can be removed as well helps. Rebuilding the CA sounds like a poor option.

Aug 15 2019, 5:48 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T227290: Design and document how to integrate the new Toolforge k8s cluster with PodSecurityPolicy.

Ok for Toolforge users, I currently have maintain-kubeusers generating individual role-bindings to the default "edit" clusterrole for new users in their namespace (T228499: Toolforge: changes to maintain-kubeusers). Since "edit" is a blank check of read/write access while preventing changes to RBAC/PSP, I thought it called for some modification. The biggest things I think we should remove from "edit" is:

Aug 15 2019, 5:40 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T228499: Toolforge: changes to maintain-kubeusers.

I was just thinking about the fact that we are applying RBAC to the user and not to the group (which seems more efficient). At this time, here is why (and this needs to be documented in the script): I do not support using the group annotation for overall cluster access when using x.509 certificates because this issue is not resolved yet. An issued cert exists until it expires, so RBAC is our primary means of immediately shutting down user access. This means we can tie some things to the group, but other things (write access to resources, at least) must be tied to the user until there is a reasonable mechanism for invalidating client certs.

Aug 15 2019, 5:27 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T228499: Toolforge: changes to maintain-kubeusers.

Currently this is using the "edit" default clusterrole for new toolforge users. That is absolutely not what I'd like it to use for now. So I've dumped out the permissions that grants and have commented out the pieces I'd like to further restrict for toolforge. I'll include this in the PSP/RBAC ticket as well with a bit more explaination (T227290: Design and document how to integrate the new Toolforge k8s cluster with PodSecurityPolicy)

1# RBAC minimum perms for toolforge users, based on the "edit" system clusterrole with some permissions commented out:
2# verbs for R/O
3# ["get", "list", "watch"]
4
5# verbs for R/W
6# ["get", "list", "watch", "create", "update", "patch", "delete"]
7
8# resources needed by Toolforge users for sure:
9# pods, deployments,
10
11rules:
12- apiGroups:
13 - ""
14 resources:
15 - pods/attach
16 - pods/exec
17 - pods/portforward
18 - pods/proxy
19 - secrets
20 - services/proxy
21 verbs:
22 - get
23 - list
24 - watch
25# - apiGroups:
26# - ""
27# resources:
28# - serviceaccounts
29# verbs:
30# - impersonate
31- apiGroups:
32 - ""
33 resources:
34 - pods
35 - pods/attach
36 - pods/exec
37 - pods/portforward
38 - pods/proxy
39 verbs:
40 - create
41 - delete
42 - deletecollection
43 - patch
44 - update
45- apiGroups:
46 - ""
47 resources:
48 - configmaps
49 - endpoints
50# - persistentvolumeclaims
51 - replicationcontrollers
52 - replicationcontrollers/scale
53 - secrets
54# - serviceaccounts
55 - services
56 - services/proxy
57 verbs:
58 - create
59 - delete
60 - deletecollection
61 - patch
62 - update
63- apiGroups:
64 - apps
65 resources:
66# - daemonsets
67 - deployments
68 - deployments/rollback
69 - deployments/scale
70 - replicasets
71 - replicasets/scale
72 - statefulsets
73 - statefulsets/scale
74 verbs:
75 - create
76 - delete
77 - deletecollection
78 - patch
79 - update
80# - apiGroups:
81# - autoscaling
82# resources:
83# - horizontalpodautoscalers
84# verbs:
85# - create
86# - delete
87# - deletecollection
88# - patch
89# - update
90- apiGroups:
91 - batch
92 resources:
93 - cronjobs
94 - jobs
95 verbs:
96 - create
97 - delete
98 - deletecollection
99 - patch
100 - update
101- apiGroups:
102 - extensions
103 resources:
104# - daemonsets
105 - deployments
106 - deployments/rollback
107 - deployments/scale
108 - ingresses
109 - networkpolicies
110 - replicasets
111 - replicasets/scale
112 - replicationcontrollers/scale
113 verbs:
114 - create
115 - delete
116 - deletecollection
117 - patch
118 - update
119# - apiGroups:
120# - policy
121# resources:
122# - poddisruptionbudgets
123# verbs:
124# - create
125# - delete
126# - deletecollection
127# - patch
128# - update
129- apiGroups:
130 - networking.k8s.io
131 resources:
132 - ingresses
133 - networkpolicies
134 verbs:
135 - create
136 - delete
137 - deletecollection
138 - patch
139 - update
140- apiGroups:
141 - ""
142 resources:
143 - configmaps
144 - endpoints
145 - persistentvolumeclaims
146 - pods
147 - replicationcontrollers
148 - replicationcontrollers/scale
149# - serviceaccounts
150 - services
151 verbs:
152 - get
153 - list
154 - watch
155- apiGroups:
156 - ""
157 resources:
158 - bindings
159 - events
160 - limitranges
161 - namespaces/status
162 - pods/log
163 - pods/status
164 - replicationcontrollers/status
165 - resourcequotas
166 - resourcequotas/status
167 verbs:
168 - get
169 - list
170 - watch
171- apiGroups:
172 - ""
173 resources:
174 - namespaces
175 verbs:
176 - get
177 - list
178 - watch
179- apiGroups:
180 - apps
181 resources:
182 - controllerrevisions
183 - daemonsets
184 - deployments
185 - deployments/scale
186 - replicasets
187 - replicasets/scale
188 - statefulsets
189 - statefulsets/scale
190 verbs:
191 - get
192 - list
193 - watch
194- apiGroups:
195 - autoscaling
196 resources:
197 - horizontalpodautoscalers
198 verbs:
199 - get
200 - list
201 - watch
202- apiGroups:
203 - batch
204 resources:
205 - cronjobs
206 - jobs
207 verbs:
208 - get
209 - list
210 - watch
211- apiGroups:
212 - extensions
213 resources:
214 - daemonsets
215 - deployments
216 - deployments/scale
217 - ingresses
218 - networkpolicies
219 - replicasets
220 - replicasets/scale
221 - replicationcontrollers/scale
222 verbs:
223 - get
224 - list
225 - watch
226- apiGroups:
227 - policy
228 resources:
229 - poddisruptionbudgets
230 verbs:
231 - get
232 - list
233 - watch
234- apiGroups:
235 - networking.k8s.io
236 resources:
237 - ingresses
238 - networkpolicies
239 verbs:
240 - get
241 - list
242 - watch

Aug 15 2019, 5:04 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm created P8918 RBAC for Toolforge suggestion.
Aug 15 2019, 5:01 PM
Bstorm added a comment to T229551: Database-reports can't see packages in its virtualenv on the grid.

This seems like it is fixed now. I don't need a bash wrapper in my test case.

Aug 15 2019, 3:52 PM · Community-Tech (Kanban (Q1 2019-20)), Patch-For-Review, Tools, Toolforge

Aug 14 2019

Bstorm added a comment to T228500: Toolforge: evaluate ingress mechanism.

I did a little research to make sure I'm not being unhelpful on this ticket by commenting (and yes, some of my comments were probably useless).

Aug 14 2019, 10:54 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm triaged T229009: Proposal: ditching the master name in kubernetes servers as Low priority.
Aug 14 2019, 5:42 PM · Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T229009: Proposal: ditching the master name in kubernetes servers.

At this point, I'm going to leave this task open basically for just long enough for us to build new control plane nodes in toolsbeta. I don't think it requires us tearing down the test cluster for now.

Aug 14 2019, 5:42 PM · Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm closed T169287: etcd config depends on puppet certs, but puppet doesn't know, a subtask of T169289: Tool Labs 2017-06-29 Labstore100[45] kernel upgrade issues, as Resolved.
Aug 14 2019, 5:39 PM · cloud-services-team (Kanban), Toolforge
Bstorm closed T169287: etcd config depends on puppet certs, but puppet doesn't know, a subtask of T215553: Figure out cert management for Toolforge kubernetes and make it clear in documents, etc. for the upgrade, as Resolved.
Aug 14 2019, 5:39 PM · Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm closed T169287: etcd config depends on puppet certs, but puppet doesn't know as Resolved.

At this point, we have ended up puppetizing the copying of puppet certs to act as etcd client certs as well as server certs in T215531: Deploy upgraded Kubernetes to toolsbeta with an "unstacked" control plane (separate etcd servers) because we found the process of dealing with node failure with a stacked control plane to be kind of awful.

Aug 14 2019, 5:39 PM · cloud-services-team (Kanban), Kubernetes, Cloud-Services
Bstorm moved T215553: Figure out cert management for Toolforge kubernetes and make it clear in documents, etc. for the upgrade from Doing to Needs discussion on the cloud-services-team (Kanban) board.

Adding to discussion in order to discuss the proposal for admin users since that is a change from the behavior of the original system as well as to open the design proposal for comment/questions/rejection/redo in general.

Aug 14 2019, 5:33 PM · Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm merged T144153: Move kubernetes authentication to using X.509 client certs into T228499: Toolforge: changes to maintain-kubeusers.
Aug 14 2019, 5:23 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm merged task T144153: Move kubernetes authentication to using X.509 client certs into T228499: Toolforge: changes to maintain-kubeusers.
Aug 14 2019, 5:23 PM · cloud-services-team (Kanban), Kubernetes, Toolforge, Tools-Kubernetes
Bstorm added a comment to T215553: Figure out cert management for Toolforge kubernetes and make it clear in documents, etc. for the upgrade.

The admin users mentioned in that diagram are still theoretical, but there is no reason to require root to interact with a k8s API. It should be straightforward to add a service or a manually run script that maps the <project>.admin group to admin user accounts and places them in the appropriate locations. That will allow Toolforge admins to interact with k8s as easily as they can Grid Engine (and nobody else--they need to use tool accounts). This should simplify playbooks and procedures for dealing with jobs and services that are misbehaving, etc.

Aug 14 2019, 5:08 PM · Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T215553: Figure out cert management for Toolforge kubernetes and make it clear in documents, etc. for the upgrade.

This describes essentially what we are now doing. Etcd client and server certs are simply the puppet certs (which should keep etcd flexible in case we need to set up routing into calico somewhere), while certs for users are x.509s generated using the certificates API of k8s. Node certs are generated by k8s as well using kubeadm (which interacts with the certs API using tokens). The certs to manage the CA and PKI are copied between k8s control plane nodes at build time. A new cluster will have a new CA, which honestly prevents leakage nicely.

Aug 14 2019, 5:05 PM · Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T229551: Database-reports can't see packages in its virtualenv on the grid.

In the meantime, I did confirm separately that what I documented about using a wrapper explicitly with an activate does work with jsub. I very much like the idea of having it fixed so that isn't necessary, though :)

Aug 14 2019, 2:31 AM · Community-Tech (Kanban (Q1 2019-20)), Patch-For-Review, Tools, Toolforge
Bstorm added a comment to T229551: Database-reports can't see packages in its virtualenv on the grid.

Updated the doc https://wikitech.wikimedia.org/wiki/Help:Toolforge/Grid#Python3_virtualenv/venv_doesn't_seem_to_work_on_Gridengine

Aug 14 2019, 2:05 AM · Community-Tech (Kanban (Q1 2019-20)), Patch-For-Review, Tools, Toolforge
Bstorm added a comment to T229551: Database-reports can't see packages in its virtualenv on the grid.

@zhuyifei1999 was kind enough to put things back so I could prove myself good and solidly wrong about the character set interfering. It is definitely the resolving of symlinks...and that's why a bash wrapper is a good idea here. Thanks @zhuyifei1999 :)

Aug 14 2019, 1:56 AM · Community-Tech (Kanban (Q1 2019-20)), Patch-For-Review, Tools, Toolforge
Bstorm added a comment to T229551: Database-reports can't see packages in its virtualenv on the grid.

Ah ok. But that doesn't make sense. My venv works fine. It also is a symlink.

Aug 14 2019, 1:41 AM · Community-Tech (Kanban (Q1 2019-20)), Patch-For-Review, Tools, Toolforge
Bstorm added a comment to T229551: Database-reports can't see packages in its virtualenv on the grid.

The character set changes on the grid seem to affect the resolution of the python search path.

Aug 14 2019, 1:40 AM · Community-Tech (Kanban (Q1 2019-20)), Patch-For-Review, Tools, Toolforge
Bstorm added a comment to T229551: Database-reports can't see packages in its virtualenv on the grid.

Root cause is the character set @zhuyifei1999

Aug 14 2019, 1:40 AM · Community-Tech (Kanban (Q1 2019-20)), Patch-For-Review, Tools, Toolforge
Bstorm added a comment to T229551: Database-reports can't see packages in its virtualenv on the grid.

This tells me that you should definitely use a wrapper script

Aug 14 2019, 1:35 AM · Community-Tech (Kanban (Q1 2019-20)), Patch-For-Review, Tools, Toolforge
Bstorm added a comment to T230442: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only.

Just to re-emphasize: this system does not have any loads on it at this time, so it's a wonderful time for it to blow up. It can be repaired and rebooted as needed.

Aug 14 2019, 1:33 AM · ops-eqiad, Operations
Bstorm renamed T230442: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only from Degraded RAID on cloudvirt1024 to Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only.
Aug 14 2019, 1:32 AM · ops-eqiad, Operations
Bstorm renamed T230289: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only from Degraded RAID on cloudvirt1024 to Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only.
Aug 14 2019, 1:32 AM · cloud-services-team, ops-eqiad, Operations
Bstorm added a comment to T230289: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only.

Per T230442, this appears to be something strange going on, possibly a controller freaking out. It lost 4 disks in a very short time and is now a read-only volume. Feel free to reboot or whatever @Cmjohnson . I included some troubleshooting info on the other ticket.

Aug 14 2019, 1:31 AM · cloud-services-team, ops-eqiad, Operations
Bstorm added a comment to T230442: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only.

Nothing in the eventlog when I tried to retrieve it.

Aug 14 2019, 1:13 AM · ops-eqiad, Operations
Bstorm added a comment to T230442: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only.

Some controller info:

Aug 14 2019, 1:12 AM · ops-eqiad, Operations
Bstorm created P8906 (An Untitled Masterwork).
Aug 14 2019, 1:10 AM
Bstorm added a comment to T230442: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only.

Since the filesystem has gone read-only, I was only able to get part of the firmware terminal logs.

Aug 14 2019, 1:09 AM · ops-eqiad, Operations
Bstorm added a comment to T230442: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only.

It seems that this is showing a loss of 4 disks. We may want to check a controller in this case.

Aug 14 2019, 1:07 AM · ops-eqiad, Operations

Aug 13 2019

Bstorm added a comment to T229551: Database-reports can't see packages in its virtualenv on the grid.

I will say that I can import that when I run this on an exec node directly, so this isn't a difference in the nodes. It could be a difference in the environment, though, which is what a wrapper might fix.

Aug 13 2019, 10:21 PM · Community-Tech (Kanban (Q1 2019-20)), Patch-For-Review, Tools, Toolforge
Bstorm added a comment to T229551: Database-reports can't see packages in its virtualenv on the grid.

The virtualenv is clearly well-formed, but the environment of the grid can be a bit weird, so I know I have to use a shell wrapper to run python to set a few things similar to what is mentioned here: https://wikitech.wikimedia.org/wiki/Help:Toolforge/Grid#An_error_with_%22ascii%22_codepage,_%22file_not_found%22,_or_UnicodeEncodeError

Aug 13 2019, 10:19 PM · Community-Tech (Kanban (Q1 2019-20)), Patch-For-Review, Tools, Toolforge
Bstorm added a comment to T230147: Toolforge: collect prometheus node exporter metrics from new k8s worker nodes.

That said, this one seems like it might actually be fixable with a pin. :)

Aug 13 2019, 5:01 PM · cloud-services-team (Kanban)
Bstorm added a comment to T230147: Toolforge: collect prometheus node exporter metrics from new k8s worker nodes.

Just for information, there's more than one quirk in building new Jessie K8s nodes. It may be worth it to just document the problem because pinning doesn't always prevent chicken/egg issues https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes#Building_new_nodes

Aug 13 2019, 4:51 PM · cloud-services-team (Kanban)
Bstorm moved T230247: Increase VCPU quota for wikidata-query project from Inbox to Clinic Duty on the cloud-services-team (Kanban) board.
Aug 13 2019, 4:42 PM · cloud-services-team (Kanban), Cloud-VPS (Quota-requests), Discovery-Wikidata-Query-Service-Sprint, Wikidata, Wikidata-Query-Service
Bstorm added a project to T230247: Increase VCPU quota for wikidata-query project: cloud-services-team (Kanban).
Aug 13 2019, 4:42 PM · cloud-services-team (Kanban), Cloud-VPS (Quota-requests), Discovery-Wikidata-Query-Service-Sprint, Wikidata, Wikidata-Query-Service
Bstorm moved T230247: Increase VCPU quota for wikidata-query project from Inbox to Approved on the Cloud-VPS (Quota-requests) board.
Aug 13 2019, 4:41 PM · cloud-services-team (Kanban), Cloud-VPS (Quota-requests), Discovery-Wikidata-Query-Service-Sprint, Wikidata, Wikidata-Query-Service
Bstorm added a comment to T230247: Increase VCPU quota for wikidata-query project.

+1 approved

Aug 13 2019, 4:41 PM · cloud-services-team (Kanban), Cloud-VPS (Quota-requests), Discovery-Wikidata-Query-Service-Sprint, Wikidata, Wikidata-Query-Service
Bstorm closed T229955: Request creation of viztest VPS project as Declined.

We declined this project in the team meeting because personal testing projects are not really supported. Though we sympathize with the difficulty of testing patches to ops/puppet we do puppet testing for our projects in the projects where other work is being done, generally.

Aug 13 2019, 4:36 PM · Cloud-VPS (Project-requests)
Bstorm moved T228499: Toolforge: changes to maintain-kubeusers from Needs discussion to Doing on the cloud-services-team (Kanban) board.
Aug 13 2019, 4:29 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm updated subscribers of T229936: Examine and prioritize work to change labs namespace to cloud in gerrit as well as groups.

It seems @tstarling might know how to go about this sort of thing?

Aug 13 2019, 3:13 PM · cloud-services-team (Kanban)
Bstorm added a comment to T228500: Toolforge: evaluate ingress mechanism.

Fair enough. I'm concerned we may need to change to the community-supported one at some point (which doesn't need to be now since there are bound to be similarities). Once this is working, we can try stuff and will know more. If the community supported one supports dynamic changes of endpoints (as is suggested by that chart), it may be a better fit for many reasons.

Aug 13 2019, 2:38 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban), Kubernetes