Page MenuHomePhabricator

Bstorm (Brooke)
Ops Witch -- Wikimedia Cloud Services Team

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Friday

  • Clear sailing ahead.

User Details

User Since
Jan 22 2018, 10:09 PM (105 w, 1 d)
Availability
Available
IRC Nick
bstorm_
LDAP User
Bstorm
MediaWiki User
BStorm (WMF) [ Global Accounts ]

On the wikis, I'm BStorm (WMF), bstorm_ on IRC and Bstorm on gerrit and WikiTech.

I work for or provide services to the Wikimedia Foundation, but this is my only Phabricator account. Edits, statements, or other contributions made from this account are my own, and may not reflect the views of the Foundation.

Recent Activity

Yesterday

Bstorm added projects to T243459: Plan for GSOC or Outreachy 2020 for PAWS: Outreachy, Google-Summer-of-Code.

Adding tags per chats here

Tue, Jan 28, 5:39 PM · Outreachy (Round 20), Google-Summer-of-Code, PAWS
Bstorm added a comment to T243459: Plan for GSOC or Outreachy 2020 for PAWS.

Sorry, All Hands isn't helping me focus on this. I'll check on how it relates to T241019 today (and talk to people on that who might help me know what to do next here). I am finding my limited involvement in PAWS is not helping me scope these tasks out.

Tue, Jan 28, 3:46 PM · Outreachy (Round 20), Google-Summer-of-Code, PAWS

Mon, Jan 27

Bstorm added a comment to T243775: Dumps NFS mounts not available on stat1006.

It looks like fstab needs a change at the get-go. Should just be /, not /dumps now.

Mon, Jan 27, 3:58 PM · Analytics
Bstorm added a comment to T243775: Dumps NFS mounts not available on stat1006.

Shoot, should have checked all these. It may be partially resolved from T243328: "stale file handle" error on notebook1003 when trying to access /mnt/data
I suspect the /etc/fstab needs update and then a umount and mount (or umount and puppet run, depending on the setup).
This is caused by T242798.

Mon, Jan 27, 3:58 PM · Analytics

Sat, Jan 25

Bstorm added a comment to T243580: "413 Request Entity Too Large" / Kubernetes 2020?.

That's a bag of all kinds of possibilities.

Sat, Jan 25, 12:18 AM · cloud-services-team (Kanban), Tools, Kubernetes, Toolforge

Fri, Jan 24

Bstorm closed T243465: refill-api memory usage is somewhat dangerous to the Kubernetes cluster on Toolforge as Resolved.

I've edited the refill.yaml file to include the new setting and saved the old file to refill.yaml.old.

Fri, Jan 24, 9:30 PM · Toolforge, cloud-services-team (Kanban)
Bstorm added a comment to T243465: refill-api memory usage is somewhat dangerous to the Kubernetes cluster on Toolforge.

After waiting a couple days, I took the liberty of modifying the deployment in place.

Fri, Jan 24, 9:24 PM · Toolforge, cloud-services-team (Kanban)
Bstorm closed T241668: Wiki Replicas: `spamblacklist` in `allowed_logtypes` but a comment says that it should be excluded as Resolved.

Comment removed

Fri, Jan 24, 4:30 PM · Security, Documentation, Data-Services, SpamBlacklist, User-DannyS712

Thu, Jan 23

Bstorm added a subtask for T203254: labstore1004 and labstore1005 high load issues following upgrades: T224582: Migrate labstore1004/labstore1005 to Stretch/Buster.
Thu, Jan 23, 9:24 PM · Data-Services, Patch-For-Review, cloud-services-team (Kanban)
Bstorm added a parent task for T224582: Migrate labstore1004/labstore1005 to Stretch/Buster: T203254: labstore1004 and labstore1005 high load issues following upgrades.
Thu, Jan 23, 9:24 PM · Cloud-VPS (Debian Jessie Deprecation), cloud-services-team (Kanban), Operations
Bstorm added a project to T243533: Improve Dell server hardware monitoring: observability.
Thu, Jan 23, 6:14 PM · observability, cloud-services-team (Kanban)
Bstorm triaged T243533: Improve Dell server hardware monitoring as Medium priority.
Thu, Jan 23, 5:27 PM · observability, cloud-services-team (Kanban)
Bstorm created T243533: Improve Dell server hardware monitoring.
Thu, Jan 23, 5:26 PM · observability, cloud-services-team (Kanban)

Wed, Jan 22

Bstorm triaged T222209: Cleanup logging and curl use in wikitech post-block hooks as Medium priority.
Wed, Jan 22, 10:49 PM · cloud-services-team (Kanban), wikitech.wikimedia.org
Bstorm triaged T212012: Create phab tag/form/workflow/something for volunteer priv requests as Medium priority.
Wed, Jan 22, 10:48 PM · cloud-services-team (Kanban)
Bstorm placed T194953: Decide on how to host Rust tools on Toolforge up for grabs.
Wed, Jan 22, 10:47 PM · cloud-services-team (Kanban), Toolforge
Bstorm triaged T194953: Decide on how to host Rust tools on Toolforge as Low priority.

Current tooling would seem to allow anyone to use rust in a tool's home dir. Establishing some kind of supported procedure for launching a rust service is another matter.

Wed, Jan 22, 10:46 PM · cloud-services-team (Kanban), Toolforge
Bstorm closed T215704: webservice command not available when running a job on the task queue of the job grid as Resolved.

This ticket looks like it can be closed. Many of the mentioned issues were resolved elsewhere, and I do not think we actually want to install webservice on the job grid. If I'm wrong, please reopen!

Wed, Jan 22, 10:40 PM · cloud-services-team (Kanban), Toolforge
Bstorm triaged T209992: Document on wikitech the general process of getting a table/column exposed to Wiki Replica users as Medium priority.
Wed, Jan 22, 10:34 PM · cloud-services-team (Kanban), Documentation, Data-Services
Bstorm moved T204359: Investigate and/or deploy LACP to NFS servers for Cloud Services from Doing to Graveyard on the cloud-services-team (Kanban) board.

With the current understanding of the setup that I have, it is likely that we could move some of this to 10G, but the interfaces on the systems that would benefit most are not configured well for LACP right now anyway. I think there are larger issues to resolve around NFS, so putting this in the Graveyard for now.

Wed, Jan 22, 10:33 PM · Data-Services, cloud-services-team (Kanban)
Bstorm triaged T234656: Systems and service continuity and availability constraints as Medium priority.
Wed, Jan 22, 10:31 PM · cloud-services-team (Kanban)
Bstorm assigned T181375: Revamp first boot process for new VMs to Andrew.

Assigning "low" only because I think the really serious things were sorted?

Wed, Jan 22, 10:30 PM · Patch-For-Review, cloud-services-team (Kanban)
Bstorm triaged T238162: Establish a process for renewing TLS certs for the 2 webhook controllers as Medium priority.
Wed, Jan 22, 10:29 PM · Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm triaged T240718: Perform failover tests on Ceph storage cluster as Medium priority.
Wed, Jan 22, 10:28 PM · Epic, cloud-services-team (Kanban)
Bstorm added a comment to T242824: Tool account cannot list all namespaced objects in its Kubernetes namespace.

@bd808 How are we feeling on this? With the exception of the podpresets, I think I'm feeling pretty strongly on keeping the others un-listable at the tool level to prevent opportunistic and hijacked accounts from listing things (that are mostly available information somewhere, but not in their live form). By live form, I mean they can be changed on the fly by Toolforge administrators should anything be happening that warrants it, and that would not be documented publicly to non-admins unless that was done intentionally.

Wed, Jan 22, 10:26 PM · Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm triaged T225373: `qstat -xml` failures lead to incomprehensible error messages from jsub/jstart/job as Medium priority.
Wed, Jan 22, 10:20 PM · cloud-services-team (Kanban), Toolforge
Bstorm triaged T240969: shinken: all puppet reports showing as 'unknown' as Medium priority.
Wed, Jan 22, 10:18 PM · cloud-services-team (Kanban)
Bstorm triaged T242637: Create a "health check" for Kubernetes worker nodes which validates local Toolforge config as Medium priority.
Wed, Jan 22, 10:18 PM · cloud-services-team (Kanban), Kubernetes, Toolforge
Bstorm triaged T243468: Add smarter resourcing logic to kubernetes backend of webservice as Medium priority.
Wed, Jan 22, 10:17 PM · Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm created T243468: Add smarter resourcing logic to kubernetes backend of webservice.
Wed, Jan 22, 10:16 PM · Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm closed T236202: Modify webservice and maintain-kubeusers to allow switching to the new cluster, a subtask of T214513: Deploy and migrate tools to a Kubernetes v1.15 or newer cluster, as Resolved.
Wed, Jan 22, 10:10 PM · Wikimedia-Incident, Goal, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm closed T236202: Modify webservice and maintain-kubeusers to allow switching to the new cluster as Resolved.

This piece is really done

Wed, Jan 22, 10:10 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban)
Bstorm moved T243465: refill-api memory usage is somewhat dangerous to the Kubernetes cluster on Toolforge from Inbox to Watching on the cloud-services-team (Kanban) board.
Wed, Jan 22, 10:07 PM · Toolforge, cloud-services-team (Kanban)
Bstorm added projects to T243465: refill-api memory usage is somewhat dangerous to the Kubernetes cluster on Toolforge: cloud-services-team (Kanban), Toolforge.
Wed, Jan 22, 10:07 PM · Toolforge, cloud-services-team (Kanban)
Bstorm added a comment to T243465: refill-api memory usage is somewhat dangerous to the Kubernetes cluster on Toolforge.

If this is moved to the new Kubernetes cluster (as described here: https://wikitech.wikimedia.org/wiki/News/2020_Kubernetes_cluster_migration), there are automatic limits to how much RAM the worker can consume, but it still must implement a higher requests value than the default or it can overrun the RAM of the node. The requests part of a container definition helps determine if there is room on the node it is getting placed on.

Wed, Jan 22, 9:52 PM · Toolforge, cloud-services-team (Kanban)
Bstorm triaged T243465: refill-api memory usage is somewhat dangerous to the Kubernetes cluster on Toolforge as High priority.
Wed, Jan 22, 9:49 PM · Toolforge, cloud-services-team (Kanban)
Bstorm added a comment to T243159: Request to enable node version 12.14.1 in toolforge to deploy VideoCutTool.

You can customize $PATH per tool on the pod level by simply adding some standard areas into the search path for that type of pod. If it doesn't exist, it won't harm anything. If we add an env array, we may have to also add the $HOME var in case we overwrite the pod preset (which would need a quick test to be sure).

Wed, Jan 22, 3:20 PM · Toolforge (Software install/update), VideoCutTool

Tue, Jan 21

Bstorm closed T243328: "stale file handle" error on notebook1003 when trying to access /mnt/data as Resolved.
[bstorm@notebook1003]:~ $ ls -al /mnt/data/xmldatadumps/public/enwiki/20200101/enwiki-20200101-pages-meta-current.xml.bz2
-rw-r--r-- 1 400 400 32529073303 Jan  3 05:58 /mnt/data/xmldatadumps/public/enwiki/20200101/enwiki-20200101-pages-meta-current.xml.bz2

Looks good now :)

Tue, Jan 21, 10:09 PM · Analytics
Bstorm added a comment to T243159: Request to enable node version 12.14.1 in toolforge to deploy VideoCutTool.

Ah, ok.

Tue, Jan 21, 10:07 PM · Toolforge (Software install/update), VideoCutTool
Bstorm added a comment to T243328: "stale file handle" error on notebook1003 when trying to access /mnt/data.

I see the puppetization needs updating, I see.

Tue, Jan 21, 9:30 PM · Analytics
Bstorm added a comment to T243328: "stale file handle" error on notebook1003 when trying to access /mnt/data.

We changed the exports. The mounts need to be unmounted and remounted. I can do that.

Tue, Jan 21, 7:52 PM · Analytics
Bstorm updated subscribers of T243159: Request to enable node version 12.14.1 in toolforge to deploy VideoCutTool.

The best way to use a more recent version on Toolforge would be NVM to my knowledge. @bd808 that works on k8s as well as the grid, right?

Tue, Jan 21, 7:43 PM · Toolforge (Software install/update), VideoCutTool
Bstorm moved T243304: Create a script to remount cloud NFS clients from Backlog to Shared Storage on the Data-Services board.
Tue, Jan 21, 6:22 PM · Data-Services, cloud-services-team (Kanban)
Bstorm added a comment to T243304: Create a script to remount cloud NFS clients.

There may be some work already done here...this may be a new plugin for cumin really. I am not 100% sure, but targeting NFS clients is the goal with admin tasks that can be quickly deployed.

Tue, Jan 21, 6:22 PM · Data-Services, cloud-services-team (Kanban)
Bstorm triaged T243304: Create a script to remount cloud NFS clients as Medium priority.
Tue, Jan 21, 6:21 PM · Data-Services, cloud-services-team (Kanban)

Sun, Jan 19

Bstorm added a comment to T229920: WMCS: migrate python2 scripts to python3.

We are on Pike now. I might argue that ignoring the others is not the worst idea.

Sun, Jan 19, 8:00 PM · Patch-For-Review, Epic, cloud-services-team (Kanban)

Fri, Jan 17

Bstorm closed T242256: Request creation of ocrtoy VPS project as Resolved.

Project should be available at https://horizon.wikimedia.org
Let us know if there are any problems.

Fri, Jan 17, 5:06 PM · cloud-services-team (Kanban), Cloud-VPS (Project-requests)
Bstorm closed T242812: Request creation of centralnotice-staging VPS project as Resolved.

You should be able to access the project in horizon now. Please let us know if there are any issues.

Fri, Jan 17, 4:56 PM · cloud-services-team (Kanban), Cloud-VPS (Project-requests)
Bstorm claimed T242812: Request creation of centralnotice-staging VPS project.
Fri, Jan 17, 4:39 PM · cloud-services-team (Kanban), Cloud-VPS (Project-requests)
Bstorm added a comment to T242719: https://tools.wmflabs.org/{toolname} no longer redirects to https://tools.wmflabs.org/{toolname}/ on new k8s cluster.

Note: because of a change in the way restarts work (they are lighter now and don't destroy ingresses), anyone looking to use the new ingress setting should webservice stop and then webservice start --backend kubernetes <whatever>

Fri, Jan 17, 3:14 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban), Kubernetes

Thu, Jan 16

Bstorm closed T140415: `webservice restart` does not always wait for service to stop before trying to start again, a subtask of T139107: Issues with 'webservice' kubernetes backend (tracking), as Resolved.
Thu, Jan 16, 11:59 PM · Toolforge, Kubernetes, Tracking-Neverending
Bstorm closed T140415: `webservice restart` does not always wait for service to stop before trying to start again as Resolved.

At this point, the restart function is a simple killing of pods. The new cluster also responds differently in general.

Thu, Jan 16, 11:59 PM · cloud-services-team (Kanban), Kubernetes, Toolforge, Tools-Kubernetes
Bstorm added a comment to T197930: Build or backport .deb for kubernetes python client for Stretch and Buster.

For the near term, I feel like doing a "just enough custom code to support existing use cases" approach to replacing pykube in webservice is the right thing to do. We need to unblock Kubernetes upgrades now. We do not need to solve all of our other dreams about a better cli tool to do that.

Thu, Jan 16, 8:23 PM · cloud-services-team (Kanban), Toolforge, Kubernetes
Bstorm added a comment to T197930: Build or backport .deb for kubernetes python client for Stretch and Buster.

Proposal: Cancel this chain of tasks based on the discussion above. Create a new task to modify webservice's Kubernetes backend for the new cluster only with design being the first step.

Thu, Jan 16, 8:17 PM · cloud-services-team (Kanban), Toolforge, Kubernetes
Bstorm added a comment to T197930: Build or backport .deb for kubernetes python client for Stretch and Buster.

Ooooorrr, that could be service in every tool namespace. That means it isn't a monolith with access to anything but itself. It runs as default service account and responds to its owner's cert to the simple commands of start, stop and restart handling all communication with k8s on its own. That would fix the auth problem without creating a global sudo of any kind. It could be expanded to include a token auth system from CI in the future, even....@bd808

Thu, Jan 16, 6:56 PM · cloud-services-team (Kanban), Toolforge, Kubernetes
Bstorm added a comment to T242397: Make webservice grid jobs "non-rerunable".

Note: I don't have any idea why the output says "queue none":
hard_queue_list: webgrid-lighttpd
Also
2937052 0.32584 lighttpd-b tools.bd808- r 12/12/2019 12:48:52 webgrid-li MASTER

Thu, Jan 16, 6:39 PM · Toolforge, cloud-services-team (Kanban)
Bstorm closed T242397: Make webservice grid jobs "non-rerunable" as Resolved.

Where job 2937052 happens to be bd808-test2:

Thu, Jan 16, 6:36 PM · Toolforge, cloud-services-team (Kanban)
Bstorm added a comment to T197930: Build or backport .deb for kubernetes python client for Stretch and Buster.

random idea: would it make sense to have the webservice mechanism be a rest API that users call with a very simple script, and hide all the gory details behind a service under our control? That would greatly reduce some of the complex things your are mentioning (versioning matrix, etc), and offer other benefits. It has some challenges, like auth, etc.

Thu, Jan 16, 6:31 PM · cloud-services-team (Kanban), Toolforge, Kubernetes
Bstorm closed T242817: m5 ran out of connections after openstack upgrade to "Pike", a subtask of T188589: m5-master overloaded by idle connections to the nova database, as Resolved.
Thu, Jan 16, 4:08 PM · Operations, Cloud-Services, DBA
Bstorm closed T242817: m5 ran out of connections after openstack upgrade to "Pike" as Resolved.

Neutron has 159 open connections now. I think this is fixed for the time being.

Thu, Jan 16, 4:08 PM · Cloud-VPS, cloud-services-team (Kanban)
Bstorm added a comment to T197930: Build or backport .deb for kubernetes python client for Stretch and Buster.

TLDR: I agree. Let's do that instead.

Thu, Jan 16, 4:06 PM · cloud-services-team (Kanban), Toolforge, Kubernetes
Bstorm added a comment to T242817: m5 ran out of connections after openstack upgrade to "Pike".

Neutron is now at 160, and things seem fairly stable. I'm going to reduce the max_connections again.

Thu, Jan 16, 12:38 AM · Cloud-VPS, cloud-services-team (Kanban)

Wed, Jan 15

Bstorm added a comment to T242817: m5 ran out of connections after openstack upgrade to "Pike".

Still holding at 154 total neutron connections.

Wed, Jan 15, 9:45 PM · Cloud-VPS, cloud-services-team (Kanban)
Bstorm added a comment to T242824: Tool account cannot list all namespaced objects in its Kubernetes namespace.

I could be convinced on podpresets, but it is an alpha API. I'm not sure it's a good idea to expose it much.

Wed, Jan 15, 7:03 PM · Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T242824: Tool account cannot list all namespaced objects in its Kubernetes namespace.

In general most of those are not listable to remove unnecessary or disallowed APIs from shell users. networkpolicies is that other one, besides events, that you can list because you are able to interact with them. Interestingly, you cannot list events on the old cluster.

Wed, Jan 15, 7:00 PM · Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T242824: Tool account cannot list all namespaced objects in its Kubernetes namespace.

Some of this is a quirk of the query. It might be better to test using the auth can-i method.

Wed, Jan 15, 6:32 PM · Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T242893: puppetmaster broken in the cloudstore project.

Hrm. Now I cannot seem to ssh to it. :)

Wed, Jan 15, 5:15 PM · Cloud-VPS, cloud-services-team (Kanban)
Bstorm updated the task description for T242893: puppetmaster broken in the cloudstore project.
Wed, Jan 15, 5:13 PM · Cloud-VPS, cloud-services-team (Kanban)
Bstorm triaged T242893: puppetmaster broken in the cloudstore project as Medium priority.
Wed, Jan 15, 5:12 PM · Cloud-VPS, cloud-services-team (Kanban)
Bstorm added a comment to T197930: Build or backport .deb for kubernetes python client for Stretch and Buster.

Some responses and thoughts as well:

  • It seems reasonable to me to stop producing webservice packages for jessie after moving to this library, leaving jessie containers to use the version they have as part of deprecation processes. I mean, Debian is doing that, right? If we think of it that way, that would kind of stop all concern about jessie with regard to the webservice package. If a new feature in webservice is needed, run it outside a container, as long as the webservice-runner still works. I do highly question how much it matters to support running the webservice frontend command inside a container anyway (as convenient as it may be).
  • https://pypi.org/project/kubernetes/ <-- recent versions of the official client still supports python2, so we might be able to do this task for future proofing/scripting and just tack on py2 for webservice. However, I don't expect them to support it for long, and staying up-to-date on this library is something I consider as serious priority for security and sustainability.
  • It is also important to remember that this general topic blocks Kubernetes upgrades past the 1.15 current minor version, which is not good (they are already at 1.17 upstream), which adds to weight to avoiding python3 purity for now.
Wed, Jan 15, 3:50 PM · cloud-services-team (Kanban), Toolforge, Kubernetes
Bstorm added a comment to T242817: m5 ran out of connections after openstack upgrade to "Pike".

After @Andrew merged that last change, it's looking a bit better.

Wed, Jan 15, 3:22 PM · Cloud-VPS, cloud-services-team (Kanban)
Bstorm added a comment to T242817: m5 ran out of connections after openstack upgrade to "Pike".

It is a bit scary that with each openstack upgrade we seem to be seeing these issues as along with the need of more open connections. Is that something OpenStack assumes? (the fact that with every new version there is the need to keep increasing connections on the database).

Wed, Jan 15, 3:14 PM · Cloud-VPS, cloud-services-team (Kanban)
Bstorm added a comment to T242817: m5 ran out of connections after openstack upgrade to "Pike".

Just after services died (reducing connections a bit), I saw this, so we know it is neutron that is the problem:

Wed, Jan 15, 12:49 AM · Cloud-VPS, cloud-services-team (Kanban)
Bstorm added a comment to T242817: m5 ran out of connections after openstack upgrade to "Pike".

Connections are currently at 340 after the above actions, so we have some wiggle room.

Wed, Jan 15, 12:38 AM · Cloud-VPS, cloud-services-team (Kanban)
Bstorm added a comment to T242817: m5 ran out of connections after openstack upgrade to "Pike".

Connecting this to the saga of the DB connections and Openstack such as T237196: openstack-nova running out of database connections

Wed, Jan 15, 12:38 AM · Cloud-VPS, cloud-services-team (Kanban)
Bstorm added a subtask for T188589: m5-master overloaded by idle connections to the nova database: T242817: m5 ran out of connections after openstack upgrade to "Pike".
Wed, Jan 15, 12:36 AM · Operations, Cloud-Services, DBA
Bstorm added a parent task for T242817: m5 ran out of connections after openstack upgrade to "Pike": T188589: m5-master overloaded by idle connections to the nova database.
Wed, Jan 15, 12:36 AM · Cloud-VPS, cloud-services-team (Kanban)
Bstorm updated subscribers of T242817: m5 ran out of connections after openstack upgrade to "Pike".

In the course of this, @JHedden restarted several services, which reduced current connection usage to sane levels, and I set the max_connections on the m5 master to 600 to give more breathing room for troubleshooting (note to @Marostegui and @jcrespo that I did that and don't intend to keep it that way).

Wed, Jan 15, 12:34 AM · Cloud-VPS, cloud-services-team (Kanban)
Bstorm updated the task description for T242817: m5 ran out of connections after openstack upgrade to "Pike".
Wed, Jan 15, 12:31 AM · Cloud-VPS, cloud-services-team (Kanban)
Bstorm triaged T242817: m5 ran out of connections after openstack upgrade to "Pike" as High priority.
Wed, Jan 15, 12:31 AM · Cloud-VPS, cloud-services-team (Kanban)

Tue, Jan 14

Bstorm added a comment to T240852: CloudVPS: horizon giving http/500 intermitently.

@Andrew and I are going to pair up on this in case that helps at all soon

Tue, Jan 14, 5:28 PM · Horizon, cloud-services-team (Kanban)
Bstorm added a comment to T145703: Horizon loses credentials every day.

@Andrew and I are going to pair up on this in case that helps at all soon

Tue, Jan 14, 5:27 PM · Security, cloud-services-team (Kanban), Horizon
Bstorm added a comment to T242719: https://tools.wmflabs.org/{toolname} no longer redirects to https://tools.wmflabs.org/{toolname}/ on new k8s cluster.

I apparently did not add this same logic to the fourohfour handler tool.

Tue, Jan 14, 4:39 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm added a comment to T242397: Make webservice grid jobs "non-rerunable".

Ok, so while I knew the jobs were "rerunable" because I'd done it, @bd808 wisely looked at an individual job and found that it was marked "not rerunable" per the default. The problem is that the queue config for this marks *everything* as rerunable, and we cannot override it at the job level, apparently.

Tue, Jan 14, 1:08 AM · Toolforge, cloud-services-team (Kanban)
Bstorm added a comment to T242637: Create a "health check" for Kubernetes worker nodes which validates local Toolforge config.

TaintNodesByCondition is luckily default on 1.15, so basic things are checked on the new cluster (not on the old, btw). The one thing that "puppet works and contributes a full config" doesn't satisfy here is monitoring the current state of our special needs like sssd. A node-tainting daemonset might still be worth it from that perspective. (a very basic idea is https://github.com/uswitch/nidhogg...and then we'd just need a daemonset that should be running on all webservice nodes that mounts things and connects to sssd). This (or any daemonset that notices a problem and applies a taint) would effectively drop the node from the pool (without "cordon"...and our monitoring would need a bit more nuance.)

Tue, Jan 14, 1:05 AM · cloud-services-team (Kanban), Kubernetes, Toolforge
Bstorm closed T242632: Apparent issues in Toolforge Kubernetes as Invalid.

I think that in toolsbeta, we had a filesystem issue on the only old-cluster node and those nodes were also causing issues. This ticket is basically not needed.

Tue, Jan 14, 12:57 AM · cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T242559: Partialy setup tools-k8s-worker instances created by novaadmin causing problems.

This feels icky - such nodes would then be reserved only for webservice things and other pods that know to tolerate that taint. Maybe it should be the other way around - nodes are tainted until we know they are able to run webservice and co.
I also think that /data/project itself (the missing thing here) is probably useful for things that don't have web services, it seems pretty fundamental to the tools project - shouldn't it be available on all nodes all the time?

Tue, Jan 14, 12:55 AM · cloud-services-team (Kanban), Toolforge

Mon, Jan 13

dduvall awarded T214513: Deploy and migrate tools to a Kubernetes v1.15 or newer cluster a Love token.
Mon, Jan 13, 7:45 PM · Wikimedia-Incident, Goal, Epic, Toolforge, cloud-services-team (Kanban), Kubernetes
Bstorm awarded T242642: Cleanup unsigned puppet client certs on tools-puppetmaster-01 a The World Burns token.
Mon, Jan 13, 5:51 PM · Puppet, cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T242632: Apparent issues in Toolforge Kubernetes.

The issues with glamtools are a fail at reading a config from what I can tell, with consistent errors across php versions and clusters.

Mon, Jan 13, 5:50 PM · cloud-services-team (Kanban), Toolforge
Bstorm lowered the priority of T242632: Apparent issues in Toolforge Kubernetes from High to Medium.

I'm now convinced that what I've seen so far has nothing to do with images or the infrastructure at this point. I'm very much hoping that this gets closed as invalid.

Mon, Jan 13, 5:50 PM · cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T242637: Create a "health check" for Kubernetes worker nodes which validates local Toolforge config.

TaintNodesByCondition is luckily default on 1.15, so basic things are checked on the new cluster (not on the old, btw). The one thing that "puppet works and contributes a full config" doesn't satisfy here is monitoring the current state of our special needs like sssd. A node-tainting daemonset might still be worth it from that perspective. (a very basic idea is https://github.com/uswitch/nidhogg...and then we'd just need a daemonset that should be running on all webservice nodes that mounts things and connects to sssd)

Mon, Jan 13, 5:31 PM · cloud-services-team (Kanban), Kubernetes, Toolforge
Bstorm renamed T242632: Apparent issues in Toolforge Kubernetes from Issue with some buster docker images in Toolforge Kubernetes to Apparent issues in Toolforge Kubernetes.
Mon, Jan 13, 4:51 PM · cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T242632: Apparent issues in Toolforge Kubernetes.

Got it running on the new cluster as well. The image is ok.

Mon, Jan 13, 4:50 PM · cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T242632: Apparent issues in Toolforge Kubernetes.

Ok, now that I rebooted that node, it's fine on the old cluster. The image may be fine.

Mon, Jan 13, 4:46 PM · cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T242637: Create a "health check" for Kubernetes worker nodes which validates local Toolforge config.

That's the mechanism we use already to configure lots about the kubelet

Mon, Jan 13, 4:45 PM · cloud-services-team (Kanban), Kubernetes, Toolforge
Bstorm added a comment to T242637: Create a "health check" for Kubernetes worker nodes which validates local Toolforge config.

Got an idea here: --register-with-taints []api.Taint is a CLI option

Mon, Jan 13, 4:44 PM · cloud-services-team (Kanban), Kubernetes, Toolforge
Bstorm added a comment to T242637: Create a "health check" for Kubernetes worker nodes which validates local Toolforge config.

From T242559#5798302

I might suggest we look at adding a taint to nodes that run webservice that only gets added when we are sure a node is ready to run a webservice process....automating such a taint is tricky without puppetdb and with puppet in general, but it would be a way we could gate things at the end of a "checklist" if you will like a message at the end of puppet to add the taint. Unless it can be added via the kubelet API (something to look at).

Mon, Jan 13, 4:40 PM · cloud-services-team (Kanban), Kubernetes, Toolforge
Bstorm added a comment to T242632: Apparent issues in Toolforge Kubernetes.

Some of what I'm seeing appears to be related to a bad filesystem on the worker node. Well that really doesn't help or relate to toolforge in anyway.

Mon, Jan 13, 4:38 PM · cloud-services-team (Kanban), Toolforge
Bstorm added a comment to T242632: Apparent issues in Toolforge Kubernetes.

Fun thing...on the old cluster, the pods aren't fully deleting. They are ending up stuck in "terminating". Note that I'm testing this in toolsbeta, so this has nothing to do with the particular clusters. These are just set up similarly. It's the container images.

Mon, Jan 13, 4:35 PM · cloud-services-team (Kanban), Toolforge