I work for or provide services to the Wikimedia Foundation, but this is my only Phabricator account. Edits, statements, or other contributions made from this account are my own, and may not reflect the views of the Foundation.
Adding tags per chats here
Sorry, All Hands isn't helping me focus on this. I'll check on how it relates to T241019 today (and talk to people on that who might help me know what to do next here). I am finding my limited involvement in PAWS is not helping me scope these tasks out.
Mon, Jan 27
It looks like fstab needs a change at the get-go. Should just be /, not /dumps now.
Shoot, should have checked all these. It may be partially resolved from T243328: "stale file handle" error on notebook1003 when trying to access /mnt/data
I suspect the /etc/fstab needs update and then a umount and mount (or umount and puppet run, depending on the setup).
This is caused by T242798.
Sat, Jan 25
That's a bag of all kinds of possibilities.
Fri, Jan 24
I've edited the refill.yaml file to include the new setting and saved the old file to refill.yaml.old.
After waiting a couple days, I took the liberty of modifying the deployment in place.
Thu, Jan 23
Wed, Jan 22
Current tooling would seem to allow anyone to use rust in a tool's home dir. Establishing some kind of supported procedure for launching a rust service is another matter.
This ticket looks like it can be closed. Many of the mentioned issues were resolved elsewhere, and I do not think we actually want to install webservice on the job grid. If I'm wrong, please reopen!
With the current understanding of the setup that I have, it is likely that we could move some of this to 10G, but the interfaces on the systems that would benefit most are not configured well for LACP right now anyway. I think there are larger issues to resolve around NFS, so putting this in the Graveyard for now.
Assigning "low" only because I think the really serious things were sorted?
@bd808 How are we feeling on this? With the exception of the podpresets, I think I'm feeling pretty strongly on keeping the others un-listable at the tool level to prevent opportunistic and hijacked accounts from listing things (that are mostly available information somewhere, but not in their live form). By live form, I mean they can be changed on the fly by Toolforge administrators should anything be happening that warrants it, and that would not be documented publicly to non-admins unless that was done intentionally.
This piece is really done
If this is moved to the new Kubernetes cluster (as described here: https://wikitech.wikimedia.org/wiki/News/2020_Kubernetes_cluster_migration), there are automatic limits to how much RAM the worker can consume, but it still must implement a higher requests value than the default or it can overrun the RAM of the node. The requests part of a container definition helps determine if there is room on the node it is getting placed on.
You can customize $PATH per tool on the pod level by simply adding some standard areas into the search path for that type of pod. If it doesn't exist, it won't harm anything. If we add an env array, we may have to also add the $HOME var in case we overwrite the pod preset (which would need a quick test to be sure).
Tue, Jan 21
[bstorm@notebook1003]:~ $ ls -al /mnt/data/xmldatadumps/public/enwiki/20200101/enwiki-20200101-pages-meta-current.xml.bz2 -rw-r--r-- 1 400 400 32529073303 Jan 3 05:58 /mnt/data/xmldatadumps/public/enwiki/20200101/enwiki-20200101-pages-meta-current.xml.bz2
Looks good now :)
I see the puppetization needs updating, I see.
We changed the exports. The mounts need to be unmounted and remounted. I can do that.
The best way to use a more recent version on Toolforge would be NVM to my knowledge. @bd808 that works on k8s as well as the grid, right?
There may be some work already done here...this may be a new plugin for cumin really. I am not 100% sure, but targeting NFS clients is the goal with admin tasks that can be quickly deployed.
Sun, Jan 19
We are on Pike now. I might argue that ignoring the others is not the worst idea.
Fri, Jan 17
Project should be available at https://horizon.wikimedia.org
Let us know if there are any problems.
You should be able to access the project in horizon now. Please let us know if there are any issues.
Note: because of a change in the way restarts work (they are lighter now and don't destroy ingresses), anyone looking to use the new ingress setting should webservice stop and then webservice start --backend kubernetes <whatever>
Thu, Jan 16
At this point, the restart function is a simple killing of pods. The new cluster also responds differently in general.
Proposal: Cancel this chain of tasks based on the discussion above. Create a new task to modify webservice's Kubernetes backend for the new cluster only with design being the first step.
Ooooorrr, that could be service in every tool namespace. That means it isn't a monolith with access to anything but itself. It runs as default service account and responds to its owner's cert to the simple commands of start, stop and restart handling all communication with k8s on its own. That would fix the auth problem without creating a global sudo of any kind. It could be expanded to include a token auth system from CI in the future, even....@bd808
Note: I don't have any idea why the output says "queue none":
2937052 0.32584 lighttpd-b tools.bd808- r 12/12/2019 12:48:52 webgrid-li MASTER
Where job 2937052 happens to be bd808-test2:
Neutron has 159 open connections now. I think this is fixed for the time being.
TLDR: I agree. Let's do that instead.
Neutron is now at 160, and things seem fairly stable. I'm going to reduce the max_connections again.
Wed, Jan 15
Still holding at 154 total neutron connections.
I could be convinced on podpresets, but it is an alpha API. I'm not sure it's a good idea to expose it much.
In general most of those are not listable to remove unnecessary or disallowed APIs from shell users. networkpolicies is that other one, besides events, that you can list because you are able to interact with them. Interestingly, you cannot list events on the old cluster.
Some of this is a quirk of the query. It might be better to test using the auth can-i method.
Hrm. Now I cannot seem to ssh to it. :)
Some responses and thoughts as well:
- It seems reasonable to me to stop producing webservice packages for jessie after moving to this library, leaving jessie containers to use the version they have as part of deprecation processes. I mean, Debian is doing that, right? If we think of it that way, that would kind of stop all concern about jessie with regard to the webservice package. If a new feature in webservice is needed, run it outside a container, as long as the webservice-runner still works. I do highly question how much it matters to support running the webservice frontend command inside a container anyway (as convenient as it may be).
- https://pypi.org/project/kubernetes/ <-- recent versions of the official client still supports python2, so we might be able to do this task for future proofing/scripting and just tack on py2 for webservice. However, I don't expect them to support it for long, and staying up-to-date on this library is something I consider as serious priority for security and sustainability.
- It is also important to remember that this general topic blocks Kubernetes upgrades past the 1.15 current minor version, which is not good (they are already at 1.17 upstream), which adds to weight to avoiding python3 purity for now.
After @Andrew merged that last change, it's looking a bit better.
Just after services died (reducing connections a bit), I saw this, so we know it is neutron that is the problem:
Connections are currently at 340 after the above actions, so we have some wiggle room.
Connecting this to the saga of the DB connections and Openstack such as T237196: openstack-nova running out of database connections
In the course of this, @JHedden restarted several services, which reduced current connection usage to sane levels, and I set the max_connections on the m5 master to 600 to give more breathing room for troubleshooting (note to @Marostegui and @jcrespo that I did that and don't intend to keep it that way).
Tue, Jan 14
@Andrew and I are going to pair up on this in case that helps at all soon
@Andrew and I are going to pair up on this in case that helps at all soon
Ok, so while I knew the jobs were "rerunable" because I'd done it, @bd808 wisely looked at an individual job and found that it was marked "not rerunable" per the default. The problem is that the queue config for this marks *everything* as rerunable, and we cannot override it at the job level, apparently.
TaintNodesByCondition is luckily default on 1.15, so basic things are checked on the new cluster (not on the old, btw). The one thing that "puppet works and contributes a full config" doesn't satisfy here is monitoring the current state of our special needs like sssd. A node-tainting daemonset might still be worth it from that perspective. (a very basic idea is https://github.com/uswitch/nidhogg...and then we'd just need a daemonset that should be running on all webservice nodes that mounts things and connects to sssd). This (or any daemonset that notices a problem and applies a taint) would effectively drop the node from the pool (without "cordon"...and our monitoring would need a bit more nuance.)
I think that in toolsbeta, we had a filesystem issue on the only old-cluster node and those nodes were also causing issues. This ticket is basically not needed.
Mon, Jan 13
The issues with glamtools are a fail at reading a config from what I can tell, with consistent errors across php versions and clusters.
I'm now convinced that what I've seen so far has nothing to do with images or the infrastructure at this point. I'm very much hoping that this gets closed as invalid.
TaintNodesByCondition is luckily default on 1.15, so basic things are checked on the new cluster (not on the old, btw). The one thing that "puppet works and contributes a full config" doesn't satisfy here is monitoring the current state of our special needs like sssd. A node-tainting daemonset might still be worth it from that perspective. (a very basic idea is https://github.com/uswitch/nidhogg...and then we'd just need a daemonset that should be running on all webservice nodes that mounts things and connects to sssd)
Got it running on the new cluster as well. The image is ok.
Ok, now that I rebooted that node, it's fine on the old cluster. The image may be fine.
That's the mechanism we use already to configure lots about the kubelet
Got an idea here: --register-with-taints api.Taint is a CLI option
I might suggest we look at adding a taint to nodes that run webservice that only gets added when we are sure a node is ready to run a webservice process....automating such a taint is tricky without puppetdb and with puppet in general, but it would be a way we could gate things at the end of a "checklist" if you will like a message at the end of puppet to add the taint. Unless it can be added via the kubelet API (something to look at).
Some of what I'm seeing appears to be related to a bad filesystem on the worker node. Well that really doesn't help or relate to toolforge in anyway.
Fun thing...on the old cluster, the pods aren't fully deleting. They are ending up stuck in "terminating". Note that I'm testing this in toolsbeta, so this has nothing to do with the particular clusters. These are just set up similarly. It's the container images.