I work for or provide services to the Wikimedia Foundation, but this is my only Phabricator account. Edits, statements, or other contributions made from this account are my own, and may not reflect the views of the Foundation.
Sat, Jan 19
PHP 7.0 has been properly replaced at this point, I think, with 7.2 in the new grid.
Fri, Jan 18
A few need a reinstall, which I'm going to do, but phpunit re-installs 7.0. There's no phpunit in the thirdparty repo, and the stretch version is old and unsupported by the upstream.
In fact, I think it better to strip them from the environment and add them deliberately if there is a reason to. I'll do that.
I could uninstall php7 libraries where they exist on the bastion at least to prevent issues.
The stretch grid nodes now have both php7.0 and php7.2 installed. It is possible this may cause future confusion when adding new nodes, but it looks good for now.
Added that to the patch. Now it's only missing a deprecated package, which I'm more ok with.
I totally missed the xdebug package! Thanks
xdebug is not in the upstream repo either. If we want that, we may have to build it ourselves and put it in the tools repo or something, then?
Ahh, that's because mcrypt is deprecated in 7.1. xdebug might still be a thing, though...
Looks like the prod repo is missing:
All new grid exec hosts (stretch) are now submit hosts.
Since concurrent jobs are now limited by https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/files/toolforge/grid-scheduler-config#6 and by https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/files/toolforge/grid-global-config#43
Thu, Jan 17
Added some notes in a few places. Also moved exec-manage to the master on SGE (and fixed it). I think that's it for this one?
Wed, Jan 16
Ok, I set you up with a project in horizon. You should be able to log in and set up an instance. I increased the project RAM so that a bigdisk2 will work.
I might just make the master and shadow submit hosts. That seems like it might be the best compromise unless I can find a way around how this acts.
Tue, Jan 15
That said, I've built nfsd-ldap on stretch and have it on my home dir on install1002.
@GTirloni I'm waiting until tomorrow morning to add it to reprepro when I can make sure ops people are around to stop me from doing something stupid. Feel free to take over and grab it from my home dir there before I'm on! I put all the needed build files in my home dir and basically nothing else is in there on that server.
Couldn't we just remove these users from ldap? Hmm, that's a different matter though because that's not shell access, that's cloud access.
That makes sense. nfsd-ldap only works for jessie at the moment, though :-p. Might have to rebuild the package.
Most of this seems to have been in the admin module for like four years, though....
So the obvious problem is with modules/admin/data/data.yaml
Mon, Jan 14
Restarted flannel on proxy-04, and it got it's flanneld interface up.
Set my hosts file to point at 03, and it works like a charm. This is ready to go.
Ok, tools-proxy-03 can now reach flannel network nodes in eqiad!
I had to add UDP port 8472 to both ends of the setup.
Flannel and kube-proxy both work now, but the connection is still not possible. Checking kube workers.
Flannel interface is now there. ferm needed a restart on the flannel etcd servers (and I added a sec group to allow the port across regions)
Flannel interface isn't showing up right now:
Sun, Jan 13
Fri, Jan 11
If other root emails come through at this point, we'll be good with things as they are, and that'll be nice (some parts will be baffling, but it'll be nice).
For specifically gridengine toolforge root emails, I may have found a small part of the problem with encoding that might be getting things flagged. If I can fail some jobs badly enough, I should be able to produce some related emails. This won't reflect on other VPSs and other types of toolforge emails from root that might be affected by issues.
Thu, Jan 10
Looks good now :)
All set for generics
In the past, I've found that not removing beforehand can break things. It didn't always work well. My question here also is whether neutron handles that better. I think we should reconsider that edit just slightly to include the remove, just in case. I believe I discovered this on the static server failovers.
The issue is an email that isn't well formatted. An email from root for cron, for instance.
I'll also point out that toolforge is currently whitelisted at the MX. We are discussing the new cluster of toolforge only.
Just connecting that change to this task, since this is really what it was hoping to resolve.
Note that this will have no effect on nodes in the old grid. That would need to be done via manually, or perhaps one-liner script for a lot of qconfs in the write dir.
Added 24 more nodes. They still need the puppet dance and such.
Started up two servers. Beginning the dance for puppet.
They all have clear, ready status now.
Wed, Jan 9
42 stretch exec nodes are up. Working through the puppet cert dances.
Yeah. There's a lot of checkers that should never be running.
Got the lock. This is done on all replica hosts.
Running with the --clean option
that's the view updates he means. Will run the update today.
Tue, Jan 8
Well, my end can't use 'em very effectively. It's all whether we are going to build it around analytics then. I seem to recall there's another solution there for them as well? Dropping them would make some of our other tickets a bit easier (but we can also make that script skip around tables if we aren't dropping them).
The grid script that uses puppet's files in NFS only configures most of the functional grid environment outside global and scheduler stuff. Since qconf can take input from files for global and scheduler, it is possible to build files from templates and then smuggle them in with python, like I did for the rest of the grid. Should be pretty easy to extend it like that.
Mon, Jan 7
I do believe we have decided to leave these limits in place only on the new grid.
Ok, so perhaps I can set that to 50 and maxujobs to 16.
I am curious if the scheduler will simply dump user jobs into a long tail of qw state if we restrict it a lot. So what I'm thinking of trying is reducing the main grid to 50 first to see what happens and then tighten to 16.
I figure the new grid can start with 16 to see how and where that creates problems at least.
The thing here is just that you have to keep your db names really clear and separate, obviously :)
It seems like you could do it using a sensible name for the DB on ToolsDB. The tables there are replicated, which isn't ideal for this purpose, but it depends on how this is used. A db that pops up and is then torn down once in a while seems like it would be fine.
Sat, Jan 5
🎉 That was something I was sure we were going to hear about. Great!
Fri, Jan 4
To put it another way, if you find something cool to fix this using some of the test environment we set up, it'd be great (by simplifying the mounts, finding some hidden options or trying a different DRBD module or some such nonsense), but if CephFS does it better in the course of that testing, let's put the effort there instead. That would be a subtask of this no matter what, just in case it fixed this along with other issues.
The general notion of this ticket (although we've been distracted with the load problems on the tools/project NFS cluster) is that our current failover scheme is fundamentally broken. If you failover, you have to reboot things just to get it to let go of the bind mounts. On the dumps cluster, failing over happens via symlinks, but the mounts end up broken so badly anyway if something goes down that it's really hard to recover toolforge.
I don't think that there honestly is a Cloud wide use case for these tables until we have a solution that keeps them in close sync with the live data. Trying to explain to our end users that there are parallel tables with different names that are up to a month behind the live data will be a customer relations headache.
@Banyek The script has no real way to work with tables as written. It runs drop view and intentionally avoids changing any tables. Having the experimental materialized table there does break the logic there. This can be worked around if we choose to keep that materialized table around. I imagine we could also just remove the experimental table for now, though, right?
Thu, Dec 27
Dec 21 2018
Added this rule Ingress IPv4 TCP 1024 - 65535 10.68.16.0/21 to the webserver group which is used for all webgrid nodes (and I think k8s as well). It will be needed during moves of servers to the new region no matter what until we move the proxy server and we have to reverse it.
We had the wrong proxy node set for the new grid in hiera. Fixed that. Looks MUCH better.
identd seems to work from the proxy to the exec node...
Found the python script that is supposed to record that information in redis then reply with 'ok'. Looking for some kind of log or something from it.
Inserted some debugging print statements on the exec node and found that the job fails after connecting and sending the port info.
Also note that once the queues have errored, they won't re-run the job. They acquire an E status that won't clear even if you reboot the instance, so we can get faked out by that blocking a run when we fix things. qmod -c '*' will clear all errors--that caused me grief last night.
Added 172.16.0.0/21 to port 5669 to the proxy security group for project-proxy. I figure basically anywhere in the proxy mess that is specifically opened for the 10. network is going to need the new region eventually--even if that likely isn't the problem here.
Dec 19 2018
From the Openstack docs, it looks like we are most likely to break designate, but that shouldn't fail if we do what @chasemp said https://docs.openstack.org/designate/pike/admin/upgrades/newton.html
The easy solution for now seems to me to be stretch+newton as the next step (especially as we upgrade all the clients to that anyway), but a nicely HA openstack cluster on k8s using helm could track releases that are actually actively supported upstream for The Future. http://superuser.openstack.org/articles/build-openstack-kubernetes/
https://gerrit.wikimedia.org/r/c/operations/puppet/+/480590 is merged and this is ready.