I work for or provide services to the Wikimedia Foundation, but this is my only Phabricator account. Edits, statements, or other contributions made from this account are my own, and may not reflect the views of the Foundation.
Note: labsdb1004's remote serial terminal seems broken. lasdb1006 looked bad, but recovered after reboot.
Mon, Nov 19
It's quiet as a mouse now. It should still spit out logs when it actually does something.
Unless we want to wait for the subtasks.
For us that's pretty good. We could probably just close this one for now.
@Banyek I think as long as it works for you, and they are all on different days, it's fine for the wiki replicas.
Sat, Nov 17
@aborrero I dare say you can be. We will probably both need to mirror updated k8s stretch packages and docker-ce stretch packages into tools aptly and then hack some puppet around them so that our setup can be maintained. To unblock the grid upgrade, all we need is a kubernetes-client with all it needs to get by. Part of that is likely a flannel package (which we'd have to invent) or flannel installed via kubeadm. I'll have to dig deeper to be sure exactly what is required to just get a bastion talking to both existing k8s and sonofgridengine with minimal tech debt for the next phases of k8s upgrades.
Fri, Nov 16
@Banyek Just looking to confirm that you will be available during the Toolsdb primary and secondary reboots as support to verify things are working correctly and help if not for 11/20 @ 17:15 for labsdb1004 and 11/21 @ 17:15 for labsdb1005.
Thu, Nov 15
Honestly, we've had no queue waiters for a while since we enabled all disabled exec hosts. I'm going to reject this task for now, pending the new grid build.
I'm aiming to write tests for this script shortly because it is too complex to not have them. Overloading the scripts functionality with something it wasn't written for makes me a bit nervous. It is already very easy to introduce mistakes requiring very careful review and manual QA in test dbs when I make updates.
So far it looks like replication is picking up where it left off nicely on labsdb1007 (done).
@aborrero I'd say it's worthy of notifying users for toolsdb/wikilabels (labsdb1004/5) and possibly osmdb (labsdb1006/7) masters but not the wiki replicas or the secondaries (except wikilabels). The users won't see any significant issue on the replicas.
That and create the _p db. Silly bugs.
Wed, Nov 14
Done so far:
The patch is deployed throughout. Should that be it for this task @Anomie ?
@Halfak labsdb1004/5 would affect wikilabels. We may just do reboots in place like last time due to the tables that don't replicate per: https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database#ToolsDB_Backups_and_Replication
- labsdb1004 is the replica for most tables on 1005, but it is the only server for wikilabels (just so that information is out there).
This was just stuck at a prompt. Stupid mistake, the output after that stage of boot was redirected to the other console. Proceeding.
Nothing. I guess this is just more digging, then, unless both systems are somehow broken.
Redirection settings are confirmed correct. Looking around other settings in the docs.
There is a cap on the user connections (the user that eats connections being OpenStack aka Cloud VPS). It just has burst capabilities and can briefly go over what we have set. I suspect that with the limits we have in place, it cannot go all that much higher. A slightly higher limit would help us get through Neutron migrations (which is almost certainly what is causing the bursts in connections).
Tue, Nov 13
Note: I'm not done reading back yet--but yeah, that's what I was thinking of. There's a lot here.
- Specialized views - Views for comments from each of revision, archive, and logging, separately. We have to test whether or not sqooping from these views would be fast enough, but it seems they would be useful for cloud db users in general.
- Access to underlying tables - We could query the underlying tables, and that would bypass any performance problems we have with the views. We would duplicate the sanitizing logic from the views, and maintain it to be always the same as it is in cloud db. This would require special permissions to the cloud db.
- Materialized views - This sounds like the best choice, as suggested by @Anomie. We thought they were discouraged by DBAs due to the implied slow-downs in replication. But if that's not a concern, let's do it!
No, misctools is a latest install. It looks like it is trying to downgrade?
apt-get purgeing it :)
I wonder if this is some buried pinning thing. Checking that.
The following packages have unmet dependencies: misctools : Depends: mariadb-client-core-5.5 but it is not installable E: Unable to correct problems, you have held broken packages. Error: /Stage[main]/Profile::Toolforge::Grid::Exec_environ/Package[misctools]/ensure: change from 1.32 to 1.31 failed: Could not update: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install misctools' returned 100: Reading package lists...
I think we are good on this. Please re-open if I am wrong. More growing pains with the new region.
Looks like it's working again.
Added the CIDR to pg_hba.conf, which is not overridden by puppet. Reloaded postgres.
2018-11-13 16:34:14 GMT FATAL: no pg_hba.conf entry for host "172.16.4.244", user "u_wikilabels", database "u_wikilabels", SSL off
Yeah, the repo was fine. The issue was the dependency declaration for the one package. Jobutils works I think.
Fri, Nov 9
I'm now poking around also at what it would look like if all the python my team uses ended up in separate packages (debs etc), and I don't hate it...🤔
Thu, Nov 8
@dschwen Do you have some idea what time it gets killed at? Can you set a timer in your script (if you haven't already)?
@Banyek This appears to be affected by the query killer wmf-pt-kill. What is the current limit placed on things? I wonder if this just goes too long or if it should be changed/tuned to allow longer queries?
toolsbeta-sgebastion-03 is actually able to complete puppet runs (on the other hand)! Exec nodes are held up by misctools deps being incorrect, but I thought I'd add the happy note.
Fixed the network access issue cross-region. However:
To put it a different way, I'm uncomfortable deploying untested python to the environment in cloud and generally prefer to test infrastructure code in general (rspec/unittest or whatever). At least within the scope of cloud materials in the repo, I'm trying to bring it all into a testable form (even if just linting). I got python3 in the containers, and the python3 tests work great where they are (only in my tests right now), and I ensured that the tests are conditionally run. (Note, the tests I put up are in the sonofgridengine module, if you are curious and have thoughts.)
I recently added python tests to one of my modules with a significant python script in it. I honestly don't see how a python script that isn't tested that does something complicated enough to merit being in python should be in puppet (on the flip side of this). I have been on a personal crusade to start implementing testing discipline for code managed by the cloud team so that taking over existing projects is safer and more consistent.
I think we may want to add some nodes in general as well. The number of waiting jobs is dropped to a much more reasonable level, but it's been a while since new nodes joined. Let's make a subtask for that. Tasks in qw are now down to 2.
I'll enable the other disabled nodes, either way. :)
is -mem 8g the setting it has always had? I think that's the total actual RAM on an exec node. That would mean it would need a totally quiet node to run on.
I can say that my own submissions (which are not CPU heavy, but they are very RAM heavy) are going through fine once a day, so it is flowing at least.
We can use this:
$ qstat -u "*" | grep qw | wc -l
There are several exec nodes disabled, probably from forgotten rebalancing efforts. I've just enabled one of them and can enable more, though I'm poking around in case I can find anything that's really busted.
Can you give me the commands you are using for this so I'm doing a straight one-to-one comparison? Also what jobs are yours? I'm poking around this.
Do we know if email submissions are tagged with the queue "mailq" from any of the script or settings what you are stripping out?
Wed, Nov 7
I still get errors for toollabs-webservice as well on stretch.
Oh yes, bastions need misctools as well (confirmed)
E: Unable to locate package jobutils
Error: /Stage[main]/Gridengine::Submit_host/Package[jobutils]/ensure: change from purged to present failed: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install jobutils' returned 100: Reading package lists...
Building dependency tree...
After a puppet change earlier today, the openstack the neutron servers were both brought up running at the same time as masters--split brain. We resolved this by simply rebooting one of them so it took on the standby role, but removing the puppet service restart is in order until we can make it better handle the failover setup.
Tue, Nov 6
labstore1001/2 are the walking dead, FYI. They are blank spares to be decommissioned when 8/9 replace 1003. 1003 is only still there because we don't have replacements up yet. 1003 is scratch and misc, which is more transient data in some cases.
Oh thanks! I'll take a look at that. I figure it must either be a BIOS config or possibly kernel option issue.
Thu, Nov 1
So would the replicated stuff on the replica servers be ready for me to run the scripts? Just double-checking before I go on with depooling and regenerating the views.
Wed, Oct 31
Aaaand same freeze when installing on 4.14. That's fun. I can try the kernel on cloudstore1009 as well, but cloudstore1009 so far behaves the same as 08 in general.
Yes, the timestamps for GET requests for the right scripts on install1002 are there when I reset one of the servers.
Bigbrother seems like a false sense of security in some ways because it doesn't trigger alerts for reboot loops and things like that (which I've seen it doing before). So, I'm not sure it is providing very good service in the first place.