The current state of this is: I rebooted labvirt1001 and it got better. I've migrated a handful of tools exec nodes back to labvirt1001 and I'm going to keep an eye on it for a few weeks. If everything is still good come mid-April, we can just shrug and repool it.
This seems to actually work properly now.
I must've done the cleanup out of order... I removed things again and they seem to be actually gone now.
(And we should have some kind of validation for the script, probably by putting a hash in the nova metadata.)
Thu, Mar 23
Wed, Mar 22
And post-reboot it's fast again dammit
The one symptom I'm fixating on is puppet runs. A puppet run on labvirt1001 takes 811.99.
@hashar, it's nothing to do with load. there are no VMs running on labvirt1001 and it still has the problem.
Any update on this?
Tue, Mar 21
It is actually possible to explicitly tell the scheduler to not put multiple nodepool instances on the same labvirt. That would work if the total number of nodepool instances is always < the number of labvirts, which I'm not sure is true.
Fixing dschwen's login was a bit of a hack... I'd like to keep this open until the actual cause of the issue is addressed.
Mon, Mar 20
This only barely warrants a script, since I just now did it with a single command:
I think that the already-existing openstack::observerenv class is just what we want here. I just forgot that it was separate.
It sounds like you don't need a quota change, so I'm closing this for now. Feel free to open a quota request if you turn out to really need the IP -- I suspect you'll find the proxy system better though :)
Fri, Mar 17
@dschwen, I think I increased your quota enough for what you need... let me know if it's not enough. Otherwise, let me know when you're cleaned up and I'll drop the quota back down.
This is because of a case mismatch between ldap and mediawiki. The mediawiki user_name table had the username 'Dschwen' but ldap had the cn as 'dschwen'.
With changes to our provision ration this hasn't been an issue anymore.
Yes, I think this is resolved.
Thu, Mar 16
Most often this is a result of clock drift on the device providing the 2fa code. Rebooting your phone might help.
@Petrb, I'm online now (which is, I believe, 10AM EDT) and will be for six hours at least. I should be available tomorrow during at least the same window.
Wed, Mar 15
I removed all of the ldap host entries.
@Petrb, I completed the in-place upgrade on Huggle and it looks ok to me...
Tue, Mar 14
I moved the related tests to labnet and nrpe -- they seem to be working fine.
I've increased your quotas to allow one additional 'bigram' instance. Let me know if I missed anything.
I just tinkered with my .ssh/config and now this works fine.
This is done on Californium and seems fine.
Are there still pending tasks here, or is this resolved?
Mon, Mar 13
@Chippyy Any progress on this? There are two weeks remaining until we start deleting Precise instances.
This instance was deleted and replaced by utrs-database and utrs-production.
Fri, Mar 10
I wasn't able to get this change merged upstream. Removing it on a labvirt seems to slightly increase disk usage, but I think it's worth it to have a more-canonical install.
Thu, Mar 9
I just built four different jessie instances, ran 'apt get update && apt-get upgrade' on them and rebooted. All four came up, no problems.
Tue, Mar 7
There doesn't seem to be a good way to win this one. I already added 'delaycompress' to the logrotate script to prevent cronspam (https://gerrit.wikimedia.org/r/#/c/313558/) but now upstart just writes to the .1 file forever.
I'm not sure this is a real bug so much as me misunderstanding yaml.
I granted this. Will rename the bug to reflect future quota reduction after the precise instance is cleaned up.
This is for one Large size instance: 16G and 8 CPUs. And, yes, we'll lower the quota after the corresponding precise instance is gone.
We have a new project, wm-bot, where wm-but is being moved.
Mon, Mar 6
@Petrb I can't tell what you're saying due to lack of punctuation... maybe "Don't! Instances in huggle project host some essential services!" or maybe "Don't instances in huggle project host some essential services?"
If no one lays claim this week I'll probably shut this instance off next Monday, just to see if anyone notices and/or cares.
I'm afraid that the camelcase name isn't supported, but I've created the 'glampipe' project with Zache-tool as the project admin. @Zache, you can add additional users or admins as appropriate.
Sorry for the delay in creation, I was sick most of last week. This project has been created, with @Hydriz as the projectadmin.
Fri, Mar 3
There's definitely no need to backup labtestweb. Silver is important to back up since it contains our techincal documentation... we have an offsite backup of it at https://wikitech-static.wikimedia.org/wiki/Main_Page but as far as I know that does not preserve all the edit history.
Mon, Feb 27
It's not a memory issue, that page is just too damn big if 'tools' is selected in the filter. If I increase max_execution_time then it loads just fine
Email nag sent to labs-announce on 2017-02-27
(Project request is approved but we need more info re: the floating IP request)
Can you explain more about needing a public IP for parsoid? Can't the parsoid service run behind a port-specific proxy? It's http right?
@jkroll, I just now created this project. You are the only member at the moment but you can add other members or projectadmins as needed.
Sat, Feb 25
Fri, Feb 24
Aaron, I know we talked about this already, but the clock is ticking and I'd appreciate an update. Thanks!