Wed, Aug 9
Here are some user-facing things that I'd like to have metrics for:
Tue, Aug 8
When all is well, silver is fine -- If something breaks that increases the logging rate then I get alerts.
Mon, Aug 7
clush is back up and running on a new host. The only action-item remaining here is the deletion of the old tools-puppetmaster-02.
OK, after a quick chat with Aaron, I've created two big VMs for you:
This is up and working.
Resolved by https://gerrit.wikimedia.org/r/#/c/368454/
Sat, Aug 5
Fri, Aug 4
Thu, Aug 3
As far as I can tell, paws.wmflabs.org (the record) can only be in paws.wmflabs.org (the domain). So having the stand-alone record be in a different project from the domain-with-subrecords won't work. That means that #2 is probably impossible to do within tools if we want *.paws.wmflabs.org to be owned by the paws project.
so, ok, clearly not unused.
btw, it turns out the init script does not observe the START=no setting in /etc/default/puppet which I tried ages ago and which is why I decided that this wasn't a puppet problem :(
- Old images use puppet 3.4.3
- New image suse puppet 3.8.5
Bisect blames 2b18741526dff42582d84c25e0a8a7fddec080f0 which is very surprising! I need to test more but it seems to be right.
Wed, Aug 2
I am able to build booting images if I roll back to puppet patch ffdfa2821bca02a0ec013d1e618d4d9690f7ec7d
Mon, Jul 31
- removing all custom-installed packages
- building on a fresh build instance
Due to various concerns we're going to just disable these passwords for now.
that was easy
Sun, Jul 30
Here are similar log snippets from an old, working image.
- building with a 4.x kernel instead of the default trusty kernel
- rearranging the disk volumes to be /sda rather than /vda
There are quite a few google hits about hangs after that 'random: nonblocking pool is initialized' message. Things I've tried:
Fri, Jul 28
Thu, Jul 27
Attached patch should resolve the ultimate cause.
Wed, Jul 26
Tue, Jul 25
Mon, Jul 24
Here are some things that need to be thought about/figured out before we can go forward:
I'm pretty sure that #1 is moot -- at least, anytime we discuss it we conclude that the 'labs-support' vlan isn't really a useful concept and should be eliminated.
Is this tagged with cloud-services-team in error, or is there something you need from us?
Thank you, Chris! This is new hardware and we can live without it... can we leave this in your hands to follow up with Dell? Is there any additional info you need?
(I should note that there's no data of interest on that box -- reimaging is just fine)
Fri, Jul 21
I have a fix to prevent this from happening again... in the meantime I've added novaadmin back to everything.
So currently I think this was caused by a misfire in OpenStackManager's removeUserFromBastionProject():
The first sign of trouble in the keystone log is:
In total it was removed from 53 projects. I'm now checking to see if any one user is in all of those projects (other than novaadmin)
We've replaced novaadmin in deployment-prep; now it's missing from the following:
I just can't think of any reason why those roles would've been removed :( investigating
There was a brief period when novaadmin couldn't log in, is it possible you just caught it at a bad moment? The above curl seems ok to me now.
Thu, Jul 20
Yep, all looks good to me.
Ok, I renamed MABot to 'MABot former'. I think when you retry you should log out entirely and create the account as though you are a new user -- that's the path that is most tested.
The extension is now installed and loading on wikitech-static. It looks terrible, for now -- waiting to see if a re-sync fixes things.
Now running 1.29.0 (52abe24)
Jul 20 2017
I resolved this by running the query in https://ask.openstack.org/en/question/494/how-to-reset-incorrect-quota-count/
Jul 19 2017
I see the MABot account in the wikitech user table but don't see an ldap record. It might be that the creation process you followed just doesn't work right, or it might be that there was some kind of unreported collision during creation (possibly due to the re-used email address, although that would surprise me.)
It's possible that you were unlucky and hit us in the middle of an ldap outage... does the same happen if you try now?
Jessie and Stretch are updated. There are unexpected issues with the Trusty build which I'm working on.
Jul 18 2017
Is this still meaningful now that there's no puppet config on wikitech? Does Horizon have the same issue?