Fri, Mar 22
fyi @jcrespo, access levels (in particular 'cloud-wide root') are defined in the policy document here: https://wikitech.wikimedia.org/wiki/Help:Access_policies
this might need a bit of indexing work -- I see this in syslog:
There are now two read-only replicas in eqiad behind the endpoint ldap-ro.eqiad.wikimedia.org
ldap-eqiad-replica01.wikimedia.org and ldap-eqiad-replica02.wikimedia.org are online now and seem to be working
Thu, Mar 21
This is probably a side-effect of issues with Striker; they most likely share the same consumer token. Horizon should really have its own.
On the keystone server (cloudcontrol2003):
Tue, Mar 19
Mon, Mar 18
Jessie creation is now disabled in most projects (including deployment-prep). I'd prefer to leave it that way in order to provide some mild resistance to new Jessie VMs showing up in the cloud.
This image is now private to the 'admin' project, with shared access extended to the 'integration' and 'testlabs' projects.
The Integration project will need to keep creating Jessie VMs for a bit.
This is working with acme now.
I don't really know what that database is about. But perhaps we want to do it at the same time as T218569: Openstack codfw DBs: move to m5-master.eqiad.wmnet. Would you mind updating that tickets so we have all the DB-reallocating info in a single place?
Sun, Mar 17
Sometime when it's not the weekend let's audit all instances for stuck dpkg processes. This might be happening all over the place.
Fri, Mar 15
right now labtestservices2001 is the only host for the labtest ldap db. So we should move that someplace before we decom, unless we want to start with a fresh db entirely.
Wed, Mar 13
These are now created and puppetized:
Oh, I was confused -- these do need to be on public IPs.
Ldap isn't really a cloud-specific service. So I propose ldap-replica01 and ldap-replica02. My instinct is to put them on private IPs (because LVM will provide public endpoint anyway) but I'm open to suggestion.
Alex, I'm hoping that you have time to do some/all of this. If not, then please answer these questions and refer back to me so I can muddle through:
That sounds like a plan! I've re-titled this task to fit that plan. We'll need two more Ganeti VMs. The current ldap servers have 8 cores and 4Gb of RAM. More RAM would be great, and fewer CPUs would probably be tolerable.
Current plan is to add two new read-only hosts (on internal IPs) and put LVS in front of them, then use that endpoint exclusively for cloud VMs access.
We haven't bothered to rebuild this host since it's pending some datacenter work. The alert is silenced and described I believe?
Tue, Mar 12
I'd like to set aside the issue of ldap-on-cloud for now and just get a couple more servers up on Ganeti. Which I don't immediately know how to do but I bet @GTirloni knows how
We would also want labtestwiki here (the MW database for labtestwikitech)
Sat, Mar 9
root@serpens:/etc/ldap# grep deref slapd.conf moduleload deref overlay deref
I'm sorry @abian, that VM does not appear to have survived the hardware failure. Its disk image is entirely missing so there's nothing to salvage.
Fri, Mar 8
Here's the kind of thing nslcd is doing on stretch:
It seems moderately possible that we need to install the 'deref' overlay on openldap to catch up with the new expectations of nslcd. That's only barely supported, but one brave soul seems to have made it work: https://www.openldap.org/lists/openldap-technical/201401/msg00025.html
I've been watching ldap traffic (with tcpdump) while starting and stopping jobs on particular grid nodes. The Stretch grid produces about 50-60 times as many ldap requests when a job starts as the old grid.
10Gb nics are disabled in the bios for every one of our cloudvirts that aren't already running 10Gb. Details on https://phabricator.wikimedia.org/T216195
Thu, Mar 7
I'm comparing the ldap/pam behavior of test stretch, jessie and trusty VMs. So far I don't see any significant difference.
Wed, Mar 6
@Avicennasis, I have disabled 2fa on your wikitech account. Sorry for the slow response time!
If you would like to test, we now have a VPS base image for buster in wmcs. You can use it in the 'testlabs' project, or I can add it to a project of your choice.
I've merged buster patches upstream, and built a +wmf release with those patches. It's in repropro, named 'python-bootstrap-vz'. I presume that eventually we can go back to using the upstream package if it's rebuilt for future buster releases.
Tue, Mar 5
I have a pending pull request for bootstrap-vz which should allow us to build without local hacks: https://github.com/andsens/bootstrap-vz/pull/496
Mon, Mar 4
That spreadsheet looks quite up-to-date to me; I'm not sure there's more to be done here.
Fri, Mar 1
I'm less sure that we need drives on hand now. We seem to be able to get replacements more-or-less overnight, and adding spare drives to the RAIDs will reduce the urgency of replacement.
First pass at this is https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Virt_capacity
To resolve this, I tried to abolish the mixed-case username 'EBernhardson':
Thu, Feb 28
This turned out to be a case of T165795
I am 90% sure this was cloudvirt1024 and not 1025, so changing title...
the next step is moving this to a new rack for 10G connections (either 2, 4 or 7 in row B) so I'm tagging dc-ops. You can hand it back to me for the rename/rebuild once it's in place.
Wed, Feb 27
I'll pay more attention to this later, but quick drive-by thoughts:
Tue, Feb 26
@ayounsi I assume you're talking about this?
@SamanthaNguyen I've created this project and added you as a projectadmin. You can assign other members as needed.
Hello! The title of this looks like an action item ('delete qna project') but I'm not clear if that's right -- are you ready for me to delete the project or are there other things that need to happen first?