Fri, Nov 9
This should be fixed, thanks to Giuseppe's changes.
Wed, Nov 7
It's unlikely that moving the clients would break that. It's /possible/ that moving the master itself broke things but I haven't seen that happen before.
Tue, Nov 6
This seems to work now but i haven't tested the icinga integration yet
Mon, Nov 5
In order to keep this ball rolling, I propose that we schedule this move for November 27th, 28th, and 29th. Any objections? We could try to cram it in the week before but then we'd run up against Thanksgiving if it takes longer than expected.
if cloudvirtanalyticsXXXX is really too long then let's go with cloudvirtdlXXXX
@Cmjohnson, please name these 5 boxes 'cloudvirtanalyticsXXXX' starting with cloudvirtanalytics1001. And rack them in row B with normal cloudvirt cabling. (If need be I can figure out in better detail what I mean by 'cloudvirt cabling' but @ayounsi is probably the best to ask about that.)
*bump* we're still getting these and my team is increasingly bleary and disoriented by all the middle-of-the-night pages. The most recent one was last night:
Sun, Nov 4
I built a couple of ntp servers in the cloudinfra project and we pointed all VMs at those servers.
Fri, Nov 2
Thu, Nov 1
Krenair is right, sorry
Yep, eventmetrics-prod01.eventmetrics.eqiad.wmflabs works for me. The 0.0.0.0/0 policy is a bit broad, you might want to reduce that to 172.16.0.0/21 for ssh.
Wed, Oct 31
A new project is fine.
Option one (easiest for cloud team):
@Volans, I don't have timestamps, but I do have this from our weekly meeting alert summary:
Tue, Oct 30
When you have a moment, could you retest this in eqiad1-r? I suspect that it has roughly the same failure cases as in eqiad, but the steps should make it clearer if/why it's failing.
*bump* -- I'm interested on if anyone is working on fixing these issues. If not, that's fine but I'll put some more time into ensuring that we don't get pages for them :)
Associated upstream patch: https://review.openstack.org/#/c/614328/
I just rolled out a new version of Horizon and (at a different time) restarted apache on both of the labweb boxes; in both cases my session persisted.
Now in Horizon I see "Host cloudvirt1018" in the server overview page. So I think this is done!
It looks like user_id is already public; revealing the virt host may be just a policy change to extended_server_attributes; I'll make some tests.
Mon, Oct 29
Running an ntp server or two on a cloud VM is probably not a big deal. But, before I go down that road... does anyone want to argue against us just using pool.ntp.org for VMs? And, what is the external source of ntp authority that the production NTP servers use?
I've attached patches that propose running a cloud-specific NTP server. I'd also be OK with changing the network ACLs to allow the new region to access the standard NTP servers (which were being used by the old region).
Fri, Oct 26
@Krenair just rattled off a list of things we'll probably have to tweak by hand:
How about noon CST on that Monday? (that's probably 17:00 UTC although that week is the week-of-timezone-slip so I can't make any promises)
OK, I'll go first :) How about if we schedule downtime for Monday the 5th?
Sorry, that last bug was attached in error.
Thu, Oct 25
I confess that I didn't have a super strong case for killing things just now; one of the labvirts was under strain and I saw several (maybe 4-5?) jembot processes running there and ran straight for the hatchet. It would be useful to know how many procs is a normal amount.
I just now killed off all jembot processes and restarted again.
Wed, Oct 24
I believe that this is happening but I don't think it has to do with load-balancing, at least directly. The session keys are held in a memcached pool that is shared between the two hosts. To verify (at least the most obvious case) I just tried this:
Does this mean that we no longer need the IP aliaser in eqiad1-r?
Ah, ok. So it sounds this works! Do you have any concerns?
You should now be able to create new VMs in eqiad1-r. Let me know if you run into any trouble.
Hm, I vaguely think that we should always use the recursors rather than the auth in this case since we're generating IPs for use on a VM, so any IP-swizzling that we do in puppet should be the same as on the VM (which only knows about the recursors).
Tue, Oct 23
Another issue is that we typically ssh via a bastion -- if the bastion is unable to resolve the target host then the connection will fail.
There are currently 23 projects running in the new region, and we're moving more over every day. This would have been a reasonable request when were originally setting up the Neutron network but it is far from trivial now.
Can I ask one of you to put up the maintenance message and suggest a window for this move? Anytime during US work hours (let's say after 14:00 UTC) will suit me. Thank you!
Since your VMs will have to be moved to the new region soon anyway, I suggest that you build these fresh instances over there (where you have plenty of quota anyway). That will save us a move later on.
Spoke too soon, got another failure overnight.
Mon, Oct 22
This should be fixed -- thanks for noticing @Paladox. Let me know if you find other things like this.
I've created a new VM, t206636-2.wikidata-query.eqiad.wmflabs. This is in the older region, on a host that is not super busy but is supporting quite a few other VMs. If your tests look good there too then we're probably in good shape and can avoid needing special hardware just for you.
Things seem better this week! Is that my imagination?
Sun, Oct 21
My only concern about this is that those recursors are used about every second on every VM, so they're a huge, vital point of failure and I'm a bit reluctant to rock the boat.
Thanks, Stas. There are two ways I think we can go forward with this:
Fri, Oct 19
Oh, also, what would you like the VM to be named?
I can create a VM with a large disk allocation any time now. The default for requests like this would be a VM with 24Gb ram, a 300Gb disk and 4 cores. Will that work for you in the near-term?
Thu, Oct 18
note to self, I can merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/468377/ after Stas releases this VM (or at least stops caring about resource contention)
Great! Thank you!
@faidon, I'm not sure I understand your response here. We have an agreed-upon date for the removal of Trusty dependencies, and we are working as fast as we can to hit that date. You seem to have mentally moved that date into the past, which doesn't seem very realistic.
VM networking does not work properly for this host, so something is still missing.
shinken-01 is still active and needs to work for now. We're hoping to rebuild it but that's a work in progress.
No real need to coordinate, you can just do it anytime -- I'll keep any real load off that host until after it's moved.
Wed, Oct 17
shinken-01.shinken.eqiad.wmflabs might be a good test.
Both these hosts are now up and running VMs.
This is working for now. In a future version we can switch from sink to the neutron integration API.
I think this is done.
I've created this project. Make sure that you have Horizon switched to the 'eqiad1-r' region before you try to create things.