So... @hashar is the work around for this just to wait a while after building a new compiler node? Or is that not adequate?
disappointingly, fixing that mistaken newline issue doesn't resolve anything
Thanks to an extremely tedious binary search, I've determined that this is not a puppetmaster issue. It correlates with this hiera setting:
Note that puppet installs a bunch of files happily before getting to this point. So it's not a total failure of file-serving, it's something specific.
Fri, Jul 12
Since there have been some other unexpected monitoring issues today related to monitoring refactors, pinging @fgiunchedi before I dig in too much.
Thu, Jul 11
<mutante> JobTimeoutSec=, JobRunningTimeoutSec=
5:40 PM when i asked if it looks at return codes i was told "what if it's still running". but this ^
5:40 PM and then there is "write another timer that is scheduled to run at the shutdown times you would like. They can run" a service which stops the you want to top, by running ExecStart=/bin/systemctl stop other.service in the service file called your shutdown timer"
5:42 PM but it if it times out that also does not mean it goes into "failed" state. "When a job for this unit is queued, a timeout JobTimeoutSec= may be configured. Similarly, JobRunningTimeoutSec= starts counting when the queued job is actually started. If either time limit is reached, the job will be cancelled, the unit however will not change state or even enter the "failed" mode. "
5:43 PM ah, you can kill it with "JobTimeoutAction= optionally configures an additional action to take when the timeout is hit, "
5:44 PM so you could have a script that does both, kill the process and tell monitoring about it and have that as your TimeoutAction command
Regarding the failure to alert:
There's an alias for zerowiki, which suggests that it's replicated. There's no actual database of that name present on the replicas, though, so maybe it was never actually set up. In any case, I'm running the cleanup steps.
Wed, Jul 10
Tue, Jul 9
A VM of this size will be quite difficult for us to manage -- among other things, it would take many hours to move off of a hypervisor. Generally when we create large VMs (although so far we have never created one of this size) it's with the understanding that I may need to delete it as part of routine maintenance and leave it to be rebuilt by the users.
Mon, Jul 8
Thu, Jul 4
Wed, Jul 3
I've made a new dashboard, https://grafana.wikimedia.org/d/ebJoA6VWz/nova-fullstack -- once I'm convinced that it's doing what I expect I'll delete the older labs-nova-fullstack board.
Tue, Jul 2
How is corosync/pacemaker going to work then with a single VIP?
Most likely from https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/520147/
(Let's use Buster for this if it's available on ganeti)
I suggest that rather than a global block we think about this as a tooling issue -- maybe provide a default 'block everything' robots.txt (or even an actual service block) and a well-documented way for users to manage this.
Mon, Jul 1
btw, I'm happy to actually set up the VMs, only assigning to Alex to approve the resource usage.
If we want single-purpose proxies we can create ganeti VMs for that. I'm still trying to determine if we can have proper three-server redundancy that way...
We can try to buy more hardware or we can just declare that our HA cluster is [cloudcontrol1003, cloudcontrol1004, cloudservices1003]. The puppet would be a bit ugly but they're all on public IPs.
fwiw I'm totally down with deciding we need a third cloudcontrol. As I understand it it's only on the front-end proxy that we'd need three, not for each API backend right?
I have never used HAproxy, so there is no plan as yet -- that's up to you :)
For the short term I've been assuming we'd just put a proxy in front of the existing endpoints. That means two each:
Fri, Jun 28
btw, if you update the live config please adjust the docs here, accordingly: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Rabbitmq
yep, I definitely just did what the HA guide said to do :/ If we can do active/active with two disk nodes that seems fine!
Thu, Jun 27
I'm certainly in favor of replacing custom code with upstream code! In particular it seems like we'll need this in order to make live-migration work sensibly between different CPU-typed cloudvirts, right? I do have a few concerns:
This is resolved -- we now use puppet certs rather than the libvirt* cert. I removed the monitor for the old cert.
Wed, Jun 26
Note that ever since https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/418945/, hypervisor<->hypervisor communication has been broken so these certs are moot (we really need libvirtd -l for that). I'm going to try to fix the certs anyway because we'll want them working to get live migration going eventually, but at present this is a very low-stakes issue.
Tue, Jun 25
Approved! We'll create this shortly.
Mon, Jun 24
Haven't seen this in ages.
Fri, Jun 21
I have an (ironically) unpuppetized example of a dual-run setup running now:
And, here is some info about the base classes applied to a VM vs a production machine:
Here are some usage stats:
Thu, Jun 20
I added a new flavor named 'mediumram' to the integration project. Thanks for conserving RAM!
Tue, Jun 18
There are files in /tmp/calibre_2.75.1_tmp_imITvt on tools-sgewebgrid-lighttpd-0921.tools.eqiad.wmflabs but it looks like they're being actively created/deleted so that's maybe fine... I don't see /tmp litter on other hosts.
Mon, Jun 17
Just to double-check: IIRC, back in the day we avoided this because we had multiple controllers attached to a shared shelf and if two controllers ran at the same time then terrible, terrible things happened. Is it safe to say that there's no current situation where having 'too many' nfs services running at once causes harm?
The remaining task here is to make/update a wiki page about this.
We had a session about this during the SRE summit. The conclusions were:
Sun, Jun 16
I'm put eight test VMs on 1015, will let them run for a few days and then see if they're still up :)
Jun 13 2019
Jun 7 2019
I wouldn't say that the role is necessarily broken; it may just be that it needs the hiera args provided if the class is applied. You could provide a good default in the git puppet tree, though, if there is a good universal default. Easier would be to just add it to the project-wide hiera setting on Horizon.
Some extensions were not bundled with the branch and so the usual 'check them out as submodules' trick did not work for
updating them; they had to be cloned manually. @Andrew do you want to add a description of what you did?