Fri, Jul 21
I've re-put this server in the rotation for the load-balancer.
I agree with @MoritzMuehlenhoff - in general assume server assignment is correct unless otherwise stated by us.
Thu, Jul 20
Please apply the same role/profile we use in production to beta too.
Wed, Jul 19
As far as the conftool part is concerned, this seems correct.
I ran the script and it worked fine, but we also need to add a redis slave on ocg1003, as right now we lost the replica of the redis master which is on ocg1002, according to puppet.
Please note that if no one ran the script that removes entries corresponding to ocg1001 from the cache, we're still serving failing requests as mediawiki will try to connect to ocg1001 directly.
Tue, Jul 18
Thu, Jul 13
I think you need to fix the logic inside the set_pooled_state function as per my comment. Apart from that, it seems correct.
I finished deploying the service, but I strongly urge you to have it respond to its root url with something different than 404.
Wed, Jul 12
I'd add another PRO to the second option (which I prefer for backwards compatibility): it allows us to be more concise.
Tue, Jul 11
Mon, Jul 10
This thing is alerting since 4 days as it's apparently using the default azure ssl cert.
I did deactivate those nodes, and me and filippo already added a "puppet node deactivate" to wmf-reimage, so I guess it's all right now.
@Krinkle sure, we can enable reusing TC in beta for now and test if the feature is stable and working as expected.
Aren't we collecting all server metrics via prometheus? If that's the case, shouldn't we just drop the diamond collector for those metrics?
Fri, Jul 7
@schana I will get into the details early next week - I just have one additional question: who will be the consumer of the service exactly? The user browser? mediawiki via an extension? another software?
Thu, Jul 6
One option to support reconnections and srv records and everything is to use the (blocking) python-etcd library via defer.deferToThread as etcd-mirror does.
Tue, Jul 4
Mon, Jul 3
So all my attempts at restarting one or multiple instances of the puppetmaster backends luckily didn't cause any puppet errors, but it takes up to 4 minutes for a restart to take effect. I'm not sure this is acceptable with our current practices.
The passenger docs about this are pretty clear: in the floss version of passenger, a restart is blocking - that is passenger will wait for all currently spawned workers to stop, and at least a new one to be spawned, before serving a new request. Those requests are going to be queued, but if we have some expensive catalog being built, it means ~ 30 seconds of blocking.
First hurdle: puppetlabs advises to set the variable environment_timeout to unlimited and to restart the puppetmaster at every code deploy, for performance reasons.
Fri, Jun 30
Wed, Jun 28
Jun 21 2017
Hey sorry I was on vacation last week.
Jun 20 2017
So what I did:
Caused by: java.nio.file.FileSystemException: .../_1ut5m.nvm: Too many open files
Jun 19 2017
@Cmjohnson please proceed to decom/derack these servers and rack new ones in their place.
Jun 9 2017
both maps and restbase are now monitored at the load-balancers of the SSL terminators in all datacenters. Resolving.
Jun 8 2017
@faidon at first I was thinking of implementing the checks on the LVS host (in the end, the puppettization is mostly the same), but I thought the nrpe checks on caches to be better just because it would monitor each cache host and not round-robin every host in a pool. It might also help seeing problems on individual caches.
Jun 7 2017
In order to do that, I want to do a local nrpe check on the cache edge servers, calling the SSL terminator, so that we cover as many logical layers as possible. Sadly, there is a bug in service-checker that I need to fix before this can go live, but apart from that it should be pretty straightforward.
Jun 6 2017
I would start monitoring restbase on text-lb and maps on text-upload.