The real solution for this is to dedicate real developer time to pybal to move it to use a FSM and a netlink-based python ipvs client.
Mon, Mar 27
This is now a blocker (sort-of) for the current work on using DNS for discovery: in fact as soon as I switched the parameter for the restbase url to the discovery one (so restbase.svc.codfw.wmnet to restbase.discovery.wmnet, both resolving to the same IP) cxserver and mobileapps started complaining and investigation showed me the issue were the requests that were being directed to the proxy instead of being direct.
What still needs to be done:
Thu, Mar 23
@elukey looking at the numbers, the only slightly worrying situation is for apis in codfw: if we lose ROW B we lose more than half of the capacity. We might want to add apis in row a or row c once we get new hardware.
@Marostegui is it depooled from mediawiki-config? If not, we might want to do it so.
Tue, Mar 21
Sun, Mar 19
Thu, Mar 16
@akosiaris are you sure about that? If replica is broken the rdb file is transferred and from what I see only some are larger than 500 MB.
Wed, Mar 15
Thinking of a general way to represent any mediawiki-config variable left me with awkward, over complicated objects like the following:
Tue, Mar 14
@Ciencia_Al_Poder care to explain why did you remove the "easy" tag?
Fri, Mar 10
Wed, Mar 8
This is now solved with the latest version of conftool
@akosiaris you are correct, but I think that's inevitable.
Tue, Mar 7
Mon, Mar 6
Just to give some context: it might be possible to try to have a true multi-dc cluster for etcd, but that will need:
@Lydia_Pintscher not really, I'm monitoring the jobqueue and it's constantly decreasing in size. We should be ok.
As a general comment on the rest of the thread:
Fri, Mar 3
So I thought a bit about it and come up with the following alternative solution
An example of an output file is:
So, I just found out that the dns cache feature we were supposedly using in HHVM has been removed from it for some time, so while we have the ini setting in our setup, it's not doing much of an effect.
@tstarling I agree, dblists is one of the things that could be stored in etcd and read from there. On the other hand, it's such a simple and relatively stable list that we could also decide to maintain this as a simple configuration file that we distribute across the cluster in a standard format, and we expect every application to read from disk.
Thu, Mar 2
@Jgreen any idea when it will happen? (all FR to jessie, I mean).
Again regarding precaching (which is surely duplicable): do we *really* need it?
Wed, Mar 1
Tue, Feb 28
I took what @Krinkle did in his patchset, fixed a couple of things in order to implement the "clone mode" and be able to simulate the full procedure:
Mon, Feb 27
Feb 24 2017
The code is done and a package has been created, although still only in experimental. This task can be considered resolved though.
Feb 22 2017
Answering some of my questions:
My preference for standalone tools is always the GPL v3, because there is no reason for people to use it in different contexts
Feb 20 2017
Feb 17 2017
Hi! I'm the one who suggested most of those timeout changes. Some have different historical reasons, but I think we can safely raise the connect timeout for the jobrunners (NOT for the common appservers).
Feb 15 2017
Feb 14 2017
Also note that while for videoscalers and jobrunners it is advisable to reimage, in the other cases a simple change of role in puppet is ok.
Feb 10 2017
So basically either the connection is kept open on the client side and the name is never looked up again, or the applications cache dns indefinitely.
The prioritized queue is working well, but I'll probably raise the number of non-prioritized workers today as we're now underutilizing the systems.
Feb 9 2017
The codfw cluster is getting replicated data from eqiad under /eqiad.wmnet/conftool.
Another interesting possibility we might want to explore:
Feb 6 2017
Looking into it better, the api user wasn't a red herring after all; I am going to ban the use of oresscores from the mw api since:
scratch what I said; the counter for etwiki is most likely broken.
So, graphing ores.*.scores_request.*.count it shows most requests seem to come from etwiki, investigating this further. RechentChanges suggests this is not coming from any form of bot activity.
From my further analysis of logs:
So after taking a quick look at ORES's logs: around 70% of requests come from changepropagation for "precaching". Also
Before raising the number of workers for ORES:
Feb 2 2017
Correct me if I'm wrong, but I think the Main page call can be skipped for all non-standard-wiki-serving machines, so API, image/video scalers; also: do we really need to warm up APC for all of the wikis? Or could we target only the ones doing 99% of the traffic (which I guess are way less than that?).
Feb 1 2017
Duplicate of T149617
Jan 30 2017
The cluster in codfw is installed and tested to work correctly with conftool. The performance of the cluster using nginx as a TLS/proxy auth seems to be much better too.
https://commons.wikimedia.org/wiki/File:Asynchronous_processing_on_the_WMF_cluster.pdf is the uploaded file.
Jan 27 2017
Jan 26 2017
@hashar rolled back to wmf.8 and I can confirm the pages I was looking at now render correctly.
The error is the following:
I can reproduce the problem. Any idea since when is this happening?