My preference for standalone tools is always the GPL v3, because there is no reason for people to use it in different contexts
Mon, Feb 20
Fri, Feb 17
Hi! I'm the one who suggested most of those timeout changes. Some have different historical reasons, but I think we can safely raise the connect timeout for the jobrunners (NOT for the common appservers).
Wed, Feb 15
Tue, Feb 14
Also note that while for videoscalers and jobrunners it is advisable to reimage, in the other cases a simple change of role in puppet is ok.
Fri, Feb 10
So basically either the connection is kept open on the client side and the name is never looked up again, or the applications cache dns indefinitely.
The prioritized queue is working well, but I'll probably raise the number of non-prioritized workers today as we're now underutilizing the systems.
Thu, Feb 9
The codfw cluster is getting replicated data from eqiad under /eqiad.wmnet/conftool.
Another interesting possibility we might want to explore:
Mon, Feb 6
Looking into it better, the api user wasn't a red herring after all; I am going to ban the use of oresscores from the mw api since:
scratch what I said; the counter for etwiki is most likely broken.
So, graphing ores.*.scores_request.*.count it shows most requests seem to come from etwiki, investigating this further. RechentChanges suggests this is not coming from any form of bot activity.
From my further analysis of logs:
So after taking a quick look at ORES's logs: around 70% of requests come from changepropagation for "precaching". Also
Before raising the number of workers for ORES:
Thu, Feb 2
Correct me if I'm wrong, but I think the Main page call can be skipped for all non-standard-wiki-serving machines, so API, image/video scalers; also: do we really need to warm up APC for all of the wikis? Or could we target only the ones doing 99% of the traffic (which I guess are way less than that?).
Wed, Feb 1
Duplicate of T149617
Mon, Jan 30
The cluster in codfw is installed and tested to work correctly with conftool. The performance of the cluster using nginx as a TLS/proxy auth seems to be much better too.
https://commons.wikimedia.org/wiki/File:Asynchronous_processing_on_the_WMF_cluster.pdf is the uploaded file.
Fri, Jan 27
Thu, Jan 26
@hashar rolled back to wmf.8 and I can confirm the pages I was looking at now render correctly.
The error is the following:
I can reproduce the problem. Any idea since when is this happening?
Tue, Jan 24
Mon, Jan 23
@Gilles will do today or tomorrow
@Cmjohnson any news on this?
Jan 22 2017
I would suggest, a few things:
Jan 21 2017
Jan 20 2017
Today I wanted to go around horizon to check and refactor hiera keys before merging https://gerrit.wikimedia.org/r/#/c/332355/.
Problem is now fixed and not just for parsoid.
Extracting from the session outcomes:
so, mystery solved.
@mobrovac when I read the task I was as surprised as you, given I remember we did create those rules correctly (although I think the copytruncate is on purpose).
Jan 19 2017
@Nemo_bis a blank page usually means something different than a timeout has happened. Probably a memory limit was hit; if we want to be able to import tens of thousands of revisions we might want to transform that into an async job instead, too.
@elukey apparently this needs a code deploy, which means accepting a pull request on github (sic) where not everyone from ops has the ability to merge a PR (I do as I'm an admin of the wikimedia github org, but YMMV), then you need to check that into the gerrit-based deploy repo, then restbase uses some ansible recipe (sic, again) to be deployed instead of scap3 or trebuchet.
@brion before the change to TMH goes into production, we also need to tweak the jobrunner setup in operations/puppet.
Jan 18 2017
Strace gives little more information, besides the fact for each of these pages parsoid does hundreds of preprocessing requests to the MW API. Maybe some recursion limit is reached?
Isolating a single request, I see that most of the time is spent in executing
Jan 17 2017
Jan 11 2017
Jan 9 2017
Slides for the starting the discussion available here https://docs.google.com/presentation/d/1DCofLYbP1dWnTb1JWNNnsb0Zp_da8sBhDzlwjCXRoq8/edit?usp=sharing
Jan 7 2017
De-assigning from me as I'm going to hop on a plane in a few hours from now and I won't be able to follow through on monday.
There was a huge error log for apache caused by an error in inserting a ticket in the history; I stopped apache, removed the file that was filling up the root filesystem, and started apache/otrs back again. Things look healthier from a server-side prespective, but I'm no expert on the application, so some error messages I see in the logs don't really make sense to me.
Jan 5 2017
@Gehel any updates on this? I guess it's going to impact our switchover this time as well?