I think I've found the correct configuration file now, at mediawiki/services/change-propagation/jobqueue-deploy/scap/vars.yaml . I couldn't tell if the concurrency limits are normally reached, and I couldn't figure out how they add up to a global connection count. Looking at current connection counts from scb* to jobrunner.svc with netstat I see counts of 113, 318, 52, 107. MediaWiki has 60 job types, is it correct to multiply that by 30, which is the top-level concurrency in vars.yaml, and then to adjust for the overridden queue types? 55 classes with 30 connections each plus the 5 overrides would make 1970 connections. Then I multiply this by 4 scb servers, for a total of 7880 maximum connections. Is this correct?
Thu, Aug 16
Current status: everything is done except enwiki and the T202032 wikis. enwiki has about another 49 hours to run.
Wed, Aug 15
So how do we end up trying to insert a row for revision 3003 twice?
Tue, Aug 14
You can see the full logs at mwmaint1001:/var/log/mediawiki/populateContentTables/ . On both aawikibooks and gotwikibooks, the error occurred on the second batch of the archive table, starting at ar_rev_id 2001. In both cases it was also the last batch, with the maximum ar_rev_id being 3275 and 3175 respectively.
I tried importing a file into testwiki with curl, forcing a centralauth DB connection in the same request by first deleting the global:centralauth-user:... cache key, still could not reproduce.
Mon, Aug 13
It doesn't have to be a LoadBalancer bug, it could just be some other extension calling reuseConnection() inappropriately. It's hard to debug without a reproduction procedure. I see in the logs that there was a series of these on 2018-08-06 with URL https://sat.wikipedia.org/w/index.php?title=%E1%B1%9F%E1%B1%A5%E1%B1%9A%E1%B1%A0%E1%B1%9F%E1%B1%AD:Import&action=submit , and the failed query indicates that the user was @MF-Warburg , who did have successful file upload imports at that time in the logs: https://sat.wikipedia.org/wiki/%E1%B1%9F%E1%B1%A5%E1%B1%9A%E1%B1%A0%E1%B1%9F%E1%B1%AD:Log/import
It's important to avoid running it on requests that don't need it. In particular, requests that only call $wgParser->setHook() but not Parser::parse() should not call firstCallInit(). Maybe the risk of that is fading but my understanding is that it's not quite gone yet.
Sat, Aug 11
Thu, Aug 9
The initial report showed a query which didn't even use ORES, so it seems unfair to assign it to them.
Here's my proposal.
Wed, Aug 8
We have debug logs for this request. On mwlog1001 do zgrep W2XVZApAAC4AAEKMbQAAAAAV /srv/mw-log/archive/test2wiki.log-20180805.gz
db1071, the master, had no writes
Tue, Aug 7
The drop may have been caused by the API maxlag parameter. Wikidata:Bots recommends using a maxlag parameter, and some client libraries set maxlag=5 by default. The point of this feature is to make bots pause during replication lag, to prioritise human users and avoid worsening the situation.
Mon, Aug 6
@greg The WN31 things are done now, only 1081 seconds for mediawikiwiki and 9252 seconds for metawiki. For metawiki the rate was about the same as anomie got for testwiki, 2000 rows per second for the revision table and 600 rows per second for the archive table. At that rate, we can expect wikidatawiki to take about 91 hours and commonswiki to take about 48 hours. We can run them concurrently since they are on different DB clusters, and that way maybe get them done by the end of the week.
Back up to ~60% loss now, due to a slow drop in capacity on logstash1008 and logstash1009. And there was a similar event on August 4, which was fixed when @fgiunchedi restarted logstash. Can we have a daily restart cron job now?
The core patch is enough to kill the warning, at the expense of extra memory usage. Merging the RemexHtml patch, releasing and updating composer will reduce the memory impact.
Sat, Aug 4
Fri, Aug 3
Short version: RemexCompatMunger case B/b calls endTag() on a p-wrapper which still has children in the stack. There is in fact an effect on the output.
Thu, Aug 2
Reduced test case:
The logspam patches removed the extra demand caused by the 1.32.0-wmf.10 deployment. Packet loss is now down to ~20%, i.e. still bad but comparable to the long term average of available Prometheus data, which starts in May. There was a drop in successfully consumed packets with this demand drop, which should leave logstash with some capacity headroom. So the next most obvious theory is stalling, and yet the 4MB receive buffer didn't help reduce the packet loss, which is amazing since the bitrate is low, 4MB should correspond to a stall of about 2.7 seconds. It seems to be slow or stalling for even longer periods of time than that.
I tried restarting logstash on logstash1007 with no other change, to avoid confounding the test, then I quadrupled the default receive buffer size (net.core.rmem_default) and restarted it again. The restart alone increased throughput by a factor of 4.7, nice but hardly enough to put a dent in the packet loss graph. The receive buffer change had no effect or a negative effect.
test2wiki and testwikidatawiki are complete.
Daniel proposed the following schedule:
There is T200362 for exporting logstash metrics.
I ran the schema change on labswiki and labtestwiki. I confirmed that no other wiki in all.dblist is missing this schema change. Edits work now. Stashbot died (quit all channels), I restarted it.
Wed, Aug 1
Tue, Jul 31
I gave her WMF-NDA access just now for the same reason.
Mon, Jul 30
Fri, Jul 27
You can talk to @Legoktm about it on IRC, he should be around, 9:30pm for you would be 11:30am for him, and he has a similar outlook to me on these questions.
Thu, Jul 26
Wed, Jul 25
Tue, Jul 24
Seems fixed now. Please reopen if necessary.
This should be fixed with https://gerrit.wikimedia.org/r/c/mediawiki/core/+/447041
Fri, Jul 20
I'm imagining that we would do it without a progress bar, just a message like you say. It's a UX improvement compared to an exception message. But I wonder whether we should still require the bigdelete right before launching a job? Currently $wgDeleteRevisionsLimit is 5000, and above that number of revisions, the bigdelete right is required. Is 5000 also an appropriate threshold for queueing a job?
Jul 19 2018
To serve read traffic correctly, $wgReadOnly needs to be false. $wgReadOnly is mostly a UI-layer concept which shows some informative message to the user, not just on POST, but also on confirmation pages. So it's not really necessary to fix this to implement active-active support, we can just set $wgReadOnly to false. So I don't think this is high priority, it is just logspam.
Poor but specific. The old script was just guessing when it said "Error looking up DB", it would have said that if the PHP binary was missing. In the new script I put in different messages for different error cases. Feel free to submit a patch if you want to reword it, then I can +2. It's on line 91 of maintenance/mysql.php.
In production, mwscript/sudo is causing its own set of problems with ctrl-Z even when running eval.php. To test it locally I skipped my own wrappers and ran php directly, since I figured the (evident) problems with the wrappers were out of scope.
It can't be done in the request because request threads can die at any time. But it could be done in the job queue, which has the means to retry jobs which do not complete successfully, so it could be reasonably reliable.
Jul 18 2018
A lesser issue is that Ctrl+Z properly suspends both mysql and the wrapper, but executing fg does not seem to send SIGCONT through to mysql.
As detailed in T105378, this was fixed by reducing the connect timeout.