Wed, Oct 10
Thu, Sep 27
I think there should also be ExcimerProfiler::flush(), which detaches the log and returns it, similar to what happens on an implicit flush. The theory is that ExcimerProfiler::stop() will leave the log still attached to the profiler, so this:
Wed, Sep 26
Tue, Sep 25
Mon, Sep 24
I'm planning the timer backend component. An interesting wrinkle is ZTS support. As in LuaSandbox, we can have an integer ID (sival_int) with our timer struct stored in a hashtable, with a lock protecting it from concurrent updates. Instead of setting a hook in a lua_State, we need to store &EG(vm_interrupt) in the timer struct, since in PHP 7.0+ it is declared with __thread, so taking the address of it is the only way to transport it to the handler thread. Then when the zend_interrupt_function() hook is called, the hook function will need to find all the ExcimerTimer/ExcimerProfiler instances associated with the local thread that have pending events -- this was not a problem with LuaSandbox which only had one "timer set" per lua_State.
I started the process of adding LuaSandbox to PECL: https://marc.info/?l=pecl-dev&m=153776610925078&w=2
The main reason to use a flush callback is for real-time analysis of overload events. The problem we've had in the past is that if profiling data is only logged at the end of the request, the requests that are timing out are invisible. If we log once every 10 seconds, we can get a realistic snapshot of what the cluster is doing.
Sat, Sep 22
Fri, Sep 21
Wed, Sep 19
Sep 12 2018
For createaccount/autocreateaccount filtering, shouldn't the log performer always be anonymous? It doesn't make sense to use a non-existent user, half created, as the performer. That logic shouldn't depend on User::isSafeToLoad(), which is implemented in a hackish way, it should just depend on $action.
Sep 11 2018
It's very mysterious. My best guess is that "PHP fatal error: unlink(1): No such file or directory" was actually a suppressed warning, and was later misinterpreted as a fatal. There's definitely no entries in fatal.log with "unlink" in the message.
It was webVideoTranscodePrioritized. This is the JobExecutor log line for one of the affected jobs:
You can find the logs by going to https://logstash.wikimedia.org, selecting "Dev Tools" and entering the query
Sep 10 2018
I'm not especially concerned about wikidiff2 since it's written in a defensive way and has mostly been reviewed for security. We are passing user input to much more horrifying C code, like exif which has the worst pointer arithmetic tricks I have ever seen and has been the subject of multiple security vulnerabilities.
The diff in question has a large number of diff ops, but only one section. So for every diff op, getSectionStartIndex() and getSectionEndIndex() have to run a regex match on almost every line in the page, looking for headings. I think a better algorithm would be to have a cache giving the section start index for each line. If a line is not in the cache, then do the preg_match() on the current line, and if it is not a section header, repeat for the previous line with a new cache fetch. And do the same for getSectionEndIndex() looking forwards.
Sep 7 2018
Sep 6 2018
It's not fixed, or has regressed. I noticed this today due to T203628 and confirmed it by placing a simple infinite loop script in mwdebug1002's /w directory.
Sep 5 2018
Sep 4 2018
Aug 28 2018
Aug 27 2018
@CCicalese_WMF asked me to look at this. Tracking all NameTableStore objects looked easy enough, and is beneficial for other reasons, so I did that in the patch linked above. However, without existing calling code, I'm not sure exactly what @Addshore needs for a reset/reload interface and where it would need to be called, so maybe he could add that in a dependent patch.
Aug 24 2018
I approve of this request. As the task description says, I recommended it. I note that @Fjalapeno is Bill's manager, so he should sign off on it, I'm just a tech lead now.
Aug 23 2018
Aug 22 2018
BlobStoreFactory has a single LoadBalancer injected into its constructor, but allows the caller to choose the wiki ID in newSqlBlobStore(). So that's wrong. Wikidata just gets the BlobStoreFactory from MediaWikiServices::getBlobStoreFactory(), which doesn't allow you to specify the wiki ID and just gives a BlobStoreFactory with the LoadBalancer for the current wiki. BlobStoreFactory could take an LBFactory instead, which would allow newSqlBlobStore() to fetch the correct LoadBalancer.
Passing the wrong LoadBalancer into the SqlBlobStore constructor would have approximately this effect. The "previous.trace" is:
The above patches only address BagOStuff. There is also:
You could measure concurrency by summing a timing metric, like I did for the "load" metrics in the API dashboard: https://grafana.wikimedia.org/dashboard/db/api-backend-summary?refresh=5m&orgId=1 . I used:
Does it have to use the same table definition? To measure lag, MediaWiki uses
Aug 21 2018
Can anyone recall the rationale for this? @Anomie thinks the task was created as an action item from a MediaWiki Core Team meeting. If there were notes from that meeting, then that might help.
For testing after the new feature reaches group0, I note that there are a few pages with a large number of revisions on test.wikipedia.org:
Aug 20 2018
enwiki is complete now, so only the T202032 wikis remain.
Aug 17 2018
I think I've found the correct configuration file now, at mediawiki/services/change-propagation/jobqueue-deploy/scap/vars.yaml . I couldn't tell if the concurrency limits are normally reached, and I couldn't figure out how they add up to a global connection count. Looking at current connection counts from scb* to jobrunner.svc with netstat I see counts of 113, 318, 52, 107. MediaWiki has 60 job types, is it correct to multiply that by 30, which is the top-level concurrency in vars.yaml, and then to adjust for the overridden queue types? 55 classes with 30 connections each plus the 5 overrides would make 1970 connections. Then I multiply this by 4 scb servers, for a total of 7880 maximum connections. Is this correct?
Aug 16 2018
Current status: everything is done except enwiki and the T202032 wikis. enwiki has about another 49 hours to run.
Aug 15 2018
So how do we end up trying to insert a row for revision 3003 twice?
Aug 14 2018
You can see the full logs at mwmaint1001:/var/log/mediawiki/populateContentTables/ . On both aawikibooks and gotwikibooks, the error occurred on the second batch of the archive table, starting at ar_rev_id 2001. In both cases it was also the last batch, with the maximum ar_rev_id being 3275 and 3175 respectively.
I tried importing a file into testwiki with curl, forcing a centralauth DB connection in the same request by first deleting the global:centralauth-user:... cache key, still could not reproduce.
Aug 13 2018
It doesn't have to be a LoadBalancer bug, it could just be some other extension calling reuseConnection() inappropriately. It's hard to debug without a reproduction procedure. I see in the logs that there was a series of these on 2018-08-06 with URL https://sat.wikipedia.org/w/index.php?title=%E1%B1%9F%E1%B1%A5%E1%B1%9A%E1%B1%A0%E1%B1%9F%E1%B1%AD:Import&action=submit , and the failed query indicates that the user was @MF-Warburg , who did have successful file upload imports at that time in the logs: https://sat.wikipedia.org/wiki/%E1%B1%9F%E1%B1%A5%E1%B1%9A%E1%B1%A0%E1%B1%9F%E1%B1%AD:Log/import
It's important to avoid running it on requests that don't need it. In particular, requests that only call $wgParser->setHook() but not Parser::parse() should not call firstCallInit(). Maybe the risk of that is fading but my understanding is that it's not quite gone yet.
Aug 11 2018
Aug 9 2018
The initial report showed a query which didn't even use ORES, so it seems unfair to assign it to them.
Here's my proposal.
Aug 8 2018
We have debug logs for this request. On mwlog1001 do zgrep W2XVZApAAC4AAEKMbQAAAAAV /srv/mw-log/archive/test2wiki.log-20180805.gz
db1071, the master, had no writes
Aug 7 2018
The drop may have been caused by the API maxlag parameter. Wikidata:Bots recommends using a maxlag parameter, and some client libraries set maxlag=5 by default. The point of this feature is to make bots pause during replication lag, to prioritise human users and avoid worsening the situation.
Aug 6 2018
@greg The WN31 things are done now, only 1081 seconds for mediawikiwiki and 9252 seconds for metawiki. For metawiki the rate was about the same as anomie got for testwiki, 2000 rows per second for the revision table and 600 rows per second for the archive table. At that rate, we can expect wikidatawiki to take about 91 hours and commonswiki to take about 48 hours. We can run them concurrently since they are on different DB clusters, and that way maybe get them done by the end of the week.
Back up to ~60% loss now, due to a slow drop in capacity on logstash1008 and logstash1009. And there was a similar event on August 4, which was fixed when @fgiunchedi restarted logstash. Can we have a daily restart cron job now?