In the getMasterDatabase() method posted above, I noticed that the database domain (e.g. DB/schema/prefix) is missing from getConnection(). Instead that should be:
Fri, Oct 19
Wed, Oct 17
Fixed in master.
Tue, Oct 16
Mon, Oct 15
Sat, Oct 13
I still see 100-200 per 3 hour interval.
Fri, Oct 12
Thu, Oct 11
Looking at https://performance.wikimedia.org/xhgui/run/view?id=5bbfdc7c3f3dfaea44b5847c after a null edit on https://en.wikipedia.org/wiki/1857_in_Sweden I see MediaWiki\Revision\RenderedRevision::getSlotParserOutputUncached being hit 4 times even though normal pages have only 1 slot...
Wed, Oct 10
Tue, Oct 9
https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/465300/ also mentions the ID in the message.
Does this occur in master? I more so wonder if https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/452878/ happens to help.
Fri, Oct 5
Thu, Oct 4
Are these jobs that try to also move user subpages?
Wed, Oct 3
CAS errors on user might also help pinpoint some causes.
Yes and yes. I think if COMMIT takes a few seconds, then even with this UPDATE near the transaction end, multiple writes can still pile up if enough tabs are opened or other things locking user rows are going on.
Tue, Oct 2
If you go that route, then something like having the getFileSha1() and the sha1 field being null for certain containers plus having doOperations and friends pass a flag to getFileStat/Sha1 to have the current behavior of lazy-loading and not using null, it might work.
Is it that much space? If you add an option, you have to have getFileStat return some dummy value for the SHA1 and also not have that mess up the logic in doOperations(), which is why it seemed easier to just include the header.
I don't see many of these in the logs for the last 7 days. This is likely caused by editing in parallel (multiple rollback tabs at once).
It looks like there is no way to say "Level 2 (reviewed for quality) is not allowed as a tag on pages outside namespace 0". Right, now, I suppose it is just convention that reviewers only mark template revisions as at level 1 (basic review). If $wgFlaggedRevsTags included a 'namespaces' field with (NS => level) as the value (defaulting to all of $wgFlaggedRevsNamespaces at the highest level, the status quo), then this could be configured.
Mon, Oct 1
It's used for originals. I don't think it matters much for thumbnails, but it's hard to cleanly tell that to SwiftFileBackend. It seems like it might be easiest to have thumbor hash the local file and save the metadata in the PUT request to avoid these errors (and slowness of triggering a GET to POST the missing data).
I think any callers should use '', not '*', which doesn't make much sense to me. That said, we already started the pattern, so it may as well work here too.
Closing per comments on patch about MCR refactor.
Sometimes callers might want I/O (swift/elastics/blazegraph) near DB I/O transactions, so even if we use setTransactionListener() (like Maintenance) and listen for points where no trx is active anywhere (kind of like DeferredUpdates), we'd want to be careful about waiting for lag too long or erroring out. Then again, mixed source-IO code should generally follow guidelines (https://www.mediawiki.org/wiki/Database_transactions#Updating_secondary_non-RDBMS_stores) and use patterns like doing the key/value writes first and committing or using commit hooks/deferred updates. So...maybe a callback could listen to setTransactionListener(), it could be given the affected row count, and a deferred MergeableUpdate could be added to DeferredUpdates when the count is high for among DBs recently (using pass-by-ref listener callback vars for last-time and running-count or something). The update could wait for replication, and would do so after any related I/O updates that relate to the DB writes.
Sep 2 2018
Aug 31 2018
perf-roots seems appropriate. If anything extra is needed, that can always be discussed in the future (probably by adding to perf-roots).
Aug 29 2018
All calls to incEditCountImmediate currently move it to the end of the transaction. According to logstash << +channel:DBPerformance +"user_editcount=user_editcount+N" +"sub-optimal" >> it seems to usually be very fast. Though I see occasional entries a little over 1 second. I suppose in that case, a fast enough edit rate by a single use could make a pile-up. I wonder if the delay comes from COMMIT itself?
Aug 28 2018
https://en.wikipedia.org/wiki/User:Sam_Sailor/CSD_log seems to be an offending page (many links, possible parallel updates).
Aside from using a narrower exception type and catching it, it's probably even easier to make acquirePageLock() return a boolean and log the error to a channel (possibly INFO level). The page_id should be extra logstash metadata, to make grouping easier. I suspect certain pages (like Commonist gallery subpages or such) are more likely to be offenders that others.
Aug 27 2018
I can't seem to reproduce this slowness (using mwdebug1002).
Aug 23 2018
I don't recall. It's been long enough that it's worth testing how queries run without it.
Aug 21 2018
Aug 18 2018
Aug 14 2018
Aug 13 2018
Where are the jenkins jobs defined?
Aug 10 2018
Can this task be closed?
Aug 8 2018
Something like that approach seems worth trying.
Is it possible to just update pagetriage_page_tags on page saves (and other relavent POST requests) when there are already master connections? For anything that depends on things updated via the job queue (like backlinks), those would have to be attached such LinksUpdates (which already run in POST/jobs). Why do things have to be updated on page views?
Aug 4 2018
I noticed that regular memcached counts ADD as it does SET (cmd_set). This for both STORED and NOT_STORED cases. There is no cmd_add. However, mcrouter does seem to expose a cmd_add counter. Perhaps there can be a mcrouter dashboard similar to the Memcache on in Grafana?
Aug 3 2018
Are there any tasks here that remain and are blockers to multi-DC?
Aug 2 2018
Aug 1 2018
Jul 30 2018
Regression from fb51330084b4bde1880c76589e55e7cd87ed0c6d I assume
Jul 27 2018
Jul 26 2018
From https://logstash.wikimedia.org/goto/0b9191830a12ab3d15bce062cdb36a93, this seems to be better. But we should wait longer.
From a glance, it looks like xtradb cluster is build on Galera (which is something itself to consider in the future). Use of GET_LOCK is tricky there since it would have to use wsrep or have such queries directed to dedicated master (perhaps with some HA in front that doesn't split brain).
Ah, right, I read that ternary backwards, <<$maxTime < PHP_INT_MAX ? PHP_INT_MAX : 1>>.
Jul 19 2018
Normally, it would be odd to let jobs pile up but not execute them, though the multi-DC use case of $wgReadOnly in one of the DCs wasn't considered in T130795. Ideally, jobs enqueued on GET/HEAD wouldn't be a thing...but that's not going away anytime soon.
Jul 17 2018
My first inclination is to try to reduce the refreshCounts() calls.
Jul 16 2018
Jul 11 2018
This was fixed by the 61a7e1acd0af4a5386df03335733accfde179fa1 backport.
Fixed with the 61a7e1acd0af4a5386df03335733accfde179fa1 backport.
Given how low server_failure_limit is, it might help to lower server_retry_timeout from 30s to something < 5s. Consistent hash ejections seem like the most obvious thing that could cause an acknowledged write to be seen as not being there for any of the next 5 seconds.