Sat, Dec 8
Sun, Dec 2
To clarify, the $useMutex logic in WAN cache never triggers due to minAsOf=INF, resulting in stampedes when someone invalidates the cache. Instead, this should be treated like a regular TTL expiration and have one thread at a time doing regeneration.
Fri, Nov 30
Mon, Nov 26
Thu, Nov 22
@Gilles: Comcast only has cable infrastructure in terms what the ISP provides itself. For customers with cable, they can also get XFinity Mobile (https://www.tomsguide.com/us/xfinity-mobile-faq,news-25223.html) . That's basically just a bunch of Wi-Fi hotspots build off of Verizon. I don't know how many people are using that and it seems new-ish. Also, the latency figures are quite low, which makes me doubt that it is XFinity Mobile and more likely regular wireless/xfinity.
It looks sane, though I wonder why Comcast is so high in usage for mobile? Is that mostly from touchpad devices instead of smartphones?
Tue, Nov 20
It definitely seems like something worth doing. Having the potential for high use cache keys becoming unusable for undefined periods of time is too much of a stability concern.
Mon, Nov 19
Wed, Nov 14
Since CategoryMembershipChangeJob runs via the job queue, wouldn't that have little effect on save timing itself? I guess it wouldn't hurt to optimize.
Nov 9 2018
Nov 8 2018
wl_notificationtimestamp is not meant to store the time the article was watched but the last revision the user saw on the page (NULL if they saw the latest revision). This would require a new column. Ideally, if watchlist sizes were limited, this woudn't need an index, but they are not.
Nov 7 2018
Keys are set by add/cas normally, so it seems like some key that takes a long time to regenerate might have expired (there are two data points at the elevated value over more than just a few seconds) or a class of many keys expired. The other possibility is some sudden change in access patterns for keys, which seems less likely, especially the more periodic this is.
Nov 6 2018
Nov 5 2018
Fixed in bf30fcb71427d673f7c83a067b3241040d3470b6. Rollback is used instead and uses $ignoreErrors so as not to trigger the exception in reportQueryError().
Cleaned up in 633eb437a3b808518469c6eaf4e86a436941d837
Nov 2 2018
Nov 1 2018
openConnection is badly named and still reuses connections. You'd probably want getConnection with CONN_TRX_AUTO
Oct 29 2018
What about our use of register_postsend_function? Is there anything equivalant?
Oct 28 2018
Oct 27 2018
Closing, per " The Error Occurs if the memcache is too slow".
This will be better with a3d6c1411dad3e057b if there are many message pages that exists for extension use.
4b1db1190bb8f2a115c6a81a5ee487b7d18cd303 seems more likely.
Note that git master (19dd28798163) installs fine with postgres, which has the same DB domain patches as 1.32.
Oct 26 2018
It looks like the errors come from some tool (JS?) that fires a bunch of API requests from a Special:Search tab to edit numerous pages in parallel. Each burst always for a certain user ID with a single referrer URL.
Does this really need to call commitAndWaitForReplication() when there is only one batch? Is it ever called thousands of times in a row?
Oct 25 2018
Oct 24 2018
Oct 22 2018
In the getMasterDatabase() method posted above, I noticed that the database domain (e.g. DB/schema/prefix) is missing from getConnection(). Instead that should be:
Oct 19 2018
Oct 17 2018
Fixed in master.
Oct 16 2018
Oct 15 2018
Oct 13 2018
I still see 100-200 per 3 hour interval.
Oct 12 2018
Oct 11 2018
Looking at https://performance.wikimedia.org/xhgui/run/view?id=5bbfdc7c3f3dfaea44b5847c after a null edit on https://en.wikipedia.org/wiki/1857_in_Sweden I see MediaWiki\Revision\RenderedRevision::getSlotParserOutputUncached being hit 4 times even though normal pages have only 1 slot...
Oct 10 2018
Oct 9 2018
https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/465300/ also mentions the ID in the message.
Does this occur in master? I more so wonder if https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/452878/ happens to help.
Oct 5 2018
Oct 4 2018
Are these jobs that try to also move user subpages?
Oct 3 2018
CAS errors on user might also help pinpoint some causes.
Yes and yes. I think if COMMIT takes a few seconds, then even with this UPDATE near the transaction end, multiple writes can still pile up if enough tabs are opened or other things locking user rows are going on.
Oct 2 2018
If you go that route, then something like having the getFileSha1() and the sha1 field being null for certain containers plus having doOperations and friends pass a flag to getFileStat/Sha1 to have the current behavior of lazy-loading and not using null, it might work.
Is it that much space? If you add an option, you have to have getFileStat return some dummy value for the SHA1 and also not have that mess up the logic in doOperations(), which is why it seemed easier to just include the header.
I don't see many of these in the logs for the last 7 days. This is likely caused by editing in parallel (multiple rollback tabs at once).
It looks like there is no way to say "Level 2 (reviewed for quality) is not allowed as a tag on pages outside namespace 0". Right, now, I suppose it is just convention that reviewers only mark template revisions as at level 1 (basic review). If $wgFlaggedRevsTags included a 'namespaces' field with (NS => level) as the value (defaulting to all of $wgFlaggedRevsNamespaces at the highest level, the status quo), then this could be configured.
Oct 1 2018
It's used for originals. I don't think it matters much for thumbnails, but it's hard to cleanly tell that to SwiftFileBackend. It seems like it might be easiest to have thumbor hash the local file and save the metadata in the PUT request to avoid these errors (and slowness of triggering a GET to POST the missing data).
I think any callers should use '', not '*', which doesn't make much sense to me. That said, we already started the pattern, so it may as well work here too.
Closing per comments on patch about MCR refactor.
Sometimes callers might want I/O (swift/elastics/blazegraph) near DB I/O transactions, so even if we use setTransactionListener() (like Maintenance) and listen for points where no trx is active anywhere (kind of like DeferredUpdates), we'd want to be careful about waiting for lag too long or erroring out. Then again, mixed source-IO code should generally follow guidelines (https://www.mediawiki.org/wiki/Database_transactions#Updating_secondary_non-RDBMS_stores) and use patterns like doing the key/value writes first and committing or using commit hooks/deferred updates. So...maybe a callback could listen to setTransactionListener(), it could be given the affected row count, and a deferred MergeableUpdate could be added to DeferredUpdates when the count is high for among DBs recently (using pass-by-ref listener callback vars for last-time and running-count or something). The update could wait for replication, and would do so after any related I/O updates that relate to the DB writes.
Sep 2 2018
Aug 31 2018
perf-roots seems appropriate. If anything extra is needed, that can always be discussed in the future (probably by adding to perf-roots).
Aug 29 2018
All calls to incEditCountImmediate currently move it to the end of the transaction. According to logstash << +channel:DBPerformance +"user_editcount=user_editcount+N" +"sub-optimal" >> it seems to usually be very fast. Though I see occasional entries a little over 1 second. I suppose in that case, a fast enough edit rate by a single use could make a pile-up. I wonder if the delay comes from COMMIT itself?
Aug 28 2018
https://en.wikipedia.org/wiki/User:Sam_Sailor/CSD_log seems to be an offending page (many links, possible parallel updates).
Aside from using a narrower exception type and catching it, it's probably even easier to make acquirePageLock() return a boolean and log the error to a channel (possibly INFO level). The page_id should be extra logstash metadata, to make grouping easier. I suspect certain pages (like Commonist gallery subpages or such) are more likely to be offenders that others.
Aug 27 2018
I can't seem to reproduce this slowness (using mwdebug1002).
Aug 23 2018
I don't recall. It's been long enough that it's worth testing how queries run without it.