Regarding RedisLockManager (it only needs 2 of the 3 host to be reachable). If one of them is depooled or refuses connections, no one should notice any disruption. For otherwise unreachable servers, there is a 2 second timeout (and the redis server will be avoided for the rest of the request).
Mon, Oct 26
Added cache hit latency panel
It would be good to get cleanup patches like https://gerrit.wikimedia.org/r/c/mediawiki/core/+/596082 merged first.
Wed, Oct 21
Tue, Oct 20
Fri, Oct 16
Tue, Oct 13
Tue, Oct 6
Mon, Oct 5
The default timeout for waitForReplication() for web requests is 1 second. JobRunner passes 3 seconds to its calls and sets setDefaultReplicationWaitTimeout( 3 ). Maybe TranslateRenderJob should use a different timeout?
Thu, Oct 1
This only happens with paratest, so I suspect that some tests depend on each other running by accident.
Wed, Sep 30
Are the jobs failing near the end and getting retried? Also, job B can still be enqueued if a duplicate job A is marked as running. This was long-since the semantic logic (and the kafka queue does this too last I checked). If job A fails, it seems like the redis job queue (https://github.com/wikimedia/mediawiki-services-jobrunner/blob/master/redisJobChronService) would then allow duplicate job C, though JobQueueDB would not. Not sure about the Kafka queue in that situation. The behavior there should be standardized in any case.
The Grafana dashboards will still need updating.
Sep 28 2020
The problem with LocalClusterCache is that there can be cycles or odd behavior for early calls. As long as the wiring works, then I'd love to have such a method.
Sep 17 2020
Declined, assuming that the DB main stash has enough GB of space.
It's not meaningfully used, though it is NON NULL so the code still references it during insertion (using -1).
Sep 14 2020
By "showed up as lagged", do you mean that there were mosly logstash messages with wiki:enwiki or did you see this in some graph of MW statsd data or such? I could probably tweak the "all replica DBs lagged" entries to have some extra fields.
Sep 10 2020
Aug 12 2020
Aug 10 2020
Aug 6 2020
Aug 5 2020
Jul 29 2020
Yeah, technically, all sorts of anomalies are possible, so callers should always (a) avoid DB updates based on cache data, (b) choose reasonable, possibly adaptive, TTLs for the given use case (e.g. would the world suffer if cache was wrong for X days/hours/minutes?), (c) be cognizant of CDN/ObjectCache TTL interactions (race conditions already dictate this awareness anyway, even without lost updates).
Jul 23 2020
Given the libketama-style consistent hashing in twemproxy and that, AFAIK, CentralAuth sessions can regenerate (notwithstanding one-off CSRF token failures and such) from the existence of the long-term centralauth_Token cookie. That would at least prevent logouts.
Jul 22 2020
In my KeyValueStore refactor patches, a consume() method was added, since implementing it with TTL changes is not generically the most convenient way to best-effort atomically consume a key.
Jul 17 2020
We don't really need purges to go to the gutter cache, given the low TTL there.
Jul 16 2020
This entry point will just be treated by CDN like POST is w.r.t multi-DC for the foreseeable future.
BannerMessageGroup::registerGroupHook is triggering master queries (a) on HTTP GET requests and also (b) inside of a getWithSetCallback() call, which should use DB_REPLICA data.
Jul 8 2020
Jul 6 2020
Jul 2 2020
Jun 30 2020
Jun 24 2020
It is still useful to have a gauge of connectivity, in some cases, before attempting to use DB handles. That makes me lean towards keeping both out of convenience.
Some more info:
aaron@mwmaint1002:~$ mwscript eval.php --wiki=testwiki > echo strlen(json_encode(['paths'=>'ref:af18298daea159b6ca5283c0c1aa45e7155e4412', 'asOf' => time()])); 74
Jun 22 2020
Jun 16 2020
Jun 15 2020
Jun 10 2020
Jun 4 2020
I suspect that the keys that cause trouble are big text/JSON blobs and ParserOutput objects, all of which don't directly get purged. READ_VERIFIED is used by MultiWriteBagOStuff already when deciding whether it is safe to do cache-aside/backfill into lower cache tiers. This could probably be integrated into mcrouter by having such calls use a route prefix that uses a warmup route. A similar flag could be added/used by WANObjectCache for other blob keys that don't receive delete()/touchCheckKey() calls. This would would at a lot of I/O resistance without making hard-to-reason-about changes to cache invalidation.
Jun 3 2020
Unfortunately, I'm not seeing any xenon .log files going back that far. In terms of backend contribution, I don't see anything to do here.
Jun 2 2020
From the perspective of popular/major articles, likely to have infoboxes, the extra 42.1 KB for loading the "app" JS doesn't seem crazy. I've looked through code several times and it seems reasonable. Testing with fast/slow 3G doesn't reveal obnoxious reflows or delay either. Having the edit link go directly to a Q<X> page when the JS hasn't fully loaded felt somewhat jarring, though I don't image that happening often. I don't see much editing at all given how discrete the icon is (a good thing).
May 30 2020
I've been looking at this from time to time, and haven't found anything real problems yet. Some of the things I'm looking out for are:
May 26 2020
Per-function flame graphs are T253679
May 23 2020
Added two panels to the regeneration row.
May 21 2020
I'm not fond of the idea of not sending purges for indirect edits nor using RefreshLinksJob instead of HtmlCacheUpdateJob (too slow IMO).
May 11 2020
May 8 2020
May 4 2020
Apr 22 2020
I agree that jobs are not always "events" or "event handler hooks" (what about jobs the subdivide or are added by maintenance scripts and so on?). There is probably a lot of stuff that can indeed be moved in that direction though.
Apr 20 2020
Apr 18 2020
Apr 17 2020
Apr 16 2020
Apr 15 2020
Apr 14 2020
Apr 11 2020
I think the wikibase improvement simply avoided a bunch of repetitive duplicate queries, which technically would lower the hit rate.
Apr 10 2020
Apr 9 2020
Apr 7 2020
Hmm, it would help if cacheGetTree() /cacheSetTree() were replaced by getWithSetCallback() perhaps. Lots of optimizations are not used atm due to that fact.
I wonder if it would be useful for the template name to appear in the key when possible. Right now it's just an opaque hash. I doubt that many invocations of different templates have the same top-level text, so I don't see it adding much fragmentation. Maybe some statsd logging could be done instead though.
Apr 3 2020
It stores the serialized naive "top frame" (e.g. headings, paragraphs, template invocation parameters) of the wikitext of pages, as well as the "sub-frames" from template invocations, upon recursive expansion. This all happens on page parse. Note that these keys do not need purges. If template X is invoked the same way on multiple pages, then parses of those pages will reuse a common sub-frame cache key for those template invocations and likewise for templates invoked from that template. So, I suppose a popular template invoked with a low enough cardinality of paramater/context bundles would trigger traffic spikes upon invalidation. The traffic would come from either (a) refreshLinks jobs or (b) page views to backlink pages that got purged via htmlCacheUpdate jobs.
Mar 30 2020
The timing of this SAL entry suggests some relation:
Mar 29 2020
Mar 25 2020
$titles = [ 'COVID-19', 'Pandemia di COVID-19 del 2020 in Italia', 'Pandemia di COVID-19 del 2019-2020', 'Influenza spagnola', 'Pandemia di COVID-19 del 2019-2020 nel mondo' ];