Fri, Oct 11
Thu, Oct 10
Indeed the logging is based on the *whole* raw unfiltered position...I should add a logstash key for the filtered one too.
Tue, Oct 8
@jcrespo @Marostegui What do think of the idea of having another cluster of mysql servers set up just like the parser cache ones? That would be nice from an HA perspective and to avoid adding extra load to any existing DB cluster (e.g. objectcache table of metawiki or extension1)? Traffic would be modest given that it would start out for use for WikimediaEvents, LoginNotify, perhaps AbuseFilter stats too (see https://docs.google.com/document/d/1tX8ekiYb3xYgpNJsmA1SiKqzkWc0F-_E4SGx6BI72vA/edit#heading=h.bdt9mhl3o7k5).
Sun, Oct 6
Wed, Oct 2
Mon, Sep 30
Not seeing this in the logs anymore.
Wed, Sep 18
Seems like some kind of merge conflict.
Sep 12 2019
Sep 11 2019
Sep 10 2019
Odd, the constant seems to be there.
Sep 9 2019
So, getting this test merged depends on redoing the wikibase schema hook application order for update.php. In CI, there seems to be a problem when it interacts with Flow hooks trying to make pages.
Sep 5 2019
Should be fixed now.
Aug 30 2019
It looks like WebStart.php sets ignore_user_abort() for POSTS and the major entry points have wfTransactionalTimeLimit() set for POSTS. In the case of module_deps updates for load.php, that's on GET.
Aug 29 2019
Client disconnects (HTTP 499) are interesting...before the ignore_user_abort() in doPostOutputShutdown(), I suppose it's possible to end up with stuff like this (and long has been). https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/519741/ would help this particular case by avoiding DB writes.
I wonder if some entry point lacks proper shutdown.
Aug 28 2019
What is the value of apc.enable_cli ? I don't seem to have that problem.
I do worry about the risk of data loss if swiftrepl is also deleting files based on container list differences.
Aug 26 2019
I'd love to have a simplified version of WebRequest as a service. One that would be useful for dealing with the issue that https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/532367/ is about. Optimization hacks like https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/526801/ could be avoided too. It could be injected with pathinfo/cookie settings, but would not deal with complex encoding stuff that uses $wgContLang and so on.
Aug 25 2019
Aug 23 2019
Still, a file was only uploaded, and no other operations done...I'm not sure why the DB would commit if the file store failed in one of the FileBackendMultiwrite backends and 'replication' is 'sync'...
Isn't there a swiftrepl background process to fix this?
Aug 22 2019
Note that CdnCacheUpdate queues a purge to happen X seconds later to help deal with lag (mediawiki-config has $wgCdnReboundPurgeDelay at 11). If lag gets near that amount, then $wgCdnMaxageLagged will kick in.
Aug 21 2019
Seems to be resolved, likely by vary-revision refactoring from T226785.
Aug 20 2019
Aug 19 2019
Aug 17 2019
I don't think so.
Aug 15 2019
Does this still occur?
Aug 12 2019
Per my comment above, this is the expected behavior.
It's an optional table, not installed by update.php.
Aug 9 2019
They were obsoleted by flaggedrevs_statistics.
Aug 8 2019
The remaining vary-revision instances are basic self-transclusions (https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/526157/ should handle those).
Aug 5 2019
Aug 2 2019
Aug 1 2019
Jul 31 2019
Is https://phabricator.wikimedia.org/T212881#5195101 the error that still happens or is it the read-only one too?
Jobs are fine...though this case is complicated since people want their "latest views" to be immediately reflected...so it would have to do something like WatchedItemStore.
How much of this is unique from T205936 ?
Jul 27 2019
Jul 25 2019
Jul 23 2019
I wonder if this is fixed in https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/519565/
The logs for doSelectDomain() look quite for the last 7 days.
959daa2ca44c039e72c8a9a5199d4c74dd05caba added the << $status->value = [ 'warnings' => $upload->checkWarnings() ]; >> line. It seems like checkWarnings() has all kinds of File objects inside of it potentially. Some callback could easily slip in given that.
Jul 22 2019
ObjectCache always mentioned getMainStashInstance() as "Ephemeral global storage". It was just supposed to *try harder* to be persistent than memcached (rdb snapshots, expectation that stuff can *probably* still be there a week later or so). The existence of redis evictions and consistent re-hashing on host failure making data disappear or go stale was well known at the time it was picked as the original "stash".
Jul 19 2019
JobQueueException should be thrown from push(), with nothing catching it other than MWExceptionHandler or site-specific callers. Things like RenameUser *depend* on knowing whether something enqueued or not in order to function correctly. Typically, push() should be used pre-send, before preOutputCommit, so everything would just rollback anyway. Jobs pushed after than are enqueued during DeferrableUpdates (directly or indirectly via lazyPush()); in that case, DeferredUpdates should (already) catch any exceptions (not just job queue ones) and rollback on an update-by-update bases. The exceptions are logged in the DeferredUpdates channel (previously the Exception channel).
Also, the timeout exceptions themselves where redis, not LBFactory. The later seemed to just have errors related to the improper shutdown.
Jul 18 2019
Dropping the field doesn't make sense, but dropping the whole table does. We do not use that class in production (and it is optional within MW core).
The redis bug is at T228303
The timeouts correspond with the redis problems:
The timeout aspect seems strange. The huge "idle" time increase at https://grafana.wikimedia.org/d/000000273/mysql sounds like the PageEditStash::parseAndCache() has an infinite timeout instead of 0 seconds (bug, it should be 0 as in non-blocking) and the parsing may have been slowed down for some reason, making more threads wait on the lock. Maybe the concurrent nutcracker issues were also affecting mcrouter (since the same hosts are used). Could also be something adding memcached write load: https://grafana.wikimedia.org/d/000000316/memcache?orgId=1&from=1563458818482&to=1563464680644 looks a little unusual, though not unlike the result of key version changes that happen from release to release (including the slow return to normal set() rate).
OK, replication for SET/DELETE seems fine on mw1261/mw2224 for me and the STORED/NOT_STORED and FOUND/NOT_FOUND replies are what I expect when using (no prefix, /otherdc/mw-wan, and /thisdc/mwwan).
Err, more PEBCAK . I put the * in the wrong spot...
So, I've noticed that on mw1261/mw2224 as *well* as plain old mwmaint1002,mwmaint2001, that broadcasting keys doesn't seem to work, e.g.: