Fri, Jun 15
Fixed in daf0514345f03189187606ba2323794588c79dc9 .
Thu, Jun 14
For web requests, the lock timeout should be 5 min:
Wed, Jun 13
That said, from mc1019, I see:
I suggest that cookie set/receive round-tripping should be tested for encoding/truncation issues with @ or # for these apps, as well as letter case changes or such. The above patch simply discards cookie headers for cpPosIndex that are botched.
There are now only ~10/min of these now. I still see no 'redis' channel errors, but I wonder if the random eviction model of redis is at play. redis 3.0 is a bit better at LRU per https://redis.io/topics/lru-cache than our 2.8.
Does this still happen after https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/440031/ ?
Seems to gotten better after https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/440031/ was deployed and likewise for logstash (+"ChronologyProtector::initPositions").
Tue, Jun 12
Mon, Jun 11
Sun, Jun 10
Sat, Jun 9
Fri, Jun 8
Wed, Jun 6
Tue, Jun 5
Mon, Jun 4
Fri, Jun 1
Tue, May 29
Thu, May 24
I suspect that some RESTBase service forwards a user's cookies (for permissions) but uses a local IP, judging from the logs. Since the CP position redis key is based on the client IP/agent hash, then it will not be found and will timeout. I don't know if the agent is passed through or not.
Wed, May 23
May 23 2018
If SET/DELETE go to all mc* servers in the wancache-(eqiad/codfw) pools (as mediawiki_wancache is configured to do in puppet), then Option B would still work since the consistent hashing wouldn't matter. Having broadcasted operations go to all mc* servers rather than just 1-per-DC (based on hash) is not required for WANCache though. Keeping it this way wouldn't scale well if the rate of those (purge) operations increased hugely for some reason. I do like the conceptual simplicity though.
May 14 2018
May 10 2018
The logging level went from INFO to WARNING. I suppose this has been happening for a very long time then.
May 8 2018
May 2 2018
Correct usage of ForkController (which has logic that ttmserver-export is mostly doing) works fine.
Why doesn't that script use ForkController btw?
LBFactory does not implement DestructibleService, though it has a destroy() method. This is due to it being in /libs. It relies on reference counting, where the old service container instance falls out of scope in resetGlobalInstance() with $oldInstance dying, then LBFactory following suite and triggering __destruct()=>destroy(), and so on. If something has a ServiceContainer instance (with LBFactory loaded) pre-fork and tries to use it later it will get ContainerDisabledException.
May 1 2018
Apr 30 2018
This has now been running for a while (since Apr 17) with the new packages (both debian versions, though the stretch server isn't there anymore afaik).
Apr 27 2018
refreshLinks2 is not used anymore. Since it is not in $wgJobClasses anymore, they probably won't get cleared in recycleAndDeleteStaleJobs().
Apr 23 2018
Apr 19 2018
I assume daf0514345f03 exposed this bug.
The warnings are pointless, the patch above adds an isset() check.
Apr 12 2018
This is related to T149847 in that we would *have* to stop moving file content around in Special:MovePage just to rename files.
Apr 11 2018
I suspect the transactions are just empty ones with SELECT statements, which don't need to give errors here.
Apr 10 2018
The message index code could do for a large amount of rework. In the meantime, I can't tell why the MessageIndexRebuildJob::newJob() instance must run immediately in isValid()...it's not like the method recheck's what it did before after the rebuild. If nothing else depends on it being immediate, then it should use a DeferredUpdate. If it has to be immediate...then CONN_TRX_AUTO can be considered (as long as it doesn't deadlock by having to transactions updating the same rows).