Since the value is "false", the callback runs, unless it's running somewhere else and there is no interim value. When this happens a lot in a short time, there will be interim values (lasting up 30 sec) used, unless they also return false due to some memcached error. If everything returns falls, then the callback runs all the time, regardless of the mutex. It won't be empty though.
I noticed that too yesterday. Note that there is a PECL memcached bug that causes things to say TIMEOUT after a KEY TO LONG or VALUE TOO LARGE error, which makes for confusing failures and logs. I'm not sure if that is it play, but it wouldn't surprise me, and statistically it would affect the most-fetched keys (whatever they are).
The php warning is noise. The "Database is read-only" flood is an actual bug...no idea why that happened.
$dbr->getLag() and $lb->getLagTimes() works fine in eval.php on wmf22 wikis as well.
I've looking at the 21->22 logs, changes, and trying things on mw.org. I don't see a read-only problem there and https://www.mediawiki.org/w/api.php?action=query&meta=siteinfo&siprop=dbrepllag&sishowalldb= looks fine. I don't see any lag in the DBs or seen by MW in that time (LoadBalancer graph at Grafana, those the resolution is low).
Tue, Feb 20
Sat, Feb 17
Thu, Feb 15
Wed, Feb 14
Thu, Feb 8
Wed, Feb 7
I see, hhvm works with and without the flags, so they could be set in the background.
Lots of keys use no value, 0, or TTL_INDEFINITE (all infinite), so there will be a lot of old keys.
MEMORY tables were kind of lame last time anyone checked, though I suppose someone can take a look. I doubt it would be too useful given a good innodb buffer pool size.
Sorry about the slow review...this extension has a bit of an ownership problem, with random people stepping in for CR. I was thinking someone else would have merged this by now.
Tue, Feb 6
Verified by local selenium test runs (passes with the fix and fails without the fix).
Fri, Jan 26
Do these tests actually used replication or is it singe DB server? Header logs would also be useful.
Jan 17 2018
So I cannot contact redis via nutcracker on tin. I noticed the password was not actually set for redis (trying to AUTH when no password set results in an error); using CONFIG SET requirepass <x> didn't make a difference though. In any case, I can use redis-cli to talk to the local redis instance on 01/02 themselves. I'm not sure how much of this is nutcracker vs redis. Restarting either does not help.
Jan 13 2018
Jan 10 2018
I fixed a stupid hostname var bug. Now I get numbers that make sense:
Same-DC (db2070.codfw.wmnet): string(57) "0.001196186542511 sec/conn (non-SSL) [db2070.codfw.wmnet]" string(60) "0.00027136325836182 sec/query (non-SSL) [db2070.codfw.wmnet]" string(53) "0.059528641700745 sec/conn (SSL) [db2070.codfw.wmnet]" string(56) "0.00028834581375122 sec/query (SSL) [db2070.codfw.wmnet]" Cross-DC (db1055.eqiad.wmnet): string(56) "0.10918385744095 sec/conn (non-SSL) [db1055.eqiad.wmnet]" string(57) "0.03636349439621 sec/query (non-SSL) [db1055.eqiad.wmnet]" string(52) "0.25189030647278 sec/conn (SSL) [db1055.eqiad.wmnet]" string(54) "0.036419949531555 sec/query (SSL) [db1055.eqiad.wmnet]"
Jan 9 2018
I see wiki IDs as a type of "domain ID" that just uses two ASCII components, (dbname,prefix), neither using slashes to avoid the ugliness of using things like "mysite?hnewswiki-en" have to appear on config or in "table_wiki" DB fields. For B/C, the non-slash rule can't be a hard-rule that throws errors. Given that, the getWiki() functions should use known-to-be-encoded wiki ID values or use use DatabaseDomain to derive them. There could be a stricter WikiDatabaseDomain subclass. Changing those methods would probably both fix and break things for the slash-scenario; maybe the "doesn't use domain hierarchy delimiter character" restriction could then be enforced by default behind a flag that could be disabled for legacy-mode.
Dec 14 2017
I keep coming with times like:
Dec 12 2017
I started a quick dashboard at https://grafana.wikimedia.org/dashboard/db/job-queue-alerts?orgId=1&from=now-12h&to=now with some alerts.
Dec 11 2017
I suppose we can use jobrunner.runner-status.error.rate, sumSeries(jobrunner.pop.*.failed.*.rate), and sumSeries(jobrunner.pop.*.ok.*.rate) to make alerts in a Grafana dashboard.
Yeah, same thing.
Dec 8 2017
I'm not sure why the time check logic is so complicated, I guess it got prematurely generalized from the single-DB case.
Dec 7 2017
Dec 6 2017
Dec 5 2017
Probably some MW fixes actually reaching production.
Does this still occur?
By reducing the lock max wait times and pushing the brunt of lag waits out if the critical section, then less real time should be wasted.
Dec 4 2017
Dec 2 2017
How long do these run? The sample rate in config is set to be extremely low. So perhaps:
- The buffering class buffers things that won't even be saved
- The buffering could be disable in CLI mode
Nov 29 2017
I noticed a worse bug of cpPosTime cookies not being used (not related to WAN cache). The patch for that is above.
The simple thing is to not set INTERIM keys in the same request that purged them. The duration of that rule would be HOLDOF_TTL so that the array holding the purged keys doesn't get too big for long-running maintenance scripts. This can be done with a HashBagOStuff nested in the WAN cache object easily enough.
This looks like an integration issue with ChronologyProtector vs WANObjectCache.
Nov 28 2017
I guess we will need MW side logging now. Probably can just add it to wmf branch.
Nov 27 2017
There is some caller that is not making keys correctly, which causes this. I can't find anymore looking though all of core and extensions and mediawiki-config.
Nov 23 2017
They were mentioned in https://www.mediawiki.org/wiki/Requests_for_comment/Master-slave_datacenter_strategy_for_MediaWiki#Job_queuing though it was never set up (partly from people being busy with other things). In general jobs are enqueued on POST requests or from other jobs, all in the master datacenter. In some cases, jobs are enqueued on GET or possibly POST (if the api-promise-nonwrite thing is set up in vlc) in rare cases. This should work in a way where the cross-DC propagation is async, rather than having JobQueue::push() blocking on cross-DC traffic.
Nov 21 2017
Nov 20 2017
Nov 17 2017
Nov 16 2017
So, the post_as_copy = true case works if SwiftFileBackend to no longer blacklist Content-Type from non-PUTs. It would always re-assert the old value if nothing was passed in by the describe() caller.
We should be mindful of the Swift post_as_copy option when set to false. At the moment that does *not* allowing changing Content-Type via POST.
Nov 8 2017
Oct 31 2017
So, running mcrouter via screen -r with the config in /etc/mcrouter/mcrouter.json on tin seems to work fine. The pool replication works and the timings are comparable to twemproxy -- often better than twemproxy.
Oct 30 2017
Oct 26 2017
Oct 24 2017
That "cannot merge" message is mostly useless and overly-technical in a Gerrit specific way (e.g. you can't "submit" without "+2", which is obvious anyway). Just look for "merge conflict" on the changeset page or where the patch shows up in listings, since that actually matters and is common.
There have always been a lot feature requests or bug reports due to misconfiguration/version-mismatch and so on. I don't really have the time anymore (for some time in fact) to sift through and find the serious bugs. When I become aware of one I try to fix it, but if it's not major then I probably won't look at it.
Oct 23 2017
Actually, I just moved them to https://grafana-admin.wikimedia.org/dashboard/db/backend-save-timing-breakdown?refresh=5m&orgId=1 .
They are on the main dashboard. If more or added, it would be good to split them out since the main save timing board is getting long.
Oct 20 2017
Probably hotTTR is way to high. It's really "expected time till refresh given 1 hit/sec". With 50/min, you'd get maybe 2 updates (new values) per regex. I'll put up a patch for that.
Oct 19 2017
Oct 18 2017
Oct 17 2017
Oct 12 2017
Oct 6 2017
Oct 5 2017
We discussed proxies in the last performance meeting and we're OK with that (it would cut down on handshake latency anyway).
JobRunner always starts an LBFactory transaction.
This was actually fixed for new installs before that patch by moving the object cache table to a separate DB.
Oct 4 2017
Also, there is https://bugs.php.net/bug.php?id=74445 :)
You can always do what extensions/CentralAuth/includes/LocalRenameJob/LocalRenameJob.php does AFAIK.
Oct 3 2017
I'd look for the new method calls that are being reached and whether they show up and how large their profile is if they do. Note that you can use cntl-F on the svg images to highlight matches in purple.
Oct 2 2017
I think it's fine to roll out there as long as you are watching https://grafana.wikimedia.org/dashboard/db/save-timing?refresh=5m&orgId=1 and check the -index.svg flamegraph at https://performance.wikimedia.org/xenon/svgs/daily/ for day of deployment the next day (current day values are always useless/incomplete).