The warnings are pointless, the patch above adds an isset() check.
Thu, Apr 12
This is related to T149847 in that we would *have* to stop moving file content around in Special:MovePage just to rename files.
Wed, Apr 11
I suspect the transactions are just empty ones with SELECT statements, which don't need to give errors here.
Tue, Apr 10
The message index code could do for a large amount of rework. In the meantime, I can't tell why the MessageIndexRebuildJob::newJob() instance must run immediately in isValid()...it's not like the method recheck's what it did before after the rebuild. If nothing else depends on it being immediate, then it should use a DeferredUpdate. If it has to be immediate...then CONN_TRX_AUTO can be considered (as long as it doesn't deadlock by having to transactions updating the same rows).
Mon, Apr 9
Wed, Apr 4
Thu, Mar 29
I don't mean "noise" as "unrelated to deploy", rather "expected, but doesn't matter".
The temp table stat increases just seem like noise due to some queries going from "SELECT @@" to "SHOW GLOBAL VARIABLES LIKE 'gtid_%'. E.g.:
Sat, Mar 24
DBO_IGNORE can only be enabled through the config or Database::factory directly. This flag is now largely irrelevant and could probably be finished off with a deprecation.
Thu, Mar 22
It's kind of hard to do this in practice, give the use of load balancers and so on. Some stuff can be removed, deprecated, or moved to IMaintainableDatabase though.
Mar 19 2018
Mar 14 2018
Mar 10 2018
Mar 8 2018
Note that the code making the "Expectation (readQueryTime <= 30) by JobRunner::run" logs does not roll anything back.
Mar 4 2018
Reconnecting in the case of rollback is an corner case, since normally just closing like that should error out. If ROLLBACK fails due to connection loss, there really isn't a need to reconnect, since everything should have rolled back on connection loss in the first place. Some sort flag to disable reconnection during rollback would be needed.
Mar 3 2018
Don't null revisions just reuse the same rev_text_id and insert no new blob? At least that's how it used to work.
Is there anything actionable here?
Mar 2 2018
Feb 27 2018
For reference, there is T156938 , for evaluating dynomite.
Feb 25 2018
Feb 22 2018
Since the value is "false", the callback runs, unless it's running somewhere else and there is no interim value. When this happens a lot in a short time, there will be interim values (lasting up 30 sec) used, unless they also return false due to some memcached error. If everything returns falls, then the callback runs all the time, regardless of the mutex. It won't be empty though.
I noticed that too yesterday. Note that there is a PECL memcached bug that causes things to say TIMEOUT after a KEY TO LONG or VALUE TOO LARGE error, which makes for confusing failures and logs. I'm not sure if that is it play, but it wouldn't surprise me, and statistically it would affect the most-fetched keys (whatever they are).
The php warning is noise. The "Database is read-only" flood is an actual bug...no idea why that happened.
$dbr->getLag() and $lb->getLagTimes() works fine in eval.php on wmf22 wikis as well.
I've looking at the 21->22 logs, changes, and trying things on mw.org. I don't see a read-only problem there and https://www.mediawiki.org/w/api.php?action=query&meta=siteinfo&siprop=dbrepllag&sishowalldb= looks fine. I don't see any lag in the DBs or seen by MW in that time (LoadBalancer graph at Grafana, those the resolution is low).
Feb 20 2018
Feb 17 2018
Feb 15 2018
Feb 14 2018
Feb 8 2018
Feb 7 2018
I see, hhvm works with and without the flags, so they could be set in the background.
Lots of keys use no value, 0, or TTL_INDEFINITE (all infinite), so there will be a lot of old keys.
MEMORY tables were kind of lame last time anyone checked, though I suppose someone can take a look. I doubt it would be too useful given a good innodb buffer pool size.
Sorry about the slow review...this extension has a bit of an ownership problem, with random people stepping in for CR. I was thinking someone else would have merged this by now.
Feb 6 2018
Verified by local selenium test runs (passes with the fix and fails without the fix).
Jan 26 2018
Do these tests actually used replication or is it singe DB server? Header logs would also be useful.
Jan 17 2018
So I cannot contact redis via nutcracker on tin. I noticed the password was not actually set for redis (trying to AUTH when no password set results in an error); using CONFIG SET requirepass <x> didn't make a difference though. In any case, I can use redis-cli to talk to the local redis instance on 01/02 themselves. I'm not sure how much of this is nutcracker vs redis. Restarting either does not help.
Jan 13 2018
Jan 10 2018
I fixed a stupid hostname var bug. Now I get numbers that make sense:
Same-DC (db2070.codfw.wmnet): string(57) "0.001196186542511 sec/conn (non-SSL) [db2070.codfw.wmnet]" string(60) "0.00027136325836182 sec/query (non-SSL) [db2070.codfw.wmnet]" string(53) "0.059528641700745 sec/conn (SSL) [db2070.codfw.wmnet]" string(56) "0.00028834581375122 sec/query (SSL) [db2070.codfw.wmnet]" Cross-DC (db1055.eqiad.wmnet): string(56) "0.10918385744095 sec/conn (non-SSL) [db1055.eqiad.wmnet]" string(57) "0.03636349439621 sec/query (non-SSL) [db1055.eqiad.wmnet]" string(52) "0.25189030647278 sec/conn (SSL) [db1055.eqiad.wmnet]" string(54) "0.036419949531555 sec/query (SSL) [db1055.eqiad.wmnet]"
Jan 9 2018
I see wiki IDs as a type of "domain ID" that just uses two ASCII components, (dbname,prefix), neither using slashes to avoid the ugliness of using things like "mysite?hnewswiki-en" have to appear on config or in "table_wiki" DB fields. For B/C, the non-slash rule can't be a hard-rule that throws errors. Given that, the getWiki() functions should use known-to-be-encoded wiki ID values or use use DatabaseDomain to derive them. There could be a stricter WikiDatabaseDomain subclass. Changing those methods would probably both fix and break things for the slash-scenario; maybe the "doesn't use domain hierarchy delimiter character" restriction could then be enforced by default behind a flag that could be disabled for legacy-mode.
Dec 14 2017
I keep coming with times like:
Dec 12 2017
I started a quick dashboard at https://grafana.wikimedia.org/dashboard/db/job-queue-alerts?orgId=1&from=now-12h&to=now with some alerts.
Dec 11 2017
I suppose we can use jobrunner.runner-status.error.rate, sumSeries(jobrunner.pop.*.failed.*.rate), and sumSeries(jobrunner.pop.*.ok.*.rate) to make alerts in a Grafana dashboard.
Yeah, same thing.
Dec 8 2017
I'm not sure why the time check logic is so complicated, I guess it got prematurely generalized from the single-DB case.
Dec 7 2017
Dec 6 2017
Dec 5 2017
Probably some MW fixes actually reaching production.
Does this still occur?
By reducing the lock max wait times and pushing the brunt of lag waits out if the critical section, then less real time should be wasted.
Dec 4 2017
Dec 2 2017
How long do these run? The sample rate in config is set to be extremely low. So perhaps:
- The buffering class buffers things that won't even be saved
- The buffering could be disable in CLI mode
Nov 29 2017
I noticed a worse bug of cpPosTime cookies not being used (not related to WAN cache). The patch for that is above.
The simple thing is to not set INTERIM keys in the same request that purged them. The duration of that rule would be HOLDOF_TTL so that the array holding the purged keys doesn't get too big for long-running maintenance scripts. This can be done with a HashBagOStuff nested in the WAN cache object easily enough.
This looks like an integration issue with ChronologyProtector vs WANObjectCache.
Nov 28 2017
I guess we will need MW side logging now. Probably can just add it to wmf branch.
Nov 27 2017
There is some caller that is not making keys correctly, which causes this. I can't find anymore looking though all of core and extensions and mediawiki-config.
Nov 23 2017
They were mentioned in https://www.mediawiki.org/wiki/Requests_for_comment/Master-slave_datacenter_strategy_for_MediaWiki#Job_queuing though it was never set up (partly from people being busy with other things). In general jobs are enqueued on POST requests or from other jobs, all in the master datacenter. In some cases, jobs are enqueued on GET or possibly POST (if the api-promise-nonwrite thing is set up in vlc) in rare cases. This should work in a way where the cross-DC propagation is async, rather than having JobQueue::push() blocking on cross-DC traffic.
Nov 21 2017
Nov 20 2017
Nov 17 2017
Nov 16 2017
So, the post_as_copy = true case works if SwiftFileBackend to no longer blacklist Content-Type from non-PUTs. It would always re-assert the old value if nothing was passed in by the describe() caller.
We should be mindful of the Swift post_as_copy option when set to false. At the moment that does *not* allowing changing Content-Type via POST.
Nov 8 2017
Oct 31 2017
So, running mcrouter via screen -r with the config in /etc/mcrouter/mcrouter.json on tin seems to work fine. The pool replication works and the timings are comparable to twemproxy -- often better than twemproxy.
Oct 30 2017
Oct 26 2017
Oct 24 2017
That "cannot merge" message is mostly useless and overly-technical in a Gerrit specific way (e.g. you can't "submit" without "+2", which is obvious anyway). Just look for "merge conflict" on the changeset page or where the patch shows up in listings, since that actually matters and is common.
There have always been a lot feature requests or bug reports due to misconfiguration/version-mismatch and so on. I don't really have the time anymore (for some time in fact) to sift through and find the serious bugs. When I become aware of one I try to fix it, but if it's not major then I probably won't look at it.
Oct 23 2017
Actually, I just moved them to https://grafana-admin.wikimedia.org/dashboard/db/backend-save-timing-breakdown?refresh=5m&orgId=1 .