Quarter has rolled over once or twice since, not currently in my goals any more and likely won't have time for this in what's left of December. Putting up first thing for FY2019–20 Q3 in January.
Sat, Dec 7
@Nikerabbit @Addshore When your respective teams are investigating this and if you need help with how the constraint works and how it is typically dealt with, you'll probably want to tag Core Platform Team as they know this code fairly well (and better than me).
Thu, Dec 5
Note that s7 is metawiki + centralauth which are involved on requests for all wikis pretty much all the time. That includes group0 but their traffic may be too small to cause issues and/or to notice them. So the simple volume of group1 (commons, wikidata) switching to the new branch seems like a possible trigger, regardless of whether s7 wikis and group1 intersect.
Thanks. Continuing at T239877 about the other two, which seems unrelated at this point (they are still "normal" replica dbs afaik).
The related issue at T239874 about db1062 (10.64.48.15, s7 wikis + centralauth) was much loader than the one reported here and turned out to be unrelated to the train. The s7 host was decom'ed but forgotten to be depooled, which @colewhite has now done. I've close that task now.
Two hosts that both claim to be the master / have 0 read load assigned. The first one is correct but the second one presumably isn't meant to be there. I don't actually know if it's valid to have a slave with 0 load. I guess that's for a follow-up to reject in the dbctl schema to prevent future issues so that one is forced to either remove it or assign at least 1 load (which is effectively the same as 0 given that 0 doesn't mean no interaction at all per the current issue). Having said that, I also don't know under which circumstances (aside from replag wait) a 0-load slave would be used. Perhaps if others are lagged / unused, which is quite likely given it is the only host with no other read queries so it might win regularly when weighing load against least-lagged meta data. To be continued :)
I'm not sure but I don't think the db host being unreachable is caused by the train. I spotted it during the first train attempt today and reported it earlier at T239874: MediaWiki: "host db1062 is unreachable" (Connection refused).
According to https://noc.wikimedia.org/db.php, db1062 is still part of s7.
(Following from IRC). Feel free to pass back to me after initial investigation if you have other work (I can patch it).
Wed, Dec 4
If I recall correctly, HHVM had a dns cache. This is among the reasons that, over the years, we gradually adopted more use of hostnames in wmf-config for services instead of hardcoding IP addresses. I guess we lost that in the PHP7 transition. Does the OS not cache this at all? Does PHP7 do something to bypass it?
Deployments happen once a week through the train. This is expected to roll out later this week, unless backported sooner which I proposed with the above cherry pick, but not sure it'll happen. Is this a volume risk currently? In terms of MW functionality the message is harmless and the fix didn't change behavior.
Usage within MediaWiki (eg. libs/objectcache), which this task was for, has been resolved in master, deployed, and and backported.
See also T207217 which triggers the same bug in ActorMigration.php from SpecialNewFiles.
Not tracking as prod-error given it is too generic / tracking other issues.
Tue, Dec 3
Some notes from an informal discussion about this task in TechCom:
I'm merging this into T220056 for now given both are about the same issue and have the same outcome as far as TechCom territory goes.
Other issues related to the (lack of) purging for HTTP caches for File description pages:
- T26575: Purge Category and File description pages from HTTP/File cache when members/usage changes via LinkUpdate
- T104711: File description page should be purged after deleting file
- T38380: After re-uploading a file, users still see the browser-cached thumbnail for the old version
- T109214: File upload should clear parser and file cache on usage pages
Looks like it. I'll find out next time but git-pull works fine now, but that already worked when there were no changes.
Note that mediawiki-quibble-vendor-postgres-php72-docker is passing and enforced on every commit, so those are presumably disabled tests of which we have many outside Postgres as well. Fixing the disabled ones is a new goal post. Moving that seems fine, but perhaps we can list the ones currently known to be disabled for that reason so that there is a clear endpoint for this task (therefore not including new tests that may be disabled for similar reasons after today).
jquery.tipsy does not have jquery.effects.* modules as dependencies. Is this task meant to be about "jquery.tipsy" instead of "jquery.effects" or is it meant to be about jquery (ui) effects and is Tipsy and unrelated subtask?