Wed, May 5
Note that sql.php should still work.
Unassigned, unless there is a clear maintainer to review that patch and do any future upkseep.
Tue, May 4
Thu, Apr 29
Note that the backing store can be moved again later on, making it easy to use mcrouter first.
Wed, Apr 28
Running it again since the screen is gone...
Tue, Apr 27
Mon, Apr 19
Using env vars for now seems OK for now.
Wed, Apr 14
There could be a FileBackendTestBase with subclasses for each backend. The "proxy" backend classes (FileBackendMultiWrite) could just use MemoryFileBackend instances. The tests for MemoryFileBackend would not need any config. The FSFileBackend subclass could just use the tmp directory. The other FileBackendStore subclass would need site config pointing to a real backend...
Apr 13 2021
Apr 9 2021
Possibly related is https://gerrit.wikimedia.org/r/c/mediawiki/core/+/659617 (Last-Write-Wins updates for subkeys within a key).
Apr 5 2021
Apr 2 2021
The headers_sent() checks should handle those, though maybe something is checked in the wrong place.
Mar 30 2021
Closing given the updates to git master and REL1_35
Mar 25 2021
Keep in mind that, strictly speaking, some of these problems are not even solved in MediaWiki for "core" DB shards. These include the "main" s[1-8] shards and the "extension" x1 shard. For example, a web request might update an S1 (enwiki) and S7 (centralauth) in one "transaction round", which just means that each of the relevant DB connections are checked for connectivity (pinged if there was no activity < 1 sec ago), and, after that passes, then the COMMITs are made in rapid succession. It is still possible, though very unlikely, that a proper subset of the transactions fail. Also, some events might be triggered from onTransaction() callbacks or PRESEND deferred updates.
Mar 24 2021
Mar 22 2021
I rebased the patch above. Once this is merged, I can consider this task closed.
I finished running this on labs via:
Mar 19 2021
Mar 15 2021
Mar 13 2021
While we gzip memcached values, it would still be larger than the 1 mb limit. Even if I let WANCache use BagOStuff::WRITE_SEGMENTABLE, that is still a lot of I/O (even with "pcTTL" enabled).
Mar 12 2021
The second one alone should be enough for a quick fix.
Mar 11 2021
About how large is the IP list that will be stored in cache?
Mar 5 2021
Ideally the SqlBagOStuff hashing would use HashRing, though any naive transition would involve a lot of misses/churn at first.
Playing around with
mwscript shell.php aawiki
...I noticed that SHOW SLAVE STATUS is empty in eqiad for the 'pc3' slot server. Both have SHOW MASTER STATUS output and read_only = 0. Any reason the eqiad DBs are not listening to the codfw DB binlogs?
Feb 26 2021
Feb 25 2021
Feb 23 2021
I see two places in ConvertibleTimestamp.php that recast generic "Exception" errors into "TimestampException", which would make convert() return false, which could cause this problem.
Feb 22 2021
Feb 19 2021
Feb 16 2021
Feb 12 2021
Feb 11 2021
Possibly triggered by MediaWiki\Auth\AuthManager->autoCreateUser calls, which I also see in the logging for the same reqId.
Feb 10 2021
The only thing that currently updates the replication positions on HTTP GET, that is not an easily spotted entrypoint like rollback/createaccount/login (which can be routed like HTTP POST) are:
- UpdateHitCountWatcher from AbuseFilter on edit form views (this should use the main stash or least the job queue); does not need chronology-protector
- CentralAuthUser->lazyImportLocalNames(); this should be solved by the migration script (T150506)
- SpecialContentTranslation setting global preferences just to store "user has seen X" state; this should use the main stash (or the job queue at least)
- ShortUrlHooks on page views; this will be fixed by T256993
Feb 8 2021
Feb 4 2021
I wonder what output_buffering value is being used in php.ini here. If I set it to off (not my distro default), I can trigger the MW_SETUP_CALLBACK use of ob_start()/OutputHandler::handle(). I definitely see some mismatch between the logic of OutputHandler::handle and MediaWiki::outputResponsePayload.
Jan 29 2021
Jan 28 2021
Jan 27 2021
Logstash no longer shows the errors after the deploy.
It would be interesting to see if the rate of occurrence changes after T266055 is deployed.
I can repro with
./srv_paratest --filter BagOStuff --use-bagostuff=redis
Jan 26 2021
It should use DB Domains, which can always be converted to wiki IDs (though not 100% the other way around in some messy legacy edge cases that do not effect WMF). This is also what LoadBalancer has always expected in it's methods.
Jan 25 2021
Jan 24 2021
Jan 21 2021
Jan 12 2021
I don't think it would be worth using pt-heartbeat for LoadBalancer::waitFor() unless the precision was much higher (likely problematically high in terms of spammy heartbeat table updates).
Jan 7 2021
Jan 5 2021
At first, I suspected a timeout causing a failed deferred updated, but it seems that no failure was logged likely due to NullLogger being used by sub-backends of FileBackendMultiWrite.
Jan 4 2021
A subset of the log entries are bogus though (should be DEBUG, not ERROR).
See also: T264735
Dec 14 2020
Dec 12 2020
Dec 11 2020
Dec 10 2020
Dec 9 2020
Dec 7 2020
Dec 3 2020
Related task: T269325
Dec 1 2020
Nov 25 2020
I'm seeing timeout associated with these entries, e.g.:
Nov 20 2020
Nov 18 2020
Looked like a config error.
Nov 17 2020
This seems to have gotten better over the weeks. Not sure why.
Nov 13 2020
What is the state of this now? Are there any query graphs specific to this table?
No objections here.
Nov 6 2020
I see plenty of timeouts where the error is not logged twice. Those that happen twice seem to be about 200 ms apart.
It could be related if the problem has to do with regenerations (since it was intermitted). In any case, excessive regeneration is a problem in itself.
Nov 5 2020
Searching for << +channel:memcached +message:/.*deps.*/ >> I still see this sometimes.