Page MenuHomePhabricator

Deprecation warning client-repo wikitext link
Closed, ResolvedPublic

Description

image.png (271×1 px, 89 KB)

A deprecation warning now shows up on master when referencing an item using wikitext on the client.

Steps to reproduce:

  1. Setup repo/client (followed this guide)
  2. Create an item on repo
  3. Reference the item on client with something like this
{{#statements:P1|from=Q1}}

The deprecation warning shows up as

Deprecated: Deprecated cross-wiki access to MediaWiki\Revision\RevisionRecord. Expected: the local wiki, Actual: 'default'. Pass expected $wikiId. [Called from MediaWiki\Revision\RevisionRecord::getId in /var/www/mediawiki/includes/Revision/RevisionRecord.php at line 286] in /var/www/mediawiki/includes/debug/MWDebug.php on line 376

Seems to stem from a change in core https://gerrit.wikimedia.org/r/c/mediawiki/core/+/666519 and after some discussion yesterday and @hoo had a chat with @daniel

[17:25:23] <hoo> So, we were just looking into https://gerrit.wikimedia.org/r/c/mediawiki/core/+/666519/21/includes/Revision/RevisionRecord.php#114
[17:27:02] <duesen> did it call any problems?
[17:27:09] <hoo> So Wikibase is tripping these off, by calling RevisionStore::getRevisionById
[17:27:28] <hoo> AFAICT we have the wiki ids set correctly throughout
[17:28:19] <duesen> Are you passing Title objects? If yes, that would be the issue. Title is always local. You need to use a PageIdentityValue instead. 
[17:28:40] <hoo> We don't, but something in core is creating titles here
[17:28:56] <hoo> I had the snippet handy… one sec
[17:29:31] <hoo> RevisionStore::newRevisionFromRowAndSlots has $page = $this->getTitle( $pageId, $revId, $queryFlags );
[17:30:33] <hoo> So that will never work for non-local wikis (if it's not supposed to, I guess we should warn right there)
[17:30:50] <duesen> Ok, let me have a look. 
[17:33:18] <duesen> The simple solution is to pass a PageIdentity yourself, instead of relying on the default.
[17:33:33] <hoo> I guess RevisionStore should be creating PageIdentityValue instead of Title                // TODO: WikiPageEntityMetaDataLookup should use RevisionStore::getQueryInfo,
[17:33:33] <hoo>                 // then we could use RevisionStore::newRevisionFromRow here!
[17:33:41] <duesen> But you are right that falling back to getTitle() isn't working correctly.
[17:33:58] <hoo> That's from Wikibase
[17:34:13] <hoo> so doing it right(tm) there will be a bit more involved
[17:34:30] <hoo> Or we allow passing a page to getRevisionById
[17:36:09] <duesen> hm yea, that would actually be kind of nice
[17:36:30] <duesen> I'll poke around a bit, I'll have a patch up in half an hour or so I hope.
[17:36:37] <duesen> will ping you then
[17:36:46] <hoo> Cool, thanks :)
[18:22:13] <duesen> Making progress, writing tests now
[19:02:38] <duesen> hoo: i'm running into trouble with how the tests are written. Nearly there...
[19:15:11] <duesen> hoo: I made two patches, one allowing a page to be passed to getRevisionById, and one that makes getRevisionById work correctly cross-wiki even when no page is provided.
[19:15:14] <duesen> https://gerrit.wikimedia.org/r/c/mediawiki/core/+/672499/1
[19:15:24] <duesen> https://gerrit.wikimedia.org/r/c/mediawiki/core/+/672462/1
[19:16:22] <duesen> I see odd test failures. Will deal with them in a bit

Event Timeline

toan renamed this task from Un-break repo-client tests to Deprecation warning client-repo wikitext link.Mar 16 2021, 8:17 AM

I also notice that MediaWiki core has again hard-deprecated code that is still used in Wikimedia-maintained code even though the stable interface policy forbids this:

Hard deprecation MUST NOT be applied to code still used in Wikimedia maintained code. Such usage MUST be removed first.

Last time this happened, the explanation (T273622#6797449) was that it was not possible to catch these usages without hard deprecation. Maybe this is a similar case, but is it too much to ask that such deprecations are at least called out or explained somewhere, e.g. in the Gerrit change introducing them or in the related Phabricator task? How are developers supposed to flag up these occurrences if the deprecation message doesn’t even mention the related task?

Mentioned in SAL (#wikimedia-operations) [2021-03-16T20:00:48Z] <brennen> 1.36.0-wmf.35 train status (T274939): blocked at group0 on T277362

I also notice that MediaWiki core has again hard-deprecated code that is still used in Wikimedia-maintained code even though the stable interface policy forbids this:

Hard deprecation MUST NOT be applied to code still used in Wikimedia maintained code. Such usage MUST be removed first.

Last time this happened, the explanation (T273622#6797449) was that it was not possible to catch these usages without hard deprecation. Maybe this is a similar case, but is it too much to ask that such deprecations are at least called out or explained somewhere, e.g. in the Gerrit change introducing them or in the related Phabricator task? How are developers supposed to flag up these occurrences if the deprecation message doesn’t even mention the related task?

That's not a valid reason to bypass the deprecation policy. In the past we just added logging for it (e.g. T176526: Remove $wgTitle fallback from EditPage in MW1.36) until we were satisfied we had caught everything. I would propose reverting the core change on the basis that it did not follow the deprecation policy, and then move the train forward again. Holding it because someone else broke the policy doesn't make sense to me.

Change 672810 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/core@master] Revert "Deprecate constructing revision with non-proper page"

https://gerrit.wikimedia.org/r/672810

Change 672811 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/core@wmf/1.36.0-wmf.35] Revert "Deprecate constructing revision with non-proper page"

https://gerrit.wikimedia.org/r/672811

I also notice that MediaWiki core has again hard-deprecated code that is still used in Wikimedia-maintained code even though the stable interface policy forbids this

This was of course not intentional - the policy asks for hard deprecation to happen as soon as all callers have been fixed. Which of course in practice means as soon as we think all callers have been fixed. The problem is that it's sometimes hard to find such code - and sometimes it's only found when it starts logging deprecation warnings in production.

That's not a valid reason to bypass the deprecation policy. In the past we just added logging for it (e.g. T176526: Remove $wgTitle fallback from EditPage in MW1.36) until we were satisfied we had caught everything.

Technically, deprecation warnings are logging. I can see that it would be preferable to start out with a lower log level, so these don't show up as production errors. But using "soft" logging means these issues don't show up as test failures. The fact that deprecation warnings make tests fail is very helpful for finding any cases we missed by looking at the code.

Perhaps the solution is to handle deprecation warnings separate from other production issues?

How are developers supposed to flag up these occurrences if the deprecation message doesn’t even mention the related task?

That is a good point - we have not included task IDs in deprecations so far. But perhaps it would be a good idea to do so.

Anyway, the "proper" fix done, I'll clean up the remaining issues in the tests tomorrow: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/672499

For context - the description is of T275531 is indeed rather short, and the real background is in T273284. The idea is that we to avoid data corruption such as T260485, we have to ensure that cross-wiki operations on entities such as users, pages, and revisions are safe. We do that by asserting that they belong to the correct wiki. Title is not cross-wiki aware, so using it in a cross-wiki context was deprecated.

I agree that it was impossible to discover this idea from the deprecation message, and in this case it was also not discoverable from the task linked in the commit message.

I propose we think about how we can minimize disruption by improving how we log deprecation messages, and improve communication by making the relevant information more discoverable in context. Please let me know if you have ideas.

Change 672810 merged by jenkins-bot:
[mediawiki/core@master] Revert "Deprecate constructing revision with non-proper page"

https://gerrit.wikimedia.org/r/672810

Change 672811 merged by jenkins-bot:
[mediawiki/core@wmf/1.36.0-wmf.35] Revert "Deprecate constructing revision with non-proper page"

https://gerrit.wikimedia.org/r/672811

Mentioned in SAL (#wikimedia-operations) [2021-03-16T23:56:32Z] <krinkle@deploy1002> Synchronized php-1.36.0-wmf.35/includes/Revision/: I8619ab9e92b, T277362, T275531 (duration: 00m 58s)

Ladsgroup claimed this task.

That's not a valid reason to bypass the deprecation policy. In the past we just added logging for it (e.g. T176526: Remove $wgTitle fallback from EditPage in MW1.36) until we were satisfied we had caught everything.

Technically, deprecation warnings are logging. I can see that it would be preferable to start out with a lower log level, so these don't show up as production errors. But using "soft" logging means these issues don't show up as test failures. The fact that deprecation warnings make tests fail is very helpful for finding any cases we missed by looking at the code.

It just pushes the problem onto someone else, by stopping their development and forcing them drop everything just to unbreak their repo - in this case it hit everyone by stopping the entire train. That Wikimedia-deployed code was still using this hard deprecated code was identified on Saturday, once that identification was done, why did it have to rise all the way to a train blocker on Tuesday to get a revert?

It just pushes the problem onto someone else, by stopping their development and forcing them drop everything just to unbreak their repo

I was not aware this broke any tests in any repo - did it?

My understanding was that there are no automated tests for the cross-wiki case in Wikibase. Because there are no such tests, the issue did not show up in CI. (Not blaming anyone but myself - I originally wrote the cross-wiki revision loading code in Wikibase, and then I rewrote it for RevisionStore).

I have been wanting to improve the testing situation for quite a while. Perhaps incidents like this one can help to to get more attention and resources for improving our testing platform. Some relevant tickets: T261848: Simulate databases for sister sites in phpunit, T248683: Create and run a suite of end-to-end tests for the Wikimedia environment, T195932: Test all MediaWiki tarball extensions in gate for all changes to MediaWiki and each other.

  • in this case it hit everyone by stopping the entire train.

I don't see why deprecation warnings should stop the train at all. I agree that they should be worked on with high priority, but why should they be treated as production errors?

That Wikimedia-deployed code was still using this hard deprecated code was identified on Saturday, once that identification was done, why did it have to rise all the way to a train blocker on Tuesday to get a revert?

I learned about the issue from Hoo on IRC at 17:30 on Monday. I immediately started work on a fix. On Tuesday, that fix got a couple of rounds of review, getting the tests right wasn't trivial. However, the urgency wasn't clear at all, I didn't hear from the Wikidata team all day.

It just pushes the problem onto someone else, by stopping their development and forcing them drop everything just to unbreak their repo

I was not aware this broke any tests in any repo - did it?

My understanding was that there are no automated tests for the cross-wiki case in Wikibase. Because there are no such tests, the issue did not show up in CI. (Not blaming anyone but myself - I originally wrote the cross-wiki reloading code in Wikibase, and then I rewrote it for RevisionStore).

It is indeed very unfortunate that there has not been automated tests for this functionality. It would have given the issue more visibility and faster.
WMDE has created tests for this recently -- this is how we identified the problem on Monday -- but he have not yet integrated it into Wikimedia Jenkins CI. I expect us to do this in a couple of weeks.

  • in this case it hit everyone by stopping the entire train.

I don't see why deprecation warnings should stop the train at all. I agree that they should be worked on with high priority, but why should they be treated as production errors?

If blocking the train on this issue -- which, BTW, got ultimately triggered by T277593 - WMDE has been the whole day going back and forth in internal chats whether this is a train blocker or whether all will be fine when the code lands in production -- was too extreme, it maybe would be good to have some clarity on how those deprecation warnings, and the exceptions/errors logged stemming from them should be treated. In my eyes treating them as a train blocker has been in line with the recently announced "new line on logspam".

That Wikimedia-deployed code was still using this hard deprecated code was identified on Saturday, once that identification was done, why did it have to rise all the way to a train blocker on Tuesday to get a revert?

I learned about the issue from Hoo on IRC at 17:30 on Monday. I immediately started work on a fix. On Tuesday, that fix got a couple of rounds of review, getting the tests right wasn't trivial. However, the urgency wasn't clear at all, I didn't hear from the Wikidata team all day.

Point taken, we should have been more active in communication about this problem outside of WMDE (virtual) office. Something we shall improve in the future.

It is indeed very unfortunate that there has not been automated tests for this functionality. It would have given the issue more visibility and faster.
WMDE has created tests for this recently -- this is how we identified the problem on Monday -- but he have not yet integrated it into Wikimedia Jenkins CI. I expect us to do this in a couple of weeks.

Oh, that's excellent! I'm curious about what approach you took for these tests.
Having them in CI would have prevented to offending code from being merged.

If blocking the train on this issue -- which, BTW, got ultimately triggered by T277593 - WMDE has been the whole day going back and forth in internal chats whether this is a train blocker or whether all will be fine when the code lands in production -- was too extreme, it maybe would be good to have some clarity on how those deprecation warnings, and the exceptions/errors logged stemming from them should be treated. In my eyes treating them as a train blocker has been in line with the recently announced "new line on logspam".

I generally agree with the stricter line on logspam. But I do think that halting the train for deprecation warnings does more harm than good. Perhaps @brennen and @thcipriani can chime in on that.

Generally, halting/reverting deployments should be done to prevent damage to the site. Halting/reverting on logspam seems like a desperate measure, intended to make sure that warnings in production are actually prioritized. Which shouldn't be necessary, but I suppose it is.

In this case however, the warning was clearly harmless (deprecation warnings indicate that things are still working properly), and the issue was already being worked on. Halting the train just caused more work for more people.

I guess part of the problem was that my fix wasn't tagged with this ticket. Which I didn't know about until late last night.

Point taken, we should have been more active in communication about this problem outside of WMDE (virtual) office. Something we shall improve in the future.

I notice that neither this task nor T277593 is on any PET board, even though it became quickly clear that the cause was in core. Pinging the relevant teams and components (in this case, MediaWiki-Core-Revision-backend and Platform Engineering) helps getting more attention on this more quickly, and to coordinate on the way forward.

A handful of responses:

  1. Deployers have no way to know in the moment that something is only a deprecation warning with no other consequences. If a warning isn't material to production health, it should probably either be labeled in a way that makes that extremely clear, or be sent to some channel that isn't surfaced where deployers are looking for signs of production breakage.

    1a. What is obvious to MediaWiki developers is frequently not obvious to train deployers. We aren't MediaWiki developers.
  1. If there's ever a conversation about whether something might become a train blocker, RelEng would greatly appreciate if we were looped into the discussion by having it surfaced on the blocker task. Please see Deployments/Risky change template. We can and will improve our documentation about this. (See T273802.)
  1. It is undeniably true that we sometimes halt the train on things which are "only" logspam. Partly this is because it's hard for us to tell the difference without developer input. Partly this is because halting the train is the only way we can get enough of them fixed to keep production logspam down to tolerable levels. If you think this is a broken system, you are probably right, but please remember that our incentives include maintaining a viable signal about production health week over week and month over month. A process which is currently manual in the extreme, and centralized on our team. We're working on changing that, but it's a hard problem.

Thank you for your response, @brennen!

  1. Deployers have no way to know in the moment that something is only a deprecation warning with no other consequences. If a warning isn't material to production health, it should probably either be labeled in a way that makes that extremely clear, or be sent to some channel that isn't surfaced where deployers are looking for signs of production breakage.

In my mind, deprecation warnings should never be material to production health. They exist in order to warn before things break. So perhaps they should be channeled elsewhere. But we would have to make sure that they are still noticed, and bugs still get filed. They should definitely not be ignored or allowed to pile up.

  1. If there's ever a conversation about whether something might become a train blocker, RelEng would greatly appreciate if we were looped into the discussion by having it surfaced on the blocker task. Please see Deployments/Risky change template. We can and will improve our documentation about this. (See T273802.)

Duly noted. We had a couple of such risky patches a while ago, around ParserCache. We did not think of this patch here as risky, precisely because it only introduced warnings, and did not break behavior.

  1. It is undeniably true that we sometimes halt the train on things which are "only" logspam. Partly this is because it's hard for us to tell the difference without developer input. Partly this is because halting the train is the only way we can get enough of them fixed to keep production logspam down to tolerable levels.

I'm not complaining that it's broken, I'm thinking about ways to improve it. And there are certainly different kinds of "warnings" - e.g. a "notice" about accessing an undefined array key may have very serious implications, depending on how that value is being used. I'm trying to think of a better channel for things that are known to be non-critical, things that should be fixed asap but are not a danger to the live site. Fixing deprecation warnings isn't just nice to have, it should be high priority. But it's not UBN, and it should not block the train. Maybe a separate log channel would be the way to go.