Page MenuHomePhabricator

Page Curation fails to create AFD page
Closed, ResolvedPublicBUG REPORT

Description

Steps to reproduce:
Go to a page in the New Page Feed
Nominate for deletion

Expected outcome:
Page is tagged, various logs updated, Articles for Deletion page is created with nominating statement

Actual result:
Page is tagged, various logs updated, Articles for Deletion page is NOT created

An example of where this happened is https://en.wikipedia.org/w/index.php?title=Ilsenburg_Factory&action=history

@ComplexRational experienced it and so they can share other information that might help in reproducing

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
DannyS712 changed the subtype of this task from "Task" to "Bug Report".Nov 11 2019, 11:35 PM

It seems that @Elmidae just experienced this as well: https://en.wikipedia.org/w/index.php?title=Wikipedia:Articles_for_deletion/Alien_Cabal&oldid=933049194 came three minutes after the attempted tagging, see contribs and edit summary.

It seems that @Elmidae just experienced this as well: https://en.wikipedia.org/w/index.php?title=Wikipedia:Articles_for_deletion/Alien_Cabal&oldid=933049194 came three minutes after the attempted tagging, see contribs and edit summary.

Note that in this case, the results were not just "page is tagged" (as stated above), but the deletion log was also updated. The one thing that did not happen was the creation of the deletion discussion. So if it's the same bug, then some fractional fix seems to have happened in between.

It seems that this bug has never been processed. It's driving reviewers away from the process just when NPP has a massive backlog and needs all the help it can get. Would it be possible to do something about it now? @MarioGom , @Joe Roe , @DanCherek

Anyone is free to investigate. (Please do not ping random folks. Thank you.)

@Aklapper could you please ensure that this bug is addressed as quickly as possible. Thanks.

@Kudpung: No, as I do not have a drawer with a bunch of developers who have nothing to do, as I am not in a position to tell other people what to work on, as this seems to be about 1 (one) single production wiki out of hundreds of wikis, and as I don't see any sudden (!) increased urgency.

I went through the diffs above and created some links to groups of diffs:

Since we have examples of both writing to "Wikipedia:Articles for deletion/Log/2019 December 29" log and the Special:Log page curation log, I've edited this ticket to just focus on the missing AFD page instead of missing logs. The missing AFD page seems to be what these all have in common.

It's possible this is the same bug as T239712

Novem_Linguae renamed this task from Page Curation Incorrectly Process AfD to Page Curation fails to create AFD page.Jul 6 2022, 1:05 AM
Novem_Linguae updated the task description. (Show Details)

@Aklapper This is quite urgent because this fix is long overdue and NPP is in crisis with a huge backlog. Many are now using Twinkle instead which does not provide the essential information and features, therefore the accuracy of the patrolling by less experienced reviewers is not good and creates more work for other patrollers who have to correct them and explain what they are doing wrong (diffs available). Of the 700+ reviewers, only a tiny fraction will do any patrolling and more are abandoning NPP due to frustration (diffs available).

Please see my previous comment again. "long overdue" does not imply urgency - nearly every issue is "long overdue" given that resources are limited. Thanks.

@Aklapper Long overdue by several years. This is a software bug. It's not a request for a new cosmetic or convenience feature. It does need to be addressed.

@Kudpung: Anybody who thinks this needs to be addressed and is overdue or urgent is and has been very welcome to provide a patch. Please also see https://www.mediawiki.org/wiki/Bug_management/Development_prioritization - thanks.

@Aklapper Many of us are aware of that page - that's why we keep asking here for cooperation and support for NPP which is not only an essential function on the en.Wiki, the WMF's flagship project, but also a software that was developed not by volunteers, but by paid WMF engineers.

I spent a couple hours on this tonight. Was unable to reproduce the #4 test case using the same article text on a localhost wiki.

I will note however if there is an error writing to the AFD log (for example, if the AFD log hasn't been created yet, it will throw an error), then the AFD page will not be written because the code is sequential. AFD log first, then AFD page is written. An error writing the AFD log aborts the AFD page.

Some relevant code is located at

Alert popups can be a good hint. If you guys get alert popups, please take note of what they say and include them with your bug report. Here is a random PageTriage alert popup so you know what it looks like:

image.png (207×591 px, 11 KB)

Just to clarify, we don't have a way of replicating this issue reliably, right?

@TheresNoTime I lack any technical knowledge, but I assume that to replicate it you would need to find an article in the feed (or create one) that should be sent to AfD and initiate the process from the Curation tool.

Just to clarify, we don't have a way of replicating this issue reliably, right?

Having spoken to @Novem_Linguae about this, they came up with the great idea to add some logging to handleError with the aim to getting these errors visible in mw-client-error on logstash — that way, each time this happens we can get some insight as to why and hopefully resolve this

Change 814294 had a related patch set uploaded (by Samtar; author: Samtar):

[mediawiki/extensions/WikimediaEvents@master] clientError.js: Add 'error.pagetriage' topic

https://gerrit.wikimedia.org/r/814294

Alright, I think I cracked this one. The problem is the UX wants you to type your AFD message, then click "Add details" to lock it in, then click "Mark for deletion". If you click "Mark for deletion" without clicking "Add details", then you get errors and bugs.

The fix is to disable the "Mark for deletion" button until the "Add details" button is clicked.

Logic needs to be added or fixed to disable the "Mark for deletion" button whenever there is an unclicked "Add details" and its associated textinput/textarea open. And logic to enable "Mark for deletion" once all of these are handled.

The following fields in "Mark for deletion" use an "Add details" button:

  • Speedy deletion -> Unambiguous copyright infringement
  • Speedy deletion -> Recreation of a page that was deleted per a deletion discussion
  • Speedy deletion -> Recently created article that duplicates an existing topic
  • Speedy deletion -> Foreign language articles that exist on another Wikimedia project
  • Proposed deletion
  • Articles for deletion

This should fix the prod bug too. T313108

2022-07-16_185208.png (749×1 px, 116 KB)

Change 814350 had a related patch set uploaded (by Novem Linguae; author: Novem Linguae):

[mediawiki/extensions/PageTriage@master] Fix AFD and PROD bugs

https://gerrit.wikimedia.org/r/814350

Change 814350 merged by jenkins-bot:

[mediawiki/extensions/PageTriage@master] Fix AFD and PROD bugs

https://gerrit.wikimedia.org/r/814350

@Novem_Linguae -- thank you for working on this! Please ping @KStoller-WMF if you need code review help on this or other PageTriage work.

Hey @MMiller_WMF. Thanks for the response. I already asked the Growth Team to review the above patch, and I received an ambiguous reply on their talk page that I interpreted to mean "we don't have time, see if you can find a volunteer". So I went ahead and found a volunteer. Would be nice to get a commitment from them to help in the future though.

@KStoller-WMF. It's very nice to meet you.

It’s very nice to meet you too, @Novem_Linguae .

Thank you so much for spending the time to troubleshoot this bug and work on a patch! And that’s great to see it’s already been reviewed and merged! (My understanding is that it should be in production by the end of next week for anyone following this task and interested in testing in production.)

The Growth engineering team is quite short-staffed currently, and we are trying to maintain focus on GrowthExperiments. This has been a challenge, as we are the stewards of many other components which all have competing priorities. So as of now, community review is likely the fastest way to move along PageTriage & NPP work. But if that’s not possible for some reason, please ping me in the task and I will see if there is anything I can do to assist. Thanks again for your work on this!

@ComplexRational reports this bug is still present even after my patch. Maybe it's a race condition. I'll release an additional patch shortly.

Change 818601 had a related patch set uploaded (by Novem Linguae; author: Novem Linguae):

[mediawiki/extensions/PageTriage@master] Prevent race condition when tagging for deletion

https://gerrit.wikimedia.org/r/818601

Test wiki created on Patch demo by TheresNoTime using patch(es) linked to this task:
https://patchdemo.wmflabs.org/wikis/f89d3f9550/w

Test wiki on Patch demo by TheresNoTime using patch(es) linked to this task was deleted:

https://patchdemo.wmflabs.org/wikis/f89d3f9550/w/

Change 818601 merged by jenkins-bot:

[mediawiki/extensions/PageTriage@master] Prevent race condition when tagging for deletion

https://gerrit.wikimedia.org/r/818601

I just encountered this bug for https://en.wikipedia.org/wiki/Mor_Luang (tool creates a tag on the page, notifies the creator, adds an entry to AfD log, but does not create the AfD page itself), so it’s not really fixed. I also remember having had the same problem a month or so ago (possibly before the patch mentioned above - sorry, I did not keep trace of what article it was). (System info : Windows 10 / Firefox 103.2.1 64b.)

Unless someone has a brilliant idea, I would suggest to remove the button from the page curation toolbar as a temporary "fix", because

  • A proper fix might take a long time. I understand this might be a complex issue to diagnose, especially since Novem_Linguae gave it a go and the patch was not sufficient. I also understand that WMF resources are limited.
  • However, having the option to create a broken-state deletion nomination is worse than not having that option at all. The best-case scenario in the present state is that the curation tool user finds out by themselves that it is broken, cleans up their broken nomination, then resolves to use another tool (such as Twinkle) to do the job. In the "no-button" scenario, the median or worst case is that the user goes to see the manual which says to use Twinkle, but the lack of a button to break things means there is no cleanup to do.
  • I assume every user is meeting the same problems, which make it a clear net negative to keep the button (see previous point). However, if the tool is broken for 5/20/50% of users, the other 95/80/50% still view it as a positive. If only some users have the problem, maybe it’s worth a quick poll on https://en.wikipedia.org/wiki/Wikipedia_talk:New_pages_patrol/Reviewers to try to get a number.
  • I assume removing / hiding the button itself is not a lot of coding work, probably a one-liner in the interface code. Changing the documentation on en-wp is probably going to be the hardest part of that change.

@Tigraan. Thanks for the ideas. The problem is intermittent, so AFD works sometimes. I have a second patch that will deploy in a couple hours that may fix this completely. For now I will be keeping a close eye on if the second patch works. If this bug happens again tomorrow or later, please do post again in this thread so that I know it's not fixed and can keep working on it. This bug is high priority for me.

AFD patch is now live if you want to do more testing.

@ComplexRational reports that this is still happening. Says it happened 2 out of 2 times they tried to AFD recently. Here's a diff.

Idea from the other bug report T239712: Does this only happen if the 1st nomination AFD page already exists? Or when there is an {{afd}} tag in the page history?

We really need to get a consistent steps to reproduce in order to solve this bug. As of now it is intermittent and not consistently reproducible.

I am not doing many NPP patrols but for me it occurs 100% of the time (steps to reproduce: see first post of T239712). Latest instance today (August 30th) at https://en.wikipedia.org/wiki/Baja_Rally, but that was a second nomination (which may shake things up).

I suppose it is environment-specific which makes it hard to specify the exact conditions to trigger. I am running Windows 10 / Firefox 103.2.1 64b (I do have a funny add-on, NoScript, but it’s set to allow anything from wikimedia domains, and it does not say it blocked anything). @Novem_Linguae , if you have any ideas of additional diagnostics that could help, I can run them before/after doing the tagging. I am not a javascript coder, but I am a coder, so I am not afraid to look at a stacktrace / inspect the browser console / install some debugging tool.

A slight "gotcha" — if you use Firefox and have "Enhanced Tracking Protection" turned on (it is on by default), any helpful diagnostic logs sent to intake-analytics.wikimedia.org will be blocked, which has heavily frustrated the efforts at T313303: [Request/Consult] PageTriage EventLogging 🙃

Could you (or anyone else attempting to replicate this) please ensure "Enhanced Tracking Protection" is turned off before retrying?

I have deactivated Firefox’s ETP on en.wikipedia.org and nominated https://en.wikipedia.org/wiki/A%27_Design_Award for deletion using the tool.

At first, it seemed to fail: there was a popup box with some message which I forgot to screenshot, I clicked OK, and the AfD box on the article had a red link to the nomination page. However, it actually worked: the nomination page was created, and a refresh of the page turned the link blue.

Now, I have not looked at the source code, nor the logs, and I am not good at parallel programming, but here’s some speculation anyway. Could it be that a failure in the logging event causes a crash of later non-logging actions? In pseudocode, if the logic is

tag_page_async()
log_page_tagging_async()
create_afd_page()
...

then if log_page_tagging_async() is executed and errors before create_afd_page() comes up, the latter is not executed.

The fix would be something like

try:
  tag_page_async()
  log_page_tagging_delayed()
  create_afd_page()
  ...
except:
  cleanup_on_errors()
finally:
  flush_log_entries()

(NB that’s some Python-style pseudocode, the finally clause is executed in all branches)

@Tigraan. Awesome info, thanks for sharing.

Next time, please try to screenshot the error message, or write it down and repost it here. That could be an essential clue.

Note to self: install Firefox and see if I can reproduce the error there, both with ETP off and ETP on. Maybe ETP is the problem.

There is no try/catch in modules/ext.pageTriage.views.toolbar. If there's an error (mostly detected using logic and .fail()), handleError() is called, which logs the error to our servers, then displays a JavaScript alert, then returns execution back to the caller. Sometimes the caller aborts everything, sometimes it continues, depending on the code that comes after it.

That section of the code is complicated because it is async and has two paths simultaneously. I recently rewrote it to wait for both paths to finish, then execute some final code. So the program flow is something like...

modules/ext.pageTriage.views.toolbar/delete.js -> submit()

image.png (922×849 px, 44 KB)

Just FYI I couldn't see any related error logs (searching the last 48 hours)

Edit: though I did see some failures relating to API logins for @Tigraan... save me looking, if a login failed for whatever reason how would PageTriage react?

Recap of ideas/things to test:

  • Does this only happen if the 1st nomination AFD page already exists?
  • Or when there is an {{afd}} tag in the page history?
  • install Firefox and see if I can reproduce the error there, both with ETP (enhanced tracking protection) off and ETP on. Maybe ETP is the problem.
    • Firefox for Linux
    • Firefox for Mac
    • Firefox for Windows
  • Do the above, but on testwiki instead of localhost, in case centralauth is involved.
    • Firefox for Linux
    • Firefox for Mac
    • Firefox for Windows
  • Bunch of login errors in the log. Try adding debugger; to the beginning of submit();, logging out in a different tab, then resuming. See what breaks.
  • Carefully duplicate the wikicode/state of the AFD daily log page to match the state of it during a recent example of the bug. See if the error "pagetriage-del-log-page-adding-error": "Failed to find target spot for the discussion", gets triggered, messing up the AFD process
  • Timezone bug of some kind, causing the wrong log page to attempt to be written to?
  • Bug involving a freshly created AFD log page with no entries yet?
  • Try going back and forth between maintenance tag tab and deletion tag tab a couple times. Click various things. Does the act of switching create a bug by leaving some data in a variable?

Well, here we go again, this time at https://en.wikipedia.org/wiki/Abdul_Jabbar_Al_Rifai. There was no previous AfD. The script failed to create the AfD page.

The failure mode was the same as for "Mor Luang" case of Aug 18 above, except... the error message did appear, and flashed for half a second, before vanishing without action on my part. I am pretty sure the last time it appeared, it stayed until I clicked on it. But then maybe it would have gone away by itself if I had waited for ten seconds - methinks it could be an asynchronous thing?

Next time, please try to screenshot the error message, or write it down and repost it here. That could be an essential clue.

Well it is a bit hard to get the exact message in those conditions, but it was something about not being able to create the target page, with an "ok" button in blue at the bottom right. (I know this sort of partial information is mighty annoying when one tries to debug... sorry)

@Tigraan Some information is better than none. Thank you and please feel free to keep reporting. When my day job calms down and I have time to sit down and take a look at this again, the info in this ticket will provide the clues I need to possibly crack the case :)

Well it is a bit hard to get the exact message in those conditions, but it was something about not being able to create the target page, with an "ok" button in blue at the bottom right. (I know this sort of partial information is mighty annoying when one tries to debug... sorry)

Was it this error message? This is the only error message with the word "target" in it:

"pagetriage-del-log-page-adding-error": "Failed to find target spot for the discussion",

Recap of ideas/things to test:

  • Or when there is an {{afd}} tag in the page history?
  • install Firefox and see if I can reproduce the error there, both with ETP (enhanced tracking protection) off and ETP on. Maybe ETP is the problem.
    • Do the above, but on testwiki instead of localhost, in case centralauth is involved.

On the test wiki, I tried the following:

  • ETF on and off on Firefox. No issues during either. The AfD page was created in both cases.
  • nominated a page for AfD. removed the afd tag, then nominated again. No issues again.

@Sideswipe9th had the bug today at https://en.wikipedia.org/w/index.php?title=New_Manifesto_of_Arts&diff=1113721053&oldid=1104041020 and https://en.wikipedia.org/w/index.php?title=Wikipedia:Articles_for_deletion/Log/2022_October_2&diff=prev&oldid=1113721058. Reports using Firefox + Enhanced Tracking Protection. That's a pretty unique fingerprint, interesting that it's cropping up again.

Sideswipe also reports that the page refreshed, indicating that PageTriage finished its sequence of events, even though the AFD page was not created.

Note to self: try changing the code to full stop on error. Might be easier to look at the console and see if there's a JavaScript error.

Note to self: rewrite the code to be fully synchronous. Still getting race condition vibes from these symptoms. Can't race condition if there's no async.

Has anybody had this bug happen on Chrome, or on Firefox without Enhanced Tracking Protection?

Idea: Do some kind of userspace error logging, since our other logging may not be working.

@TheresNoTime invited me on this bug from the Wikimedia Discord. From first glance based on the discussions there between Sammy, @Novem_Linguae and @Sideswipe9th, I got a whiff that this is a race condition. After looking at the code that Novem provided, my suspicions grew worse.

If you need the specifics of JavaScript promise handling, read on, otherwise you can skip to the next paragraph. The issue seems to reside in a misuse of .done and .fail on jQuery Deferreds. .done and .fail both return the original Deferred, but in cases like these where we have multiple promises that should run in sequence (in reality, these can run in parallel but that requires refactoring), this is not what we want. To properly execute all promises involved in a chain, .then should be used while passing both the "done" callback and the "fail" callback. The reason that .done and .fail shouldn't be used is that once .done creates (and even returns) the promise, everything is over and it won't wait for the returned promise (since, again, it waits on the Deferred to finish, not anything else that's attached to it with .done or .then). Using .then will cause it to return a new promise that will resolve when the promise returned inside the callback is also resolved. The callback for failures is also put in .then so that it'll stop immediately if an error happens on the Deferred. Placing the .fail only after .then is called will cause it to only catch errors from within the .then's "done" callback. Placing it before will cause the .then could work, but creates unnecessary lines of code that we don't need and is also against convention (linked ahead).

The solution, therefore, is moving everything in .done and .fail to .then. This really should have been done earlier — MediaWiki's JS code convention recommends .then over .done and .fail. The entirety of my debugging process was logged at P35364 for those interested. Will submit a patch fixing this specific error (but not any other issues relating to .done/.fail misuse, particularly another yet unproblematic one the .tagPage fork) in a bit; stay tuned.

Whoops, I forgot to mention the actual problem in my previous comment. The actual problem is that the page reloads faster than the AfD page can get created. Theoretically with a slow enough internet connection (or one request being extremely slow while others being really fast; unlikely due to HTTP Keep-Alive), even adding the page to the AfD log would also fail.

Change 838851 had a related patch set uploaded (by Chlod Alejandro; author: Chlod Alejandro):

[mediawiki/extensions/PageTriage@master] Replace promise handling when AfD'ing pages

https://gerrit.wikimedia.org/r/838851

Change 838851 merged by jenkins-bot:

[mediawiki/extensions/PageTriage@master] Replace promise handling when AfD'ing pages

https://gerrit.wikimedia.org/r/838851

Change 839575 had a related patch set uploaded (by Samtar; author: Chlod Alejandro):

[mediawiki/extensions/PageTriage@wmf/1.40.0-wmf.4] Replace promise handling when AfD'ing pages

https://gerrit.wikimedia.org/r/839575

Change 839576 had a related patch set uploaded (by Samtar; author: Chlod Alejandro):

[mediawiki/extensions/PageTriage@wmf/1.40.0-wmf.3] Replace promise handling when AfD'ing pages

https://gerrit.wikimedia.org/r/839576

Change 839575 merged by jenkins-bot:

[mediawiki/extensions/PageTriage@wmf/1.40.0-wmf.4] Replace promise handling when AfD'ing pages

https://gerrit.wikimedia.org/r/839575

Change 839576 merged by jenkins-bot:

[mediawiki/extensions/PageTriage@wmf/1.40.0-wmf.3] Replace promise handling when AfD'ing pages

https://gerrit.wikimedia.org/r/839576

Mentioned in SAL (#wikimedia-operations) [2022-10-06T20:37:14Z] <samtar@deploy1002> Started scap: Backport for [[gerrit:839575|Replace promise handling when AfD'ing pages (T238025)]], [[gerrit:839576|Replace promise handling when AfD'ing pages (T238025)]]

Mentioned in SAL (#wikimedia-operations) [2022-10-06T20:37:37Z] <samtar@deploy1002> samtar and samtar: Backport for [[gerrit:839575|Replace promise handling when AfD'ing pages (T238025)]], [[gerrit:839576|Replace promise handling when AfD'ing pages (T238025)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2022-10-06T20:45:11Z] <samtar@deploy1002> Finished scap: Backport for [[gerrit:839575|Replace promise handling when AfD'ing pages (T238025)]], [[gerrit:839576|Replace promise handling when AfD'ing pages (T238025)]] (duration: 07m 56s)

Anyone still getting this bug since our patch deployed on October 6? Last call, I'm closing the ticket soon :)

Novem_Linguae claimed this task.
Novem_Linguae moved this task from Waiting for enwiki deploy to Done on the PageTriage board.