Page MenuHomePhabricator

Bot1058 is unable to perform a link update using the API
Closed, ResolvedPublic

Description

@Wbm1058's bot Bot1058 is unable to perform a link update using the API, and I can't figure out why. Neither the API responses nor the server logs indicate any failures. Yet the page_links_updated database field is not getting bumped, which should always happen at the end of a link update.

Discussion so far: https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)/Archive_209#Purge_API_seems_to_still_be_broken

We tried enabling verbose logs for a few attempts. It's clearly scheduling a DeferredUpdate to do it (note the Recursive link purge enqueued for {title} entries), and then that update is not happening, and I don't see anything that would indicate why. https://logstash.wikimedia.org/goto/db796e584a83d2528d4569bb55bfdf8a

For comparison, here's a similar request performed from my account, which succeeds (note the DeferredUpdates::run: started AutoCommitUpdate_MediaWiki\Deferred\LinksUpdate\LinksUpdate::doUpdate entries, missing in the previous log). https://logstash.wikimedia.org/goto/f080a980922ea6c958620a6823236a46

(Note for Logstash viewing: entries within the same millisecond may be shown out-of-order; also, I hid a few logging channels, so that the more interesting for the whole request may be shown without hitting the limit of 500 results)

Event Timeline

Umherirrender subscribed.

forcerecursivelinkupdate does not handle the page itself, just templates, but the help page indicates that it should work.

https://en.wikipedia.org/w/api.php?modules=purge

forcelinkupdate
Update the links tables and do other secondary data updates.
Type: boolean (details)

forcerecursivelinkupdate
Same as forcelinkupdate, and update the links tables for any page that uses this page as a template.
Type: boolean (details)

With the normal force it works. Before:

MariaDB [enwiki_p]> select * from page where page_namespace = 0 and page_title = 'Soygu';
+----------+----------------+------------+------------------+-------------+----------------+----------------+--------------------+-------------+----------+--------------------+-----------+
| page_id  | page_namespace | page_title | page_is_redirect | page_is_new | page_random    | page_touched   | page_links_updated | page_latest | page_len | page_content_model | page_lang |
+----------+----------------+------------+------------------+-------------+----------------+----------------+--------------------+-------------+----------+--------------------+-----------+
| 25247276 |              0 | Soygu      |                1 |           0 | 0.573656683018 | 20231120220014 | 20230817082405     |  1075321009 |       27 | wikitext           | NULL      |
+----------+----------------+------------+------------------+-------------+----------------+----------------+--------------------+-------------+----------+--------------------+-----------+
1 row in set (0.002 sec)

Running https://en.wikipedia.org/wiki/Special:ApiSandbox#action=purge&format=json&forcelinkupdate=1&titles=Soygu&formatversion=2 gives update:

MariaDB [enwiki_p]> select * from page where page_namespace = 0 and page_title = 'Soygu';
+----------+----------------+------------+------------------+-------------+----------------+----------------+--------------------+-------------+----------+--------------------+-----------+
| page_id  | page_namespace | page_title | page_is_redirect | page_is_new | page_random    | page_touched   | page_links_updated | page_latest | page_len | page_content_model | page_lang |
+----------+----------------+------------+------------------+-------------+----------------+----------------+--------------------+-------------+----------+--------------------+-----------+
| 25247276 |              0 | Soygu      |                1 |           0 | 0.573656683018 | 20231121183032 | 20231121183032     |  1075321009 |       27 | wikitext           | NULL      |
+----------+----------------+------------+------------------+-------------+----------------+----------------+--------------------+-------------+----------+--------------------+-----------+
1 row in set (0.002 sec)

Maybe some of the recursive handling is broken.

Per https://en.wikipedia.org/w/index.php?title=Wikipedia:Village_pump_(technical)&dtnewcommentssince=c-Wbm1058-20231121200100-Matma_Rex-20231121152800&dtinthread=1#Purge_API_seems_to_still_be_broken this doesn't seem specific to the account, but to the method (same account can purge in APISandbox, and other account can't purge using the bot script). I'm confused why the script worked for me, though.

wbm1058 mentions using a bot password, maybe that's somehow related? But I think if the bot password wasn't allowed to perform this action, it would fail earlier…

I think it hits this condition and gets converted to a job queue entry (which doesn't happen during manual testing because you don't purge many articles at the same time). The job queue should probably write something to the log when that happens.

I suspect that changes related to Phabricator T265749 broke my bot's purging. Right about the time when this was marked as closed and resolved (~September 10-11) is when my bot stopped working.

What's the difference between "DeferredUpdates" and the "Job Queue"? Are they the same thing?

The odds of losing updates due to an error increases when executing long queues
and when large amounts of time pass while tasks are queued. Mitigate this by
trying to eagerly move updates to the JobQueue when possible.

Losing updates is not a big deal for my bot. If an update is "lost", my bot will just re-request the same update until it isn't lost.

What is an issue is when ALL of my bots updates are ALWAYS lost! I have no idea how many of my bots updates were lost before whatever system changes were implemented in late August or early September. I think it was some of them, some of the time, but overall I think the vast majority of my bots updates were promptly handled and not deferred for any large amount of time or lost. If the system were kind enough to inform my bot via its API when its requests were being "job queued" rather than immediately executed, then I would be kind enough to the system to make my bot sleep while the job queue was being emptied.

What's the difference between "DeferredUpdates" and the "Job Queue"? Are they the same thing?

Deferreds execute in the same request, after the response has been sent. Jobs are serialized as a job specification and stored in a job queue, and popped and executed some time later by some dedicated job runner process. Among other things, this means that finding the relevant log entries is harder because they aren't associated with the original web request. Some deferreds can be turned into jobs and queued for later if MediaWiki decides it's not a good time to run a deferred update.

It seems like we started inserting refreshLinksPrioritized jobs on Sept. 14 (the group 2 train day for 1.41.0-wmf.26):

Screenshot Capture - 2023-11-27 - 12-54-59.png (369×2 px, 112 KB)

(although there are later periods where they aren't inserted, maybe that's when the bot didn't run?)
So maybe the job just doesn't work for some reason, this wasn't an issue before because we didn't use the job queue (although there are jobs before that date, just much fewer), and the issue doesn't occur in manual tests because the 100+ deferred limit for converting deferreds into jobs isn't hit? How many articles does the bot purge in a single request?

Seems like enqueuing deferreds as jobs was not done for web requests until rMWe56552557f0d: deferred: Decouple DeferredUpdates from MediaWikiServices (note the differences in when $wgCommandLineMode gets checked), which was indeed deployed with wmf.26. So that explains why matmarex is not getting any useful debug output. (It would be nice if such "unintended" jobs inherited the request ID.) Not sure why the job is failing without error (at least according to grafana it's not being retried), but RefreshSecondaryDataUpdate::getAsJobSpecification() seems pretty different from RefreshLinksJob parameter handling (useRecursiveLinksUpdate instead of recursive, no table parameter...).

Seems like enqueuing deferreds as jobs was not done for web requests until rMWe56552557f0d: deferred: Decouple DeferredUpdates from MediaWikiServices (note the differences in when $wgCommandLineMode gets checked), which was indeed deployed with wmf.26.

@Krinkle can you review this? It's not the root cause of the problem, but the commit summary suggests it might have been an unintentional change.

It seems like we started inserting refreshLinksPrioritized jobs on Sept. 14 (the group 2 train day for 1.41.0-wmf.26):

Screenshot Capture - 2023-11-27 - 12-54-59.png (369×2 px, 112 KB)

(although there are later periods where they aren't inserted, maybe that's when the bot didn't run?)
So maybe the job just doesn't work for some reason, this wasn't an issue before because we didn't use the job queue (although there are jobs before that date, just much fewer), and the issue doesn't occur in manual tests because the 100+ deferred limit for converting deferreds into jobs isn't hit? How many articles does the bot purge in a single request?

Some time after I became aware that my bot wasn't working, I shut it down. After seeing other Phabricators fixing issues that I thought were causing the problem (but weren't) I restarted the bot. Then, seeing that those fixes didn't address my bot's problem, I shut it down again. Recent, temporary bot runs have just been for testing theories. I had the bot configured to process 10,000 pages on a single run. The API documentation suggests bundling multiple pages into a single API call, so I bundled 20 pages into a single API call. After the system responds to indicate that it handled those 20 pages, my bot sleeps for one second and then sends another request to refresh 20 more pages. This algorithm seemed to work well, without putting any obvious stress on the servers that I could see, up until, I suppose, probably September 14. I didn't notice the problem right away as just a few days after that I was taking time off to do a multi-day bike ride along the Erie Canal in New York State.

I can run a test that attempts to skate under the 100-page limit by making the bot process just 80 pages, rather than 10 thousand, and see whether that works.

I ran a quick bot test on Toolforge Kuberbetes to purge 80 pages. It finished in 14 seconds. Alas, it too failed. My Quarry query of two pages that the bot attempted to purge shows their page_links_updated dates to still be August 20, 2023. That's 25 days before September 14. This bot was endeavoring to keep pages refreshed every 30 days or so, and would have been running really well if it had actually worked it down to just 25. Those pages likely would have been purged around September 15 or 16 if the issue hadn't cropped up, give or take a day or few.

@Wbm1058 I think the limit @Tgr is talking about is per-request – try changing your API calls to pass just one page title per call, rather than 20, and my guess is that this will make it work. (This is just a workaround, it's definitely a bug that it doesn't work with 20 titles at once.)

Change 978719 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/core@master] DeferredUpates: Mention it in the logs when enqueuing deferreds

https://gerrit.wikimedia.org/r/978719

Change 978720 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/core@master] DeferredUpdates: Inherit request ID when enqueued

https://gerrit.wikimedia.org/r/978720

Change 978719 merged by jenkins-bot:

[mediawiki/core@master] DeferredUpates: Mention it in the logs when enqueuing deferreds

https://gerrit.wikimedia.org/r/978719

I ran a quick bot test on Toolforge Kuberbetes to purge 80 pages. It finished in 14 seconds. Alas, it too failed. My Quarry query of two pages that the bot attempted to purge shows their page_links_updated dates to still be August 20, 2023. That's 25 days before September 14. This bot was endeavoring to keep pages refreshed every 30 days or so, and would have been running really well if it had actually worked it down to just 25. Those pages likely would have been purged around September 15 or 16 if the issue hadn't cropped up, give or take a day or few.

There is no url example the bot is using and which could be used for testing. There is only one working example in my comment. It seems the action=purge is working in most cases, but not when running from this bot code, the ip (toolforge?) or by that specific bot account.

Using forcerecursivelinkupdate is using the job queue internal to process pages using the requested page as template, not sure if the page itself is process via job queue or directly. It should only used in special situation where it makes sense to enqueue also the pages using the template/module.
Using forcelinkupdate is running the necessary update directly as it can timeout on many pages which are expensive to parse.

The job queue refreshLinksPrioritized is mostly used to process cascade protection updates as a high running job.
The call to tryOpportunisticExecute with the condition about queuing or running is only done on CLI run.

Change 978720 merged by jenkins-bot:

[mediawiki/core@master] JobSpecification: Inherit request ID

https://gerrit.wikimedia.org/r/978720

Seems like enqueuing deferreds as jobs was not done for web requests until rMWe56552557f0d: deferred: Decouple DeferredUpdates from MediaWikiServices (note the differences in when $wgCommandLineMode gets checked), which was indeed deployed with wmf.26.

@Krinkle can you review this? It's not the root cause of the problem, but the commit summary suggests it might have been an unintentional change.

I'm not sure I see the difference, but I wrote it so I'm likely blind to it. The $wgCommandLineMode check moved from the caller (addUpdate/tryOpportunisticExecute) to the scope integration (tryOpportunisticExecute/allowOpportunisticUpdates) is an early return and should thus behave identically.

I left a more detailed comment at https://gerrit.wikimedia.org/r/c/mediawiki/core/+/978719/3/includes/deferred/DeferredUpdates.php.

My first two suspicions would be:

  • a regression happened elsewhere that caused more updates to fail (and thus trigger the catch-and-enqueue block),
  • or a regression happened that caused the maintenance script in question to fail to commit its database transaction regularly (thus preventing any updates from running and starting to triggering the large-buffer-to-enqueue block).

Having said that, the refactor being in the same branch is definitely suspect. If you see where the logic is different or if the added log message reveals this being the case, let me know. Happy to patch or review as-needed.

I'm not sure I see the difference, but I wrote it so I'm likely blind to it. The $wgCommandLineMode check moved from the caller (addUpdate/tryOpportunisticExecute) to the scope integration (tryOpportunisticExecute/allowOpportunisticUpdates) is an early return and should thus behave identically.

I left a comment on https://gerrit.wikimedia.org/r/c/mediawiki/core/+/952271 which is hopefully more clear.

Change 980953 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/core@master] DeferredUpdates: Fix logging for opportunistic enqueue

https://gerrit.wikimedia.org/r/980953

Change 980953 merged by jenkins-bot:

[mediawiki/core@master] DeferredUpdates: Fix logging for opportunistic enqueue

https://gerrit.wikimedia.org/r/980953

The patches for better logging have reached production, so maybe we should try the experiment with X-Wikimedia-Debug once more?

OK, I turned X-Wikimedia-Debug back on, and ran a one-off test that attempted to refresh 10,000 links. Hopefully that's given you the data you need to track it down. Let me know if you want me to run more tests.

Thanks! Here are the logs from one of the requests: https://logstash.wikimedia.org/goto/8fc4991e8476abd2a21040db1515631c

The deferred updates are indeed being queued as jobs. The jobs seem to work - there are no errors, and there's a bunch of Saved in parser cache log entries (20 per API request), and SQL commands which do look like parser cache insertions. Not sure how to tell if links tables were updated.

RefreshSecondaryDataUpdate::getAsJobSpecification() seems pretty different from RefreshLinksJob parameter handling (useRecursiveLinksUpdate instead of recursive, no table parameter...).

After staring at the code for a very long time, I think the parameter handling here is actually correct. Which is to say, I have no idea what's going on with the bot.

I think the next thing to improve would be that the X-Wikimedia-Debug header is not percolating through the job queue. JobExecutor has some decent logging which could be used to understand what jobs are and aren't scheduled (there are three levels of job recursion involved here, ApiPurge schedules a RefreshSecondaryDataUpdate which gets turned into a job RefreshLinksJob with useRecursiveLinksUpdate=true, that schedules a job with recursive=true, that schedules a number of non-recursive jobs), but in production we only log it at warning level.

The rootJobTimestamp of the refreshLinksPrioritized job is set to the timestamp of the current revision, RefreshLinksJob::getLagAwareRootTimestamp returns this timestamp and in RefreshLinksJob::isAlreadyRefreshed the queued job is ignored as the last links update was after the last edit and everything is fine from view of the job when queued after edits of the page (for example when a template gets edited in a short timeframe the jobs queued for the first edit running after the second edit would do the update and the second job would be skipped to reduce duplicate workload). So RefreshLinksJob::isAlreadyRefreshed needs to return false with a newer timestamp in rootJobTimestamp when queued by api purge.

Another solution could be to avoid that jobs get queued, it is unclear if this is wanted or a regression from e5655255, without queuing the links update is run as part of the web request of the bot, there is already a comment on the task - https://gerrit.wikimedia.org/r/c/mediawiki/core/+/952271/16/includes/deferred/DeferredUpdates.php#349.
Or by reducing the requested titles to keep the internal queue small, but there is no way to say what would be a good size.
If there is a fresher timestamp in rootJobTimestamp some work still could be done by the job queue, before the change that was syncron, now it could be asyncron, that could be unwanted by some use cases (a bot purging the main page for refreshing the cascade protection on new day would expect that the update is done now or the request fails, not done seconds later by the job queue)

Change 985341 had a related patch set uploaded (by Umherirrender; author: Umherirrender):

[mediawiki/core@master] deferred: Add a freshness option to RefreshSecondaryDataUpdate

https://gerrit.wikimedia.org/r/985341

I think the next thing to improve would be that the X-Wikimedia-Debug header is not percolating through the job queue.

Filed as T354042: Forward X-Wikimedia-Debug header to MediaWiki jobs.

Change 985341 merged by jenkins-bot:

[mediawiki/core@master] deferred: Add a freshness option to RefreshSecondaryDataUpdate

https://gerrit.wikimedia.org/r/985341

Please retest, an improvement has been deployed with the train of this week, hopefully this helps on your requests made by the bot.

I've finished three 10,000 page runs. They still have the X-Wikimedia-Debug. Yes, the new deployment helped. At least some of the pages are now purged.

If everything looks good on the back-end then I'll remove the debug parameter and restart normal processing of one of my two processes and monitor it to see if it begins to catch up towards my goal of refreshing mainspace links every 30 days, give or take.

Logs from a sample request: https://logstash.wikimedia.org/goto/8ad5692096c39f0511410bc5bff1847b
I think it looks normal (although until T354042 gets fixed, we don't have a lot of visibility into what's happening in jobs). If the bot seems to be working, I'd call the task done.

I've finished three 10,000 page runs. They still have the X-Wikimedia-Debug. Yes, the new deployment helped. At least some of the pages are now purged.

Thanks for the feedback! It should work on all pages in the main namespace, it could take some seconds to process when processed by the jobqueue and not directly in the request.

Sorry, I have found one more scenario where "Bot1058 is unable to perform a link update using the API" where it was previously able to, before August/September 2023. I'm still not totally up to speed on Phabricator etiquette, so will let y'all who are still subscribed to this decide whether it is appropriate to reopen this Phab, open a new one, or just comment here on this closed Phab without reopening it.

Page links last updated before 20230705 (and not NULL) finds nothing. This is because up to this date, my bot was able to successfully purge all pages.
Page links last updated before 20230706 (and not NULL) finds 1335 rows; all .js and .css pages in user space.

An API sandbox test shows the problem. Under "warnings": the message "You do not have permission to edit this JavaScript page because it contains another user's personal settings." is repeated three times.

  • Why is the message repeated 3 times?
  • My bot is not trying to edit these user javascripts, merely to purge them. Before last August, the bot was allowed to do that. Now, it's not. Was this change intentional?
  • Does purging .js and .css pages actually ever do anything, other than update the page_links_updated database field?
  • Why is the message repeated 3 times?

For each affected page in the batch the warning gets added.

  • My bot is not trying to edit these user javascripts, merely to purge them. Before last August, the bot was allowed to do that. Now, it's not. Was this change intentional?

That is tracked under T354460: purge actions are affected by permission check via $wgNamespaceProtection or config pages (.js/.css)

  • Does purging .js and .css pages actually ever do anything, other than update the page_links_updated database field?

JS and CSS are parsed like wikitext ($wgTextModelsToParse), so link tables gets effect (pagelinks, templatelinks, categorylinks, etc.)