Maniphest T207941

api.php from Special:Watchlist
Closed, ResolvedPublicPRODUCTION ERROR
Actions

Description

There has been an increase in transaction size error, seeing by the trending at:

https://logstash.wikimedia.org/goto/8f62fe5730a9a938ba5a0f9985abe3bd

They appear to be mostly commonswiki api calls (with referer commons categories -maybe js related?), it could be just traffic pattern related, but maybe worth investigating if there is a root cause.

[12:38] <icinga-wm> PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen

Details

Subject	Repo	Branch	Lines +/-
WatchedItemStore: Use batching in setNotificationTimestampsForUser	mediawiki/core	master	+137 -90
ApiSetNotificationTimestamp: Make entirewatchlist more efficient	mediawiki/core	master	+10 -4
Add user id and POST values to Monolog context	mediawiki/core	master	+52 -1

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		jcrespo	T207940 Large transaction-related errors and other problems (tracking)
		Resolved	PRODUCTION ERROR	Catrope	T207941 Spike of DBTransactionSizeError exceptions from /w/api.php from Special:Watchlist

Event Timeline

jcrespo created this task.Oct 25 2018, 11:57 AM

jcrespo updated the task description. (Show Details)

jcrespo added a subscriber: Krinkle.

Krinkle moved this task from Untriaged to Aug2019/1.34.wmf.17+ on the Wikimedia-production-error board.Oct 26 2018, 10:27 PM

Krinkle renamed this task from Increasing exceptions DBTransactionSizeError specially on commonswiki to Spike of DBTransactionSizeError exceptions from /w/api.php from Special:Watchlist.Nov 7 2018, 8:25 PM

Krinkle edited projects, added Performance-Team, MediaWiki-Watchlist; removed Performance Issue, MediaWiki-libs-Rdbms.

Restricted Application added a project: Growth-Team. · View Herald TranscriptNov 7 2018, 8:25 PM

Most of them seem to be from /w/api.php on either de.wikipedia.org or commons.wikimedia.org with a referer of Special:Watchlist (or its localised name).

I have not done any research that suggests they are related, but he had Watchlist regressions in the past on certain wikis when rcs growed a lot on long watchlists at T171027#3667090 Not assuming it is Wikidata at all, but it could be any other process making rcs grow (a thing that is easy to check or discard).

However, based on "API with referrer from watchlist" is most likely a js gadget doing too many slow requests in an almost abusive manner (beyond the recommended API limits). This needs more research.

We'll need better logging to figure out what's causing this. The stack trace comes from MediaWiki::preOutputCommit so we have no idea which API module this came from or what kind of query it tried to issue.

Catrope removed projects: MediaWiki-Watchlist, Growth-Team.Nov 13 2018, 8:52 AM

• Imarlier assigned this task to • Gilles.Nov 19 2018, 9:19 PM

• Imarlier moved this task from Inbox, needs triage to To-do: Goals prioritized current Quarter on the Performance-Team board.

Change 474933 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/core@master] Add user id and POST values to Monolog context

https://gerrit.wikimedia.org/r/474933

gerritbot added a project: Patch-For-Review.Nov 20 2018, 3:59 PM

Change 474933 abandoned by Gilles:
Add user id and POST values to Monolog context

https://gerrit.wikimedia.org/r/474933

Some recent instances in api.log:

mw1282 dewiki 1.33.0-wmf.4 api INFO: API POST Bertramz x.x.x.x T=3209ms action=setnotificationtimestamp format=json entirewatchlist=true token=[redacted] formatversion=2  
mw1339 dewiki 1.33.0-wmf.4 api INFO: API POST Bertramz x.x.x.x T=3787ms action=setnotificationtimestamp format=json entirewatchlist=true token=[redacted] formatversion=2
mw1290 commonswiki 1.33.0-wmf.4 api INFO: API POST Pyb x.x.x.x T=3514ms action=setnotificationtimestamp format=json entirewatchlist=true token=[redacted] formatversion=2

It seems obvious that it's particular users being affected.

If I'm reading the DB records correctly, Bertramz has 20k+ watched pages on dewiki and Pyb has 187k+ on Commons.

The API parameter "entirewatchlist=true" does indeed make it sound like some code blows up when trying to process entire watchlists that are too large.

@Anomie you were the last to touch this API (ApiSetNotificationTimestamp) for MCR. Does it look to you like something that could be caused by recent MCR changes?

If I'm following the code correctly based on the API parameters, it ends up calling setNotificationTimestampsForUser in WatchedItemStore, which seems to update all rows at once for the user calling this API in the watchlist table, setting the wl_notificationtimestamp column to null. It's doing the whole thing inside the POST, as the return status of the API call seems to depend on the success or failure of that update call.

Interestingly, just below in WatchedItemStore we find a "resetAllNotificationTimestampsForUser" that seems to do the same thing async, post-send, in chunks.

I think the obvious fix would be to make the setnotificationtimestamp API call with those parameters use resetAllNotificationTimestampsForUser, but I'm not familiar with any of that code or those features and I don't know what the implications would be for making the API call in those conditions "fire and forget" instead of returning a real status that corresponds to the status of the DB request.

The less risky alternative would be to share the chunking logic currently found in the job with the setNotificationTimestampsForUser function.

Looking at potential callers, it seems to be coming from the "Mark all changes as seen" button on Special:Watchlist. Which, when clicked, after firing the API call we're seeing, triggers an update of the UI by fetching the changes list from the server. I'm guessing that this is probably the reason why the API call tries to do all this work synchronously. I think this calls for the API call doing chunking and still being synchronous, if we want to avoid drastic UI+API changes.

Who's responsible for maintaining Watchlist/API code? At this point, now that the issue has been tracked down, this shouldn't require the Performance team's attention, other than providing further advice if needed.

In T207941#4773444, @Gilles wrote:

Who's responsible for maintaining Watchlist/API code? At this point, now that the issue has been tracked down, this shouldn't require the Performance team's attention, other than providing further advice if needed.

That's the Growth-Team. Pinging @Catrope.

In T207941#4773149, @Gilles wrote:

@Anomie you were the last to touch this API (ApiSetNotificationTimestamp) for MCR. Does it look to you like something that could be caused by recent MCR changes?

It's certainly not rMW07842be379ca: API: Update query modules for MCR. All that did was replace calls to stub methods on Revision with calls to RevisionStore directly. It seems pretty unlikely to be anything to do with MCR. With this sort of thing, it could well be that people with huge watchlists just happened to start using the button.

In T207941#4773160, @Gilles wrote:

If I'm following the code correctly based on the API parameters, it ends up calling setNotificationTimestampsForUser in WatchedItemStore, which seems to update all rows at once for the user calling this API in the watchlist table, setting the wl_notificationtimestamp column to null. It's doing the whole thing inside the POST, as the return status of the API call seems to depend on the success or failure of that update call.

The API module in the "entirewatchlist" code path doesn't check whether the call to WatchedItemStore succeeds or read the database to check for the effect of the update, so the only way the return status would be affected is if the WatchedItemStore call throws an exception.

Interestingly, just below in WatchedItemStore we find a "resetAllNotificationTimestampsForUser" that seems to do the same thing async, post-send, in chunks.

I think the obvious fix would be to make the setnotificationtimestamp API call with those parameters use resetAllNotificationTimestampsForUser, but I'm not familiar with any of that code or those features and I don't know what the implications would be for making the API call in those conditions "fire and forget" instead of returning a real status that corresponds to the status of the DB request.

From an API perspective I don't think there'd be much of a problem with having the API call be asynchronous when "entirewatchlist" is used. When I wrote the module (long before it was used for the watchlist rewrite) it was only made synchronous because that was what SpecialWatchlist's reset button was doing at the time. OTOH, I have no idea whether that would be a problem for the new watchlist UI that's now using the API module.

T188801: Migrate wl_notificationtimestamp updates to the job queue may also be relevant.

Catrope moved this task from Inbox to Needs Discussion on the Growth-Team board.Nov 27 2018, 7:16 AM

• Gilles reassigned this task from • Gilles to Catrope.Nov 27 2018, 8:31 AM

• Gilles subscribed.

• Imarlier edited projects, added Performance-Team (Radar); removed Performance-Team.Dec 5 2018, 4:39 PM

Krinkle removed a project: Patch-For-Review.Dec 13 2018, 8:35 PM

Change 481978 had a related patch set uploaded (by Catrope; owner: Catrope):
[mediawiki/core@master] ApiSetNotificationTimestamp: Make entirewatchlist more efficient

https://gerrit.wikimedia.org/r/481978

gerritbot added a project: Patch-For-Review.Jan 3 2019, 1:42 AM

JTannerWMF edited projects, added Growth-Team (Sprint 0 (Growth Team)); removed Growth-Team.Jan 3 2019, 6:13 PM

JTannerWMF moved this task from Incoming to Code Review on the Growth-Team (Sprint 0 (Growth Team)) board.

Catrope mentioned this in T208542: Watchlist reset fails "Lock wait timeout exceeded: WatchedItemStore::setNotificationTimestampsForUser" (from API setnotificationtimestamp).Jan 8 2019, 7:00 AM

Change 484778 had a related patch set uploaded (by Catrope; owner: Catrope):
[mediawiki/core@master] WatchedItemStore: Use batching in setNotificationTimestampsForUser

https://gerrit.wikimedia.org/r/484778

Change 481978 merged by jenkins-bot:
[mediawiki/core@master] ApiSetNotificationTimestamp: Make entirewatchlist more efficient

https://gerrit.wikimedia.org/r/481978

kostajh moved this task from Code Review to QA on the Growth-Team (Sprint 0 (Growth Team)) board.Jan 17 2019, 6:11 PM

ReleaseTaggerBot added a project: MW-1.33-notes (1.33.0-wmf.14; 2019-01-22).Jan 17 2019, 7:00 PM

Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.Jan 21 2019, 2:17 AM

Note for testing: needs to be checked in wmf.14. Presently (wmf.13), there are numerous errors on https://logstash.wikimedia.org/app/kibana#/dashboard/Fatal-Monitor?_g=h@8525ade&_a=h@fea007a with referrer https://de.wikipedia.org/wiki/Spezial:Beobachtungsliste

[XEdohwpAAEcAAImkB-MAAACB] /w/api.php   Wikimedia\Rdbms\DBTransactionSizeError from line 1359 of /srv/mediawiki/php-1.33.0-wmf.13/includes/libs/rdbms/loadbalancer/LoadBalancer.php: Transaction spent 9.9563910961151 second(s) in writes, exceeding the limit of 3.

Checked in wmf.14 - 'Mark all changes as seen' on the Watchlist with 8,635 pages does not trigger any exceptions; and no Watchlist related error were present (except the mentioned above Spezial:Beobachtungsliste

Krinkle moved this task from Aug2019/1.34.wmf.17+ to Resolved on the Wikimedia-production-error board.Jan 25 2019, 10:51 PM

https://gerrit.wikimedia.org/r/484778 has some feedback, also T130067: Add wl_id to watchlist tables on production dbs is done so that may impact the approach in the patch.

Catrope moved this task from Ready for Development to Code Review on the Growth-Team (Sprint 0 (Growth Team)) board.Mar 15 2019, 6:17 PM

Change 484778 had a related patch set uploaded (by Catrope; owner: Catrope):
[mediawiki/core@master] WatchedItemStore: Use batching in setNotificationTimestampsForUser

https://gerrit.wikimedia.org/r/484778

Change 484778 merged by jenkins-bot:
[mediawiki/core@master] WatchedItemStore: Use batching in setNotificationTimestampsForUser

https://gerrit.wikimedia.org/r/484778

ReleaseTaggerBot edited projects, added MW-1.33-notes (1.33.0-wmf.23; 2019-03-26); removed MW-1.33-notes (1.33.0-wmf.14; 2019-01-22).Mar 21 2019, 5:00 AM

Catrope moved this task from Code Review to QA on the Growth-Team (Sprint 0 (Growth Team)) board.Mar 28 2019, 3:24 PM

Etonkovidova closed this task as Resolved.Apr 8 2019, 8:49 PM

• mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:08 PM

Umherirrender mentioned this in T204555: ClearWatchlistNotifications fails with "Lock wait timeout exceeded" on watchlist.wl_notificationtimestamp.Aug 5 2020, 10:49 PM

Aklapper removed a subscriber: Anomie.Oct 16 2020, 5:38 PM

Spike of DBTransactionSizeError exceptions from /w/api.php from Special:WatchlistClosed, ResolvedPublicPRODUCTION ERRORActions

Description

Details

Related ObjectsSearch...

Event Timeline

Spike of DBTransactionSizeError exceptions from /w/api.php from Special:Watchlist
Closed, ResolvedPublicPRODUCTION ERROR
Actions

Related Objects
Search...