Page MenuHomePhabricator

Spike of DBTransactionSizeError exceptions from /w/api.php from Special:Watchlist
Closed, ResolvedPublicPRODUCTION ERROR

Description

There has been an increase in transaction size error, seeing by the trending at:

https://logstash.wikimedia.org/goto/8f62fe5730a9a938ba5a0f9985abe3bd

They appear to be mostly commonswiki api calls (with referer commons categories -maybe js related?), it could be just traffic pattern related, but maybe worth investigating if there is a root cause.

[12:38] <icinga-wm> PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen

Event Timeline

jcrespo updated the task description. (Show Details)
jcrespo added a subscriber: Krinkle.
Krinkle renamed this task from Increasing exceptions DBTransactionSizeError specially on commonswiki to Spike of DBTransactionSizeError exceptions from /w/api.php from Special:Watchlist.Nov 7 2018, 8:25 PM

Most of them seem to be from /w/api.php on either de.wikipedia.org or commons.wikimedia.org with a referer of Special:Watchlist (or its localised name).

I have not done any research that suggests they are related, but he had Watchlist regressions in the past on certain wikis when rcs growed a lot on long watchlists at T171027#3667090 Not assuming it is Wikidata at all, but it could be any other process making rcs grow (a thing that is easy to check or discard).

However, based on "API with referrer from watchlist" is most likely a js gadget doing too many slow requests in an almost abusive manner (beyond the recommended API limits). This needs more research.

We'll need better logging to figure out what's causing this. The stack trace comes from MediaWiki::preOutputCommit so we have no idea which API module this came from or what kind of query it tried to issue.

Change 474933 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/core@master] Add user id and POST values to Monolog context

https://gerrit.wikimedia.org/r/474933

Change 474933 abandoned by Gilles:
Add user id and POST values to Monolog context

https://gerrit.wikimedia.org/r/474933

Some recent instances in api.log:

mw1282 dewiki 1.33.0-wmf.4 api INFO: API POST Bertramz x.x.x.x T=3209ms action=setnotificationtimestamp format=json entirewatchlist=true token=[redacted] formatversion=2  
mw1339 dewiki 1.33.0-wmf.4 api INFO: API POST Bertramz x.x.x.x T=3787ms action=setnotificationtimestamp format=json entirewatchlist=true token=[redacted] formatversion=2
mw1290 commonswiki 1.33.0-wmf.4 api INFO: API POST Pyb x.x.x.x T=3514ms action=setnotificationtimestamp format=json entirewatchlist=true token=[redacted] formatversion=2

It seems obvious that it's particular users being affected.

If I'm reading the DB records correctly, Bertramz has 20k+ watched pages on dewiki and Pyb has 187k+ on Commons.

The API parameter "entirewatchlist=true" does indeed make it sound like some code blows up when trying to process entire watchlists that are too large.

@Anomie you were the last to touch this API (ApiSetNotificationTimestamp) for MCR. Does it look to you like something that could be caused by recent MCR changes?

If I'm following the code correctly based on the API parameters, it ends up calling setNotificationTimestampsForUser in WatchedItemStore, which seems to update all rows at once for the user calling this API in the watchlist table, setting the wl_notificationtimestamp column to null. It's doing the whole thing inside the POST, as the return status of the API call seems to depend on the success or failure of that update call.

Interestingly, just below in WatchedItemStore we find a "resetAllNotificationTimestampsForUser" that seems to do the same thing async, post-send, in chunks.

I think the obvious fix would be to make the setnotificationtimestamp API call with those parameters use resetAllNotificationTimestampsForUser, but I'm not familiar with any of that code or those features and I don't know what the implications would be for making the API call in those conditions "fire and forget" instead of returning a real status that corresponds to the status of the DB request.

The less risky alternative would be to share the chunking logic currently found in the job with the setNotificationTimestampsForUser function.

Looking at potential callers, it seems to be coming from the "Mark all changes as seen" button on Special:Watchlist. Which, when clicked, after firing the API call we're seeing, triggers an update of the UI by fetching the changes list from the server. I'm guessing that this is probably the reason why the API call tries to do all this work synchronously. I think this calls for the API call doing chunking and still being synchronous, if we want to avoid drastic UI+API changes.

Who's responsible for maintaining Watchlist/API code? At this point, now that the issue has been tracked down, this shouldn't require the Performance team's attention, other than providing further advice if needed.

Jdforrester-WMF subscribed.

Who's responsible for maintaining Watchlist/API code? At this point, now that the issue has been tracked down, this shouldn't require the Performance team's attention, other than providing further advice if needed.

That's the Growth-Team. Pinging @Catrope.

@Anomie you were the last to touch this API (ApiSetNotificationTimestamp) for MCR. Does it look to you like something that could be caused by recent MCR changes?

It's certainly not rMW07842be379ca: API: Update query modules for MCR. All that did was replace calls to stub methods on Revision with calls to RevisionStore directly. It seems pretty unlikely to be anything to do with MCR. With this sort of thing, it could well be that people with huge watchlists just happened to start using the button.

If I'm following the code correctly based on the API parameters, it ends up calling setNotificationTimestampsForUser in WatchedItemStore, which seems to update all rows at once for the user calling this API in the watchlist table, setting the wl_notificationtimestamp column to null. It's doing the whole thing inside the POST, as the return status of the API call seems to depend on the success or failure of that update call.

The API module in the "entirewatchlist" code path doesn't check whether the call to WatchedItemStore succeeds or read the database to check for the effect of the update, so the only way the return status would be affected is if the WatchedItemStore call throws an exception.

Interestingly, just below in WatchedItemStore we find a "resetAllNotificationTimestampsForUser" that seems to do the same thing async, post-send, in chunks.

I think the obvious fix would be to make the setnotificationtimestamp API call with those parameters use resetAllNotificationTimestampsForUser, but I'm not familiar with any of that code or those features and I don't know what the implications would be for making the API call in those conditions "fire and forget" instead of returning a real status that corresponds to the status of the DB request.

From an API perspective I don't think there'd be much of a problem with having the API call be asynchronous when "entirewatchlist" is used. When I wrote the module (long before it was used for the watchlist rewrite) it was only made synchronous because that was what SpecialWatchlist's reset button was doing at the time. OTOH, I have no idea whether that would be a problem for the new watchlist UI that's now using the API module.

T188801: Migrate wl_notificationtimestamp updates to the job queue may also be relevant.

Change 481978 had a related patch set uploaded (by Catrope; owner: Catrope):
[mediawiki/core@master] ApiSetNotificationTimestamp: Make entirewatchlist more efficient

https://gerrit.wikimedia.org/r/481978

Change 484778 had a related patch set uploaded (by Catrope; owner: Catrope):
[mediawiki/core@master] WatchedItemStore: Use batching in setNotificationTimestampsForUser

https://gerrit.wikimedia.org/r/484778

Change 481978 merged by jenkins-bot:
[mediawiki/core@master] ApiSetNotificationTimestamp: Make entirewatchlist more efficient

https://gerrit.wikimedia.org/r/481978

Note for testing: needs to be checked in wmf.14. Presently (wmf.13), there are numerous errors on https://logstash.wikimedia.org/app/kibana#/dashboard/Fatal-Monitor?_g=h@8525ade&_a=h@fea007a with referrer https://de.wikipedia.org/wiki/Spezial:Beobachtungsliste

[XEdohwpAAEcAAImkB-MAAACB] /w/api.php   Wikimedia\Rdbms\DBTransactionSizeError from line 1359 of /srv/mediawiki/php-1.33.0-wmf.13/includes/libs/rdbms/loadbalancer/LoadBalancer.php: Transaction spent 9.9563910961151 second(s) in writes, exceeding the limit of 3.

Checked in wmf.14 - 'Mark all changes as seen' on the Watchlist with 8,635 pages does not trigger any exceptions; and no Watchlist related error were present (except the mentioned above Spezial:Beobachtungsliste

kostajh subscribed.

https://gerrit.wikimedia.org/r/484778 has some feedback, also T130067: Add wl_id to watchlist tables on production dbs is done so that may impact the approach in the patch.

Change 484778 had a related patch set uploaded (by Catrope; owner: Catrope):
[mediawiki/core@master] WatchedItemStore: Use batching in setNotificationTimestampsForUser

https://gerrit.wikimedia.org/r/484778

Change 484778 merged by jenkins-bot:
[mediawiki/core@master] WatchedItemStore: Use batching in setNotificationTimestampsForUser

https://gerrit.wikimedia.org/r/484778

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:08 PM