Page MenuHomePhabricator

CentralNotice: DB timeouts when enabling more than one campaign at once from Special:CentraNotice
Closed, ResolvedPublic4 Estimated Story Points

Description

This happened several times, from around 16:10 UTC 2016-03-04, over a period of about 20-30 min. Apparently it's happened before, too.

However, it seems that disabling a lot of campaigns at once hasn't caused problems.

Event Timeline

The error message was something like this:

To avoid excessive replication lag, your transaction has been cancelled after exceeding the 6 second timeout (6.05 s). If you are changing many items, try multiple smaller transactions.

fluorine:/a/mw-log$ grep "Special:CentralNotice" exception.log 
2016-03-04 16:15:28 mw1220 metawiki 1.27.0-wmf.15 exception ERROR: [cff71ab3] /wiki/Special:CentralNotice   DBTransactionError from line 245 of /srv/mediawiki/php-1.27.0-wmf.15/includes/db/loadbalancer/LBFactory.php: To avoid creating high replication lag, this transaction was aborted because the write duration (6.1663353443146) exceeded the 6 seconds limit.
2016-03-04 16:18:16 mw1072 metawiki 1.27.0-wmf.15 exception ERROR: [e29716e0] /wiki/Special:CentralNotice   DBTransactionError from line 245 of /srv/mediawiki/php-1.27.0-wmf.15/includes/db/loadbalancer/LBFactory.php: To avoid creating high replication lag, this transaction was aborted because the write duration (7.2294545173645) exceeded the 6 seconds limit.
2016-03-04 16:22:07 mw1209 metawiki 1.27.0-wmf.15 exception ERROR: [17025733] /wiki/Special:CentralNotice   DBTransactionError from line 245 of /srv/mediawiki/php-1.27.0-wmf.15/includes/db/loadbalancer/LBFactory.php: To avoid creating high replication lag, this transaction was aborted because the write duration (6.3699862957001) exceeded the 6 seconds limit.
2016-03-04 16:29:51 mw1255 metawiki 1.27.0-wmf.15 exception ERROR: [86810c9f] /wiki/Special:CentralNotice   DBTransactionError from line 245 of /srv/mediawiki/php-1.27.0-wmf.15/includes/db/loadbalancer/LBFactory.php: To avoid creating high replication lag, this transaction was aborted because the write duration (6.0277230739594) exceeded the 6 seconds limit.
2016-03-04 16:30:50 mw1171 metawiki 1.27.0-wmf.15 exception ERROR: [d8fd9d42] /wiki/Special:CentralNotice   DBTransactionError from line 245 of /srv/mediawiki/php-1.27.0-wmf.15/includes/db/loadbalancer/LBFactory.php: To avoid creating high replication lag, this transaction was aborted because the write duration (7.4766249656677) exceeded the 6 seconds limit.
2016-03-04 16:40:57 mw1074 metawiki 1.27.0-wmf.15 exception ERROR: [35c0dbff] /wiki/Special:CentralNotice   DBTransactionError from line 245 of /srv/mediawiki/php-1.27.0-wmf.15/includes/db/loadbalancer/LBFactory.php: To avoid creating high replication lag, this transaction was aborted because the write duration (6.3603825569153) exceeded the 6 seconds limit.

It seems pretty clear that CentralNotice::handleNoticePostFromList() is very long overdue for a rewrite and some simple "optimization" (if that's what you call making something slightly less heinous). As pointed out by @AlexMonk-WMF on IRC, the code might be doing something odd that's not obvious, and despite all the needless updating, the table being needlessly updated isn't huge. So, also dunno if there might not be some issue worth checking out on the DB-infrastructure side here.

That is a mediawiki error, as in, mysql (as in, the infrastructure, not the logic/lag) has no problems with long transactions, but mediawiki kills them (rightfully).

With the advancement of technology (SSDs, large caches) and in mysql (parallel replication) large number of short transactions are strongly preferred over long-running small number of commits. Ideally, transactions should not take more than 1 second to execute. If something requires "logical" transactions (atomic actions) that are long-running, the logic should be kept at application side.

With this I am not distancing myself from the issue, please ask for help for optimization; I am just confirming that there was not unusual issues/maintenance/hardware problems on centralauth (s7) on the dates given.

Here's the mysql log from enabling multiple campaigns at once.

I just ran into this error attempting to unarchive a single campaign.

Database error
To avoid creating high replication lag, this transaction was aborted because the write duration (6.2203695774078) exceeded the 6 seconds limit.
If you are changing many items at once, try doing multiple smaller operations instead.

I think someone reduced or proposed reducing the timeout for this on mediawiki, but I didn't follow up. @aaron do you know if there was recently any related change?

I think someone reduced or proposed reducing the timeout for this on mediawiki, but I didn't follow up. @aaron do you know if there was recently any related change?

The config has not changed recently, though the value is on the high side still.

I think someone is already trying to work on optimizing this (cannot remember who).

Indeed 6 seconds seems like a long time to block DB_MASTER--is that what would be happening? Any CentralNotice Database queries that come in on a POST request (as is the case with these) use DB_MASTER.

As per @jcrespo's instructions, I profiled the operations involved locally. The log and report are attached. Total exec time, 34ms. My local cn_notices table has only 3 rows. I wish I knew how to interpret this report better... any further assistance would be greatly appreciated!

If it's useful, I could probably find a way to locally simulate conditions more similar to production (i.e., put a similar number of rows in cn_notices).

Thanks much!! :)


@AndyRussG- I am not an expert on mediawiki, but post requests can be handled on a slave- with the due precautions (writing/locking on the master). However, master operations should be as short as possible.

This a summary of the size of cn_* tables on meta on production:

+--------------------------------+--------+----------------+----------------+
| Name                           |  Rows  | Avg_row_length | Data_length    |
+--------------------------------+--------+----------------+----------------+
| cn_assignments                 |   1919 |             59 |         114688 |
| cn_known_devices               |      5 |           3276 |          16384 |  
| cn_known_mobile_carriers       |      0 |              0 |          16384 |
| cn_notice_countries            |  20860 |             76 |        1589248 |
| cn_notice_languages            |  57099 |             45 |        2621440 |
| cn_notice_log                  |  22013 |           1742 |       38354944 |
| cn_notice_mixin_params         |   1217 |             53 |          65536 |
| cn_notice_mixins               |    348 |             70 |          24576 |
| cn_notice_mobile_carriers      |      0 |              0 |          16384 |
| cn_notice_projects             |   5085 |             38 |         196608 |
| cn_notices                     |    924 |            124 |         114688 |
| cn_template_devices            |  13191 |             36 |         475136 |
| cn_template_log                |  35344 |            133 |        4734976 |
| cn_template_mixins             |      0 |              0 |          16384 |
| cn_templates                   |  10476 |            151 |        1589248 |
+--------------------------------+--------+----------------+----------------+

Assuming the same queries are executed on production, I can run those in production in a slave similar to the master and give you an estimation of the time they really take. Can you identify on the file which block of queries exactly corresponds to the http request that fails/transaction that fails?

FYI, this error only count time in write queries, not SELECTS.

Can you identify on the file which block of queries exactly corresponds to the http request that fails/transaction that fails?

You mean in

? All of them are from that single POST. I removed parts of the log from other requests. Also removed processing to generate the page displayed after the POST.

FYI, this error only count time in write queries, not SELECTS.

Ah OK, that's important to know. So this would be the fault of one specific write query? Or is it the aggregate time of all write queries performed in the context of a single request? (Apologies that my understanding of this is limited...)

So, as you can see, the code that handles this POST is patently awful; it writes to every row in cn_notices no matter what. I'll start coding up a long-overdue de-awfulization. As mentioned above, I just wanted to check that nothing very unexpected was going on.

Change 277453 had a related patch set uploaded (by AndyRussG):
Admin UI: Move JS and CSS for campaign pager to RL module

https://gerrit.wikimedia.org/r/277453

AndyRussG set the point value for this task to 4.Mar 15 2016, 2:52 PM

Change 277792 had a related patch set uploaded (by AndyRussG):
[WIP] Admin UI: Optimize handling of changes to campaigns via list

https://gerrit.wikimedia.org/r/277792

Here's the slow.log and digest with the above patches, on my local install with 3 rows in cn_notices. Looks to me like an improvement. Unless someone notices something new, I guess we should try get this through code review, deploy, and hope for the best.

Thanks!!


Change 277453 merged by jenkins-bot:
Admin UI: Move JS and CSS for campaign pager to RL module

https://gerrit.wikimedia.org/r/277453

Change 277792 merged by Ejegg:
Admin UI: Optimize handling of changes to campaigns via list

https://gerrit.wikimedia.org/r/277792

The fix for this has been deployed to production. However, we might wait for some bulk changes to go through successfully before closing. :) Thanks!!

I just tested disabling then enabling 5 campaigns at once, and then archiving 4 campaigns at once. All worked much quicker than previously, no timeouts, and everything in CentralNoticeLogs looks fine too. Thanks a lot!