CentralNotice: DB timeouts when enabling more than one campaign at once from Special:CentraNotice
Closed, ResolvedPublic4 Estimated Story Points
Actions

Assigned To

Authored By

	AndyRussG
	Mar 4 2016, 4:47 PM

Description

This happened several times, from around 16:10 UTC 2016-03-04, over a period of about 20-30 min. Apparently it's happened before, too.

However, it seems that disabling a lot of campaigns at once hasn't caused problems.

Details

	Subject	Repo	Branch	Lines +/-
	Admin UI: Optimize handling of changes to campaigns via list	mediawiki/extensions/CentralNotice	master	+213 -127
	Admin UI: Move JS and CSS for campaign pager to RL module	mediawiki/extensions/CentralNotice	master	+52 -39

Customize query in gerrit

Related Objects

Mentioned In: T132090: CentralNotice: PHPUnit and QUnit tests for controls on campaigns list
T110315: Summary of campaign settings changes is not logged in CentralNotice logs when modified using Special:CentralNotice main page table

Event Timeline

AndyRussG created this task.Mar 4 2016, 4:47 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 4 2016, 4:47 PM

The error message was something like this:

To avoid excessive replication lag, your transaction has been cancelled after exceeding the 6 second timeout (6.05 s). If you are changing many items, try multiple smaller transactions.

• DStrine assigned this task to AndyRussG.Mar 4 2016, 9:12 PM

• DStrine triaged this task as High priority.

• DStrine added projects: Fundraising Sprint Elevator Maintenance 2016, Unplanned-Sprint-Work.

• DStrine moved this task from Triage to Current Sprint & Completed in Q3 1516 on the Fundraising-Backlog board.Mar 4 2016, 11:40 PM

AndyRussG added subscribers: jcrespo, aaron.Mar 5 2016, 2:57 AM

fluorine:/a/mw-log$ grep "Special:CentralNotice" exception.log 
2016-03-04 16:15:28 mw1220 metawiki 1.27.0-wmf.15 exception ERROR: [cff71ab3] /wiki/Special:CentralNotice   DBTransactionError from line 245 of /srv/mediawiki/php-1.27.0-wmf.15/includes/db/loadbalancer/LBFactory.php: To avoid creating high replication lag, this transaction was aborted because the write duration (6.1663353443146) exceeded the 6 seconds limit.
2016-03-04 16:18:16 mw1072 metawiki 1.27.0-wmf.15 exception ERROR: [e29716e0] /wiki/Special:CentralNotice   DBTransactionError from line 245 of /srv/mediawiki/php-1.27.0-wmf.15/includes/db/loadbalancer/LBFactory.php: To avoid creating high replication lag, this transaction was aborted because the write duration (7.2294545173645) exceeded the 6 seconds limit.
2016-03-04 16:22:07 mw1209 metawiki 1.27.0-wmf.15 exception ERROR: [17025733] /wiki/Special:CentralNotice   DBTransactionError from line 245 of /srv/mediawiki/php-1.27.0-wmf.15/includes/db/loadbalancer/LBFactory.php: To avoid creating high replication lag, this transaction was aborted because the write duration (6.3699862957001) exceeded the 6 seconds limit.
2016-03-04 16:29:51 mw1255 metawiki 1.27.0-wmf.15 exception ERROR: [86810c9f] /wiki/Special:CentralNotice   DBTransactionError from line 245 of /srv/mediawiki/php-1.27.0-wmf.15/includes/db/loadbalancer/LBFactory.php: To avoid creating high replication lag, this transaction was aborted because the write duration (6.0277230739594) exceeded the 6 seconds limit.
2016-03-04 16:30:50 mw1171 metawiki 1.27.0-wmf.15 exception ERROR: [d8fd9d42] /wiki/Special:CentralNotice   DBTransactionError from line 245 of /srv/mediawiki/php-1.27.0-wmf.15/includes/db/loadbalancer/LBFactory.php: To avoid creating high replication lag, this transaction was aborted because the write duration (7.4766249656677) exceeded the 6 seconds limit.
2016-03-04 16:40:57 mw1074 metawiki 1.27.0-wmf.15 exception ERROR: [35c0dbff] /wiki/Special:CentralNotice   DBTransactionError from line 245 of /srv/mediawiki/php-1.27.0-wmf.15/includes/db/loadbalancer/LBFactory.php: To avoid creating high replication lag, this transaction was aborted because the write duration (6.3603825569153) exceeded the 6 seconds limit.

https://github.com/wikimedia/mediawiki-extensions-CentralNotice/blob/8b2b0ca60091370f0be3b4918489083027249124/special/SpecialCentralNotice.php#L184-L194

It seems pretty clear that CentralNotice::handleNoticePostFromList() is very long overdue for a rewrite and some simple "optimization" (if that's what you call making something slightly less heinous). As pointed out by @AlexMonk-WMF on IRC, the code might be doing something odd that's not obvious, and despite all the needless updating, the table being needlessly updated isn't huge. So, also dunno if there might not be some issue worth checking out on the DB-infrastructure side here.

AndyRussG moved this task from Backlog to Doing on the Fundraising Sprint Elevator Maintenance 2016 board.Mar 5 2016, 3:52 AM

Krenair removed a subscriber: • AlexMonk-WMF.Mar 5 2016, 5:29 PM

That is a mediawiki error, as in, mysql (as in, the infrastructure, not the logic/lag) has no problems with long transactions, but mediawiki kills them (rightfully).

With the advancement of technology (SSDs, large caches) and in mysql (parallel replication) large number of short transactions are strongly preferred over long-running small number of commits. Ideally, transactions should not take more than 1 second to execute. If something requires "logical" transactions (atomic actions) that are long-running, the logic should be kept at application side.

With this I am not distancing myself from the issue, please ask for help for optimization; I am just confirming that there was not unusual issues/maintenance/hardware problems on centralauth (s7) on the dates given.

Here's the mysql log from enabling multiple campaigns at once.

I just ran into this error attempting to unarchive a single campaign.

Database error
To avoid creating high replication lag, this transaction was aborted because the write duration (6.2203695774078) exceeded the 6 seconds limit.
If you are changing many items at once, try doing multiple smaller operations instead.

I think someone reduced or proposed reducing the timeout for this on mediawiki, but I didn't follow up. @aaron do you know if there was recently any related change?

In T128869#2106711, @jcrespo wrote:

I think someone reduced or proposed reducing the timeout for this on mediawiki, but I didn't follow up. @aaron do you know if there was recently any related change?

The config has not changed recently, though the value is on the high side still.

I think someone is already trying to work on optimizing this (cannot remember who).

Indeed 6 seconds seems like a long time to block DB_MASTER--is that what would be happening? Any CentralNotice Database queries that come in on a POST request (as is the case with these) use DB_MASTER.

As per @jcrespo's instructions, I profiled the operations involved locally. The log and report are attached. Total exec time, 34ms. My local cn_notices table has only 3 rows. I wish I knew how to interpret this report better... any further assistance would be greatly appreciated!

If it's useful, I could probably find a way to locally simulate conditions more similar to production (i.e., put a similar number of rows in cn_notices).

Thanks much!! :)

cn-write-local-slow.log155 KBDownload

cn-write-local_pt-query-digest.txt61 KBDownload

@atgo, @DStrine, @Pcoombe, @jrobell, @aaron, @jcrespo, should we raise the priority of this?

@AndyRussG- I am not an expert on mediawiki, but post requests can be handled on a slave- with the due precautions (writing/locking on the master). However, master operations should be as short as possible.

This a summary of the size of cn_* tables on meta on production:

+--------------------------------+--------+----------------+----------------+
| Name                           |  Rows  | Avg_row_length | Data_length    |
+--------------------------------+--------+----------------+----------------+
| cn_assignments                 |   1919 |             59 |         114688 |
| cn_known_devices               |      5 |           3276 |          16384 |  
| cn_known_mobile_carriers       |      0 |              0 |          16384 |
| cn_notice_countries            |  20860 |             76 |        1589248 |
| cn_notice_languages            |  57099 |             45 |        2621440 |
| cn_notice_log                  |  22013 |           1742 |       38354944 |
| cn_notice_mixin_params         |   1217 |             53 |          65536 |
| cn_notice_mixins               |    348 |             70 |          24576 |
| cn_notice_mobile_carriers      |      0 |              0 |          16384 |
| cn_notice_projects             |   5085 |             38 |         196608 |
| cn_notices                     |    924 |            124 |         114688 |
| cn_template_devices            |  13191 |             36 |         475136 |
| cn_template_log                |  35344 |            133 |        4734976 |
| cn_template_mixins             |      0 |              0 |          16384 |
| cn_templates                   |  10476 |            151 |        1589248 |
+--------------------------------+--------+----------------+----------------+

Assuming the same queries are executed on production, I can run those in production in a slave similar to the master and give you an estimation of the time they really take. Can you identify on the file which block of queries exactly corresponds to the http request that fails/transaction that fails?

FYI, this error only count time in write queries, not SELECTS.

In T128869#2111181, @jcrespo wrote:

Can you identify on the file which block of queries exactly corresponds to the http request that fails/transaction that fails?

You mean in

cn-write-local-slow.log155 KBDownload

? All of them are from that single POST. I removed parts of the log from other requests. Also removed processing to generate the page displayed after the POST.

In T128869#2111182, @aaron wrote:

FYI, this error only count time in write queries, not SELECTS.

Ah OK, that's important to know. So this would be the fault of one specific write query? Or is it the aggregate time of all write queries performed in the context of a single request? (Apologies that my understanding of this is limited...)

So, as you can see, the code that handles this POST is patently awful; it writes to every row in cn_notices no matter what. I'll start coding up a long-overdue de-awfulization. As mentioned above, I just wanted to check that nothing very unexpected was going on.

Change 277453 had a related patch set uploaded (by AndyRussG):
Admin UI: Move JS and CSS for campaign pager to RL module

https://gerrit.wikimedia.org/r/277453

gerritbot added a project: Patch-For-Review.Mar 15 2016, 2:49 AM

AndyRussG set the point value for this task to 4.Mar 15 2016, 2:52 PM

Change 277792 had a related patch set uploaded (by AndyRussG):
[WIP] Admin UI: Optimize handling of changes to campaigns via list

https://gerrit.wikimedia.org/r/277792

AndyRussG moved this task from Doing to Review on the Fundraising Sprint Elevator Maintenance 2016 board.Mar 16 2016, 4:29 PM

• DStrine added a project: Fundraising Sprint Freshmaking.Mar 16 2016, 10:55 PM

AndyRussG moved this task from Backlog to Review on the Fundraising Sprint Freshmaking board.Mar 16 2016, 11:08 PM

Here's the slow.log and digest with the above patches, on my local install with 3 rows in cn_notices. Looks to me like an improvement. Unless someone notices something new, I guess we should try get this through code review, deploy, and hope for the best.

Thanks!!

cn-patched-write-local_slow.log29 KBDownload

cn-patched-write-local_pt-query-digest.txt32 KBDownload

AndyRussG mentioned this in T110315: Summary of campaign settings changes is not logged in CentralNotice logs when modified using Special:CentralNotice main page table.Mar 18 2016, 8:00 PM

• atgo unsubscribed.Mar 30 2016, 10:05 PM

• DStrine moved this task from Current Sprint & Completed in Q3 1516 to Current Sprint & Completed in Q4 1516 on the Fundraising-Backlog board.Mar 30 2016, 10:07 PM

• DStrine added a project: Fundraising Sprint Ghostbusting .Mar 30 2016, 10:39 PM

Change 277453 merged by jenkins-bot:
Admin UI: Move JS and CSS for campaign pager to RL module

https://gerrit.wikimedia.org/r/277453

AndyRussG moved this task from Backlog to Review on the Fundraising Sprint Ghostbusting board.Mar 31 2016, 12:27 PM

AndyRussG mentioned this in T132090: CentralNotice: PHPUnit and QUnit tests for controls on campaigns list.Apr 7 2016, 7:54 PM

Change 277792 merged by Ejegg:
Admin UI: Optimize handling of changes to campaigns via list

https://gerrit.wikimedia.org/r/277792

AndyRussG moved this task from Review to Done on the Fundraising Sprint Ghostbusting board.Apr 13 2016, 4:29 PM

• DStrine added a project: Fundraising Sprint Hermit Crab Husbandry.Apr 13 2016, 10:08 PM

AndyRussG moved this task from Backlog to Done on the Fundraising Sprint Hermit Crab Husbandry board.Apr 13 2016, 10:27 PM

• DStrine removed a project: Unplanned-Sprint-Work.Apr 14 2016, 3:57 PM

The fix for this has been deployed to production. However, we might wait for some bulk changes to go through successfully before closing. :) Thanks!!

I just tested disabling then enabling 5 campaigns at once, and then archiving 4 campaigns at once. All worked much quicker than previously, no timeouts, and everything in CentralNoticeLogs looks fine too. Thanks a lot!

	F3646181: cn-patched-write-local_slow.log
	Mar 17 2016, 1:07 AM

	F3646182: cn-patched-write-local_pt-query-digest.txt
	Mar 17 2016, 1:07 AM

	F3610361: cn-write-local_pt-query-digest.txt
	Mar 11 2016, 4:03 AM

CentralNotice: DB timeouts when enabling more than one campaign at once from Special:CentraNoticeClosed, ResolvedPublic4 Estimated Story PointsActions

Description

Details

Related Objects

Event Timeline

CentralNotice: DB timeouts when enabling more than one campaign at once from Special:CentraNotice
Closed, ResolvedPublic4 Estimated Story Points
Actions