WMDE banners failing to save - Timing out on save
Closed, DuplicatePublic
Actions

Description

When I save our new banners https://meta.wikimedia.org/wiki/Special:CentralNoticeBanners/Edit/B17WMDE_01_170706_var and https://meta.wikimedia.org/wiki/Special:CentralNoticeBanners/Edit/B17WMDE_01_170706_ctrl there are instances where saving takes a very long time until it times out.

I've tried to check if it makes difference whether I edit template parameters (CN form fields) or banner code, but the timeout was always the same. "Edit banner on-wiki" (just changing banner wiki page) works without timeouts, so it its something in CN.

The banner values themselves are always saved: my workaround for the timeouts is to hit enter in the address bar to reload the page and my changes were always there.

I can't reproduce it at the moment to show the error messages that come after 1-2 minutes of "saving", but the error occurred for nearly every save that was done in the banner history. Maybe you have some logs where you can correlate banner saving time with incidents/timeouts in the calls that saving a banner in CN makes.

Related Objects

Mentioned In: T203925: Save times for changes to translation variable text in centralnotice paralysingly slow
T170968: Translate: Large number of rows affected when rebuilding message index, causing DB timeouts
T167442: Adding translatable message to banner FR2015_translations takes a long time and shows error
Mentioned Here: T170968: Translate: Large number of rows affected when rebuilding message index, causing DB timeouts
rETRA1e9bcb0d048a: Try to avoid huge inserts in DatabaseMessageIndex
T168926: MessageGroupStats caused database query issues
T158084: Mediawiki namespace pages, including CentralNotice banners, are slow to save
T167442: Adding translatable message to banner FR2015_translations takes a long time and shows error

Event Timeline

• DStrine created this task.Jul 13 2017, 3:45 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 13 2017, 3:45 PM

• Jseddon renamed this task from WMDE banners failing to save to WMDE banners failing to save - Timing out on save.Jul 13 2017, 3:46 PM

gabriel-wmde updated the task description. (Show Details)Jul 13 2017, 4:54 PM

Thanks!!!

There are results for a search for the banner names in LogStash:

B17WMDE_01_170706_ctrl

B17WMDE_01_170706_var

I don't know enough about many parts of our infrastructure to understand many of the log entries... What's most notable and understandable to me is the database timeouts. @jcrespo?

For other banners edited that day, I don't see errors, though I do see warnings about laggy database servers. For example, here are log searches for B1718_0712_enIL_ipd_p1_lg_nag_lglink and robin_20170713_message_banner.

A notable difference is the heavy use of translatable messages in the WMDE banner. That might point to something related to T158084. (Also, just saw T167442 again--looks like the same issue!)

Are these messages worked on via Translatewiki? If not, you might wish to try including the message text directly in the banner, eliminating the translatable messages (or as many as possible). Could be useful to see if the situation improves then...

Thanks again!!

AndyRussG added subscribers: aaron, Pcoombe, spatton.Jul 14 2017, 12:10 AM

The queries seem to be writing 113800 rows in a single query. I cannot help you with that, that is not a server issue that a dba can help you with- please notify the code owner and tell him/her to fix the queries performed or things will break every time that functionality is used.

If the person that handles that code is not responsive, we should just disable the extension because it breaks production.

In T170591#3438513, @jcrespo wrote:

The queries seem to be writing 113800 rows in a single query.

Ouch! Ok, that's terrible. Is there a log of the queries somewhere?

If the person that handles that code is not responsive, we should just disable the extension because it breaks production.

Don't worry, I'm here. :) CentralNotice is the extension that shows banners on (almost) all production wikis. It's used heavily for community projects and announcements, as well as fundraising.

Fortunately, it's rare that banners include so many i18n messages (assuming that's really what triggers this). How would you suggest we check how often this is really happening, and look at details of any the other occurrences?

Also, it seems likely the queries themselves are not run directly by CentralNotice. It could be core, or maybe the Translate extension. The code that saves banner content is Banner::saveBodyContent(). This calls core's WikiPage::doEditContent(), which has special handling for i18n messages (Mediawiki: namespace pages).

Maybe we could check that this really isn't happening when you save the same banner by editing it on-wiki? Perhaps it could be happening then too, but the error isn't visible to the user because of how updates are run?

• DStrine triaged this task as Unbreak Now! priority.Jul 14 2017, 4:36 PM

Restricted Application added subscribers: Jay8g, TerraCodes. · View Herald TranscriptJul 14 2017, 4:36 PM

AndyRussG added projects: Performance Issue, SRE.Jul 14 2017, 4:49 PM

Is there a log of the queries somewhere?

This is what I got from the log you pointed:

Query affected 113800 row(s):
query-m: REPLACE INTO `translate_messageindex` (tmi_key,tmi_value) VALUES ('X')

Ah OK @jcrespo... Yeah, I see it was right there, sorry... And thanks!!!

AndyRussG edited projects, added MediaWiki-extensions-Translate; removed SRE, Performance Issue.Jul 14 2017, 6:31 PM

AndyRussG edited subscribers, added: Nikerabbit; removed: aaron.Jul 14 2017, 6:36 PM

AndyRussG mentioned this in T167442: Adding translatable message to banner FR2015_translations takes a long time and shows error.Jul 14 2017, 7:30 PM

Maybe the same bug as is causing T168926?

Restricted Application added a subscriber: Liuxinyu970226. · View Herald TranscriptJul 17 2017, 10:40 AM

The queries come from Translate/utils/MessageIndex.php. Since 1e9bcb0d, it is doing incremental updates with replace + delete (instead of delete all + insert). Such a high number could only happen if

the messageindex was wiped
the message index was never created
the old message index failed to load and silently ignored
a big change has been made which affects a lot of messages (e.g. a new group or making changes to an aggregate message group)
a bug that affects huge number of (real or not real) messages

1 and 2 sound unlikely, but I suggest instrumenting DatabaseMessageIndex::store or MessageIndex::getArrayDiff to figure out what those 100k messages actually are to determine whether it is any of 1-5 or even something else.

KartikMistry subscribed.Jul 18 2017, 12:22 PM

A few more tidbits:

This still happens on production. I was able to reproduce it on production by changing the content (rather than settings) of the banner B17WMDE_01_170706_ctrl. (Just added a little whitespace.) I got a "Service Temporarily Unavailable" error after a long wait. Then, when I reloaded the banner edit page, the changes were correctly shown.

Searching for "translate_messageindex" in LogStash brings up 22 log entries this month, many showing a high count of rows changed. Most are not part of CentralNotice requests. I don't know how the infrastructure may be affected.

AndyRussG mentioned this in T170968: Translate: Large number of rows affected when rebuilding message index, causing DB timeouts.Jul 18 2017, 6:41 PM

T170968 is about the more general Translate extension issue... :) Thanks!!!!!

• DStrine moved this task from Triage to Blocked or not fr-tech on the Fundraising-Backlog board.Jul 18 2017, 7:14 PM

Nemo_bis subscribed.Jul 19 2017, 10:51 AM

(Subscribing, but adding to Radar for now - per IRC conversation with Aaron, this seems mostly related to the known issues with Translate, that Language Engineering / Global Collaboration are probably already more familiar with. Feel free to move back to our Inbox if/when any further input or help is needed from us.)

Krinkle moved this task from Inbox, needs triage to Radar on the Performance-Team board.Jul 25 2017, 5:59 PM

Is this in the hands of the Translate extension team at this point? FR Tech has put it in their "blocked" column, but I want to be clear that someone else plans to work on it at some point.

It's also not clear to me whether T170968 is a more general case of this problem, or a related but distinct problem.

In T170591#3479903, @ksmith wrote:

Is this in the hands of the Translate extension team at this point? FR Tech has put it in their "blocked" column, but I want to be clear that someone else plans to work on it at some point.

It's also not clear to me whether T170968 is a more general case of this problem, or a related but distinct problem.

Hi...!! Mmm to answer your questions out of order... Yes, it seems likely that both T170968 and the banner-saving timeouts described here are caused by the same bug in the Translate extension. T170968 describes Translate-related server-side error messages that show up when this bug manifests itself, and also in other circumstances. So, T170968 should be the first thing we focus on, I think. :)

My understanding is that Language Engineering will try to look at it, though they're short-staffed right now. I agree with @Krinkle (as I understand his comment :) ) that it'd probably be most efficient for people familiar with the Translate code to dive in.

Yep, it's quite important that someone works on it!!!

Thanks much! :D

+ @Amire80 @K4-713 for visibility given importance of this task expressed by @AndyRussG.

@AndyRussG @ksmith @ggellerman @Amire80 @Arrbee @KartikMistry

I am now back from my vacation :)

T170968 is only a description of symptoms. It does not describe the cause.

In the middle of my vacation, I gave instructions in T170591#3445570 (as well as in IRC where this was first discussed) how someone familiar with CentralNotice but not with Translate can debug this issue in production (or locally, if it can be reproduced locally) to determine the cause. I was hoping someone would follow the instructions to determine the cause while I am away.

In T170591#3479903, @ksmith wrote:

Is this in the hands of the Translate extension team at this point?

Practically speaking, I am the only active maintainer and I have many other things on my plate as well. I used to get support from the Language team for this work, but given the changes to teams, I don't know the current status. See below for my suggestion how to proceed.

In T170591#3483772, @AndyRussG wrote:

Hi...!! Mmm to answer your questions out of order... Yes, it seems likely that both T170968 and the banner-saving timeouts described here are caused by the same bug in the Translate extension. T170968 describes Translate-related server-side error messages that show up when this bug manifests itself, and also in other circumstances. So, T170968 should be the first thing we focus on, I think. :)

My understanding is that Language Engineering will try to look at it, though they're short-staffed right now. I agree with @Krinkle (as I understand his comment :) ) that it'd probably be most efficient for people familiar with the Translate code to dive in.

In this case Translate extension is a bit like operating system. It seems likely, that given this seems to only happen with CentralNotice, that CentralNotice is doing something unexpected that then causes issues when Translate does what it regularly does. I am not saying that this is the case here, but is rather an argument why the cause should be determined first.

And figuring out the cause doesn't need my expertise and thus the fastest way forward would be for someone else to attempt that. I will of course help, but if I don't have steps to reproduce this locally, it will likely take longer for me than for someone else.

I hope this clarifies expectations for both sides.

Hi... I'll be digging into this right away. I'll let you know if I have any questions... Thanks much!!! :)

Krinkle edited projects, added Performance-Team (Radar); removed Performance-Team.Aug 8 2017, 3:14 AM

Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.Aug 16 2017, 6:56 PM

• DStrine moved this task from Blocked or not fr-tech to Current Sprint and Completed in Q1 2017-18 on the Fundraising-Backlog board.Aug 31 2017, 7:59 PM

• DStrine added a project: Fundraising Sprint Quill Pencil.

In T170591#3501942, @AndyRussG wrote:

Hi... I'll be digging into this right away. I'll let you know if I have any questions... Thanks much!!! :)

Hi there. This has been UBN! for a while (since July 14th) and no update in over a month. This doesn't seem UBN! :)

(I look at the list of UBN! tasks a few times a week to make sure nothing misses my radar/doesn't get resolved in a timely manner.)

Yes, agreed that UBN isn't right at this point...

AndyRussG lowered the priority of this task from Unbreak Now! to High.Sep 7 2017, 3:43 AM

In T170591#3501942, @AndyRussG wrote:

Hi... I'll be digging into this right away.

Can the task be still considered assigned to you (I understand that only after your investigation others may be involved in fixing)?

In T170591#3587309, @Nemo_bis wrote:

In T170591#3501942, @AndyRussG wrote:

Hi... I'll be digging into this right away.

Can the task be still considered assigned to you (I understand that only after your investigation others may be involved in fixing)?

It's still on my plate, though help from anywhere else is certainly appreciated. :) I had to put it aside before really getting very far--apologies for not mentioning that here, and for not keeping the priority in line with how we were dealing with it. Thanks!!!!! ;)

• DStrine added a project: Fundraising Sprint RadioActivewear.Sep 13 2017, 8:28 PM

• DStrine added a project: Fundraising Sprint Synchronized Screaming.Sep 26 2017, 8:46 PM

• DStrine moved this task from Current Sprint and Completed in Q1 2017-18 to Q3 2021-2022 on the Fundraising-Backlog board.Oct 10 2017, 8:04 PM

• ggellerman unsubscribed.Nov 21 2017, 10:55 PM

• DStrine moved this task from Q3 2021-2022 to Sprint +1 on the Fundraising-Backlog board.Jan 11 2018, 3:55 PM

Krinkle unsubscribed.Jan 12 2018, 10:47 PM

• DStrine added a project: Fr-CentralNotice-Translation-Bugs.Feb 7 2018, 3:08 PM

Nikerabbit moved this task from Backlog to maintenance and operational issues on the MediaWiki-extensions-Translate board.Jun 3 2018, 10:45 AM

Nikerabbit mentioned this in T203925: Save times for changes to translation variable text in centralnotice paralysingly slow.Sep 11 2018, 7:21 AM

AndyRussG closed this task as a duplicate of T203925: Save times for changes to translation variable text in centralnotice paralysingly slow.Sep 25 2018, 8:12 PM

Restricted Application removed a subscriber: Liuxinyu970226. · View Herald TranscriptSep 25 2018, 8:12 PM