Page MenuHomePhabricator

WMDE banners failing to save - Timing out on save
Closed, DuplicatePublic

Description

When I save our new banners https://meta.wikimedia.org/wiki/Special:CentralNoticeBanners/Edit/B17WMDE_01_170706_var and https://meta.wikimedia.org/wiki/Special:CentralNoticeBanners/Edit/B17WMDE_01_170706_ctrl there are instances where saving takes a very long time until it times out.

I've tried to check if it makes difference whether I edit template parameters (CN form fields) or banner code, but the timeout was always the same. "Edit banner on-wiki" (just changing banner wiki page) works without timeouts, so it its something in CN.

The banner values themselves are always saved: my workaround for the timeouts is to hit enter in the address bar to reload the page and my changes were always there.

I can't reproduce it at the moment to show the error messages that come after 1-2 minutes of "saving", but the error occurred for nearly every save that was done in the banner history. Maybe you have some logs where you can correlate banner saving time with incidents/timeouts in the calls that saving a banner in CN makes.

Event Timeline

Jseddon renamed this task from WMDE banners failing to save to WMDE banners failing to save - Timing out on save.Jul 13 2017, 3:46 PM

Thanks!!!

There are results for a search for the banner names in LogStash:

B17WMDE_01_170706_ctrl

B17WMDE_01_170706_var

I don't know enough about many parts of our infrastructure to understand many of the log entries... What's most notable and understandable to me is the database timeouts. @jcrespo?

For other banners edited that day, I don't see errors, though I do see warnings about laggy database servers. For example, here are log searches for B1718_0712_enIL_ipd_p1_lg_nag_lglink and robin_20170713_message_banner.

A notable difference is the heavy use of translatable messages in the WMDE banner. That might point to something related to T158084. (Also, just saw T167442 again--looks like the same issue!)

Are these messages worked on via Translatewiki? If not, you might wish to try including the message text directly in the banner, eliminating the translatable messages (or as many as possible). Could be useful to see if the situation improves then...

Thanks again!!

The queries seem to be writing 113800 rows in a single query. I cannot help you with that, that is not a server issue that a dba can help you with- please notify the code owner and tell him/her to fix the queries performed or things will break every time that functionality is used.

If the person that handles that code is not responsive, we should just disable the extension because it breaks production.

The queries seem to be writing 113800 rows in a single query.

Ouch! Ok, that's terrible. Is there a log of the queries somewhere?

If the person that handles that code is not responsive, we should just disable the extension because it breaks production.

Don't worry, I'm here. :) CentralNotice is the extension that shows banners on (almost) all production wikis. It's used heavily for community projects and announcements, as well as fundraising.

Fortunately, it's rare that banners include so many i18n messages (assuming that's really what triggers this). How would you suggest we check how often this is really happening, and look at details of any the other occurrences?

Also, it seems likely the queries themselves are not run directly by CentralNotice. It could be core, or maybe the Translate extension. The code that saves banner content is Banner::saveBodyContent(). This calls core's WikiPage::doEditContent(), which has special handling for i18n messages (Mediawiki: namespace pages).

Maybe we could check that this really isn't happening when you save the same banner by editing it on-wiki? Perhaps it could be happening then too, but the error isn't visible to the user because of how updates are run?

DStrine triaged this task as Unbreak Now! priority.Jul 14 2017, 4:36 PM

Is there a log of the queries somewhere?

This is what I got from the log you pointed:

Query affected 113800 row(s):
query-m: REPLACE INTO `translate_messageindex` (tmi_key,tmi_value) VALUES ('X')

Ah OK @jcrespo... Yeah, I see it was right there, sorry... And thanks!!!

The queries come from Translate/utils/MessageIndex.php. Since 1e9bcb0d, it is doing incremental updates with replace + delete (instead of delete all + insert). Such a high number could only happen if

  1. the messageindex was wiped
  2. the message index was never created
  3. the old message index failed to load and silently ignored
  4. a big change has been made which affects a lot of messages (e.g. a new group or making changes to an aggregate message group)
  5. a bug that affects huge number of (real or not real) messages

1 and 2 sound unlikely, but I suggest instrumenting DatabaseMessageIndex::store or MessageIndex::getArrayDiff to figure out what those 100k messages actually are to determine whether it is any of 1-5 or even something else.

A few more tidbits:

  • This still happens on production. I was able to reproduce it on production by changing the content (rather than settings) of the banner B17WMDE_01_170706_ctrl. (Just added a little whitespace.) I got a "Service Temporarily Unavailable" error after a long wait. Then, when I reloaded the banner edit page, the changes were correctly shown.
  • Searching for "translate_messageindex" in LogStash brings up 22 log entries this month, many showing a high count of rows changed. Most are not part of CentralNotice requests. I don't know how the infrastructure may be affected.

T170968 is about the more general Translate extension issue... :) Thanks!!!!!

Krinkle subscribed.

(Subscribing, but adding to Radar for now - per IRC conversation with Aaron, this seems mostly related to the known issues with Translate, that Language Engineering / Global Collaboration are probably already more familiar with. Feel free to move back to our Inbox if/when any further input or help is needed from us.)

Is this in the hands of the Translate extension team at this point? FR Tech has put it in their "blocked" column, but I want to be clear that someone else plans to work on it at some point.

It's also not clear to me whether T170968 is a more general case of this problem, or a related but distinct problem.

Is this in the hands of the Translate extension team at this point? FR Tech has put it in their "blocked" column, but I want to be clear that someone else plans to work on it at some point.

It's also not clear to me whether T170968 is a more general case of this problem, or a related but distinct problem.

Hi...!! Mmm to answer your questions out of order... Yes, it seems likely that both T170968 and the banner-saving timeouts described here are caused by the same bug in the Translate extension. T170968 describes Translate-related server-side error messages that show up when this bug manifests itself, and also in other circumstances. So, T170968 should be the first thing we focus on, I think. :)

My understanding is that Language Engineering will try to look at it, though they're short-staffed right now. I agree with @Krinkle (as I understand his comment :) ) that it'd probably be most efficient for people familiar with the Translate code to dive in.

Yep, it's quite important that someone works on it!!!

Thanks much! :D

+ @Amire80 @K4-713 for visibility given importance of this task expressed by @AndyRussG.

@AndyRussG @ksmith @ggellerman @Amire80 @Arrbee @KartikMistry

I am now back from my vacation :)

T170968 is only a description of symptoms. It does not describe the cause.

In the middle of my vacation, I gave instructions in T170591#3445570 (as well as in IRC where this was first discussed) how someone familiar with CentralNotice but not with Translate can debug this issue in production (or locally, if it can be reproduced locally) to determine the cause. I was hoping someone would follow the instructions to determine the cause while I am away.

Is this in the hands of the Translate extension team at this point?

Practically speaking, I am the only active maintainer and I have many other things on my plate as well. I used to get support from the Language team for this work, but given the changes to teams, I don't know the current status. See below for my suggestion how to proceed.

Hi...!! Mmm to answer your questions out of order... Yes, it seems likely that both T170968 and the banner-saving timeouts described here are caused by the same bug in the Translate extension. T170968 describes Translate-related server-side error messages that show up when this bug manifests itself, and also in other circumstances. So, T170968 should be the first thing we focus on, I think. :)

My understanding is that Language Engineering will try to look at it, though they're short-staffed right now. I agree with @Krinkle (as I understand his comment :) ) that it'd probably be most efficient for people familiar with the Translate code to dive in.

In this case Translate extension is a bit like operating system. It seems likely, that given this seems to only happen with CentralNotice, that CentralNotice is doing something unexpected that then causes issues when Translate does what it regularly does. I am not saying that this is the case here, but is rather an argument why the cause should be determined first.

And figuring out the cause doesn't need my expertise and thus the fastest way forward would be for someone else to attempt that. I will of course help, but if I don't have steps to reproduce this locally, it will likely take longer for me than for someone else.

I hope this clarifies expectations for both sides.

Hi... I'll be digging into this right away. I'll let you know if I have any questions... Thanks much!!! :)

Hi... I'll be digging into this right away. I'll let you know if I have any questions... Thanks much!!! :)

Hi there. This has been UBN! for a while (since July 14th) and no update in over a month. This doesn't seem UBN! :)

(I look at the list of UBN! tasks a few times a week to make sure nothing misses my radar/doesn't get resolved in a timely manner.)

Yes, agreed that UBN isn't right at this point...

AndyRussG lowered the priority of this task from Unbreak Now! to High.Sep 7 2017, 3:43 AM

Hi... I'll be digging into this right away.

Can the task be still considered assigned to you (I understand that only after your investigation others may be involved in fixing)?

Hi... I'll be digging into this right away.

Can the task be still considered assigned to you (I understand that only after your investigation others may be involved in fixing)?

It's still on my plate, though help from anywhere else is certainly appreciated. :) I had to put it aside before really getting very far--apologies for not mentioning that here, and for not keeping the priority in line with how we were dealing with it. Thanks!!!!! ;)