Page MenuHomePhabricator

MessageGroupStats::clear performs the same deletion on all mediawiki servers, locking itself
Closed, ResolvedPublic

Description

From time to time, we see on the WMF logs something like this:

MessageGroupStats::clear <ip-redacted> 
1205 Lock wait timeout exceeded; try restarting transaction (<ip-redacted>) 
DELETE FROM `translate_groupstats` WHERE tgs_group = '<one group here>' AND tgs_lang = 'en'

While the records will probably eventually deleted, it creates database stress. It seems that the code was thought for being executed in only one application server, when on the WMF, it runs on every of them, failing "number of mediawikis-1"* N times, being at some points the top error appearing on the logs.

An exclusive-write mechanism should be implemented so that not all servers execute the same query again, so it is done only once.

Details

Related Gerrit Patches:
mediawiki/extensions/Translate : masterCentralize post translation save updates
mediawiki/extensions/Translate : masterSkip stats cache purges on no-op edits

Event Timeline

jcrespo created this task.Jun 12 2015, 8:04 AM
jcrespo raised the priority of this task from to Needs Triage.
jcrespo updated the task description. (Show Details)
jcrespo added a subscriber: jcrespo.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 12 2015, 8:04 AM

What do you mean "it runs on every of them"? It most likely caused by some kind of user action, but I can't think of what would cause masses of them for English. Do you have examples of the group names?

Every time a mass lock is detected is because it runs on *all mediawiki frontends* of the Wikimedia Foundation (Apache/HHVM):
This set of servers:
http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&s=by+name&c=API%2520application%2520servers%2520eqiad&tab=m&vn=&hide-hf=false

I do not see a reason to execute the same delete query more than once, as there is only one master database per shard.

Here you have the full details of one of such errors from today:

{
  "_index": "logstash-2015.06.16",
  "_type": "mediawiki",
  "_id": "ztDCk3sGRQ6FZbHWEn3bow",
  "_score": null,
  "_source": {
    "message": "MessageGroupStats::clear\t10.64.16.22\t1205\tLock wait timeout exceeded; try restarting transaction (10.64.16.22)\tDELETE FROM `translate_groupstats` WHERE tgs_group = 'page-History of the Wikimedia Foundation Board of Trustees' AND tgs_lang = 'en'",
    "@version": 1,
    "@timestamp": "2015-06-16T02:19:35.006Z",
    "type": "mediawiki",
    "host": "mw1001",
    "level": "ERROR",
    "tags": [
      "syslog",
      "es",
      "es",
      "normalized_message_untrimmed"
    ],
    "channel": "wfLogDBError",
    "url": "/rpc/RunJobs.php?wiki=metawiki&type=MessageUpdateJob&maxtime=30&maxmem=300M",
    "ip": "127.0.0.1",
    "http_method": "POST",
    "server": "127.0.0.1",
    "referrer": null,
    "uid": "627769a",
    "process_id": 12672,
    "wiki": "metawiki",
    "db_server": "10.64.16.22",
    "db_name": "metawiki",
    "db_user": "wikiuser",
    "method": "DatabaseBase::reportQueryError",
    "errno": 1205,
    "error": "Lock wait timeout exceeded; try restarting transaction (10.64.16.22)",
    "sql1line": "DELETE FROM `translate_groupstats` WHERE tgs_group = 'page-History of the Wikimedia Foundation Board of Trustees' AND tgs_lang = 'en'",
    "fname": "MessageGroupStats::clear",
    "normalized_message": "MessageGroupStats::clear\t10.64.16.22\t1205\tLock wait timeout exceeded; try restarting transaction (10.64.16.22)\tDELETE FROM `translate_groupstats` WHERE tgs_group = 'page-History of the Wikimedia Foundation Board of Trustees' AND tgs_lang = 'en'"
  },
  "sort": [
    1434421175006
  ]
}

The same query, DELETE FROM translate_groupstats WHERE tgs_group = 'page-History of the Wikimedia Foundation Board of Trustees' AND tgs_lang = 'en' produced an error 18 times, each on a different frontend (and I assume it was successful on the other 50) within the same second at 2015-06-16T02:19:35.003Z. The number of errors vary between 20-80 per batch, I assume depending on when mediawiki or mysql decides to kill the query.

Variations of the same delete query is seen regularly on the mediawiki logs, with different parameters (tgs_group values): page-Help:Recent changes, page-Commons:Overwriting existing files, page-Wikivoyage. I do not think it is a one-time problem, but that it happens every time or at least regularly. As you can see, it is a job, so I assume (haven't seen the code) it can be limited to only be executed once.

I can help with the database side of things if needed, as well as more error log details, but MySQL/InnoDB here is doing what it can (not allowing 2 concurrent writes on he same rows at the same time), I think the query model could be improved with some kind of synchronizatoin. Even if it doesn't affect other queries, the extension could potentially hide other more problematic lock timeout errors.

Thank you for taking the time.

So after looking at the code, it seems it tries to purge stats cache for every translation unit of the page (potentially hundreds) even if they have not changed. This is an easy optimization to do, even though it doesn't completely solve the problem when actually hundreds of units are changed (marking a new page for translation).

Change 218667 had a related patch set uploaded (by Nikerabbit):
Skip stats cache purges on no-op edits

https://gerrit.wikimedia.org/r/218667

jcrespo added a comment.EditedJun 17 2015, 4:46 PM

Another one of those, this time on metawiki today at 16:29 UTC, with 25 deletes:

DELETE FROM `translate_groupstats` WHERE tgs_group IN ('page-100wikidays','agg-Wikimedia_movement') AND tgs_lang = 'en'

/rpc/RunJobs.php?wiki=metawiki&type=MessageGroupStatesUpdaterJob&maxtime=30&maxmem=300M

Nikerabbit triaged this task as High priority.Jun 18 2015, 7:19 AM
Nikerabbit added a subscriber: Language-Team.

Change 218667 merged by jenkins-bot:
Skip stats cache purges on no-op edits

https://gerrit.wikimedia.org/r/218667

Was this fixed by merging "Skip stats cache purges on no-op edits"?
Or is there more work to do here?

This is an easy optimization to do, even though it doesn't completely solve the problem

I don't know how to check whether this issue is still happening. @jcrespo can you check or tell us how?

I am not sure, because at least they have lowered its number, but after 26th, not after 23rd. Let me present you the raw numbers:

  • This happened 60 times for the last time on 2015-06-26T14:42:36.625Z
{
  "_index": "logstash-2015.06.26",
  "_type": "mediawiki",
  "_id": "j1WvaQezTguMnmn-PmPIww",
  "_score": null,
  "_source": {
    "message": "MessageGroupStats::clear\t10.64.16.22\t1205\tLock wait timeout exceeded; try restarting transaction (10.64.16.22)\tDELETE FROM `translate_groupstats` WHERE tgs_group IN ('page-100wikidays','agg-Wikimedia_movement')  AND tgs_lang = 'uk'",
    "@version": 1,
    "@timestamp": "2015-06-26T14:42:36.625Z",
    "type": "mediawiki",
    "host": "mw1224",
    "level": "ERROR",
    "tags": [
      "syslog",
      "es",
      "es",
      "normalized_message_untrimmed"
    ],
    "channel": "wfLogDBError",
    "url": "/w/api.php",
    "ip": "10.64.0.102",
    "http_method": "POST",
    "server": "meta.wikimedia.org",
    "referrer": "https://meta.wikimedia.org/w/index.php?title=Special:Translate&group=page-100wikidays&action=page&filter=&language=uk",
    "uid": "f944387",
    "process_id": 11174,
    "wiki": "metawiki",
    "db_server": "10.64.16.22",
    "db_name": "metawiki",
    "db_user": "wikiuser",
    "method": "DatabaseBase::reportQueryError",
    "errno": 1205,
    "error": "Lock wait timeout exceeded; try restarting transaction (10.64.16.22)",
    "sql1line": "DELETE FROM `translate_groupstats` WHERE tgs_group IN ('page-100wikidays','agg-Wikimedia_movement')  AND tgs_lang = 'uk'",
    "fname": "MessageGroupStats::clear",
    "normalized_message": "MessageGroupStats::clear\t10.64.16.22\t1205\tLock wait timeout exceeded; try restarting transaction (10.64.16.22)\tDELETE FROM `translate_groupstats` WHERE tgs_group IN ('page-100wikidays','agg-Wikimedia_movement')  AND tgs_lang = 'uk'"
  },
  "sort": [
    1435329756625
  ]
}

Before that, the block errors seemed to happen every 30 minutes or so.

After these last one, the only similar error I get on deletion is 5 on 2015.06.29 of:

{
  "_index": "logstash-2015.06.29",
  "_type": "mediawiki",
  "_id": "_3lIG1MUQzemkMX8--WA7Q",
  "_score": null,
  "_source": {
    "message": "MessageGroupStats::clear\t10.64.16.27\t1213\tDeadlock found when trying to get lock; try restarting transaction (10.64.16.27)\tDELETE FROM `translate_groupstats` WHERE tgs_group IN ('page-Help:Extension:ContentTranslation','agg-Content_translation','agg-Help_pages_for_extensions')  AND tgs_lang = 'en'",
    "@version": 1,
    "@timestamp": "2015-06-29T19:53:12.413Z",
    "type": "mediawiki",
    "host": "mw1009",
    "level": "ERROR",
    "tags": [
      "syslog",
      "es",
      "es",
      "normalized_message_trimmed"
    ],
    "channel": "wfLogDBError",
    "url": "/rpc/RunJobs.php?wiki=mediawikiwiki&type=MessageUpdateJob&maxtime=30&maxmem=300M",
    "ip": "127.0.0.1",
    "http_method": "POST",
    "server": "127.0.0.1",
    "referrer": null,
    "uid": "b95bf27",
    "process_id": 22691,
    "wiki": "mediawikiwiki",
    "db_server": "10.64.16.27",
    "db_name": "mediawikiwiki",
    "db_user": "wikiuser",
    "method": "DatabaseBase::reportQueryError",
    "errno": 1213,
    "error": "Deadlock found when trying to get lock; try restarting transaction (10.64.16.27)",
    "sql1line": "DELETE FROM `translate_groupstats` WHERE tgs_group IN ('page-Help:Extension:ContentTranslation','agg-Content_translation','agg-Help_pages_for_extensions')  AND tgs_lang = 'en'",
    "fname": "MessageGroupStats::clear",
    "normalized_message": "MessageGroupStats::clear\t10.64.16.27\t1213\tDeadlock found when trying to get lock; try restarting transaction (10.64.16.27)\tDELETE FROM `translate_groupstats` WHERE tgs_group IN ('page-Help:Extension:ContentTranslation','agg-Content_translation','agg-Help_"
  },
  "sort": [
    1435607592413
  ]
}

This is right now the number one cause of database errors, coming in spikes of 1000:

{
  "_index": "logstash-2016.02.10",
  "_type": "mediawiki",
  "_id": "AVLLPW4dMRv_gmyxPwgg",
  "_score": null,
  "_source": {
    "message": "MessageGroupStats::clear\t10.64.16.27\t1205\tLock wait timeout exceeded; try restarting transaction (10.64.16.27)\tDELETE FROM `translate_groupstats` WHERE tgs_group = 'page-Help:Tables' AND tgs_lang = 'en'",
    "@version": 1,
    "@timestamp": "2016-02-10T12:52:26.000Z",
    "type": "mediawiki",
    "host": "mw1002",
    "level": "ERROR",
    "tags": [
      "syslog",
      "es",
      "es"
    ],
    "channel": "wfLogDBError",
    "normalized_message": "{fname}\t{db_server}\t{errno}\t{error}\t{sql1line}",
    "url": "/rpc/RunJobs.php?wiki=mediawikiwiki&type=MessageUpdateJob&maxtime=60&maxmem=300M",
    "ip": "127.0.0.1",
    "http_method": "POST",
    "server": "127.0.0.1",
    "referrer": null,
    "uid": "997ba56",
    "process_id": 26556,
    "wiki": "mediawikiwiki",
    "mwversion": "1.27.0-wmf.13",
    "db_server": "10.64.16.27",
    "db_name": "mediawikiwiki",
    "db_user": "wikiuser",
    "method": "DatabaseBase::reportQueryError",
    "errno": 1205,
    "error": "Lock wait timeout exceeded; try restarting transaction (10.64.16.27)",
    "sql1line": "DELETE FROM `translate_groupstats` WHERE tgs_group = 'page-Help:Tables' AND tgs_lang = 'en'",
    "fname": "MessageGroupStats::clear"
  },
  "sort": [
    1455108746000
  ]
}

Seems to correspond with user activity: https://www.mediawiki.org/w/index.php?title=Special:Log&page=Help%3ATables

Just checking: Does the table have expected indices on production?

The given page has about 400 translation units and it was marked for translation twice. Even if each one of the sections always triggered MessageGroupStats::clear, that would only explain ~800 errors, not thousands. Of course if some other updates queue up that might add to the pile. Relevant log: https://www.mediawiki.org/w/index.php?title=Special%3ALog&type=pagetranslation&user=&page=&year=&month=-1&tagfilter=&hide_thanks_log=1&hide_patrol_log=1&hide_tag_log=1

In any case, something is seriously broken, I just don't know what. Would be helpful if the title and other params of the job were included in the details.

showjobs.php for mediawikiwiki shows zero jobs, so it is probably too late to find details of those jobs if they aren't logged anywhere.

@Nikerabbit - I do not know for sure, but I suppose the queue manger retries jobs if they fail- that could contribute to it.

Just checking: Does the table have expected indices on production?

Can you link me to the committed mediawiki schema? I will check immediately against production.

In any case, something is seriously broken, I just don't know what.

It is ok, I do not expect you to know or fix this immediately :-), I just would like you to give it the required priority due to the high number of failures. We can start by enabling more verbose logging.

Via GitHub: https://github.com/wikimedia/mediawiki-extensions-Translate/blob/master/sql/translate_groupstats.sql

Primary key (tgs_group, tgs_lang) which I assume the deletes are also using.

Maybe there are logs or graphs somewhere already for number of jobs per type, that could be a start.

There is indeed https://grafana.wikimedia.org/dashboard/db/job-queue-rate (notice also drop down menu).

Let me check the structure and that the index is indeed being used.

Is it just me or does the job type filter dropdown not have any effect on that page?

This is the production master in mediawikiwiki, which is I think the one that failed last:

mysql> SHOW CREATE TABLE translate_groupstats\G
*************************** 1. row ***************************
       Table: translate_groupstats
Create Table: CREATE TABLE `translate_groupstats` (
  `tgs_group` varbinary(100) NOT NULL,
  `tgs_lang` varbinary(20) NOT NULL,
  `tgs_total` int(5) unsigned DEFAULT NULL,
  `tgs_translated` int(5) unsigned DEFAULT NULL,
  `tgs_fuzzy` int(5) unsigned DEFAULT NULL,
  `tgs_proofread` int(5) unsigned DEFAULT NULL,
  PRIMARY KEY (`tgs_group`,`tgs_lang`),
  KEY `tgs_lang` (`tgs_lang`)
) ENGINE=InnoDB DEFAULT CHARSET=binary
1 row in set (0.00 sec)

As far as I can see the structure is the same as in the code, including the indexes.

The index is used correctly for the delete, according to explain:

mysql> EXPLAIN DELETE FROM `translate_groupstats` WHERE tgs_group = 'page-Help:Tables' AND tgs_lang = 'en'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: translate_groupstats
         type: range
possible_keys: PRIMARY,tgs_lang
          key: PRIMARY
      key_len: 124
          ref: NULL
         rows: 1
        Extra: Using where

This is right now the second most common cause of database errors over last week, with half of the errors that are not caused by a generic "could not connect".

Over 7000 errors over the last week:

https://logstash.wikimedia.org/#dashboard/temp/AVOfvZK4O3D718AO3AX5

It will help a lot if we stop clearing the stats for each translation unit when pages are marked for translation and just do it once when the page is marked for translation. Not sure about the MessageGroupStatesUpdaterJob though.

Change 282378 had a related patch set uploaded (by Glaisher):
Improve translation statistics regeneration

https://gerrit.wikimedia.org/r/282378

Change 282378 had a related patch set uploaded (by Glaisher):
Centralize post-page saves updates

https://gerrit.wikimedia.org/r/282378

Change 282378 merged by jenkins-bot:
Centralize post translation save updates

https://gerrit.wikimedia.org/r/282378

Already there is visible downward tend, 1700 in past 7 days, 28 in past 2 days. Maybe the latest patches will make it appear altogether in couple of weeks.

Nikerabbit changed the task status from Open to Stalled.Apr 20 2016, 1:28 PM

Well, not completely. For last 7 days there were 62, but 53 of them were last Wednesday.

@jcrespo Is this rate acceptable for now? I'd like to investigate the cause of the remaining spikes, but I am short on time.

Yes, 62 << 7000. The main issue was log noise. Thank you very much for working on it.

I would close this ticket now as resolved, although it would be nice if there was a long term, lower priority, commitment for the others (as they could discover architecture issues) but not over other, more important issues.