Page MenuHomePhabricator

Frequent "Wikimedia\\Rdbms\\DatabaseMysqlBase::lock failed to acquire lock" errors on WMF mediawiki logs
Closed, ResolvedPublic

Description

I think I need help from Performance-Team to try to debug a frequent occurrence of an error found on mediawiki logs:

https://logstash.wikimedia.org/goto/8635d4bbe6af47f6a8e36444fa50fa65

(I've deliverately chosen a time range that didn't contain other know outages/degradations).

It seems that daily there are 1000-2000 events of this kind, and I think they could be related to "Edit" actions, so that is why I thought of Performance, as it may be making edit times slower. This is an example trace:

{
  "_index": "logstash-2017.11.15",
  "_type": "mediawiki",
  "_id": "AV_AMQzc-DyidI0FSVub",
  "_score": null,
  "_source": {
    "server": "www.wikidata.org",
    "unique_id": "WgxWRwpAIEQAAHmVURMAAAAP",
    "level": "WARNING",
    "ip": "10.64.32.105",
    "wiki": "wikidatawiki",
    "channel": "DBQuery",
    "mwversion": "1.31.0-wmf.7",
    "message": "Wikimedia\\Rdbms\\DatabaseMysqlBase::lock failed to acquire lock 'wikidatawiki:prepared-edit:a86c7dff21e0a19ca4cb9038f91886e1:b0b622d893bc3afa199c2a958f12f49c054acf58:1b290b154b85db95fde079d11dd3f0ca'",
    "type": "mediawiki",
    "normalized_message": "Wikimedia\\Rdbms\\DatabaseMysqlBase::lock failed to acquire lock '{lockname}'",
    "url": "/w/api.php",
    "tags": [
      "syslog",
      "es",
      "es"
    ],
    "reqId": "WgxWRwpAIEQAAHmVURMAAAAP",
    "lockname": "wikidatawiki:prepared-edit:a86c7dff21e0a19ca4cb9038f91886e1:b0b622d893bc3afa199c2a958f12f49c054acf58:1b290b154b85db95fde079d11dd3f0ca",
    "referrer": "https://www.wikidata.org/w/index.php?title=Wikidata:Do_not_merge/11&action=submit",
    "@timestamp": "2017-11-15T14:59:20.000Z",
    "http_method": "POST",
    "@version": 1,
    "host": "mw1198",
    "shard": "s5"
  },
  "fields": {
    "@timestamp": [
      1510757960000
    ]
  },
  "sort": [
    1510757960000
  ]
}

It happens on all wikis proportionally to the traffic they normally have.

I wonder if that is supposed to happen, or it is impossible to avoid, and it is transparent to the user. While it may not be the most common database-related error we have, the others (mosty api-related errors and replication lag errors) are known and being worked on.

Event Timeline

jcrespo created this task.Nov 17 2017, 3:07 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 17 2017, 3:07 PM

The first occurrence found seems to start at 2017-11-03T17:57:04, but that is taken from logstash, I cannot be 100% sure the events before then haven't been already purged.

prepared-edit indicates that this is part of edit stashing. Which is preemptive processing of an edit a user is making before they actually hit save. If it fails, the worst that happens is that the edit processing has to be redone from scratch when the user hits "save".

@aaron knows more about edit stashing and maybe why that particular warning may happen. But worst case scenario this is only a missed optimization, it shouldn't have any functional impact for the user.

Ok, you can triage as low/close if it should happen, but maybe some wait timeout can be tuned down.

Note CategoryMembershipUpdates are around half of the errors right now.

Gilles assigned this task to aaron.Nov 20 2017, 9:11 PM
Gilles moved this task from Inbox to Next In This Quarter on the Performance-Team board.
Krinkle triaged this task as High priority.Nov 20 2017, 9:19 PM

Change 392558 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Reduce lag waiting time spent in CategoryMembershipUpdateJob critical section

https://gerrit.wikimedia.org/r/392558

Change 392559 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Make CategoryMembershipChangeJob query more readable and use exactly equality

https://gerrit.wikimedia.org/r/392559

Change 392560 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Reduce chance of pointless lock acquistion failures during edit stashing

https://gerrit.wikimedia.org/r/392560

Really interesting patch- MySQL 8.0 will introduce SKIP LOCKED and NOWAIT which maybe interesting in the future for similar cases: https://mysqlserverteam.com/mysql-8-0-1-using-skip-locked-and-nowait-to-handle-hot-rows/

Change 392559 merged by jenkins-bot:
[mediawiki/core@master] Make CategoryMembershipChangeJob query more readable

https://gerrit.wikimedia.org/r/392559

Change 392558 merged by jenkins-bot:
[mediawiki/core@master] Reduce lag waiting time in CategoryMembershipUpdateJob critical section

https://gerrit.wikimedia.org/r/392558

Change 392560 merged by jenkins-bot:
[mediawiki/core@master] Avoid blocking locks during API edit stash generation

https://gerrit.wikimedia.org/r/392560

Thanks, if you could have from time to time a look to mediawiki logs and look for "Lock wait timeout" or "failed to acquire lock" you can check if those are one-time, doing-the-right thing errors or actual contention issues, and then sending them to the right owners- it is not easy for me to see if they are something I should be worried about.

Change 394833 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Lower DatabaseMysqlBase::lock loggin to INFO

https://gerrit.wikimedia.org/r/394833

Change 394833 merged by jenkins-bot:
[mediawiki/core@master] Lower DatabaseMysqlBase::lock logging to INFO

https://gerrit.wikimedia.org/r/394833

aaron added a comment.Dec 5 2017, 4:17 AM

By reducing the lock max wait times and pushing the brunt of lag waits out if the critical section, then less real time should be wasted.

The last change pushes the log level down to just INFO, so it should stop spamming as much when that goes live.

aaron closed this task as Resolved.Dec 5 2017, 4:17 AM
mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:09 PM