Page MenuHomePhabricator

RefreshLinksJob::runForTitle: transaction round 'RefreshLinksJob::run' already started on commons
Closed, ResolvedPublic

Description

Error

Request URL: jobrunner.discovery.wmnet/rpc/RunSingleJob.php
Request ID: XKOs1gpAAEMAAIpwETMAAAAK

trace
{
  "_index": "logstash-2019.04.03",
  "_type": "mediawiki",
  "_id": "AWnkxwrc8aQffZ3Hxr2k",
  "_score": 1,
  "_source": {
    "exception": {
      "trace": "#0 /srv/mediawiki/php-1.33.0-wmf.24/includes/jobqueue/jobs/RefreshLinksJob.php(150): Wikimedia\\Rdbms\\LBFactory->beginMasterChanges(string)\n#1 /srv/mediawiki/php-1.33.0-wmf.24/includes/jobqueue/jobs/RefreshLinksJob.php(124): RefreshLinksJob->runForTitle(Title)\n#2 /srv/mediawiki/php-1.33.0-wmf.24/extensions/EventBus/includes/JobExecutor.php(65): RefreshLinksJob->run()\n#3 /srv/mediawiki/rpc/RunSingleJob.php(77): JobExecutor->execute(array)\n#4 {main}",
      "code": 0,
      "file": "/srv/mediawiki/php-1.33.0-wmf.24/includes/libs/rdbms/lbfactory/LBFactory.php:246",
      "message": "RefreshLinksJob::runForTitle: transaction round 'RefreshLinksJob::run' already started.",
      "class": "Wikimedia\\Rdbms\\DBTransactionError"
    },
    "server": "jobrunner.discovery.wmnet",
    "phpversion": "5.6.99-hhvm",
    "wiki": "commonswiki",
    "channel": "exception",
    "exception_id": "XKOs1gpAAEMAAIpwETMAAAAK",
    "program": "mediawiki",
    "type": "mediawiki",
    "message_checksum": "78b933215170c2b6b34a496d28f237c5",
    "caught_by": "mwe_handler",
    "exception_url": "/rpc/RunSingleJob.php",
    "http_method": "POST",
    "host": "mw1304",
    "@version": 1,
    "shard": "s4",
    "timestamp": "2019-04-03T19:57:23+00:00",
    "severity": "err",
    "level": "ERROR",
    "ip": "10.64.16.69",
    "mwversion": "1.33.0-wmf.24",
    "logsource": "mw1304",
    "message": "[XKOs1gpAAEMAAIpwETMAAAAK] /rpc/RunSingleJob.php   Wikimedia\\Rdbms\\DBTransactionError from line 246 of /srv/mediawiki/php-1.33.0-wmf.24/includes/libs/rdbms/lbfactory/LBFactory.php: RefreshLinksJob::runForTitle: transaction round 'RefreshLinksJob::run' already started.",
    "normalized_message": "[{exception_id}] {exception_url}   Wikimedia\\Rdbms\\DBTransactionError from line 246 of /srv/mediawiki/php-1.33.0-wmf.24/includes/libs/rdbms/lbfactory/LBFactory.php: RefreshLinksJob::runForTitle: transaction round 'RefreshLinksJob::run' already started.",
    "url": "/rpc/RunSingleJob.php",
    "reqId": "XKOs1gpAAEMAAIpwETMAAAAK",
    "tags": [
      "input-kafka-rsyslog-udp-localhost",
      "rsyslog-udp-localhost",
      "kafka",
      "es"
    ],
    "referrer": null,
    "@timestamp": "2019-04-03T19:57:23.000Z",
    "facility": "user"
  },
  "fields": {
    "@timestamp": [
      1554321443000
    ]
  }
}

Impact

Unknown. Around 150,000 errors messages logged in a 5 minute span prior to rollback. 1.33.0-wmf.24 is holding at group0.

Event Timeline

dduvall triaged this task as Unbreak Now! priority.Apr 3 2019, 8:06 PM
dduvall created this task.
Restricted Application added a subscriber: TerraCodes. · View Herald TranscriptApr 3 2019, 8:07 PM

Mentioned in SAL (#wikimedia-operations) [2019-04-03T20:09:17Z] <marxarelli> 1.33.0-wmf.24 is holding at group0 following rollback. filed T220037. cc: T206678

dduvall renamed this task from Spike in DBTransactionError following 1.33.0-wmf.24 group1 promotion to RefreshLinksJob::runForTitle: transaction round 'RefreshLinksJob::run' already started on commons.Apr 3 2019, 8:16 PM
Restricted Application added a project: Analytics. · View Herald TranscriptApr 3 2019, 8:17 PM
13:25:40 <+Pchelolo> James_F: https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/499958/ is an obvious suspect..
13:27:19 <+Pchelolo> ok. I know what happened James_F
13:29:20 <+Pchelolo> seems like the gerrit I've posted before requires https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/427019/ to work, however that one was never ported to the jobExecutor class that we're using in production

Change 501051 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/extensions/EventBus@master] Respect the job execution flags regarding TRX.

https://gerrit.wikimedia.org/r/501051

The patch above should fix it.

This makes me reprioritize the merging of the JobRunner and JobExecutor to avoid situations like this in the future. Will work on it shortly.

fdans moved this task from Incoming to Radar on the Analytics board.Apr 4 2019, 4:47 PM

Change 501051 merged by jenkins-bot:
[mediawiki/extensions/EventBus@master] Respect the job execution flags regarding TRX.

https://gerrit.wikimedia.org/r/501051

Change 501372 had a related patch set uploaded (by Jforrester; owner: Ppchelko):
[mediawiki/extensions/EventBus@wmf/1.33.0-wmf.24] Respect the job execution flags regarding TRX.

https://gerrit.wikimedia.org/r/501372

Change 501372 merged by jenkins-bot:
[mediawiki/extensions/EventBus@wmf/1.33.0-wmf.24] Respect the job execution flags regarding TRX.

https://gerrit.wikimedia.org/r/501372

Mentioned in SAL (#wikimedia-operations) [2019-04-04T19:06:34Z] <marxarelli> fetch/rebase looks good, incorporates fixes for T220037, T219510. deploying

Mentioned in SAL (#wikimedia-operations) [2019-04-04T19:13:15Z] <marxarelli> large spike in DBTransactionError errors. rolling back. cc: T220037

Mentioned in SAL (#wikimedia-operations) [2019-04-04T20:11:35Z] <marxarelli> error rates look good after proper syncs and re-deploy. cc: T220037

Following syncs and promotion to all wikis, the error rate is much lower than the previously seen spike, but it's still rather high, average of 800/min.

Example:

{
  "_index": "logstash-2019.04.04",
  "_type": "mediawiki",
  "_id": "AWnqCdqV8aQffZ3HyCjK",
  "_version": 1,
  "_score": null,
  "_source": {
    "exception": {
      "trace": "#0 /srv/mediawiki/php-1.33.0-wmf.24/extensions/EventBus/includes/JobExecutor.php(289): Wikimedia\\Rdbms\\LBFactory->commitMasterChanges(string, array)\n#1 /srv/mediawiki/php-1.33.0-wmf.24/extensions/EventBus/includes/JobExecutor.php(67): JobExecutor->commitMasterChanges(Wikimedia\\Rdbms\\LBFactoryMulti, string)\n#2 /srv/mediawiki/rpc/RunSingleJob.php(77): JobExecutor->execute(array)\n#3 {main}",
      "code": 0,
      "file": "/srv/mediawiki/php-1.33.0-wmf.24/includes/libs/rdbms/lbfactory/LBFactory.php:261",
      "message": "RefreshLinksJob::run: transaction round 'RefreshLinksJob::runForTitle' still running.",
      "class": "Wikimedia\\Rdbms\\DBTransactionError"
    },
    "server": "jobrunner.discovery.wmnet",
    "phpversion": "5.6.99-hhvm",
    "wiki": "enwiki",
    "channel": "exception",
    "exception_id": "XKW0twpAAEUAAFjwWmIAAAAA",
    "program": "mediawiki",
    "type": "mediawiki",
    "message_checksum": "40dee3e70084b7a6f50133fd69b72f74",
    "caught_by": "mwe_handler",
    "exception_url": "/rpc/RunSingleJob.php",
    "http_method": "POST",
    "host": "mw1302",
    "@version": 1,
    "shard": "s1",
    "timestamp": "2019-04-04T20:28:27+00:00",
    "severity": "err",
    "level": "ERROR",
    "ip": "10.64.16.67",
    "mwversion": "1.33.0-wmf.24",
    "logsource": "mw1302",
    "message": "[XKW0twpAAEUAAFjwWmIAAAAA] /rpc/RunSingleJob.php   Wikimedia\\Rdbms\\DBTransactionError from line 261 of /srv/mediawiki/php-1.33.0-wmf.24/includes/libs/rdbms/lbfactory/LBFactory.php: RefreshLinksJob::run: transaction round 'RefreshLinksJob::runForTitle' still running.",
    "normalized_message": "[{exception_id}] {exception_url}   Wikimedia\\Rdbms\\DBTransactionError from line 261 of /srv/mediawiki/php-1.33.0-wmf.24/includes/libs/rdbms/lbfactory/LBFactory.php: RefreshLinksJob::run: transaction round 'RefreshLinksJob::runForTitle' still running.",
    "url": "/rpc/RunSingleJob.php",
    "reqId": "XKW0twpAAEUAAFjwWmIAAAAA",
    "tags": [
      "input-kafka-rsyslog-udp-localhost",
      "rsyslog-udp-localhost",
      "kafka",
      "es"
    ],
    "referrer": null,
    "@timestamp": "2019-04-04T20:28:27.000Z",
    "facility": "user"
  },
  "fields": {
    "@timestamp": [
      1554409707000
    ]
  },
  "sort": [
    1554409707000
  ]
}

Mentioned in SAL (#wikimedia-operations) [2019-04-04T20:45:10Z] <marxarelli> promotion of 1.33.0-wmf.24 rolled back to group0 and holding. cc: T206678, T220037

Change 501447 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Add missing transaction round commit calls to RefreshLinksJob

https://gerrit.wikimedia.org/r/501447

Change 501459 had a related patch set uploaded (by Krinkle; owner: Aaron Schulz):
[mediawiki/core@wmf/1.33.0-wmf.24] Add missing transaction round commit calls to RefreshLinksJob

https://gerrit.wikimedia.org/r/501459

Change 501447 merged by jenkins-bot:
[mediawiki/core@master] Add missing transaction round commit calls to RefreshLinksJob

https://gerrit.wikimedia.org/r/501447

Change 501459 merged by jenkins-bot:
[mediawiki/core@wmf/1.33.0-wmf.24] Add missing transaction round commit calls to RefreshLinksJob

https://gerrit.wikimedia.org/r/501459

Mentioned in SAL (#wikimedia-operations) [2019-04-05T16:02:11Z] <krinkle@deploy1001> Synchronized php-1.33.0-wmf.24/includes/jobqueue/jobs/RefreshLinksJob.php: Ib1ac31365f9c / T220037 (duration: 00m 59s)

Mentioned in SAL (#wikimedia-operations) [2019-04-08T19:48:18Z] <marxarelli> promoting 1.33.0-wmf.24 to all wikis. cc: T220037, T206678

Mentioned in SAL (#wikimedia-operations) [2019-04-08T19:59:29Z] <marxarelli> promotion of 1.33.0-wmf.24 to all wikis completed. error rates nominal aside from usual timeouts. cc: T206678, T220037

dduvall closed this task as Resolved.Apr 8 2019, 8:00 PM
dduvall claimed this task.