Page MenuHomePhabricator

MediaWiki\Extension\MachineVision\Maintenance\FetchSuggestions::execute: no transaction to commit, something got out of sync regularly erroring since 2019-11-13
Closed, ResolvedPublic

Description

https://logstash.wikimedia.org/goto/fafee5f10a58dd74e015a52a7dc06635

Screenshot_20191114_080242.png (873×2 px, 75 KB)

{
  "_index": "logstash-mediawiki-2019.11.14",
  "_type": "mediawiki",
  "_id": "AW5oppJRh3Uj6x1zPIxj",
  "_version": 1,
  "_score": null,
  "_source": {
    "severity": "err",
    "level": "ERROR",
    "phpversion": "7.2.22-1+0~20190902.26+debian9~1.gbpd64eb7+wmf1",
    "wiki": "commonswiki",
    "channel": "DBQuery",
    "mwversion": "1.35.0-wmf.5",
    "program": "mediawiki",
    "logsource": "mwmaint1002",
    "message": "MediaWiki\\Extension\\MachineVision\\Maintenance\\FetchSuggestions::execute: no transaction to commit, something got out of sync",
    "type": "mediawiki",
    "normalized_message": "MediaWiki\\Extension\\MachineVision\\Maintenance\\FetchSuggestions::execute: no transaction to commit, something got out of sync",
    "reqId": "11cf3eb958a76e2b9a37cd6a",
    "tags": [
      "input-kafka-rsyslog-udp-localhost",
      "rsyslog-udp-localhost",
      "kafka",
      "es"
    ],
    "@timestamp": "2019-11-14T06:42:57.081Z",
    "cli_argv": "/srv/mediawiki-staging/multiversion/MWScript.php /srv/mediawiki/php-1.35.0-wmf.5/extensions/MachineVision/maintenance/fetchSuggestions.php --wiki commonswiki --filelist /home/mholloway-shell/mw/quality.txt",
    "host": "mwmaint1002",
    "@version": 1,
    "shard": "s4",
    "facility": "user",
    "timestamp": "2019-11-14T06:42:57+00:00"
  },
  "fields": {
    "@timestamp": [
      1573713777081
    ]
  },
  "sort": [
    1573713777081
  ]
}

Event Timeline

The problem is this line, calling commitTransaction when no transaction has been opened: https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/MachineVision/+/master/maintenance/fetchSuggestions.php#81

The current run will last probably three more days. Is this serious enough that we should halt the script and deploy a fix immediately, or can it wait until after the current run finishes?

Change 550920 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[mediawiki/extensions/MachineVision@master] Fix: Begin a transaction before fetching labels for a batch of files

https://gerrit.wikimedia.org/r/550920

Is this serious enough that we should halt the script and deploy a fix immediately, or can it wait until after the current run finishes?

The purpose of me opening the ticket is precisely to ask you that :-D. If it is causing data loss/performance issues, it should be fixed ASAP. If it is log spam, but no impact to production, it should be eventually fixed (log spam is bad as it may hide other more important issues, specially if frequent), but it is not in a hurry. It is difficult for SREs to tell at first glance.

In a different, but related topic- We do not recommend running long running scripts continuously, mostly for one reason- if the configuration, code or database topology changes, it will not be able to see the changes, and it may fail. It is very common to kill long running database connections and depool and restart database replicas and application servers. It will also not survive a server restart if unpuppetized. The recommendation is to run short batches with some kind of control (e.g. writing a "last row modified id" to a file.

Thanks, @jcrespo. It looks like the only practical effect in this case is the logspam, so I'll let the script run its course this time. Thanks also for the recommendations — that makes sense. We'll update accordingly for the next run.

Change 550920 merged by jenkins-bot:
[mediawiki/extensions/MachineVision@master] Fix: Begin a transaction before fetching labels for a batch of files

https://gerrit.wikimedia.org/r/550920

The fix for this is merged, and should roll out before the next script run.