Page MenuHomePhabricator

MediaWiki\Extension\MachineVision\Maintenance\FetchSuggestions::execute: no transaction to commit, something got out of sync regularly erroring since 2019-11-13
Closed, ResolvedPublic

Description

https://logstash.wikimedia.org/goto/fafee5f10a58dd74e015a52a7dc06635

{
  "_index": "logstash-mediawiki-2019.11.14",
  "_type": "mediawiki",
  "_id": "AW5oppJRh3Uj6x1zPIxj",
  "_version": 1,
  "_score": null,
  "_source": {
    "severity": "err",
    "level": "ERROR",
    "phpversion": "7.2.22-1+0~20190902.26+debian9~1.gbpd64eb7+wmf1",
    "wiki": "commonswiki",
    "channel": "DBQuery",
    "mwversion": "1.35.0-wmf.5",
    "program": "mediawiki",
    "logsource": "mwmaint1002",
    "message": "MediaWiki\\Extension\\MachineVision\\Maintenance\\FetchSuggestions::execute: no transaction to commit, something got out of sync",
    "type": "mediawiki",
    "normalized_message": "MediaWiki\\Extension\\MachineVision\\Maintenance\\FetchSuggestions::execute: no transaction to commit, something got out of sync",
    "reqId": "11cf3eb958a76e2b9a37cd6a",
    "tags": [
      "input-kafka-rsyslog-udp-localhost",
      "rsyslog-udp-localhost",
      "kafka",
      "es"
    ],
    "@timestamp": "2019-11-14T06:42:57.081Z",
    "cli_argv": "/srv/mediawiki-staging/multiversion/MWScript.php /srv/mediawiki/php-1.35.0-wmf.5/extensions/MachineVision/maintenance/fetchSuggestions.php --wiki commonswiki --filelist /home/mholloway-shell/mw/quality.txt",
    "host": "mwmaint1002",
    "@version": 1,
    "shard": "s4",
    "facility": "user",
    "timestamp": "2019-11-14T06:42:57+00:00"
  },
  "fields": {
    "@timestamp": [
      1573713777081
    ]
  },
  "sort": [
    1573713777081
  ]
}

Details

Related Gerrit Patches:

Event Timeline

jcrespo created this task.Thu, Nov 14, 7:03 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptThu, Nov 14, 7:03 AM

The problem is this line, calling commitTransaction when no transaction has been opened: https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/MachineVision/+/master/maintenance/fetchSuggestions.php#81

The current run will last probably three more days. Is this serious enough that we should halt the script and deploy a fix immediately, or can it wait until after the current run finishes?

Change 550920 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[mediawiki/extensions/MachineVision@master] Fix: Begin a transaction before fetching labels for a batch of files

https://gerrit.wikimedia.org/r/550920

Is this serious enough that we should halt the script and deploy a fix immediately, or can it wait until after the current run finishes?

The purpose of me opening the ticket is precisely to ask you that :-D. If it is causing data loss/performance issues, it should be fixed ASAP. If it is log spam, but no impact to production, it should be eventually fixed (log spam is bad as it may hide other more important issues, specially if frequent), but it is not in a hurry. It is difficult for SREs to tell at first glance.

In a different, but related topic- We do not recommend running long running scripts continuously, mostly for one reason- if the configuration, code or database topology changes, it will not be able to see the changes, and it may fail. It is very common to kill long running database connections and depool and restart database replicas and application servers. It will also not survive a server restart if unpuppetized. The recommendation is to run short batches with some kind of control (e.g. writing a "last row modified id" to a file.

Thanks, @jcrespo. It looks like the only practical effect in this case is the logspam, so I'll let the script run its course this time. Thanks also for the recommendations — that makes sense. We'll update accordingly for the next run.

Change 550920 merged by jenkins-bot:
[mediawiki/extensions/MachineVision@master] Fix: Begin a transaction before fetching labels for a batch of files

https://gerrit.wikimedia.org/r/550920

Mholloway closed this task as Resolved.Thu, Nov 21, 3:35 PM

The fix for this is merged, and should roll out before the next script run.