Page MenuHomePhabricator

Job queue broken for labswiki (jobs for wikitech.wikimedia.org are not running)
Closed, ResolvedPublic

Description

This appears to be happening since 31 Oct: https://logstash.wikimedia.org/#dashboard/temp/AVDHmj_HptxhN1XarUng with peaks of 150 errors/hour.

The runners fail to connect to the database, likely due to vlan restrictions. However, I am not sure the runner should run there in the first place.

Event Timeline

jcrespo raised the priority of this task from to Needs Triage.
jcrespo updated the task description. (Show Details)
jcrespo subscribed.

IIRC, labswiki jobs are supposed to be running locally on silver only...

demon triaged this task as Medium priority.Nov 3 2015, 10:47 PM

Checking https://logstash.wikimedia.org/#/dashboard/elasticsearch/mediawiki-errors shows these connection errors from RunJobs for labswiki are now trending at the top of the dashboard.

About 3500 in the last 24h.

method: DatabaseMysqlBase::open
Error connecting to 208.80.154.136: Can't connect to MySQL server on '208.80.154.136' (4)

method: LoadBalancer::reportConnectionError
Connection error: Unknown error (208.80.154.136)

Example url: /rpc/RunJobs.php?wiki=*&type=*&maxtime=*&maxmem=*

IIRC, labswiki jobs are supposed to be running locally on silver only...

Actually, we might have changed this at some point to get more jobs run in the normal way (centrally).

These were the first occurrences:

{
  "_index": "logstash-2015.10.31",
  "_type": "mediawiki",
  "_id": "AVC8f0N1lAIL90ZzMe6V",
  "_score": null,
  "_source": {
    "message": "Connection error: Unknown error (208.80.154.136)",
    "@version": 1,
    "@timestamp": "2015-10-31T06:04:27.000Z",
    "type": "mediawiki",
    "host": "mw1015",
    "level": "ERROR",
    "tags": [
      "syslog",
      "es",
      "es",
      "normalized_message_untrimmed"
    ],
    "channel": "wfLogDBError",
    "url": "/rpc/RunJobs.php?wiki=labswiki&type=EchoNotificationDeleteJob&maxtime=60&maxmem=300M",
    "ip": "127.0.0.1",
    "http_method": "POST",
    "server": "127.0.0.1",
    "referrer": null,
    "uid": "fa60c66",
    "process_id": 9763,
    "wiki": "labswiki",
    "method": "LoadBalancer::reportConnectionError",
    "last_error": "Unknown error",
    "db_server": "208.80.154.136",
    "normalized_message": "Connection error: Unknown error (208.80.154.136)"
  },
  "sort": [
    1446271467000
  ]
}
{
  "_index": "logstash-2015.10.31",
  "_type": "mediawiki",
  "_id": "AVC9HgUUMRv_gmyxRqFQ",
  "_score": null,
  "_source": {
    "message": "Connection error: Unknown error (208.80.154.136)",
    "@version": 1,
    "@timestamp": "2015-10-31T08:57:51.000Z",
    "type": "mediawiki",
    "host": "mw1015",
    "level": "ERROR",
    "tags": [
      "syslog",
      "es",
      "es",
      "normalized_message_untrimmed"
    ],
    "channel": "wfLogDBError",
    "url": "/rpc/RunJobs.php?wiki=labswiki&type=refreshLinksPrioritized&maxtime=60&maxmem=300M",
    "ip": "127.0.0.1",
    "http_method": "POST",
    "server": "127.0.0.1",
    "referrer": null,
    "uid": "e8373ce",
    "process_id": 9763,
    "wiki": "labswiki",
    "method": "LoadBalancer::reportConnectionError",
    "last_error": "Unknown error",
    "db_server": "208.80.154.136",
    "normalized_message": "Connection error: Unknown error (208.80.154.136)"
  },
  "sort": [
    1446281871000
  ]
}

This may help pinpoint the relevant commits.
Krinkle renamed this task from RunJobs.php fails to be executed on labswiki to Job queue broken for labswiki (jobs for wikitech.wikimedia.org are not running).Dec 3 2015, 2:40 PM

This is causing problems on wikitech since link updates are not running. E.g. pages added or removed from categories do not actually get added to their respective categories. This is making it hard to find up to date documentation, and also hard to write/update documentation as nothing seems to actually stick or happen in reality after saving as refreshLinks just fails on a mysql connect error.

FWIW, labswiki is not supposed to use the cluster's jobqueue at all:

https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/InitialiseSettings.php#L15903-L15906

so something very strange is going on.

I just confirmed with tcpdump: silver (wikitech) is submitting jobs to the jobqueue even if it shouldn't.

Not sure where the bug lies but this seems pretty serious, given how clearly the rule is bypassed.

Found the problem - the jobqueue file gets included disregarding the fact that we're on labswiki

https://gerrit.wikimedia.org/r/#/c/250170/5/wmf-config/CommonSettings.php,cm line 193

Change 256698 had a related patch set uploaded (by Giuseppe Lavagetto):
Inclusion of jobqueue files is not unconditional

https://gerrit.wikimedia.org/r/256698

Change 256698 merged by Giuseppe Lavagetto:
Inclusion of jobqueue files is not unconditional

https://gerrit.wikimedia.org/r/256698

I can confirm this is fixed, last error has timestamp: 2015-12-03T21:11:58.000Z

I see a couple of reverts and re-reverts, I hope this was not a general issue, but a mere deployment issue (@Joe)?