Page MenuHomePhabricator

Some code is trying to connect to aawiki on the wrong shard, generating 40 errors per minute
Closed, DeclinedPublicPRODUCTION ERROR

Description

{
  "_index": "logstash-2017.06.19",
  "_type": "mediawiki",
  "_id": "AVzBSFRqrHtES4E5g7xd",
  "_score": null,
  "_source": {
    "db_server": "10.64.48.22",
    "method": "Wikimedia\\Rdbms\\DatabaseMysqlBase::open",
    "level": "ERROR",
    "wiki": "aawiki",
    "channel": "DBConnection",
    "mwversion": "1.30.0-wmf.5",
    "message": "Error connecting to 10.64.48.22: :real_connect(): (42000/1049): Unknown database 'aawiki'",
    "type": "mediawiki",
    "error": ":real_connect(): (42000/1049): Unknown database 'aawiki'",
    "normalized_message": "Error connecting to {db_server}: {error}",
    "tags": [
      "syslog",
      "es",
      "es"
    ],
    "reqId": "5ffae18505b5fa49bbe86b0d",
    "@timestamp": "2017-06-19T16:56:01.000Z",
    "db_name": "aawiki",
    "db_user": "wikiadmin",
    "@version": 1,
    "host": "terbium",
    "shard": "s3"
  },
  "fields": {
    "@timestamp": [
      1497891361000
    ]
  },
  "sort": [
    1497891361000
  ]
}

If you check, someone is trying to connect to aawiki on an enwiki (s1) shard, and it is not there. Could be some maintenance task using aawiki, when in reality it uses a specific wiki (tries to connect to it)? I do not have enough info on the error to track where this is coming from.

Event Timeline

Aklapper renamed this task from Some code is trying to connecto to aawiki on the wrong shard, generating 40 errors per minute to Some code is trying to connect to aawiki on the wrong shard, generating 40 errors per minute.Jun 20 2017, 12:20 PM
Aklapper edited projects, added WMF-General-or-Unknown; removed MediaWiki-General.

Does anybody know a way to debug which code generates that?- the trace doesn't have clue on which is the origin of the issue.

Krinkle subscribed.

Not seen in the logs for at least 7 days. Only match for "Error connecting" under type:mediawiki in Logstash is for errors these two cases:

  1. labweb1001/labweb1002 (as labswiki) trying to connect to various IPs (10.64.0.93, 10.64.48.150, 10.64.48.11:3314, and more).
  2. Various app servers in both DCs having lost connections due to reading authorization packet or waiting for initial communication packet.

We may need to file tasks for those, but I'll close this for now given none of them are about aawiki on s1 or anything else relating to maintenance servers (mwmaint1001/terbium).

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:10 PM