Page MenuHomePhabricator

Untangle wikitech/labtestwikitech and s7 DBs and networking and mysql grants
Closed, ResolvedPublic

Description

I was trying to work out why Special:RecentChanges on labtestwikitech (labtestweb2001) wasn't working (it was trying to connect to the s7 database cluster). Turns out, we have global AbuseFilter support turned on and pointing at metawiki on s7. It's trying to load global AbuseFilter revision tags from there.

silver seems to be able to connect to some s7 eqiad slaves - db1039 + db1062, but not db1041, and none of the codfw ones (110 "Connection timed out"), and labtestweb2001 can't connect to any of the ones I tested (110, with a much longer timeout and slightly different error message?).

@Andrew, given that these servers host their own DBs, we're not wanting to depend on the central DB clusters, right? So should this functionality be turned off? If not, we'll need to get the networking and grants sorted out.

Event Timeline

Krenair assigned this task to Andrew.
Krenair raised the priority of this task from to Needs Triage.
Krenair updated the task description. (Show Details)
Krenair added subscribers: Krenair, Andrew.

I think I'm fine with turning it off... it isn't working now anyway, is it?

It's working from wikitech (silver) because of some special grants that labtestwikitech (labtestweb2001) doesn't appear to have.

Alex has pointed out that this dependency means that history pages will break on wikitech if/when the main db servers go down. That seems potentially problematic, so let's disable the feature.

Change 264980 had a related patch set uploaded (by Alex Monk):
Disable global abuse filters on nonglobal wikis

https://gerrit.wikimedia.org/r/264980

Change 264980 merged by jenkins-bot:
Disable global abuse filters on nonglobal wikis

https://gerrit.wikimedia.org/r/264980

So now I think we should work out why silver seems to be able to connect to some s7 slaves and not others, and why different rules are applied to labtestweb2001

It was causing the same kind of problems, but different source: T121866 (left here for reference).

Example error (only db1041 is failing):

{
  "_index": "logstash-2016.01.19",
  "_type": "mediawiki",
  "_id": "AVJbqY0KptxhN1XaX8G7",
  "_score": null,
  "_source": {
    "message": "Error connecting to 10.64.16.30: :real_connect(): (HY000/2003): Can't connect to MySQL server on '10.64.16.30' (4)",
    "@version": 1,
    "@timestamp": "2016-01-19T20:53:04.000Z",
    "type": "mediawiki",
    "host": "silver",
    "level": "ERROR",
    "tags": [
      "syslog",
      "es",
      "es",
      "normalized_message_untrimmed"
    ],
    "channel": "wfLogDBError",
    "url": "/w/index.php?title=Database_snapshots&action=history",
    "ip": "[SANITIZED]",
    "http_method": "GET",
    "server": "wikitech.wikimedia.org",
    "referrer": null,
    "uid": "822282b",
    "process_id": 25538,
    "wiki": "labswiki",
    "db_server": "10.64.16.30",
    "db_name": "metawiki",
    "db_user": "wikiuser",
    "method": "DatabaseMysqlBase::open",
    "error": ":real_connect(): (HY000/2003): Can't connect to MySQL server on '10.64.16.30' (4)",
    "normalized_message": "Error connecting to 10.64.16.30: :real_connect(): (HY000/2003): Can't connect to MySQL server on '10.64.16.30' (4)"
  },
  "sort": [
    1453236784000
  ]
}

I do not think it is grant related- the error would be permission denied, and db1041 is a "generic group" server, sharing the same grants for wikiuser than 39 and 62, the other generic mysqls with even larger weights.

My bet is on firewall: db1041 is the only server on s7 with the new firewall rules deployed.

from sites.pp:

node /^db10(41)\.eqiad\.wmnet/ {

    $cluster = 'mysql'
    class { 'role::mariadb::core':
        shard => 's7',
        p_s   => 'on',
    }
    include base::firewall
}

Change 265127 had a related patch set uploaded (by Andrew Bogott):
Disable abusefilter on wikitech

https://gerrit.wikimedia.org/r/265127

Change 265127 abandoned by Andrew Bogott:
Disable abusefilter on wikitech

Reason:
Krenair> andrewbogott, I don't think we should disable the whole of abusefilter

https://gerrit.wikimedia.org/r/265127

Change 265135 had a related patch set uploaded (by Alex Monk):
Really disable global abusefilters on the nonglobal wikis

https://gerrit.wikimedia.org/r/265135

Change 265135 merged by jenkins-bot:
Really disable global abusefilters on the nonglobal wikis

https://gerrit.wikimedia.org/r/265135

Should this be closed now (no more db errors) or should firewall be changed?

Don't the servers have inconsistent database access (whether by network restrictions or grants) still?

@Krenair yes, but that is only because T120122 is ongoing (and will take ~6 months to fully cover all servers). Only 1 server out of every shard has the new configuration, precisely to detect potential issues like this before applying it to all servers.

I guess this is resolved.