Page MenuHomePhabricator

s5-master contention caused (?) by refreshlinksprioritized/addUsagesForPage jobs running for all wikis
Closed, ResolvedPublicPRODUCTION ERROR

Description

There is high number of connection errors to 10.64.16.144 (db1049, or s5-master) caused probably by high number of connections such as:

{
  "_index": "logstash-2016.09.19",
  "_type": "mediawiki",
  "_id": "AVdD7nZU2B4w3SKhWGu0",
  "_score": null,
  "_source": {
    "message": "Error connecting to 10.64.16.144: Can't connect to MySQL server on '10.64.16.144' (4)",
    "@version": 1,
    "@timestamp": "2016-09-19T19:31:23.000Z",
    "type": "mediawiki",
    "host": "mw1299",
    "level": "ERROR",
    "tags": [
      "syslog",
      "es",
      "es"
    ],
    "channel": "wfLogDBError",
    "normalized_message": "Error connecting to {db_server}: {error}",
    "url": "/rpc/RunJobs.php?wiki=dewiki&type=wikibase-addUsagesForPage&maxtime=60&maxmem=300M",
    "ip": "127.0.0.1",
    "http_method": "POST",
    "server": "127.0.0.1",
    "referrer": null,
    "wiki": "dewiki",
    "mwversion": "1.28.0-wmf.18",
    "reqId": "a7a0b44622b72ad31d160f12",
    "db_server": "10.64.16.144",
    "db_name": "dewiki",
    "db_user": "wikiuser",
    "method": "DatabaseMysqlBase::open",
    "error": "Can't connect to MySQL server on '10.64.16.144' (4)"
  },
  "fields": {
    "@timestamp": [
      1474313483000
    ]
  },
  "highlight": {
    "channel.raw": [
      "@kibana-highlighted-field@wfLogDBError@/kibana-highlighted-field@"
    ]
  },
  "sort": [
    1474313483000
  ]
}

This could be the cause or just a consequence because this job is very common.

Here is a sample of errors: https://logstash.wikimedia.org/goto/11cd5759017b61d371035de09a41531c

The number was high before, but at 17:00 UTC today there was a spike of 2000 errors in 5 minutes, following by a continuous tail of ~100 errors/5 minutes. This could be just a spike on activity that will disappear, or could be something substantial (code pattern change).

75% of current database errors are connection errors to this server (not normal). By looking at https://grafana-admin.wikimedia.org/dashboard/db/mysql?from=1473709162759&to=1474313962759&var-dc=eqiad%20prometheus%2Fops&var-server=db1049 I can see there is a pattern change at 6-8 UTC today, and increasing highly at 17 UTC, but see nothing strange at infrastructure side.

Event Timeline

jcrespo renamed this task from s5-master contention caused (?) by refreshlinksprioritized job running for all wikis to s5-master contention caused (?) by refreshlinksprioritized/addUsagesForPage jobs running for all wikis.Sep 19 2016, 7:44 PM

Since 98bd2437ae38f395a8b47e6895793e88ca3ae6b4 we use larger transactions in EntityUsageTable (one for the whole set (up to 400(?)), not one per batch (up to 100 rows).

If this is indeed the cause for this, we presumably have to do what I suggested on the change "Should this cause problems with long lasting transactions, the atomic sections should be removed instead.".

Does addUsages() get called when no other writes are pending commit? If so, you can do the usual getEmptyTransactionTicket/commitAndWaitForReplication dance. If not, you'd have to pass the ticked down from above...

The errors have slowed down to the point they disappeared. I will close this in my afternoon as invalid/a one time spike unless they reappear.

jcrespo claimed this task.

There is still some spikes here and there, but the level of errors is low and it is only 30% of the total, so I am resolving. Probably there was an ongoing import causing higher load that usual. The current state is not worrying.

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:11 PM