Page MenuHomePhabricator

High replication lag to dewiki
Closed, ResolvedPublic

Description

tools.taxonbot reports high replication lag since May 12, 2016, 03:30 UTC, reading and writing at dewiki

I don't know which MariaDB it is. ...

Event Timeline

Restricted Application added subscribers: Zppix, Luke081515, Aklapper. · View Herald Transcript
doctaxon triaged this task as Unbreak Now! priority.May 12 2016, 4:00 AM

dewiki reports high database utilization showing recent changes or user contributions

mzmcbride@tools-bastion-03:~$ mysql -hdewiki.labsdb dewiki_p -e "select max(rc_timestamp) from recentchanges;"
+-------------------+
| max(rc_timestamp) |
+-------------------+
| 20160512042602    |
+-------------------+
mzmcbride@tools-bastion-03:~$ date
Thu May 12 04:26:07 UTC 2016

What replication lag are you seeing?

@MZMcBride Try to read or write content by API at tools-bastion-03 on dewiki, and you'll see long lasting replication lag. And look at dewiki user contributions error line.

Looking at https://de.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=dbrepllag&sishowalldb= currently, I see:

{
    "batchcomplete": "",
    "query": {
        "dbrepllag": [
            {
                "host": "db1049",
                "lag": 0
            },
            {
                "host": "db1026",
                "lag": 93
            },
            {
                "host": "db1045",
                "lag": 0
            },
            {
                "host": "db1070",
                "lag": 0
            },
            {
                "host": "db1071",
                "lag": 0
            }
        ]
    }
}

@MZMcBride

dewiki user contributions error line gives: Due to high database server lag, changes newer than 98 seconds may not be shown in this list.

Looks like it's all okay again. It lasted about 2 hours. What was going on?

I don't see a delay on rc or other special pages and dbrepllag seems 0 at the moment.

{
    "batchcomplete": "",
    "query": {
        "dbrepllag": [
            {
                "host": "db1049",
                "lag": 0
            },
            {
                "host": "db1026",
                "lag": 0
            },
            {
                "host": "db1045",
                "lag": 0
            },
            {
                "host": "db1070",
                "lag": 0
            },
            {
                "host": "db1071",
                "lag": 0
            }
        ]
    }
}

yes, since 05:10 UTC all is okay again. But I'd like to know, what was going on there.

valhallasw edited projects, added DBA, SRE; removed Cloud-Services, Toolforge, Tool-Database-Queries.

Presumably there was a high write load and db1026 couldn't keep up.

Because this bug is about the production database, please tag it as SRE rather than Cloud-Services. Apart from that, please do not set priority yourself -- the priority convenes the priority of the people working on it, not the priority for those that are waiting for the issue to be resolved.

valhallasw lowered the priority of this task from Unbreak Now! to Needs Triage.May 12 2016, 2:01 PM

@doctaxon one of the largest dewiki servers went down a few days ago. New servers have already been requested and will be up soon. In particular, to mitigate the load issues, I decided to sacrifice the "recentchanges" capacity (instead of all other traffic for all other functions), as you can see on our config https://noc.wikimedia.org/conf/highlight.php?file=db-eqiad.php :

db1026' => 0,   # 1.4TB  64GB, watchlist, recentchanges, contributions, logpager; vslow, dump until new servers are in production

Hopefuly this is only a very occasional problem, and it will not happen again once the new, more powerful servers are up.

Hi!

Thanks to jcrespo for reopening:
May 15th: 17:50 to 20:30 UTC host db1026 on dewiki was very laggy again. @hoo wrote in IRC, that it's because huge amounts of edits on wikidata and he probably blocks the bot.

My problem is, that I have running bots on dewiki, they have to be started at definite moments of time, but if host db1026 has replication lags, API queries and token queries are not possible, so the bot cannot work. Wikipedia users, who need the bot work, are not very amused about it and I have the problems, I am looking to solve it by this task.

@jcrespo and my thoughts are, to separate dewiki and wikidata database on host db1026 to another host. Please solve this issue, so that I and my bot can work more economically and trusty.

Thank you very much indeed ...

The user who caused the db lag yesterday is now using the api's maxlag parameter, so I hope this no longer is a problem.

If immediate issues are not likely to happen, I would wait for new new servers to be setup and then reevaluate. If wikidata is very prone to bot edits/imports/etc., it may make sense to separate it the same way that commons is already.

Joe changed the task status from Open to Stalled.May 16 2016, 11:54 AM
Joe triaged this task as High priority.

Change 289147 had a related patch set uploaded (by Jcrespo):
Empty db1026 except for vslow, dump

https://gerrit.wikimedia.org/r/289147

Change 289147 merged by Jcrespo:
Empty db1026 except for vslow, dump

https://gerrit.wikimedia.org/r/289147

Change 289149 had a related patch set uploaded (by Jcrespo):
Set db1026 back as rc node; move roles around

https://gerrit.wikimedia.org/r/289149

Change 289149 merged by Jcrespo:
Set db1026 back as rc node; move roles around

https://gerrit.wikimedia.org/r/289149

Mentioned in SAL [2016-05-17T08:15:15Z] <jynus> reducing durability and enabling GTID on db1026 T135100

jcrespo changed the task status from Stalled to Open.May 17 2016, 1:17 PM
jcrespo claimed this task.
jcrespo moved this task from Triage to In progress on the DBA board.