Page MenuHomePhabricator

Replication lag on multiple databases on tool-labs
Closed, ResolvedPublic

Description

https://tools.wmflabs.org/betacommand-dev/cgi-bin/replag

MariaDB [dewiki_p]>  SELECT UNIX_TIMESTAMP() - UNIX_TIMESTAMP(MAX(rc_timestamp)) FROM recentchanges;
+------------------------------------------------------+
| UNIX_TIMESTAMP() - UNIX_TIMESTAMP(MAX(rc_timestamp)) |
+------------------------------------------------------+
|                                         32647.000000 |
+------------------------------------------------------+
1 row in set (0.00 sec)

MariaDB [dewiki_p]> USE commonswiki_p;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
MariaDB [commonswiki_p]>  SELECT UNIX_TIMESTAMP() - UNIX_TIMESTAMP(MAX(rc_timestamp)) FROM recentchanges;
+------------------------------------------------------+
| UNIX_TIMESTAMP() - UNIX_TIMESTAMP(MAX(rc_timestamp)) |
+------------------------------------------------------+
|                                         32659.000000 |
+------------------------------------------------------+
1 row in set (0.00 sec)

Event Timeline

Steinsplitter raised the priority of this task from to Unbreak Now!.
Steinsplitter updated the task description. (Show Details)
Steinsplitter added a project: Toolforge.
Steinsplitter subscribed.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Steinsplitter renamed this task from Replications lag on multiple databases to Replication lag on multiple databases on tool-labs.Jul 11 2015, 11:34 AM
Steinsplitter set Security to None.

labsdb1002 crashed yesterday at 2:38 UTC due to excessive memory usage. I've restarted replication, there is not much to do now but wait.

ah, replag has been corrected now. It's over. Thanks!

but yesterday 2:38 UTC, it was running still till 10:15 UTC today

jcrespo claimed this task.

db was running, replication wasn't. Replication being stopped for 8 hours is consistent with the effects seen. Does that answer your question?

However, there seem to be corruption in some user-created tables (T105503) as some people use unsafe engines such as MyISAM.

Jcrespo: Why could my tasks work till 10:15 AM UTC today, if you say, it crashed 2:38 UTC yesterday

When mysql crashes, mysqld_safe, the watchdog process restarts mysql automatically. To avoid replication errors, replication is configured to not restart automatically and require human intervention.

While I am ok with answering questions on IRC, please do not reopen a task unless it as been closed incorrectly. If there is another issue with the databases, open a new task. Thank you!

this task is resolved only for dewiki_p but not for commonswiki_p , there is still replication lag

labsdb1002 crashed yesterday at 2:38 UTC due to excessive memory usage. I've restarted replication, there is not much to do now but wait.

Are you aware that there is still a replag on commonswiki_p which is blocking a lot of stuff on commons?

MariaDB [commonswiki_p]> SELECT UNIX_TIMESTAMP() - UNIX_TIMESTAMP(MAX(rc_timestamp)) AS replag FROM recentchanges;
+--------------+
| replag       |
+--------------+
| 45998.000000 |
+--------------+
1 row in set (0.01 sec)
Nemo_bis updated the task description. (Show Details)
Nemo_bis subscribed.

Regardless with what happened to the primary mariadb server (wmf switched to this a while back) the actual database replication is still non-functional. As of now we are at one day 4 hours lag and growing.

Lag seems suddenly resolved, mostly under few minutes with few exceptions.

This issue seems to have caused some serious problems on replica servers. See T105713.

Superyetkin, your issue is not about replication and is already tracked in its own report; please don't reopen this one.