Page MenuHomePhabricator

labsdb1002 and labsdb1003 crashed, affecting replication
Closed, ResolvedPublic

Assigned To
Authored By
Luke081515
Nov 22 2015, 12:07 AM
Referenced Files
None
Tokens
"The World Burns" token, awarded by Vituzzu."The World Burns" token, awarded by Steinsplitter."The World Burns" token, awarded by Luke081515.

Description

The replag grows and grows, since 17:15-17:30 UTC it looks like the replication stopped completly.

It's now Sunday evening and last edit is still the same. Replication is broken:

MariaDB [wikidatawiki_p]> SELECT rc_timestamp FROM recentchanges ORDER BY rc_timestamp DESC LIMIT 1;
+----------------+
| rc_timestamp   |
+----------------+
| 20151121171514 |
+----------------+
1 row in set (0.01 sec)

Event Timeline

Luke081515 raised the priority of this task from to Unbreak Now!.
Luke081515 updated the task description. (Show Details)
Luke081515 added projects: DBA, Cloud-Services.
Luke081515 subscribed.

labsdb1001 is ok, labsdb1002 is not. I didn't check any others.

hoo added a subscriber: jcrespo.

s3 is also experiencing serious issues.

Multichill renamed this task from High replag at labs db to Replication broken om labsdb1002..Nov 22 2015, 9:31 PM
Multichill updated the task description. (Show Details)
Multichill added projects: SRE, Toolforge.
Giftpflanze renamed this task from Replication broken om labsdb1002. to Replication broken on labsdb1002..Nov 22 2015, 9:37 PM
Giftpflanze updated the task description. (Show Details)

And this is why your DBA cannot take a single day of vacations.

Waiting for replication lag to go back to 0 to close this ticket.

@jcrespo Replag at dewiki and commons is still growing :-(

Sadly, after a server crashes, it empties its caches, so I would expect it to be growing for a while. I am monitoring it, just in case there is more issues- will keep you updated if I identify other problems.

There is actually some issues, investigating.

jcrespo renamed this task from Replication broken on labsdb1002. to labsdb1002 and labsdb1003 crashed, affecting replication.Nov 23 2015, 12:58 PM
jcrespo claimed this task.
jcrespo moved this task from Done to In progress on the DBA board.
jcrespo updated the task description. (Show Details)

So, the recovery thread was still running, efectively setting the hosts in read-only mode (not really, just for enough tables to block the replication process fully). labsdb1002 has finally started its replication process, and very slowly catching up. I expect labsdb1003 to do the same soon, if not, I will switchover users of the 3rd server among the others.

Corruption is still on, I am tring to repair the tables manually. For that, I need to put the labsdb servers down for some time. This may take some time as there are thousands of tables to check, but logs do not indicate the exact tables affected. I cannot give an ETA yet.

I think I may have fixed the corruption on labsdb1003, but sadly some data was lost in the process. I cannot guarantee the accuracy of its contents right now, but at least the replication is working, so I think we only have lost a few users and revisions, that can be backfilled later.

MariaDB LABS localhost (none) > pager grep Seconds
PAGER set to 'grep Seconds'
MariaDB LABS localhost (none) > SHOW ALL SLAVES STATUS\G
        Seconds_Behind_Master: 105834
        Seconds_Behind_Master: 100679
        Seconds_Behind_Master: 103366
        Seconds_Behind_Master: 105613
        Seconds_Behind_Master: 88615
        Seconds_Behind_Master: 105311
        Seconds_Behind_Master: 99352
7 rows in set (0.00 sec)

MariaDB LABS localhost (none) > SHOW ALL SLAVES STATUS\G
        Seconds_Behind_Master: 105826
        Seconds_Behind_Master: 100675
        Seconds_Behind_Master: 103359
        Seconds_Behind_Master: 105611
        Seconds_Behind_Master: 88604
        Seconds_Behind_Master: 105302
        Seconds_Behind_Master: 99340
7 rows in set (0.00 sec)

I will focus on labsdb1002 (so I will to restart that server and make it unavailable for some time).

Short term fix done on labsdb1002, too:

MariaDB LABS localhost (none) > pager grep Seconds
PAGER set to 'grep Seconds'
MariaDB LABS localhost (none) > SHOW ALL SLAVES STATUS\G
        Seconds_Behind_Master: 110965
        Seconds_Behind_Master: 171886
        Seconds_Behind_Master: 174191
        Seconds_Behind_Master: 173584
        Seconds_Behind_Master: 170065
        Seconds_Behind_Master: 169644
        Seconds_Behind_Master: 172763
7 rows in set (0.00 sec)

MariaDB LABS localhost (none) > SHOW ALL SLAVES STATUS\G
        Seconds_Behind_Master: 110953
        Seconds_Behind_Master: 171869
        Seconds_Behind_Master: 174172
        Seconds_Behind_Master: 173565
        Seconds_Behind_Master: 170045
        Seconds_Behind_Master: 169607
        Seconds_Behind_Master: 172752

@jcrespo: Do you update dns entries before you kill a database server? We use names like "wikidatawiki.labsdb" and that way the impact is reduced a lot.

@Multichill No, that was done on pourpose. The servers were killed because an OOM, as we have 2/3 servers down, redirecting to a single server will bring down the only one we have left.

I know this is more work, but I think the community would prefer that you do a fresh import of the data. Having an unknown volume of data corruption in a table is not a good idea. Starting with a fresh copy would ensure the reliability of the data in question.

@Betacommand I hear you, but doing a full import will take down the servers for 5 days. Having only 1 server available and up to date is not a good state. Once we get to a normal state in terms of availability, we can consider it.

But always in a scheduled way, with enough warning and proper shiwthcover, not during an unbreak now event.

So, to give you a general update:

  1. After an upgrade and configuration change to workaround a bug in the storage engine/paralel replication, replag is catching up slowly. It has an ETA of 10 hours to go back to 0 lag for both servers
  2. some data might be missing / corrupt and should be reported if important, specially after those 10 hours (I think it is not too large)
  3. when replication has caught up we will reevaluate the state. We can correct small issues online by doing consistency checks and syncs from production. We will consider doing a reimport, as we did some moths ago, but that will require taking down one server at a time in a scheduled window.

I will give you a broader update of why this happened once everything is fixed, but I can advance you it is due to the contraints we have for labs in terms of hardware and human resources. In order to prevent this in the future, I would like to discuss enforcing stronger constraints per user/tool.

On the good side, I think the upgrade I had to do as part of the fix has fixed the replication lag table update and that will be available soon.

Edit: To clarify about corruption: there is not physical (hardware, filesystem) corruption or logical corruption anymore, I have checked manually every single table. However, there could be a few missing or too many rows on some tables, being different from our production. Sadly, importing is not a straightforward process, because that is true all the time due to the filtering that it is done for privacy purposes. That is why I say we need to evaluate after going to a normal state.

jcrespo lowered the priority of this task from Unbreak Now! to High.Nov 23 2015, 7:32 PM
jcrespo raised the priority of this task from High to Unbreak Now!.Nov 23 2015, 8:36 PM

Labsdb1002 crashed again, creating new corruption on some user tables:

151123 20:02:14 [ERROR] mysqld: Table './p50380g50803__mp/mentee_mentor' is marked as crashed and should be repaired
151123 20:02:14 [Warning] Checking table:   './p50380g50803__mp/mentee_mentor'
151123 20:10:09 [ERROR] mysqld: Table './p50380g50803__mp/mentee' is marked as crashed and should be repaired
151123 20:10:09 [Warning] Checking table:   './p50380g50803__mp/mentee'

Looking into it.

After fixing this issue (and setting so that if databases crash, they can recover automatically), I will enforce stricter per-user query limits temporarily, until we can agree to set some appropriate usage rules that allows everybody proper usage of the databases without affecting other users and the stability of the system.

jcrespo lowered the priority of this task from Unbreak Now! to High.Nov 24 2015, 11:16 AM

The previous tables have been checked. Things seem back to normal.

This is not 100% fixed to me, some checks and actionables I mentioned are pending, although lower priority but I suppose we can track those on separate tickets, that I will reference here.