Page MenuHomePhabricator

labsdb1010 crashed
Closed, ResolvedPublic

Description

Replication on s4 labsdb1010 is broken. This is most likely because of the crash it had:

Feb 06 05:51:45 labsdb1010 mysqld[3351]: 180206  5:51:45 [ERROR] mysqld got signal 11 ;
Feb 06 05:51:45 labsdb1010 mysqld[3351]: 2018-02-06  5:51:45 140303701526272 [Warning] Event Scheduler: [root@localhost][ops.wmf_labs_slow_duplicates] Data truncated for column 'STATE' at row 1
Feb 06 05:51:45 labsdb1010 mysqld[3351]: This could be because you hit a bug. It is also possible that this binary
Feb 06 05:51:45 labsdb1010 mysqld[3351]: or one of the libraries it was linked against is corrupt, improperly built,
Feb 06 05:51:45 labsdb1010 mysqld[3351]: or misconfigured. This error can also be caused by malfunctioning hardware.
Feb 06 05:51:45 labsdb1010 mysqld[3351]: To report this bug, see https://mariadb.com/kb/en/reporting-bugs
Feb 06 05:51:45 labsdb1010 mysqld[3351]: 2018-02-06  5:51:45 140303701526272 [Warning] Event Scheduler: [root@localhost][ops.wmf_labs_slow_duplicates] Data truncated for column 'STATE' at row 1
Feb 06 05:51:45 labsdb1010 mysqld[3351]: We will try our best to scrape up some info that will hopefully help
Feb 06 05:51:45 labsdb1010 mysqld[3351]: diagnose the problem, but since we have already crashed,
Feb 06 05:51:45 labsdb1010 mysqld[3351]: something is definitely wrong and this may fail.
Feb 06 05:51:45 labsdb1010 mysqld[3351]: 2018-02-06  5:51:45 140303701526272 [Warning] Event Scheduler: [root@localhost][ops.wmf_labs_slow_duplicates] Data truncated for column 'STATE' at row 1
Feb 06 05:51:45 labsdb1010 mysqld[3351]: Server version: 10.1.30-MariaDB
Feb 06 05:51:45 labsdb1010 mysqld[3351]: key_buffer_size=134217728
Feb 06 05:51:45 labsdb1010 mysqld[3351]: 2018-02-06  5:51:45 140303701526272 [Warning] Event Scheduler: [root@localhost][ops.wmf_labs_slow_duplicates] Data truncated for column 'STATE' at row 1
Feb 06 05:51:45 labsdb1010 mysqld[3351]: read_buffer_size=131072
Feb 06 05:51:45 labsdb1010 mysqld[3351]: max_used_connections=9
Feb 06 05:51:45 labsdb1010 mysqld[3351]: 2018-02-06  5:51:45 140303701526272 [Warning] Event Scheduler: [root@localhost][ops.wmf_labs_slow_duplicates] Data truncated for column 'STATE' at row 17
Feb 06 05:51:45 labsdb1010 mysqld[3351]: max_threads=1026
Feb 06 05:51:45 labsdb1010 mysqld[3351]: thread_count=6
Feb 06 05:51:45 labsdb1010 mysqld[3351]: 2018-02-06  5:51:45 140303701526272 [Warning] Event Scheduler: [root@localhost][ops.wmf_labs_slow_duplicates] Data truncated for column 'STATE' at row 17
Feb 06 05:51:45 labsdb1010 mysqld[3351]: It is possible that mysqld could use up to
Feb 06 05:51:45 labsdb1010 mysqld[3351]: key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 2384972 K  bytes of memory
Feb 06 05:51:45 labsdb1010 mysqld[3351]: Hope that's ok; if not, decrease some variables in the equation.
Feb 06 05:51:45 labsdb1010 mysqld[3351]: 2018-02-06  5:51:45 140303701526272 [Warning] Event Scheduler: [root@localhost][ops.wmf_labs_slow_duplicates] Data truncated for column 'STATE' at row 17
Feb 06 05:51:45 labsdb1010 mysqld[3351]: Thread pointer: 0x7f349083a008
Feb 06 05:51:45 labsdb1010 mysqld[3351]: 2018-02-06  5:51:45 140303701526272 [Warning] Event Scheduler: [root@localhost][ops.wmf_labs_slow_duplicates] Data truncated for column 'STATE' at row 17
Feb 06 05:51:45 labsdb1010 mysqld[3351]: Attempting backtrace. You can use the following information to find out
Feb 06 05:51:45 labsdb1010 mysqld[3351]: where mysqld died. If you see no messages after this, something went
Feb 06 05:51:45 labsdb1010 mysqld[3351]: terribly wrong...
Feb 06 05:51:45 labsdb1010 mysqld[3351]: 2018-02-06  5:51:45 140303701526272 [Warning] Event Scheduler: [root@localhost][ops.wmf_labs_slow_duplicates] Data truncated for column 'STATE' at row 17
Feb 06 05:51:45 labsdb1010 mysqld[3351]: stack_bottom = 0x7f9b09097c88 thread_stack 0x30000
Feb 06 05:51:45 labsdb1010 mysqld[3351]: (my_addr_resolve failure: fork)
Feb 06 05:51:45 labsdb1010 mysqld[3351]: /opt/wmf-mariadb101/bin/mysqld(my_print_stacktrace+0x2e) [0x55fa236a120e]
Feb 06 05:51:45 labsdb1010 mysqld[3351]: /opt/wmf-mariadb101/bin/mysqld(handle_fatal_signal+0x3bd) [0x55fa231e5a5d]
Feb 06 05:51:45 labsdb1010 mysqld[3351]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x110c0) [0x7f9b18c3a0c0]
Feb 06 05:51:45 labsdb1010 mysqld[3351]: /lib/x86_64-linux-gnu/libc.so.6(strlen+0x26) [0x7f9b17369646]
Feb 06 05:52:03 labsdb1010 systemd[1]: mariadb.service: Main process exited, code=killed, status=11/SEGV
Feb 06 05:52:03 labsdb1010 systemd[1]: mariadb.service: Unit entered failed state.
Feb 06 05:52:03 labsdb1010 systemd[1]: mariadb.service: Failed with result 'signal'.
Feb 06 05:52:09 labsdb1010 systemd[1]: mariadb.service: Service hold-off time over, scheduling restart.
Feb 06 05:52:09 labsdb1010 systemd[1]: Stopped MariaDB database server.
Feb 06 05:52:09 labsdb1010 systemd[1]: Starting MariaDB database server...

And this is replication broken:

Last_Error: Could not execute Update_rows_v1 event on table commonswiki.logging; Can't find record in 'logging', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log db1102-bin.000638, end_log_pos 830192000

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 6 2018, 6:41 AM
Marostegui triaged this task as Normal priority.Feb 6 2018, 6:41 AM
Marostegui moved this task from Triage to In progress on the DBA board.

Mentioned in SAL (#wikimedia-operations) [2018-02-06T06:49:23Z] <marostegui> Fix replication on labsdb1010 - T186579

I have fixed the missing row there. Replication is now flowing finely on s4. Time will dictate if we actually need to rebuild this host if more errors show up after the crash.
We will see...

jcrespo added a subscriber: jcrespo.EditedFeb 6 2018, 9:47 AM

Because errors that could be already on both the data and the filesystem corruption, we should try or at least schedule a rebuilt of the whole instance when one crashes without persistent-innodb replication control (GTID). We have been sting in the past whenever we didn't (worse issues) so it was a kind of a "DBA rule" even before I entered here- it provoked many issues that we are still fixing nowadays.

We (I) can do this relatively quickly by copying it from the web one while depooled, for example.

We discussed this in private and it was clarified that we actually were on the same page, just that I expressed myself in a non too clear way :-)
It is all clarified and yes, we will rebuild this host at some point :-)

Could not execute Delete_rows_v1 event on table hywiki.geo_tags; Can't find record in 'geo_tags', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log db1095-bin.004423, end_log_pos 400824621
Could not execute Delete_rows_v1 event on table hywiki.geo_tags; Can't find record in 'geo_tags', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log db1095-bin.004423, end_log_pos 400824621

Didn't last long...

Mentioned in SAL (#wikimedia-operations) [2018-02-08T11:37:26Z] <marostegui> Fix replication on labsdb1010 - T186579

I have inserted the missing row to keep replication flowing till we rebuild this host.

Last_SQL_Error: Could not execute Delete_rows_v1 event on table srwiki.geo_tags; Can't find record in 'geo_tags', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log db1095-bin.004435, end_log_pos 27446482

Mentioned in SAL (#wikimedia-operations) [2018-02-09T06:52:13Z] <marostegui> Fix replication on labsdb1010 - T186579

Last_SQL_Error: Could not execute Delete_rows_v1 event on table wikidatawiki.wb_changes; Can't find record in 'wb_changes', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log db1095-bin.004435, end_log_pos 155412031
Last_SQL_Error: Could not execute Delete_rows_v1 event on table wikidatawiki.wb_changes; Can't find record in 'wb_changes', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log db1095-bin.004435, end_log_pos 155412031

Replication is now flowing, but it will most likely fail during the weekend again :-(, so we need to rebuild it as soon as we have some time

jcrespo moved this task from In progress to Next on the DBA board.Feb 14 2018, 11:56 AM
mysql:root@localhost [(none)]> show slave 's7' status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: db1102.eqiad.wmnet
                  Master_User: repl
                  Master_Port: 3317
                Connect_Retry: 60
              Master_Log_File: db1102-bin.000688
          Read_Master_Log_Pos: 1022302149
               Relay_Log_File: labsdb1010-relay-bin-s7.008142
                Relay_Log_Pos: 31769431
        Relay_Master_Log_File: db1102-bin.000688
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 1032
                   Last_Error: Could not execute Delete_rows_v1 event on table fawiki.page_props; Can't find record in 'page_props', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log db1102-bin.000688, end_log_pos 765771676
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 765771504
              Relay_Log_Space: 393159125
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 1032
               Last_SQL_Error: Could not execute Delete_rows_v1 event on table fawiki.page_props; Can't find record in 'page_props', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log db1102-bin.000688, end_log_pos 765771676
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 171978764
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: No
                  Gtid_IO_Pos:
      Replicate_Do_Domain_Ids:
  Replicate_Ignore_Domain_Ids:
                Parallel_Mode: conservative

Mentioned in SAL (#wikimedia-operations) [2018-02-25T07:35:52Z] <marostegui> Fix s7 replication on labsdb1010 - T186579

Replication is now flowing labsdb1010 for s7.

This is just an idea, but maybe we could load balance for some servers not only if the host is down, but also if it has replag > X seconds. This is very very easy work. If you think it is a good idea, I can file a ticket and put it on a low priority backlog.

Of course, the right way to move this is to reconstruct labsdb1010, which I will do this week.

This is just an idea, but maybe we could load balance for some servers not only if the host is down, but also if it has replag > X seconds. This is very very easy work. If you think it is a good idea, I can file a ticket and put it on a low priority backlog.

I was thinking about that when we set up load balancing between all the hosts and from one side I like it (a better experience is given to the replicas users, as you avoid a lagged server), but on the other hand...what if we end up with all the servers lagging (and not because of broken replication, but just because of heavy queries that might make the server lag).
On the other hand, the query killer should prevent big big big queries that might potentially make the server lag.

A good load balancer, like the one we use for mediawiki or varnish doesn't allow depooling more than X hosts or a % of them. In this case, if the 2 are degraded, it should send queries to the least behind. I think haproxy would allow that by on config, but if we substitute the checks by custom scripts we would get even more flexibility, very similar to this: https://github.com/jayjanssen/percona-xtradb-cluster-tutorial/blob/master/instructions/Load%20Balancing.rst

jcrespo claimed this task.Feb 28 2018, 12:00 PM
jcrespo moved this task from Next to In progress on the DBA board.

Change 415265 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] Depool labsdb1011 to copy its data away

https://gerrit.wikimedia.org/r/415265

Change 415265 merged by Jcrespo:
[operations/puppet@production] Depool labsdb1011 to copy its data away

https://gerrit.wikimedia.org/r/415265

Mentioned in SAL (#wikimedia-operations) [2018-02-28T14:53:13Z] <jynus> stopping labsdb1011 to clone it to labsdb1010 T186579

labsdb1011 is back up, but needs to be upgraded and catch up with replication. The next steps after that are:

  • pool labsdb1011 back
  • depool labsdb1010
  • Recover it from generated copy

Is it worth to fix the current s7 replication breakage on labsdb1010 or you are planning to put it down soon?

The replication catch-up and copy will take at the very least >24 hours, always assuming there was no data loss forcing us to start from the beginning.

The replication catch-up and copy will take at the very least >24 hours, always assuming there was no data loss forcing us to start from the beginning.

Sure, my question was basically if it is worth to fix it or not, as it will be wiped up anyways to get the data from labsdb1011 :)

jcrespo added a comment.EditedMar 1 2018, 3:57 PM

if it is worth to fix it or not

I don't know, that is why I was giving information on how much time is left for labsdb1010 to be restored (>24 hours).

if it is worth to fix it or not

I don't know, that is why I was giving information on how much time is left for labsdb1010 to be restored (>24 hours).

Ah cool, I think there was a misunderstanding. As I thought you'd wipe it and then send it the data back.
I will get it fixed now.

I plan to transfer the compressed package first- I am not 100% sure it is in a good state- will try to start decompressing it without deleting anything for as long as there is space left.

Mentioned in SAL (#wikimedia-operations) [2018-03-01T16:06:03Z] <marostegui> Fix s7 replication on labsdb1010 - T186579

replication is now flowing on s7

Change 415774 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] labsdb: Depool labsdb1010 in preparatio for its recovery

https://gerrit.wikimedia.org/r/415774

Change 415774 merged by Jcrespo:
[operations/puppet@production] labsdb: Depool labsdb1010 in preparation for its recovery

https://gerrit.wikimedia.org/r/415774

jcrespo closed this task as Resolved.Mar 2 2018, 7:57 PM

labsdb1010 is up and running with a copy of labsdb1011 data.
https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&from=now%2Fd&to=now%2Fd&var-dc=eqiad%20prometheus%2Fops&var-server=labsdb1010&var-port=9104

Note for @jcrespo When copying these hosts around, you need to delete all master.info and master-multisource.info files (maybe the status ones, too) for mariadb not to crash at start. This is probably a bug, as I think it should at least start without a crash. But this works for now.

Considering it as resolved, despite needing to catch up after 2 days.