Page MenuHomePhabricator

labsdb1009 crashed
Closed, ResolvedPublic

Description

Mar 31 19:19:46 labsdb1009 mysqld[1430]: 180331 19:19:46 [ERROR] mysqld got signal 11 ;
Mar 31 19:19:46 labsdb1009 mysqld[1430]: This could be because you hit a bug. It is also possible that this binary
Mar 31 19:19:46 labsdb1009 mysqld[1430]: or one of the libraries it was linked against is corrupt, improperly built,
Mar 31 19:19:46 labsdb1009 mysqld[1430]: or misconfigured. This error can also be caused by malfunctioning hardware.
Mar 31 19:19:46 labsdb1009 mysqld[1430]: To report this bug, see https://mariadb.com/kb/en/reporting-bugs
Mar 31 19:19:46 labsdb1009 mysqld[1430]: We will try our best to scrape up some info that will hopefully help
Mar 31 19:19:46 labsdb1009 mysqld[1430]: diagnose the problem, but since we have already crashed,
Mar 31 19:19:46 labsdb1009 mysqld[1430]: something is definitely wrong and this may fail.
Mar 31 19:19:46 labsdb1009 mysqld[1430]: Server version: 10.1.31-MariaDB
Mar 31 19:19:46 labsdb1009 mysqld[1430]: key_buffer_size=134217728
Mar 31 19:19:46 labsdb1009 mysqld[1430]: read_buffer_size=131072
Mar 31 19:19:46 labsdb1009 mysqld[1430]: max_used_connections=82
Mar 31 19:19:46 labsdb1009 mysqld[1430]: max_threads=1026
Mar 31 19:19:46 labsdb1009 mysqld[1430]: thread_count=29
Mar 31 19:19:46 labsdb1009 mysqld[1430]: It is possible that mysqld could use up to
Mar 31 19:19:46 labsdb1009 mysqld[1430]: key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 2384980 K  bytes of memory
Mar 31 19:19:46 labsdb1009 mysqld[1430]: Hope that's ok; if not, decrease some variables in the equation.
Mar 31 19:19:46 labsdb1009 mysqld[1430]: Thread pointer: 0x7f3cc170d008
Mar 31 19:19:46 labsdb1009 mysqld[1430]: Attempting backtrace. You can use the following information to find out
Mar 31 19:19:46 labsdb1009 mysqld[1430]: where mysqld died. If you see no messages after this, something went
Mar 31 19:19:46 labsdb1009 mysqld[1430]: terribly wrong...
Mar 31 19:19:46 labsdb1009 mysqld[1430]: stack_bottom = 0x7f3d90e9dc88 thread_stack 0x30000
Mar 31 19:19:47 labsdb1009 mysqld[1430]: /opt/wmf-mariadb101/bin/mysqld(my_print_stacktrace+0x2e)[0x557ff718112e]
Mar 31 19:19:47 labsdb1009 mysqld[1430]: /opt/wmf-mariadb101/bin/mysqld(handle_fatal_signal+0x3bd)[0x557ff6cc463d]
Mar 31 19:19:48 labsdb1009 mysqld[1430]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x110c0)[0x7fa29f6d40c0]
Mar 31 19:19:48 labsdb1009 mysqld[1430]: /opt/wmf-mariadb101/bin/mysqld(_Z45fix_semijoin_strategies_for_picked_join_orderP4JOIN+0x98)
[0x557ff6c41878]
Mar 31 19:19:49 labsdb1009 mysqld[1430]: /opt/wmf-mariadb101/bin/mysqld(_Z20get_best_combinationP4JOIN+0x5f)[0x557ff6b6182f]
Mar 31 19:19:49 labsdb1009 mysqld[1430]: /opt/wmf-mariadb101/bin/mysqld(+0x469787)[0x557ff6b7f787]
Mar 31 19:19:50 labsdb1009 mysqld[1430]: /opt/wmf-mariadb101/bin/mysqld(_ZN4JOIN14optimize_innerEv+0x794)[0x557ff6b868d4]
Mar 31 19:19:50 labsdb1009 mysqld[1430]: /opt/wmf-mariadb101/bin/mysqld(_ZN4JOIN8optimizeEv+0x2b)[0x557ff6b8928b]
Mar 31 19:19:51 labsdb1009 mysqld[1430]: /opt/wmf-mariadb101/bin/mysqld(_Z12mysql_selectP3THDPPP4ItemP10TABLE_LISTjR4ListIS1_ES2_jP8st_orderSB_S2_SB_yP13select_resultP18st_select_lex_unitP13st_select_lex+0x8f8)[0x557ff6b89c48]
Mar 31 19:19:51 labsdb1009 mysqld[1430]: /opt/wmf-mariadb101/bin/mysqld(_Z13handle_selectP3THDP3LEXP13select_resultm+0x13f)[0x557ff6b89e1f]
Mar 31 19:19:52 labsdb1009 mysqld[1430]: /opt/wmf-mariadb101/bin/mysqld(+0x3609e4)[0x557ff6a769e4]
Mar 31 19:19:52 labsdb1009 mysqld[1430]: /opt/wmf-mariadb101/bin/mysqld(_Z21mysql_execute_commandP3THD+0x700f)[0x557ff6b3435f]
Mar 31 19:19:53 labsdb1009 mysqld[1430]: /opt/wmf-mariadb101/bin/mysqld(_Z11mysql_parseP3THDPcjP12Parser_state+0x311)[0x557ff6b36471]
Mar 31 19:19:53 labsdb1009 mysqld[1430]: /opt/wmf-mariadb101/bin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcj+0x27d7)[0x557ff6b39de7]
Mar 31 19:19:54 labsdb1009 mysqld[1430]: /opt/wmf-mariadb101/bin/mysqld(_Z10do_commandP3THD+0x148)[0x557ff6b3a5f8]
Mar 31 19:19:54 labsdb1009 mysqld[1430]: /opt/wmf-mariadb101/bin/mysqld(_Z24do_handle_one_connectionP3THD+0x1d2[0x557ff6c09112]
Mar 31 19:19:55 labsdb1009 mysqld[1430]: /opt/wmf-mariadb101/bin/mysqld(handle_one_connection+0x40)[0x557ff6c09270]
Mar 31 19:19:55 labsdb1009 mysqld[1430]: /opt/wmf-mariadb101/bin/mysqld(+0x8126d1)[0x557ff6f286d1]
Mar 31 19:19:56 labsdb1009 mysqld[1430]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7494)[0x7fa29f6ca494]
Mar 31 19:19:56 labsdb1009 mysqld[1430]: /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7fa29de6bacf]
Mar 31 19:19:56 labsdb1009 mysqld[1430]: Trying to get some variables.
Mar 31 19:19:56 labsdb1009 mysqld[1430]: Some pointers may be invalid and cause the dump to abort.
Mar 31 19:19:56 labsdb1009 mysqld[1430]: Query (0x7f3c1b0f9020): SELECT DISTINCT p.page_id,p.page_title,p.page_namespace,p.page_touched,p.page_len FROM ( SELECT * from categorylinks where cl_to IN ('Kateg  ria:Lektor  land  _lapok_2004_augusztus  b  l')) cl0 INNER JOIN categorylinks cl1 on cl0.cl_from=cl1.cl_from and cl1.cl_to IN ('Kateg  ria:Lektor  land  _lapok_2004_december  b  l') INNER JOIN categorylinks cl2 on cl0.cl_from=cl2.cl_from and cl2.cl_to IN ('Kateg  ria:Lektor  land  _lapok_2004_szeptember  b  l') INNER JOIN categorylinks cl3 on cl0.cl_from=cl3.cl_from and cl3.cl_to IN ('Kateg  ria:Lektor  land  _lapok_2005_augusztus  b  l') INNER JOIN categorylinks cl4 on cl0.cl_from=cl4.cl_from and cl4.cl_to IN ('Kateg  ria:Lektor  land  _lapok_2005_december  b  l') INNER JOIN categorylinks cl5 on cl0.cl_from=cl5.cl_from and cl5.cl_to IN ('Kateg  ria:Lektor  land  _lapok_2005_febru  rj  b  l') INNER JOIN categorylinks cl6 on cl0.cl_from=cl6.cl_from and cl6.cl_to IN ('Kateg  ria:Lektor  land  _lapok_2005_janu  rj  b  l') INNER JOIN categorylinks cl7 on cl0.cl_from=cl7.cl_from and cl7.cl_to IN ('Kateg  ria:Lektor  land  _lapok_2005_j  lius  b  l') INNER JOIN categorylinks cl8 on cl0.cl_from=cl8.cl_from and cl8.cl_to IN ('Kateg  ria:Lektor  land  _lapok_2005_j  nius  b  l') INNER JOIN categorylinks cl9 on cl0.cl_from=cl9.cl_from and cl9.cl_to IN ('Kateg  ria:Lektor  land  _lapok_2005_m  jus  b  l') INNER JOIN categorylinks cl10 on cl0.cl_from=cl10.cl_from and cl10.cl_to IN ('Kateg  ria:Lektor  land  _lapok_2005_november  b  l') INNER JOIN categorylinks cl11 on cl0.cl_from=cl11.cl_from and cl11.cl_to IN ('Kateg  ria:Lektor  land  _lapok_2005_okt  ber  b  l') INNER JOIN categorylinks cl12 on cl0.cl_from=cl12.cl_from and cl12.cl_to IN ('Kateg  ria:Lektor  land  _lapok_2005_szeptember  b  l') INNER JOIN categorylinks cl13 on cl0.cl_from=cl13.cl_from and cl13.cl_to IN ('Kateg  ria:Lektor  land  _lapok_2005_  prilis  b  l') INNER JOIN categorylinks cl14 on cl0.cl_from=cl14.cl_from and cl14.cl_to IN ('Kateg  ria:Lektor  land  _lapok_2006_augusztus  b  l') INNER
Mar 31 19:19:56 labsdb1009 mysqld[1430]: JOIN categorylinks cl15 on cl0.cl_from=cl15.cl_from and cl15.cl_to IN ('Kateg  ria:Lektor  land  _lapok_2006_december  b  l') INNER JOIN categorylinks cl16 on cl0.cl_from=cl16.cl_from and cl16.cl_to IN ('Kateg  ria:Lektor  land  _lapok_2006_febru  rj  b  l') INNER JOIN categorylinks cl17 on cl0.cl_from=cl17.cl_from and cl17.cl_to IN ('Kateg  ria:Lektor  land  _lapok_2006_janu  rj  b  l') INNER JOIN categorylinks cl18 on cl0.cl_from=cl18.cl_from and cl18.cl_to IN ('Kateg  ria:Lektor  land  _lapok_2006_j  lius  b  l') INNER JOIN categorylinks cl19 on cl0.cl_from=cl19.cl_from and cl19.cl_to IN ('Kateg  ria:Lektor  land  _lapok_2006_j  nius  b  l') INNER JOIN categorylinks cl20 on cl0.cl_from=cl20.cl_from and cl20.cl_to IN ('Kateg  ria:Lektor  land  _lapok_2006_m  jus  b  l') INNER JOIN categorylinks cl21 on cl0.cl_from=cl21.cl_from and cl21.cl_to IN ('Kateg  ria:Lektor  land  _lapok_2006_m  rcius  b  l') INNER JOIN categorylinks cl22 on cl0.cl_from=cl22.cl_from and cl22.cl_to IN ('Kateg  ria:Lektor  land  _lapok_2006_november  b  l') INNER JOIN categorylinks cl23 on cl0.cl_from=cl23.cl_from and cl23.cl_to IN ('Kateg  ria:Lektor  land  _lapok_2006_okt  ber  b  l') INNER JOIN categorylinks cl24 on cl0.cl_from=cl24.cl_from and cl24.cl_to IN ('Kateg  ria:Lektor  land  _lapok_2006_szeptember  b  l') INNER JOIN categorylinks cl25 on cl0.cl_from=cl25.cl_from and cl25.cl_to IN ('Kateg  ria:Lektor  land  _lapok_2006_  prilis  b  l') INNER JOIN categorylinks cl26 on cl0.cl_from=cl26.cl_from and cl26.cl_to IN ('Kateg  ria:Lektor  land  _lapok_2007_augusztus  b  l') INNER JOIN categorylinks cl27 on cl0.cl_from=cl27.cl_from and cl27.cl_to IN ('Kateg  ria:Lektor  land  _lapok_2007_febru  rj  b  l') INNER JOIN categorylinks cl28 on cl0.cl_from=cl28.cl_from and cl28.cl_to IN ('Kateg  ria:Lektor  land  _lapok_2007_janu  rj  b  l') INNER JOIN categorylinks cl29 on cl0.cl_from=cl29.cl_from and cl29.cl_to IN ('Kateg  ria:Lektor  land  _lapok_2007_j  lius  b  l') INNER JOIN categorylinks cl30 on cl0.cl_from=cl30.cl_from and cl30.cl_to I
Mar 31 19:19:56 labsdb1009 mysqld[1430]: N ('Kateg  ria:Lektor  land  _lapok_2007_j  nius  b  l') INNER JOIN categorylinks cl31 on cl
0.cl_from=cl31.cl_from and cl31.cl_to IN ('Kateg  ria:Lektor  land  _lapok_2007_m  jus  b  l') INNER JOIN categorylinks cl32 on cl0.c
l_from=cl32.cl_from and cl32.cl_to IN ('Kateg  ria:Lektor  land  _lapok_2007_m  rcius  b  l') INNER JOIN categorylinks cl33 on cl0.cl_from=cl33.cl_from and cl33.cl_to IN ('Kateg  ria:Lektor  land  _lapok_2007_november  b  l') INNER JOIN categorylinks cl34 on cl0.cl_from=cl34.cl_from and cl34.cl_to IN ('Kateg  ria:Lektor  land  _lapok_2007_okt  ber  b  l') INNER JOIN categorylinks cl35 on cl0.cl_from=cl35.cl_from and cl35.cl_to IN ('Kateg  ria:Lektor  land  _lapok_2007_szeptember  b  l') INNER JOIN categorylinks cl36 on cl0.cl_from=cl36.cl_from and cl36.cl_to IN ('Kateg  ria:Lektor  land  _lapok_2007_  prilis  b  l') INNER JOIN categorylinks cl37 on cl0.cl_from=cl37.cl_from and cl37.cl_to IN ('Kateg  ria:Lektor  land  _lapok_2008_augusztus  b  l') INNER JOIN categorylinks cl38 on cl0.cl_from=cl38.cl_from and cl38.cl_to IN ('Kateg  ria:Lektor  land  _lapok_2008_j  lius  b  l') INNER JOIN categorylinks cl39 on cl0.cl_from=cl39.cl_from and cl39.cl_to IN ('Kateg  ria:Lektor  land  _lapok_2008_okt  ber  b  l') INNER JOIN categorylinks cl40 on cl0.cl_from=cl40.cl_from and cl40.cl_to IN ('Kateg  ria:Lektor  land  _lapok_2009_febru  rj  b  l') INNER JOIN categorylinks cl41 on cl0.cl_from=cl41.cl_from and cl41.cl_to IN ('Kateg  ria:Lektor  land  _lapok_2009_janu  rj  b  l') INNER JOIN categorylinks cl42 on cl0.cl_from=cl42.cl_from and cl42.cl_to IN ('Kateg  ria:Lektor  land  _lapok_2009_j  nius  b  l') INNER JOIN categorylinks cl43 on cl0.cl_from=cl43.cl_from and cl43.cl_to IN ('Kateg  ria:Lektor  land  _lapok_2009_m  rcius  b  l') INNER JOIN categorylinks cl44 on cl0.cl_from=cl44.cl_from and cl44.cl_to IN ('Kateg  ria:Lektor  land  _lapok_2009_november  b  l') INNER JOIN categorylinks cl45 on cl0.cl_from=cl45.cl_from and cl45.cl_to IN ('Kateg  ria:Lektor  land  _lapok_2010_j  nius  b  l') INNER JOIN c
Mar 31 19:20:13 labsdb1009 systemd[1]: mariadb.service: Main process exited, code=killed, status=11/SEGV
Mar 31 19:20:13 labsdb1009 systemd[1]: mariadb.service: Unit entered failed state.
Mar 31 19:20:13 labsdb1009 systemd[1]: mariadb.service: Failed with result 'signal'.
Mar 31 19:20:18 labsdb1009 systemd[1]: mariadb.service: Service hold-off time over, scheduling restart.
Mar 31 19:20:18 labsdb1009 systemd[1]: Stopped MariaDB database server.
Mar 31 19:20:18 labsdb1009 systemd[1]: Starting MariaDB database server...

T179244 again? probably related to T190627

Event Timeline

@jcrespo LMK if you'd like me to do anything about it during the weekend.

Just for the record - replication is broken on a few threads:

root@labsdb1009:~# mysql --skip-ssl -e "show all slaves status\G" | egrep "Connection|Seconds|Last"
              Connection_name: db1095
                   Last_Errno: 1032
                   Last_Error: Could not execute Delete_rows_v1 event on table enwiki.pagelinks; Can't find record in 'pagelinks', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log db1095-bin.004966, end_log_pos 302054150
        Seconds_Behind_Master: NULL
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 1032
               Last_SQL_Error: Could not execute Delete_rows_v1 event on table enwiki.pagelinks; Can't find record in 'pagelinks', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log db1095-bin.004966, end_log_pos 302054150
              Connection_name: s2
                   Last_Errno: 1032
                   Last_Error: Could not execute Update_rows_v1 event on table nlwiki.recentchanges; Can't find record in 'recentchanges', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log db1102-bin.000957, end_log_pos 390825731
        Seconds_Behind_Master: NULL
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 1032
               Last_SQL_Error: Could not execute Update_rows_v1 event on table nlwiki.recentchanges; Can't find record in 'recentchanges', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log db1102-bin.000957, end_log_pos 390825731
              Connection_name: s4
                   Last_Errno: 1032
                   Last_Error: Could not execute Update_rows_v1 event on table commonswiki.page; Can't find record in 'page', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log db1102-bin.000831, end_log_pos 409279866
        Seconds_Behind_Master: NULL
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 1032
               Last_SQL_Error: Could not execute Update_rows_v1 event on table commonswiki.page; Can't find record in 'page', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log db1102-bin.000831, end_log_pos 409279866
              Connection_name: s6
                   Last_Errno: 0
                   Last_Error:
        Seconds_Behind_Master: 0
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
              Connection_name: s7
                   Last_Errno: 1032
                   Last_Error: Could not execute Update_rows_v1 event on table fawiki.page; Can't find record in 'page', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log db1102-bin.000773, end_log_pos 730567608
        Seconds_Behind_Master: NULL
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 1032
               Last_SQL_Error: Could not execute Update_rows_v1 event on table fawiki.page; Can't find record in 'page', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log db1102-bin.000773, end_log_pos 730567608
jcrespo triaged this task as High priority.Apr 2 2018, 8:01 AM

Change 423448 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] labsdb: depool labsdb1011 to copy it to labsdb1009

https://gerrit.wikimedia.org/r/423448

Change 423448 merged by Jcrespo:
[operations/puppet@production] labsdb: depool labsdb1011 to copy it to labsdb1009

https://gerrit.wikimedia.org/r/423448

Change 423450 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] labsdb: Depool labsdb1009 to copy it from labsdb1011

https://gerrit.wikimedia.org/r/423450

Change 423450 merged by Jcrespo:
[operations/puppet@production] labsdb: Depool labsdb1009 to copy it from labsdb1011

https://gerrit.wikimedia.org/r/423450

@ayounsi : As part of this emergency recovery, we are copying 7T at ~240MBytes/s from labsdb1011 to labsdb1009, it will take around 8 hours.

@chasemp No user impact -yet- due to the proxy working as intended and failing over to labsdb1010.

@JAllemandou Please stop sqoop, we are close to a full service outage.

Change 423495 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] dbproxy: Move analytics wikireplica service to labsdb1011

https://gerrit.wikimedia.org/r/423495

Change 423495 merged by Jcrespo:
[operations/puppet@production] dbproxy: Move analytics wikireplica service to labsdb1011

https://gerrit.wikimedia.org/r/423495

Change 423517 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] dbproxy: Repool labsdb1009 as part of the analytics hosts

https://gerrit.wikimedia.org/r/423517

Change 423517 merged by Jcrespo:
[operations/puppet@production] dbproxy: Repool labsdb1009 as part of the analytics hosts

https://gerrit.wikimedia.org/r/423517

This should be fixed now, labsdb1009 was loaded with a copy of labsdb1011. I also took the time to upgrade kernels and mariadb versions. There will be lag for some time and will need tuning( 9 and 11 are on analytics, 10 is on web(, but this should be working now.

Change 423741 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] dbproxy-labsdb: set labsdb1009 as the main web, the others analytics

https://gerrit.wikimedia.org/r/423741

Change 423741 merged by Jcrespo:
[operations/puppet@production] dbproxy-labsdb: set labsdb1009 as the main web, the others analytics

https://gerrit.wikimedia.org/r/423741