Page MenuHomePhabricator

labsdb1009 crashed - OOM
Closed, ResolvedPublic

Description

labsdb1009 crashed last night due to OOM (https://grafana.wikimedia.org/dashboard/db/server-board?refresh=1m&panelId=14&fullscreen&orgId=1&var-server=labsdb1009&var-network=eth0&from=1509213927891&to=1509225331129)

Oct 28 20:45:25 labsdb1009 mysqld[24005]: 2017-10-28 20:45:25 140063807833856 [ERROR] mysqld: Out of memory (Needed 28896192 bytes)
Oct 28 20:45:25 labsdb1009 mysqld[24005]: 2017-10-28 20:45:25 140063807833856 [ERROR] Out of memory; check if mysqld or some other process uses all available memory; if not, you may have to use 'ulimit' to allow mysqld to use more memory or you can add more swap space
Oct 28 20:45:25 labsdb1009 mysqld[24005]: 2017-10-28 20:45:25 140063807833856 [ERROR] mysqld: Out of memory (Needed 28896192 bytes)
Oct 28 20:45:25 labsdb1009 mysqld[24005]: 2017-10-28 20:45:25 140063807833856 [ERROR] Out of memory; check if mysqld or some other process uses all available memory; if not, you may have to use 'ulimit' to allow mysqld to use more memory or you can add more swap space
Oct 28 20:45:25 labsdb1009 mysqld[24005]: 171028 20:45:25 [ERROR] mysqld got signal 11 ;
Oct 28 20:45:25 labsdb1009 mysqld[24005]: This could be because you hit a bug. It is also possible that this binary
Oct 28 20:45:25 labsdb1009 mysqld[24005]: or one of the libraries it was linked against is corrupt, improperly built,
Oct 28 20:45:25 labsdb1009 mysqld[24005]: or misconfigured. This error can also be caused by malfunctioning hardware.
Oct 28 20:45:25 labsdb1009 mysqld[24005]: To report this bug, see https://mariadb.com/kb/en/reporting-bugs
Oct 28 20:45:25 labsdb1009 mysqld[24005]: We will try our best to scrape up some info that will hopefully help
Oct 28 20:45:25 labsdb1009 mysqld[24005]: diagnose the problem, but since we have already crashed,
Oct 28 20:45:25 labsdb1009 mysqld[24005]: something is definitely wrong and this may fail.
Oct 28 20:45:25 labsdb1009 mysqld[24005]: Server version: 10.1.28-MariaDB
Oct 28 20:45:25 labsdb1009 mysqld[24005]: key_buffer_size=134217728
Oct 28 20:45:25 labsdb1009 mysqld[24005]: read_buffer_size=131072
Oct 28 20:45:25 labsdb1009 mysqld[24005]: max_used_connections=27
Oct 28 20:45:25 labsdb1009 mysqld[24005]: max_threads=1026
Oct 28 20:45:25 labsdb1009 mysqld[24005]: thread_count=9
Oct 28 20:45:25 labsdb1009 mysqld[24005]: It is possible that mysqld could use up to
Oct 28 20:45:25 labsdb1009 mysqld[24005]: key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 2384788 K  bytes of memory
Oct 28 20:45:25 labsdb1009 mysqld[24005]: Hope that's ok; if not, decrease some variables in the equation.
Oct 28 20:45:25 labsdb1009 mysqld[24005]: Thread pointer: 0x7efcf1e9d008
Oct 28 20:45:25 labsdb1009 mysqld[24005]: Attempting backtrace. You can use the following information to find out
Oct 28 20:45:25 labsdb1009 mysqld[24005]: where mysqld died. If you see no messages after this, something went
Oct 28 20:45:25 labsdb1009 mysqld[24005]: terribly wrong...
Oct 28 20:45:25 labsdb1009 mysqld[24005]: stack_bottom = 0x7f632582fc88 thread_stack 0x30000
Oct 28 20:45:25 labsdb1009 mysqld[24005]: (my_addr_resolve failure: fork)
Oct 28 20:45:25 labsdb1009 mysqld[24005]: /opt/wmf-mariadb101//bin/mysqld(my_print_stacktrace+0x2e) [0x56441c724e3e]
Oct 28 20:45:25 labsdb1009 mysqld[24005]: /opt/wmf-mariadb101//bin/mysqld(handle_fatal_signal+0x3bd) [0x56441c2699ad]
Oct 28 20:45:25 labsdb1009 mysqld[24005]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x110c0) [0x7f63285a00c0]
Oct 28 20:45:25 labsdb1009 mysqld[24005]: /opt/wmf-mariadb101//bin/mysqld(st_join_table::save_explain_data(Explain_table_access*, unsigned long long, bool, st_join_table*)+0x6d) [0x56441c1278dd]
Oct 28 20:45:25 labsdb1009 mysqld[24005]: /opt/wmf-mariadb101//bin/mysqld(st_join_table::update_explain_data(unsigned int)+0x37b) [0x56441c128c9b]
Oct 28 20:45:25 labsdb1009 mysqld[24005]: /opt/wmf-mariadb101//bin/mysqld(+0x462ea9) [0x56441c128ea9]
Oct 28 20:46:04 labsdb1009 systemd[1]: mariadb.service: Main process exited, code=killed, status=11/SEGV
Oct 28 20:46:04 labsdb1009 systemd[1]: mariadb.service: Unit entered failed state.
Oct 28 20:46:04 labsdb1009 systemd[1]: mariadb.service: Failed with result 'signal'.
Oct 28 20:46:09 labsdb1009 systemd[1]: mariadb.service: Service hold-off time over, scheduling restart.
Oct 28 20:46:09 labsdb1009 systemd[1]: Stopped MariaDB database server.
Oct 28 20:46:09 labsdb1009 systemd[1]: Starting MariaDB database server...
Oct 28 20:46:09 labsdb1009 mysqld[31730]: 2017-10-28 20:46:09 139666438564096 [Note] /opt/wmf-mariadb101//bin/mysqld (mysqld 10.1.28-MariaDB) starting as process 31730 ...
Oct 28 20:46:09 labsdb1009 mysqld[31730]: 2017-10-28 20:46:09 139666438564096 [Warning] No argument was provided to --log-bin and neither --log-basename or --log-bin-index where used;  This may cause repliction to break when this server acts as a master and has its hostname changed! Please use '--log-basename=labsdb1009' or '--log-bin=labsdb1009-bin' to avoid this problem.
Oct 28 20:46:09 labsdb1009 mysqld[31730]: 2017-10-28 20:46:09 139666438564096 [ERROR] Plugin 'unix_socket' already installed

The table wikidatawiki.wb_changes got corrupted and broke replication.

Event Timeline

Marostegui triaged this task as Medium priority.Oct 29 2017, 6:34 AM
Marostegui updated the task description. (Show Details)
Marostegui moved this task from Triage to In progress on the DBA board.

I have restarted replication on all the threads and now working on getting that table re-imported

I have reimported the table on labsdb1009, and replication is now up-to-date and the table is getting new records finely.
Also reloaded the proxy so 1009 is back to be analytics host after labsdb1010 took over when labsdb1009 crashed.
I have checked which queries where running at the time, and there was one scanning around 700M rows: https://phabricator.wikimedia.org/P6213

jcrespo added a subscriber: jcrespo.

We have to reimport labsdb1009 from labsdb1010.

For the record, this is the first table that crashed, probably corrupted because of the crash:

Nov 13 13:45:45 labsdb1009 mysqld[31730]: 2017-11-13 13:45:45 139666025916160 [ERROR] Master 'db1095': Slave SQL: Could not execute Delete_rows_v1 event on table cebwiki.geo_tags; Can't find record in 'geo_tags', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log db1095-bin.003688, end_log_pos 129839964, Gtid 171966669-171966669-1334350810, Internal MariaDB error code: 1032

This table was reimported from labsdb1010 and replication resumed finely.

It happened twice- we cannot trust labsdb1009- copying labsdb1010 away and failing it over is a day's work, with very little human intervention.

It happened twice- we cannot trust labsdb1009- copying labsdb1010 away and failing it over is a day's work, with very little human intervention.

Agreed :-)

Mentioned in SAL (#wikimedia-operations) [2017-11-14T16:23:09Z] <jynus> stop labsdb1010 mariadb to clone it later to labsdb1009 T179244

Change 391551 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] labsdb: point analytics replica to labsdb1010 for labsdb1009 maintenance

https://gerrit.wikimedia.org/r/391551

Change 391551 merged by Jcrespo:
[operations/puppet@production] labsdb: point analytics replica to labsdb1010 for labsdb1009 maintenance

https://gerrit.wikimedia.org/r/391551

Mentioned in SAL (#wikimedia-operations) [2017-11-15T15:43:05Z] <jynus> shutting down labsdb1009 for maintenance T179244

Change 391783 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] wikireplicas: Point all labsdb replica hosts to labsdb1011

https://gerrit.wikimedia.org/r/391783

Change 391783 merged by Jcrespo:
[operations/puppet@production] wikireplicas: Point all labsdb replica hosts to labsdb1011

https://gerrit.wikimedia.org/r/391783

Change 392020 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] wikireplicas: Repoint web and analytics back to the original servers

https://gerrit.wikimedia.org/r/392020

Change 392020 merged by Jcrespo:
[operations/puppet@production] wikireplicas: Repoint web and analytics back to the original servers

https://gerrit.wikimedia.org/r/392020

This should be now fixed.