labsdb1009 crashed - OOM
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Marostegui
	Oct 29 2017, 6:33 AM

Description

labsdb1009 crashed last night due to OOM (https://grafana.wikimedia.org/dashboard/db/server-board?refresh=1m&panelId=14&fullscreen&orgId=1&var-server=labsdb1009&var-network=eth0&from=1509213927891&to=1509225331129)

Oct 28 20:45:25 labsdb1009 mysqld[24005]: 2017-10-28 20:45:25 140063807833856 [ERROR] mysqld: Out of memory (Needed 28896192 bytes)
Oct 28 20:45:25 labsdb1009 mysqld[24005]: 2017-10-28 20:45:25 140063807833856 [ERROR] Out of memory; check if mysqld or some other process uses all available memory; if not, you may have to use 'ulimit' to allow mysqld to use more memory or you can add more swap space
Oct 28 20:45:25 labsdb1009 mysqld[24005]: 2017-10-28 20:45:25 140063807833856 [ERROR] mysqld: Out of memory (Needed 28896192 bytes)
Oct 28 20:45:25 labsdb1009 mysqld[24005]: 2017-10-28 20:45:25 140063807833856 [ERROR] Out of memory; check if mysqld or some other process uses all available memory; if not, you may have to use 'ulimit' to allow mysqld to use more memory or you can add more swap space
Oct 28 20:45:25 labsdb1009 mysqld[24005]: 171028 20:45:25 [ERROR] mysqld got signal 11 ;
Oct 28 20:45:25 labsdb1009 mysqld[24005]: This could be because you hit a bug. It is also possible that this binary
Oct 28 20:45:25 labsdb1009 mysqld[24005]: or one of the libraries it was linked against is corrupt, improperly built,
Oct 28 20:45:25 labsdb1009 mysqld[24005]: or misconfigured. This error can also be caused by malfunctioning hardware.
Oct 28 20:45:25 labsdb1009 mysqld[24005]: To report this bug, see https://mariadb.com/kb/en/reporting-bugs
Oct 28 20:45:25 labsdb1009 mysqld[24005]: We will try our best to scrape up some info that will hopefully help
Oct 28 20:45:25 labsdb1009 mysqld[24005]: diagnose the problem, but since we have already crashed,
Oct 28 20:45:25 labsdb1009 mysqld[24005]: something is definitely wrong and this may fail.
Oct 28 20:45:25 labsdb1009 mysqld[24005]: Server version: 10.1.28-MariaDB
Oct 28 20:45:25 labsdb1009 mysqld[24005]: key_buffer_size=134217728
Oct 28 20:45:25 labsdb1009 mysqld[24005]: read_buffer_size=131072
Oct 28 20:45:25 labsdb1009 mysqld[24005]: max_used_connections=27
Oct 28 20:45:25 labsdb1009 mysqld[24005]: max_threads=1026
Oct 28 20:45:25 labsdb1009 mysqld[24005]: thread_count=9
Oct 28 20:45:25 labsdb1009 mysqld[24005]: It is possible that mysqld could use up to
Oct 28 20:45:25 labsdb1009 mysqld[24005]: key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 2384788 K  bytes of memory
Oct 28 20:45:25 labsdb1009 mysqld[24005]: Hope that's ok; if not, decrease some variables in the equation.
Oct 28 20:45:25 labsdb1009 mysqld[24005]: Thread pointer: 0x7efcf1e9d008
Oct 28 20:45:25 labsdb1009 mysqld[24005]: Attempting backtrace. You can use the following information to find out
Oct 28 20:45:25 labsdb1009 mysqld[24005]: where mysqld died. If you see no messages after this, something went
Oct 28 20:45:25 labsdb1009 mysqld[24005]: terribly wrong...
Oct 28 20:45:25 labsdb1009 mysqld[24005]: stack_bottom = 0x7f632582fc88 thread_stack 0x30000
Oct 28 20:45:25 labsdb1009 mysqld[24005]: (my_addr_resolve failure: fork)
Oct 28 20:45:25 labsdb1009 mysqld[24005]: /opt/wmf-mariadb101//bin/mysqld(my_print_stacktrace+0x2e) [0x56441c724e3e]
Oct 28 20:45:25 labsdb1009 mysqld[24005]: /opt/wmf-mariadb101//bin/mysqld(handle_fatal_signal+0x3bd) [0x56441c2699ad]
Oct 28 20:45:25 labsdb1009 mysqld[24005]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x110c0) [0x7f63285a00c0]
Oct 28 20:45:25 labsdb1009 mysqld[24005]: /opt/wmf-mariadb101//bin/mysqld(st_join_table::save_explain_data(Explain_table_access*, unsigned long long, bool, st_join_table*)+0x6d) [0x56441c1278dd]
Oct 28 20:45:25 labsdb1009 mysqld[24005]: /opt/wmf-mariadb101//bin/mysqld(st_join_table::update_explain_data(unsigned int)+0x37b) [0x56441c128c9b]
Oct 28 20:45:25 labsdb1009 mysqld[24005]: /opt/wmf-mariadb101//bin/mysqld(+0x462ea9) [0x56441c128ea9]
Oct 28 20:46:04 labsdb1009 systemd[1]: mariadb.service: Main process exited, code=killed, status=11/SEGV
Oct 28 20:46:04 labsdb1009 systemd[1]: mariadb.service: Unit entered failed state.
Oct 28 20:46:04 labsdb1009 systemd[1]: mariadb.service: Failed with result 'signal'.
Oct 28 20:46:09 labsdb1009 systemd[1]: mariadb.service: Service hold-off time over, scheduling restart.
Oct 28 20:46:09 labsdb1009 systemd[1]: Stopped MariaDB database server.
Oct 28 20:46:09 labsdb1009 systemd[1]: Starting MariaDB database server...
Oct 28 20:46:09 labsdb1009 mysqld[31730]: 2017-10-28 20:46:09 139666438564096 [Note] /opt/wmf-mariadb101//bin/mysqld (mysqld 10.1.28-MariaDB) starting as process 31730 ...
Oct 28 20:46:09 labsdb1009 mysqld[31730]: 2017-10-28 20:46:09 139666438564096 [Warning] No argument was provided to --log-bin and neither --log-basename or --log-bin-index where used;  This may cause repliction to break when this server acts as a master and has its hostname changed! Please use '--log-basename=labsdb1009' or '--log-bin=labsdb1009-bin' to avoid this problem.
Oct 28 20:46:09 labsdb1009 mysqld[31730]: 2017-10-28 20:46:09 139666438564096 [ERROR] Plugin 'unix_socket' already installed

The table wikidatawiki.wb_changes got corrupted and broke replication.

Details

Subject	Repo	Branch	Lines +/-
wikireplicas: Repoint web and analytics back to the original servers	operations/puppet	production	+6 -6
wikireplicas: Point all labsdb replica hosts to labsdb1011	operations/puppet	production	+4 -4
labsdb: point analytics replica to labsdb1010 for labsdb1009 maintenance	operations/puppet	production	+4 -4

Customize query in gerrit

Related Objects

Mentioned In: T191149: labsdb1009 crashed
T180564: Execution of maintain-view can create pileups due to metadata locking (was: Queries to wikidatawiki_p.wb_items_per_site on *.web.db.svc.eqiad.wmflabs are timing out)
T180380: "ERROR 2006 (HY000): MySQL server has gone away" failures for a variety of queries against Wiki Replica servers
T133705: Multiple concurrent long running queries from s51434 overloading labsdb1003/labsdb1009

Event Timeline

Marostegui created this task.Oct 29 2017, 6:33 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 29 2017, 6:33 AM

I have restarted replication on all the threads and now working on getting that table re-imported

Marostegui updated the task description. (Show Details)Oct 29 2017, 6:39 AM

I have reimported the table on labsdb1009, and replication is now up-to-date and the table is getting new records finely.
Also reloaded the proxy so 1009 is back to be analytics host after labsdb1010 took over when labsdb1009 crashed.
I have checked which queries where running at the time, and there was one scanning around 700M rows: https://phabricator.wikimedia.org/P6213

bd808 moved this task from Inbox to Done on the cloud-services-team (Kanban) board.Oct 31 2017, 11:36 PM

We have to reimport labsdb1009 from labsdb1010.

jcrespo mentioned this in T133705: Multiple concurrent long running queries from s51434 overloading labsdb1003/labsdb1009.Nov 13 2017, 3:53 PM

For the record, this is the first table that crashed, probably corrupted because of the crash:

Nov 13 13:45:45 labsdb1009 mysqld[31730]: 2017-11-13 13:45:45 139666025916160 [ERROR] Master 'db1095': Slave SQL: Could not execute Delete_rows_v1 event on table cebwiki.geo_tags; Can't find record in 'geo_tags', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log db1095-bin.003688, end_log_pos 129839964, Gtid 171966669-171966669-1334350810, Internal MariaDB error code: 1032

This table was reimported from labsdb1010 and replication resumed finely.

It happened twice- we cannot trust labsdb1009- copying labsdb1010 away and failing it over is a day's work, with very little human intervention.

In T179244#3755473, @jcrespo wrote:

It happened twice- we cannot trust labsdb1009- copying labsdb1010 away and failing it over is a day's work, with very little human intervention.

Agreed :-)

jcrespo claimed this task.Nov 13 2017, 4:24 PM

jcrespo mentioned this in T180380: "ERROR 2006 (HY000): MySQL server has gone away" failures for a variety of queries against Wiki Replica servers.Nov 13 2017, 6:20 PM

Mentioned in SAL (#wikimedia-operations) [2017-11-14T16:23:09Z] <jynus> stop labsdb1010 mariadb to clone it later to labsdb1009 T179244

Change 391551 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] labsdb: point analytics replica to labsdb1010 for labsdb1009 maintenance

https://gerrit.wikimedia.org/r/391551

gerritbot added a project: Patch-For-Review.Nov 15 2017, 1:56 PM

Change 391551 merged by Jcrespo:
[operations/puppet@production] labsdb: point analytics replica to labsdb1010 for labsdb1009 maintenance

https://gerrit.wikimedia.org/r/391551

Mentioned in SAL (#wikimedia-operations) [2017-11-15T15:43:05Z] <jynus> shutting down labsdb1009 for maintenance T179244

jcrespo mentioned this in T180564: Execution of maintain-view can create pileups due to metadata locking (was: Queries to wikidatawiki_p.wb_items_per_site on *.web.db.svc.eqiad.wmflabs are timing out).Nov 15 2017, 3:50 PM

Change 391783 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] wikireplicas: Point all labsdb replica hosts to labsdb1011

https://gerrit.wikimedia.org/r/391783

Change 391783 merged by Jcrespo:
[operations/puppet@production] wikireplicas: Point all labsdb replica hosts to labsdb1011

https://gerrit.wikimedia.org/r/391783

Change 392020 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] wikireplicas: Repoint web and analytics back to the original servers

https://gerrit.wikimedia.org/r/392020

Change 392020 merged by Jcrespo:
[operations/puppet@production] wikireplicas: Repoint web and analytics back to the original servers

https://gerrit.wikimedia.org/r/392020

This should be now fixed.

jcrespo mentioned this in T191149: labsdb1009 crashed.Mar 31 2018, 7:45 PM

labsdb1009 crashed - OOMClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

labsdb1009 crashed - OOM
Closed, ResolvedPublic
Actions