db1218 crashed
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Ladsgroup
	Jul 19 2023, 5:46 PM

Description

Can't ssh into it: s1 candidate master.

Details

	Subject	Repo	Branch	Lines +/-
	mariadb: Switch candidate host of s1	operations/puppet	production	+2 -2

Customize query in gerrit

Related Objects

Mentioned In: T346454: Master and candidate master of s5 and s8 in eqiad are in the same row
Mentioned Here: P49963 dbctl commit (dc=all): 'Depool db1184 T342284'
P49814 dbctl commit (dc=all): 'Depool db1196 T342284'
T324858: db1186 power supplies not redundant

Event Timeline

Ladsgroup created this task.Jul 19 2023, 5:46 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 19 2023, 5:46 PM

SEL is clean:

racadm>>getsel
Record:      1
Date/Time:   01/29/2023 17:23:14
Source:      system
Severity:    Ok
Description: Log cleared.
--------------------------------

Maybe network issue like yesterday with db1198?

Mentioned in SAL (#wikimedia-operations) [2023-07-19T17:49:43Z] <Amir1> powercycled db1218 (T342284)

In T342284#9029213, @Marostegui wrote:

Maybe network issue like yesterday with db1198?

It's quite persistent, I haven't been able to ssh into the host for at least ten minutes.

Yeah yesterday it was the cable I believe. Is the loging prompt available if you try via idrac?
If not, let's get DCOps to take a look

well, with powercycle, I still can't ssh into it. I guess it's a persistent network issue

In T342284#9029219, @Marostegui wrote:

Yeah yesterday it was the cable I believe. Is the loging prompt available if you try via idrac?
If not, let's get DCOps to take a look

sure. Thanks!

It's back looks like?

yeah, I'll go check if it was a network blip or something more nefarious.

I brought back the mysql so it keeps getting replication but won't repool before checking what happened and a quick data integrity check

You could see if the network link went down at the syslog, or if it was just a full crash

yeah, let me check

Actually it looks like the whole server froze for ten-ish minutes (until reboot)

syslog:

Jul 19 17:37:06 db1218 systemd[1]: Finished Export confd Prometheus metrics.
Jul 19 17:37:07 db1218 systemd[1]: prometheus_puppet_agent_stats.service: Succeeded.
Jul 19 17:37:07 db1218 systemd[1]: Finished Regular job to collect puppet agent stats.
Jul 19 17:37:07 db1218 systemd[1]: prometheus_puppet_agent_stats.service: Consumed 1.271s CPU time.
Jul 19 17:52:54 db1218 systemd-modules-load[680]: Inserted module 'nf_conntrack'
Jul 19 17:52:54 db1218 systemd-modules-load[680]: Inserted module 'ipmi_devintf'
Jul 19 17:52:54 db1218 lvm[673]:   1 logical volume(s) in volume group "tank" monitored
Jul 19 17:52:54 db1218 systemd[1]: Mounted POSIX Message Queue File System.
Jul 19 17:52:54 db1218 systemd[1]: Mounted Kernel Debug File System.
Jul 19 17:52:54 db1218 systemd[1]: Mounted Kernel Trace File System.

Nothing in kern.log (starts with the reboot)

Yeah that's the most likely issue

No network issues it seems.

Why nothing is showing up in SEL? :/

It could just be OS related. Sometimes even HW stuff isn't logged.

How can we figure it out? Is it fine to leave it as is? I'm scared 😅

We should probably make just another 10.4 host as candidate master just in case.

What is the plan with this?

I think we should change it not to be candidate master but I need to figure out what host should become the candidate and do the dance (turning the binlog into statement, etc.)

Maybe db1219 - it doesn't have a history of crashing. We just need to check it is in a different from from the current master.

it's in C6 which also contains candidate master of x1 (db1220: according to https://fault-tolerance.toolforge.org/map?cluster=db-master-candidates) and we might end up with two masters in the same rack. It's not that bad though.

s1 replicas that are not in the same rack of any other master nor candidate master:

db1186 (A8)
db1132 or db1206 (B8)
db1184 (D6)
db1196 (E2)

Which one sound good to you?

The only one that we'd need to exclude is db1132 as it is 10.6

db1186 has just one task where it failed T324858: db1186 power supplies not redundant but it looks related to a cable not being correctly set.
db1184 and db1196 has no tasks related to failures, so one of them should be fine

I'd go with db1196 as the number implies it might be the newest host among them. Does that sound good to you?

Awesome.

I never changed a candidate master before.
Is it:

change binlog to be statemetn
change tags in orch and dbctl
make patch in puppet to change the comment

Anything else?

In T342284#9047722, @Ladsgroup wrote:

Awesome.

I never changed a candidate master before.
Is it:

change binlog to be statemetn

change tags in orch and dbctl

make patch in puppet to change the comment

Anything else?

Don't forget to double check the binlog has indeed changed its format on disk.
Other than that, that's all yeah

Mentioned in SAL (#wikimedia-operations) [2023-07-31T12:52:52Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Depool db1196 T342284', diff saved to https://phabricator.wikimedia.org/P49814 and previous config saved to /var/cache/conftool/dbconfig/20230731-125252-ladsgroup.json

So I ran

root@db1196.eqiad.wmnet[(none)]> stop slave;
Query OK, 0 rows affected (0.006 sec)

root@db1196.eqiad.wmnet[(none)]> SET GLOBAL binlog_format = 'STATEMENT';
Query OK, 0 rows affected (0.001 sec)

root@db1196.eqiad.wmnet[(none)]> start slave;
Query OK, 0 rows affected (0.076 sec)

root@db1196.eqiad.wmnet[(none)]> stop slave;
Query OK, 0 rows affected (0.004 sec)

root@db1196.eqiad.wmnet[(none)]> flush binary logs;
Query OK, 0 rows affected (0.002 sec)

root@db1196.eqiad.wmnet[(none)]> start slave;
Query OK, 0 rows affected (0.004 sec)

And the binlogs look SBR to me. Does it look correct to you @Marostegui

Yeah, just checked the logs on the host and they look SBR

Stupid question: What if mariadb get restarted? Since I set the variable binlog_format in mariadb live, would a restart override it and we end up with RBR again?

Since candidate masters don't have a dedicated hiera or puppet rule to set it in mysql conf. That worries me a bit.

In T342284#9055664, @Ladsgroup wrote:

Stupid question: What if mariadb get restarted? Since I set the variable binlog_format in mariadb live, would a restart override it and we end up with RBR again?

Since candidate masters don't have a dedicated hiera or puppet rule to set it in mysql conf. That worries me a bit.

Candidate masters do have a different puppet hiera:

cat db1218.yaml
# db1218
# candidate master for s1
mariadb::shard: 's1'
mariadb::binlog_format: 'STATEMENT'

You'd need to add that to the new candidate master and remove that from the old one.

ah, thanks. I missed it.

Mentioned in SAL (#wikimedia-operations) [2023-08-02T12:32:29Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Depool db1184 T342284', diff saved to https://phabricator.wikimedia.org/P49963 and previous config saved to /var/cache/conftool/dbconfig/20230802-123228-ladsgroup.json