Page MenuHomePhabricator

db1218 crashed
Closed, ResolvedPublic

Description

Can't ssh into it: s1 candidate master.

Event Timeline

Ladsgroup moved this task from Triage to In progress on the DBA board.

SEL is clean:

racadm>>getsel
Record:      1
Date/Time:   01/29/2023 17:23:14
Source:      system
Severity:    Ok
Description: Log cleared.
--------------------------------

Maybe network issue like yesterday with db1198?

Maybe network issue like yesterday with db1198?

It's quite persistent, I haven't been able to ssh into the host for at least ten minutes.

Yeah yesterday it was the cable I believe. Is the loging prompt available if you try via idrac?
If not, let's get DCOps to take a look

well, with powercycle, I still can't ssh into it. I guess it's a persistent network issue

Yeah yesterday it was the cable I believe. Is the loging prompt available if you try via idrac?
If not, let's get DCOps to take a look

sure. Thanks!

yeah, I'll go check if it was a network blip or something more nefarious.

I brought back the mysql so it keeps getting replication but won't repool before checking what happened and a quick data integrity check

You could see if the network link went down at the syslog, or if it was just a full crash

Actually it looks like the whole server froze for ten-ish minutes (until reboot)

syslog:

Jul 19 17:37:06 db1218 systemd[1]: Finished Export confd Prometheus metrics.
Jul 19 17:37:07 db1218 systemd[1]: prometheus_puppet_agent_stats.service: Succeeded.
Jul 19 17:37:07 db1218 systemd[1]: Finished Regular job to collect puppet agent stats.
Jul 19 17:37:07 db1218 systemd[1]: prometheus_puppet_agent_stats.service: Consumed 1.271s CPU time.
Jul 19 17:52:54 db1218 systemd-modules-load[680]: Inserted module 'nf_conntrack'
Jul 19 17:52:54 db1218 systemd-modules-load[680]: Inserted module 'ipmi_devintf'
Jul 19 17:52:54 db1218 lvm[673]:   1 logical volume(s) in volume group "tank" monitored
Jul 19 17:52:54 db1218 systemd[1]: Mounted POSIX Message Queue File System.
Jul 19 17:52:54 db1218 systemd[1]: Mounted Kernel Debug File System.
Jul 19 17:52:54 db1218 systemd[1]: Mounted Kernel Trace File System.

Nothing in kern.log (starts with the reboot)

Yeah that's the most likely issue

Why nothing is showing up in SEL? :/

It could just be OS related. Sometimes even HW stuff isn't logged.

How can we figure it out? Is it fine to leave it as is? I'm scared 😅

We should probably make just another 10.4 host as candidate master just in case.

I think we should change it not to be candidate master but I need to figure out what host should become the candidate and do the dance (turning the binlog into statement, etc.)

Maybe db1219 - it doesn't have a history of crashing. We just need to check it is in a different from from the current master.

it's in C6 which also contains candidate master of x1 (db1220: according to https://fault-tolerance.toolforge.org/map?cluster=db-master-candidates) and we might end up with two masters in the same rack. It's not that bad though.

s1 replicas that are not in the same rack of any other master nor candidate master:

  • db1186 (A8)
  • db1132 or db1206 (B8)
  • db1184 (D6)
  • db1196 (E2)

Which one sound good to you?

The only one that we'd need to exclude is db1132 as it is 10.6

db1186 has just one task where it failed T324858: db1186 power supplies not redundant but it looks related to a cable not being correctly set.
db1184 and db1196 has no tasks related to failures, so one of them should be fine

I'd go with db1196 as the number implies it might be the newest host among them. Does that sound good to you?

Awesome.

I never changed a candidate master before.
Is it:

  • change binlog to be statemetn
  • change tags in orch and dbctl
  • make patch in puppet to change the comment

Anything else?

Awesome.

I never changed a candidate master before.
Is it:

  • change binlog to be statemetn
  • change tags in orch and dbctl
  • make patch in puppet to change the comment

Anything else?

Don't forget to double check the binlog has indeed changed its format on disk.
Other than that, that's all yeah

Mentioned in SAL (#wikimedia-operations) [2023-07-31T12:52:52Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Depool db1196 T342284', diff saved to https://phabricator.wikimedia.org/P49814 and previous config saved to /var/cache/conftool/dbconfig/20230731-125252-ladsgroup.json

So I ran

root@db1196.eqiad.wmnet[(none)]> stop slave;
Query OK, 0 rows affected (0.006 sec)

root@db1196.eqiad.wmnet[(none)]> SET GLOBAL binlog_format = 'STATEMENT';
Query OK, 0 rows affected (0.001 sec)

root@db1196.eqiad.wmnet[(none)]> start slave;
Query OK, 0 rows affected (0.076 sec)

root@db1196.eqiad.wmnet[(none)]> stop slave;
Query OK, 0 rows affected (0.004 sec)

root@db1196.eqiad.wmnet[(none)]> flush binary logs;
Query OK, 0 rows affected (0.002 sec)

root@db1196.eqiad.wmnet[(none)]> start slave;
Query OK, 0 rows affected (0.004 sec)

And the binlogs look SBR to me. Does it look correct to you @Marostegui

Yeah, just checked the logs on the host and they look SBR

Stupid question: What if mariadb get restarted? Since I set the variable binlog_format in mariadb live, would a restart override it and we end up with RBR again?

Since candidate masters don't have a dedicated hiera or puppet rule to set it in mysql conf. That worries me a bit.

Stupid question: What if mariadb get restarted? Since I set the variable binlog_format in mariadb live, would a restart override it and we end up with RBR again?

Since candidate masters don't have a dedicated hiera or puppet rule to set it in mysql conf. That worries me a bit.

Candidate masters do have a different puppet hiera:

cat db1218.yaml
# db1218
# candidate master for s1
mariadb::shard: 's1'
mariadb::binlog_format: 'STATEMENT'

You'd need to add that to the new candidate master and remove that from the old one.

Mentioned in SAL (#wikimedia-operations) [2023-08-02T12:32:29Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Depool db1184 T342284', diff saved to https://phabricator.wikimedia.org/P49963 and previous config saved to /var/cache/conftool/dbconfig/20230802-123228-ladsgroup.json

Change 944897 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] mariadb: Switch candidate host of s1

https://gerrit.wikimedia.org/r/944897

Change 944897 merged by Ladsgroup:

[operations/puppet@production] mariadb: Switch candidate host of s1

https://gerrit.wikimedia.org/r/944897

By the power vested in me by @Marostegui I now pronounce db1184 as the new candidate master

Ladsgroup claimed this task.
Ladsgroup moved this task from In progress to Done on the DBA board.