Can't ssh into it: s1 candidate master.
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
mariadb: Switch candidate host of s1 | operations/puppet | production | +2 -2 |
Related Objects
Event Timeline
SEL is clean:
racadm>>getsel Record: 1 Date/Time: 01/29/2023 17:23:14 Source: system Severity: Ok Description: Log cleared. --------------------------------
Mentioned in SAL (#wikimedia-operations) [2023-07-19T17:49:43Z] <Amir1> powercycled db1218 (T342284)
It's quite persistent, I haven't been able to ssh into the host for at least ten minutes.
Yeah yesterday it was the cable I believe. Is the loging prompt available if you try via idrac?
If not, let's get DCOps to take a look
well, with powercycle, I still can't ssh into it. I guess it's a persistent network issue
I brought back the mysql so it keeps getting replication but won't repool before checking what happened and a quick data integrity check
You could see if the network link went down at the syslog, or if it was just a full crash
Actually it looks like the whole server froze for ten-ish minutes (until reboot)
syslog:
Jul 19 17:37:06 db1218 systemd[1]: Finished Export confd Prometheus metrics. Jul 19 17:37:07 db1218 systemd[1]: prometheus_puppet_agent_stats.service: Succeeded. Jul 19 17:37:07 db1218 systemd[1]: Finished Regular job to collect puppet agent stats. Jul 19 17:37:07 db1218 systemd[1]: prometheus_puppet_agent_stats.service: Consumed 1.271s CPU time. Jul 19 17:52:54 db1218 systemd-modules-load[680]: Inserted module 'nf_conntrack' Jul 19 17:52:54 db1218 systemd-modules-load[680]: Inserted module 'ipmi_devintf' Jul 19 17:52:54 db1218 lvm[673]: 1 logical volume(s) in volume group "tank" monitored Jul 19 17:52:54 db1218 systemd[1]: Mounted POSIX Message Queue File System. Jul 19 17:52:54 db1218 systemd[1]: Mounted Kernel Debug File System. Jul 19 17:52:54 db1218 systemd[1]: Mounted Kernel Trace File System.
Nothing in kern.log (starts with the reboot)
I think we should change it not to be candidate master but I need to figure out what host should become the candidate and do the dance (turning the binlog into statement, etc.)
Maybe db1219 - it doesn't have a history of crashing. We just need to check it is in a different from from the current master.
it's in C6 which also contains candidate master of x1 (db1220: according to https://fault-tolerance.toolforge.org/map?cluster=db-master-candidates) and we might end up with two masters in the same rack. It's not that bad though.
s1 replicas that are not in the same rack of any other master nor candidate master:
- db1186 (A8)
- db1132 or db1206 (B8)
- db1184 (D6)
- db1196 (E2)
Which one sound good to you?
The only one that we'd need to exclude is db1132 as it is 10.6
db1186 has just one task where it failed T324858: db1186 power supplies not redundant but it looks related to a cable not being correctly set.
db1184 and db1196 has no tasks related to failures, so one of them should be fine
I'd go with db1196 as the number implies it might be the newest host among them. Does that sound good to you?
Awesome.
I never changed a candidate master before.
Is it:
- change binlog to be statemetn
- change tags in orch and dbctl
- make patch in puppet to change the comment
Anything else?
Don't forget to double check the binlog has indeed changed its format on disk.
Other than that, that's all yeah
Mentioned in SAL (#wikimedia-operations) [2023-07-31T12:52:52Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Depool db1196 T342284', diff saved to https://phabricator.wikimedia.org/P49814 and previous config saved to /var/cache/conftool/dbconfig/20230731-125252-ladsgroup.json
So I ran
root@db1196.eqiad.wmnet[(none)]> stop slave; Query OK, 0 rows affected (0.006 sec) root@db1196.eqiad.wmnet[(none)]> SET GLOBAL binlog_format = 'STATEMENT'; Query OK, 0 rows affected (0.001 sec) root@db1196.eqiad.wmnet[(none)]> start slave; Query OK, 0 rows affected (0.076 sec) root@db1196.eqiad.wmnet[(none)]> stop slave; Query OK, 0 rows affected (0.004 sec) root@db1196.eqiad.wmnet[(none)]> flush binary logs; Query OK, 0 rows affected (0.002 sec) root@db1196.eqiad.wmnet[(none)]> start slave; Query OK, 0 rows affected (0.004 sec)
And the binlogs look SBR to me. Does it look correct to you @Marostegui
Stupid question: What if mariadb get restarted? Since I set the variable binlog_format in mariadb live, would a restart override it and we end up with RBR again?
Since candidate masters don't have a dedicated hiera or puppet rule to set it in mysql conf. That worries me a bit.
Candidate masters do have a different puppet hiera:
cat db1218.yaml # db1218 # candidate master for s1 mariadb::shard: 's1' mariadb::binlog_format: 'STATEMENT'
You'd need to add that to the new candidate master and remove that from the old one.
Mentioned in SAL (#wikimedia-operations) [2023-08-02T12:32:29Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Depool db1184 T342284', diff saved to https://phabricator.wikimedia.org/P49963 and previous config saved to /var/cache/conftool/dbconfig/20230802-123228-ladsgroup.json
Change 944897 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):
[operations/puppet@production] mariadb: Switch candidate host of s1
Change 944897 merged by Ladsgroup:
[operations/puppet@production] mariadb: Switch candidate host of s1
By the power vested in me by @Marostegui I now pronounce db1184 as the new candidate master