We got a few pages, despite no actual outage/wrong status on configuration happened on switch_dc:
You have 5 incidents. Incident: 409 State: Critical Service: db1103/MariaDB read only x1 #page Message: Notification Type: PROBLEM Service: MariaDB read only x1 #page Host: db1103 Address: 10.64.0.164 State: CRITICAL Date/Time: Tue Sept 1 14:16:13 UTC 2020 Notes URLs: https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only Acknowledged by : Additional Info: CRIT: read_only: True, expected False: OK: Version 10.4.13-MariaDB-log, Uptime 7186674s, event_scheduler: True, 58.49 QPS, connection latency: 0.002301s, query latency: 0.000828s Incident: 408 State: Critical Service: db1123/MariaDB read only s3 #page Message: Notification Type: PROBLEM Service: MariaDB read only s3 #page Host: db1123 Address: 10.64.48.35 State: CRITICAL Date/Time: Tue Sept 1 14:08:27 UTC 2020 Notes URLs: https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only Acknowledged by : Additional Info: CRIT: read_only: True, expected False: OK: Version 10.1.43-MariaDB, Uptime 10141557s, event_scheduler: True, 91.40 QPS, connection latency: 0.002181s, query latency: 0.000572s Incident: 407 State: Critical Service: db1100/MariaDB read only s5 #page Message: Notification Type: PROBLEM Service: MariaDB read only s5 #page Host: db1100 Address: 10.64.32.197 State: CRITICAL Date/Time: Tue Sept 1 14:07:20 UTC 2020 Notes URLs: https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only Acknowledged by : Additional Info: CRIT: read_only: True, expected False: OK: Version 10.1.43-MariaDB, Uptime 10313104s, event_scheduler: True, 52.52 QPS, connection latency: 0.002177s, query latency: 0.000697s Incident: 406 State: Critical Service: db1093/MariaDB read only s6 #page Message: Notification Type: PROBLEM Service: MariaDB read only s6 #page Host: db1093 Address: 10.64.48.152 State: CRITICAL Date/Time: Tue Sept 1 14:07:19 UTC 2020 Notes URLs: https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only Acknowledged by : Additional Info: CRIT: read_only: True, expected False: OK: Version 10.1.44-MariaDB, Uptime 4349659s, event_scheduler: True, 98.31 QPS, connection latency: 0.002759s, query latency: 0.000729s Incident: 405 State: Critical Service: es1021/MariaDB read only es4 #page Message: Notification Type: PROBLEM Service: MariaDB read only es4 #page Host: es1021 Address: 10.64.16.148 State: CRITICAL Date/Time: Tue Sept 1 14:07:18 UTC 2020 Notes URLs: https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only Acknowledged by : Additional Info: CRIT: read_only: True, expected False: OK: Version 10.4.13-MariaDB-log, Uptime 4596161s, event_scheduler: True, 28.15 QPS, connection latency: 0.002521s, query latency: 0.000423s
While we can try to improve either how this is dynamically handled (constrained by icinga options), by downtiming it or handing them differently- but after a few confusing puppet runs on the hosts and icinga, it was apparent that calculated puppet data (mw_primary) was flopping between eqiad and codfw.
This was narrowed down for a stalled confctl node giving outdated information (Riccado will know more about this).
This task is to explain what happened and see what actionables can be done to improve this so it doesn't happen again.