Page MenuHomePhabricator

Alerting spam and wrong state of primary dc source info on databases while switching dc from eqiad -> codfw
Closed, ResolvedPublic

Description

We got a few pages, despite no actual outage/wrong status on configuration happened on switch_dc:

You have 5 incidents.

Incident: 409
State:    Critical
Service:  db1103/MariaDB read only x1 #page
Message:  Notification Type: PROBLEM

Service: MariaDB read only x1 #page
Host: db1103
Address: 10.64.0.164
State: CRITICAL

Date/Time: Tue Sept 1 14:16:13 UTC 2020

Notes URLs: https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only

Acknowledged by :

Additional Info:

CRIT: read_only: True, expected False: OK: Version 10.4.13-MariaDB-log, Uptime 7186674s, event_scheduler: True, 58.49 QPS, connection latency: 0.002301s, query latency: 0.000828s


Incident: 408
State:    Critical
Service:  db1123/MariaDB read only s3 #page
Message:  Notification Type: PROBLEM

Service: MariaDB read only s3 #page
Host: db1123
Address: 10.64.48.35
State: CRITICAL

Date/Time: Tue Sept 1 14:08:27 UTC 2020

Notes URLs: https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only

Acknowledged by :

Additional Info:

CRIT: read_only: True, expected False: OK: Version 10.1.43-MariaDB, Uptime 10141557s, event_scheduler: True, 91.40 QPS, connection latency: 0.002181s, query latency: 0.000572s

Incident: 407
State:    Critical
Service:  db1100/MariaDB read only s5 #page
Message:  Notification Type: PROBLEM

Service: MariaDB read only s5 #page
Host: db1100
Address: 10.64.32.197
State: CRITICAL

Date/Time: Tue Sept 1 14:07:20 UTC 2020

Notes URLs: https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only

Acknowledged by :

Additional Info:

CRIT: read_only: True, expected False: OK: Version 10.1.43-MariaDB, Uptime 10313104s, event_scheduler: True, 52.52 QPS, connection latency: 0.002177s, query latency: 0.000697s

Incident: 406
State:    Critical
Service:  db1093/MariaDB read only s6 #page
Message:  Notification Type: PROBLEM

Service: MariaDB read only s6 #page
Host: db1093
Address: 10.64.48.152
State: CRITICAL

Date/Time: Tue Sept 1 14:07:19 UTC 2020

Notes URLs: https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only

Acknowledged by :

Additional Info:

CRIT: read_only: True, expected False: OK: Version 10.1.44-MariaDB, Uptime 4349659s, event_scheduler: True, 98.31 QPS, connection latency: 0.002759s, query latency: 0.000729s

Incident: 405
State:    Critical
Service:  es1021/MariaDB read only es4 #page
Message:  Notification Type: PROBLEM

Service: MariaDB read only es4 #page
Host: es1021
Address: 10.64.16.148
State: CRITICAL

Date/Time: Tue Sept 1 14:07:18 UTC 2020

Notes URLs: https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only

Acknowledged by :

Additional Info:

CRIT: read_only: True, expected False: OK: Version 10.4.13-MariaDB-log, Uptime 4596161s, event_scheduler: True, 28.15 QPS, connection latency: 0.002521s, query latency: 0.000423s

While we can try to improve either how this is dynamically handled (constrained by icinga options), by downtiming it or handing them differently- but after a few confusing puppet runs on the hosts and icinga, it was apparent that calculated puppet data (mw_primary) was flopping between eqiad and codfw.

This was narrowed down for a stalled confctl node giving outdated information (Riccado will know more about this).

This task is to explain what happened and see what actionables can be done to improve this so it doesn't happen again.

Event Timeline

The context of the outdated info was confd stuck on one of the puppetmaster, so when one of the DB was hitting that host for getting the catalog compiler it was getting the wrong DC as primary because confd had not updated the local file.
The fix was to restart confd on the host, it was stuck due to certificate expiration, and might be related to the Puppet CA failover we did a while ago.

As for the actionable items I'd suggest to add in the cleanup phase (08) a cookbook to force a puppet run on all affected DBs (just masters?) to quickly make them converge. (On the doc I'm about to suggest also to restart all confd before the switchdc).

Marostegui moved this task from Triage to Blocked external/Not db team on the DBA board.
Marostegui added subscribers: RLazarus, CDanis.

restarting all confds before switching DC seems overkill and frankly useless. We should rather remember to restart them when we update the puppet CA, same as we had to do with anything docker-related.

I don't think there is much to do for this task besides that (and maybe add a puppet run on the databases in phase 8).

@RLazarus what do you want to do with this task? is this something that needs fixing before the switch back or should we close it per @Joe's comment?

Change 636471 had a related patch set uploaded (by RLazarus; owner: RLazarus):
[operations/cookbooks@master] switchdc: Run Puppet on DB masters after setting read-write

https://gerrit.wikimedia.org/r/636471

Change 636471 merged by jenkins-bot:
[operations/cookbooks@master] switchdc: Run Puppet on DB masters after setting read-write

https://gerrit.wikimedia.org/r/636471

RLazarus claimed this task.

@RLazarus what do you want to do with this task? is this something that needs fixing before the switch back or should we close it per @Joe's comment?

Sorry for the delay -- I think now that we run Puppet on the DB masters, this is okay to close.

I agree with Joe that the right answer is to restart confd whenever we do work that requires it, rather than right before a switchover. I do think it's a real issue if we don't reliably complete that step, and I'm skeptical that "well, next time we should just not forget to do it" is an effective strategy. But I agree that restarting them defensively before a switchover isn't the answer -- and either way that problem is certainly out of scope for DBA.

The immediate switchover issue is solved with the Puppet run, so let's consider this resolved.