Page MenuHomePhabricator

Upgrade and restart m1 master (db1080)
Closed, ResolvedPublic

Description

In order to enable report_host flag (T266483) on m1 master, we need to restart its MySQL.
This host currently has the following active databases:

bacula
bacula9
cas
cas_staging
dbbackups
etherpadlite
librenms
pki

When: Thursday 21st January at 09:00AM UTC

  • Impact: The above databases will be read only**

This is just a daemon restart, so it shouldn't take too long, maybe a couple of minutes of read-only time once the proxy has failed over to the read-only replica.

Event Timeline

Marostegui added subscribers: jcrespo, akosiaris.

@akosiaris @jcrespo what about Thursday 21 at 09:00AM UTC for the mysql restart of this host?
@jcrespo, if this doesn't work cause it can mess up with backup times, let me know so we can look for a more suitable day and time

Marostegui triaged this task as Medium priority.Jan 8 2021, 1:33 PM
Marostegui moved this task from Triage to Ready on the DBA board.

@jcrespo, if this doesn't work cause it can mess up with backup times, let me know so we can look for a more suitable day and time

That time would be ideal, I expect no issues unless we have an ongoing restore at that time.

Excellent, thanks. I will double check with @akosiaris to see if he can be around in case etherpad requires some action.

Excellent, thanks. I will double check with @akosiaris to see if he can be around in case etherpad requires some action.

Fine by me.

Thank you both, calendar invite sent.

Pinging the service owners for this upcoming maintenance: @jbond @ayounsi @MoritzMuehlenhoff
@Trizek-WMF just to let you know that etherpad might be read-only of a brief period of time - not sure if we need to ping the community about it.

Pinging the service owners for this upcoming maintenance: @jbond @ayounsi @MoritzMuehlenhoff

That's fine, this only impacts adding new U2F tokens during the r/o period.

Not familiar with all the databases – is Etherpad the only user-facing thing being affected here? There'll be no other potential disruptions?

Not familiar with all the databases – is Etherpad the only user-facing thing being affected here? There'll be no other potential disruptions?

Yeah, the only user facing database here is Etherpad. The rest are databases used internally by the SRE team

Fine for LibreNMS, just don't do it during a major network outage :)

OK, thanks for the heads-up – I don't think a couple of minutes of Ehterpad read-only is enough of a problem to be included in Tech News, but appreciate the tag. Rather remove it ten times than miss the one thing I should have included.

Is there a way to add a notice on Etherpad like we do on Mediawiki during read-onlys?

Do we have data about the average number of edit/usages of this service?

Is there a way to add a notice on Etherpad like we do on Mediawiki during read-onlys?

No as far as I know

Do we have data about the average number of edit/usages of this service?

Yeah we do: https://grafana.wikimedia.org/d/000000193/etherpad?orgId=1&from=now-24h&to=now it is not massive :)

Is there a way to add a notice on Etherpad like we do on Mediawiki during read-onlys?

@Trizek-WMF If I may suggest a way, I think in the past a simple email to wikitech-l list was enough. While the technical community and the users of etherpad's overlap may not be 1:1, it should be enough to catch a huge blocker, like a large event happening at the same time (which is probably unlikely).

I forgot about the wikitech-l email. Good idea! I will send one tomorrow.

Procedure:

Pre restart

  • Silence m1 hosts
  • buffer pool dump + disablement in advance to make the restart faster

Restart

  • !log m1 master restart - T271540
  • db1080: restart mysql
  • verify report_host is enabled
  • verify read_only is OFF
  • Once mysql is back: reload haproxy on dbproxy1014 (active) and dbproxy1012 (passive)
  • check everything is ok
  • close this task

I forgot about the wikitech-l email. Good idea! I will send one tomorrow.

Sent to wikitech-l and wikitech-ambassadors
https://lists.wikimedia.org/pipermail/wikitech-ambassadors/2021-January/002396.html

All hosts but the master have report_host enabled.

./section m1 | while read host port; do echo "=== $port:$port ===";mysql.py -h$host:$port -e "select @@report_host";done
=== 3306:3306 ===
@@report_host
db2132.codfw.wmnet
=== 3321:3321 ===
@@report_host
db2078.codfw.wmnet
=== 3321:3321 ===
@@report_host
db1117.eqiad.wmnet
=== 3306:3306 ===
@@report_host
NULL

Reminder: This is happening in around 1h

Mentioned in SAL (#wikimedia-operations) [2021-01-21T08:37:52Z] <marostegui> Silence m1 hosts in preparation for the restart T271540

Mentioned in SAL (#wikimedia-operations) [2021-01-21T08:51:41Z] <jynus> stopping puppet and bacula for backup1001 T271540

This was done.
Downtime was from: 09:00:19 to 09:00:48, so 29 seconds of downtime

Closing this as resolved - thanks everyone for the help!

root@db1080.eqiad.wmnet[(none)]> select @@report_host;
+--------------------+
| @@report_host      |
+--------------------+
| db1080.eqiad.wmnet |
+--------------------+
1 row in set (0.001 sec)