Page MenuHomePhabricator

Enable report_host for mariadb
Open, MediumPublic

Description

By default mariadb does not report a hostname when it starts replicating from another server. This means the output of show slave status only contains very opaque information:

root@pc2007.codfw.wmnet[(none)]> show slave hosts;
+-----------+------+------+-----------+
| Server_id | Host | Port | Master_id |
+-----------+------+------+-----------+
| 171966644 |      | 3306 | 180355176 |
| 180367374 |      | 3306 | 180355176 |
+-----------+------+------+-----------+

If we use puppet to set report_host to the fqdn, then it will show up in the show slave hosts table, as well as be available for orchestrator to query.

NOTE: This requires MySQL daemon to be restarted

https://mariadb.com/kb/en/replication-and-binary-log-system-variables/#report_host

Scripts used for rolling this out:

Restarting hosts progress

  • s1
    • eqiad
    • codfw
  • s2
    • eqiad
    • codfw
  • s3
    • eqiad
    • codfw
  • s4
    • eqiad
    • codfw
  • s5
    • eqiad
    • codfw
  • s6
    • eqiad
    • codfw
  • s7
    • eqiad
    • codfw
  • s8
    • eqiad
    • codfw
  • x1
    • eqiad
    • codfw
  • eqiad testing host
  • eqiad backup testing (db1133)
  • db2102 - codfw backup testing host
  • dbstore1003, dbstore1004, dbstore1005 (multi-instance)
  • eqiad backup sources db1095 db1102 db1116 db1139 db1140 db1145 db1150
  • codfw backup sources db2097 db2098 db2099 db2100 db2101 db2139 db2141
    • folded into main section above
  • labsdb1009, labsdb1010, labsdb1011 (no need, will be replaced with the new clouddb hosts)
  • labsdb1012
  • Sanitarium hosts
    • db2094/db2095
      • db1124
        • s1
        • s3
        • s5
        • s8
      • db1125
        • s2
        • s4
        • s6
        • s7
  • es1
  • es2
  • es3
  • es4
    • eqiad
    • codfw
  • es5
  • pc1
  • pc2
    • eqiad
    • codfw
  • pc3
    • eqiad
    • codfw
  • m1
  • m2
  • m3
  • m5
  • tendril/zarcillo/orchestrator
    • db1115
    • db2093

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Kormat added a comment.Nov 3 2020, 3:50 PM

@jcrespo : perfect. Yep, just needs mariadb restarted to pick it up.

Kormat added a comment.Nov 3 2020, 4:53 PM

s1/s2/s3/s5 in codfw are done, excluding the sanitarium masters + sanitariums, and dbstore hosts.

BTW, check if prometheus exporter daemon needs a shake on restarted host, there is quite a few collection failures showing on grafana (unless it is something else).

Most likely they need a restart, for those running mariadb 10.4, due to that bug we saw when testing 10.4

Kormat added a comment.Nov 4 2020, 8:27 AM

BTW, check if prometheus exporter daemon needs a shake on restarted host, there is quite a few collection failures showing on grafana (unless it is something else).

Good catch, thanks. Fixed: https://thanos.wikimedia.org/graph?g0.range_input=1h&g0.max_source_resolution=0s&g0.expr=sum_over_time(mysql_exporter_last_scrape_error%5B5m%5D)%20%3E%204&g0.tab=1

Kormat added a comment.Nov 4 2020, 9:50 AM

s6/s7 in codfw are done, xcluding the sanitarium masters + sanitariums, and dbstore hosts.

Kormat updated the task description. (Show Details)Nov 4 2020, 9:58 AM
Kormat updated the task description. (Show Details)
Kormat updated the task description. (Show Details)Nov 4 2020, 10:03 AM
Kormat updated the task description. (Show Details)Nov 4 2020, 10:09 AM
Kormat updated the task description. (Show Details)Nov 4 2020, 10:33 AM
Kormat updated the task description. (Show Details)Nov 4 2020, 10:58 AM
Kormat updated the task description. (Show Details)Nov 4 2020, 11:14 AM
Kormat updated the task description. (Show Details)
Kormat updated the task description. (Show Details)Nov 4 2020, 11:22 AM
Kormat updated the task description. (Show Details)Nov 4 2020, 11:32 AM
Kormat updated the task description. (Show Details)Nov 4 2020, 11:46 AM
Kormat updated the task description. (Show Details)Nov 4 2020, 11:48 AM
Kormat updated the task description. (Show Details)Nov 4 2020, 11:52 AM
Kormat updated the task description. (Show Details)
Kormat updated the task description. (Show Details)Nov 4 2020, 11:54 AM
Kormat updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2020-11-04T13:35:03Z] <jynus> restart mysqls at db1095 T266483

Mentioned in SAL (#wikimedia-operations) [2020-11-04T13:40:11Z] <jynus> restart mysqls at db1102 T266483

Mentioned in SAL (#wikimedia-operations) [2020-11-04T13:43:23Z] <jynus> restart mysqls at db1116 T266483

Mentioned in SAL (#wikimedia-operations) [2020-11-04T13:47:19Z] <jynus> restart mysqls at db1139 T266483

Mentioned in SAL (#wikimedia-operations) [2020-11-04T13:51:42Z] <jynus> restart mysqls at db1140 T266483

Mentioned in SAL (#wikimedia-operations) [2020-11-04T13:54:05Z] <jynus> restart mysqls at db1145 T266483

Mentioned in SAL (#wikimedia-operations) [2020-11-04T13:59:18Z] <jynus> restart mysqls at db1150 T266483

Mentioned in SAL (#wikimedia-operations) [2020-11-04T14:14:59Z] <jynus> restart mysqls at db209[789],db210[01], db2139, db2141 T266483

Mentioned in SAL (#wikimedia-operations) [2020-11-04T14:37:55Z] <jynus> restart mysql at db1133 T266483

jcrespo updated the task description. (Show Details)Nov 4 2020, 2:44 PM
jcrespo updated the task description. (Show Details)

I restarted mysql on all backup source instances, as well as the backup testing host. Will run now a script to double check they set the variable correctly.

Kormat updated the task description. (Show Details)Nov 4 2020, 3:35 PM
Kormat updated the task description. (Show Details)
Kormat updated the task description. (Show Details)Nov 5 2020, 9:24 AM
Kormat updated the task description. (Show Details)Nov 5 2020, 9:27 AM
Kormat updated the task description. (Show Details)
Kormat updated the task description. (Show Details)Nov 5 2020, 9:30 AM
Kormat added a comment.EditedNov 5 2020, 9:35 AM

es1 eqiad:

  • es1012
  • es1016
  • es1018
  • es1027
  • es1029
Kormat updated the task description. (Show Details)EditedNov 5 2020, 9:35 AM

es2 eqiad:

  • es1011
  • es1013
  • es1015
  • es1026
  • es1030
Kormat updated the task description. (Show Details)EditedNov 5 2020, 9:36 AM

es3 eqiad:

  • es1014
  • es1017
  • es1019
  • es1028
  • es1031
Kormat updated the task description. (Show Details)Nov 5 2020, 9:37 AM
Kormat updated the task description. (Show Details)
Kormat moved this task from In progress to Ready on the DBA board.Nov 5 2020, 10:13 AM
Kormat moved this task from Unsorted 💣 to Blocked 🚧 on the User-Kormat board.
Kormat updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2020-11-10T06:44:44Z] <marostegui> Restart pc1010 to pick up report_host - T266483

Marostegui added a comment.EditedTue, Nov 10, 6:46 AM

pc1 eqiad:

  • pc1010
  • pc1007

Mentioned in SAL (#wikimedia-operations) [2020-11-10T06:53:58Z] <marostegui> Restart dbstore* to pick up report_host - T266483

Marostegui moved this task from Ready to In progress on the DBA board.Tue, Nov 10, 7:16 AM

Mentioned in SAL (#wikimedia-operations) [2020-11-10T13:17:23Z] <marostegui> Restart db1117* to pick up report_host - T266483

m1 eqiad:

  • db1117
  • db1080

m2 eqiad:

  • db1117
  • db1107

m3 eqiad:

  • db1117
  • db1132

m5 eqiad:

  • db1117
  • db1128

Mentioned in SAL (#wikimedia-operations) [2020-11-10T13:28:59Z] <marostegui> Restart db2093 to pick up report_host - T266483

Change 641168 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Pool pc1010 instead of pc1007.

https://gerrit.wikimedia.org/r/641168

Change 641168 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Pool pc1010 instead of pc1007.

https://gerrit.wikimedia.org/r/641168

Mentioned in SAL (#wikimedia-operations) [2020-11-16T14:06:38Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Depool pc1007 and place pc1010 instead of it T266483 (duration: 01m 00s)

Mentioned in SAL (#wikimedia-operations) [2020-11-16T14:06:45Z] <marostegui> Restart pc1007's mysql T266483

pc1 master done:

root@pc1007:~# mysql -e "select @@report_host"
+--------------------+
| @@report_host      |
+--------------------+
| pc1007.eqiad.wmnet |
+--------------------+

Mentioned in SAL (#wikimedia-operations) [2020-11-16T14:12:09Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Repool pc1007 in pc1 after restarting mysql T266483 (duration: 00m 59s)

es1 eqiad:

  • es1012
  • es1016
  • es1018
  • es1027
  • es1029

es2 eqiad:

  • es1011
  • es1013
  • es1015
  • es1026
  • es1030

es3 eqiad:

  • es1014
  • es1017
  • es1019
  • es1028
  • es1031

@Kormat - es1018, es1015 and es1019 are done, but I cannot edit your comment, would you mark them as done? Thanks!

Change 641330 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool pc1008

https://gerrit.wikimedia.org/r/641330

Change 641330 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool pc1008

https://gerrit.wikimedia.org/r/641330

Mentioned in SAL (#wikimedia-operations) [2020-11-17T10:19:09Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Depool pc1008 and place pc1010 instead of it T266483 (duration: 00m 57s)

Mentioned in SAL (#wikimedia-operations) [2020-11-17T10:19:17Z] <marostegui> Restart mysql on pc1008 T266483

pc2 eqiad done:

root@pc1008:~# mysql -e "select @@report_host"
+--------------------+
| @@report_host      |
+--------------------+
| pc1008.eqiad.wmnet |
+--------------------+

Mentioned in SAL (#wikimedia-operations) [2020-11-17T10:24:59Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Repool pc1008 in pc2 after restarting mysql T266483 (duration: 00m 56s)

Marostegui updated the task description. (Show Details)

Change 641629 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool pc1009

https://gerrit.wikimedia.org/r/641629

Change 641629 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool pc1009

https://gerrit.wikimedia.org/r/641629

Mentioned in SAL (#wikimedia-operations) [2020-11-18T11:56:10Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Depool pc1009 and place pc1010 instead of it T266483 (duration: 01m 18s)

Mentioned in SAL (#wikimedia-operations) [2020-11-18T11:56:23Z] <marostegui> Restart mysql on pc1009 T266483

Mentioned in SAL (#wikimedia-operations) [2020-11-18T12:00:44Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Repool pc1009 in pc3 after restarting mysql T266483 (duration: 01m 06s)

pc1009 (pc3) is done

root@pc1009:~# mysql -e "select @@report_host"
+--------------------+
| @@report_host      |
+--------------------+
| pc1009.eqiad.wmnet |
+--------------------+

es5 eqiad

  • es1025
  • es1024
  • es1023