Page MenuHomePhabricator

Test MariaDB 10.4 in production
Closed, ResolvedPublic

Description

db1107 has been running 10.4 for a few weeks (replicating to a Percona 8 slave with no issue).
We've not seen any replication issues.

We should replay some production queries into it to make sure nothing shows up and if so, let's reimage db1107 and place it somewhere in production with very low weight to see how it performs.

Details

Related Gerrit Patches:
operations/puppet : productionprometheus-mysqld-exporter: Workaround upstream package regression
operations/puppet : productionprometheus-mysqld-exporter: Fix options for multiinstance hosts
operations/puppet : productionmariadb: Move db1114 to s8
operations/puppet : productiondb2086: Disable notifications
operations/puppet : productiondb1107: Enable notifications
operations/software : mastercontrol-mariadb-*: Change version
operations/puppet : productionmariadb: Place db1107 in s1

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Mentioned in SAL (#wikimedia-operations) [2020-02-10T08:44:47Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Increase weight from 1 to 5 for db1107 - T242702', diff saved to https://phabricator.wikimedia.org/P10364 and previous config saved to /var/cache/conftool/dbconfig/20200210-084446-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-02-10T15:45:52Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1107 after first day of 10.4 testing - T242702', diff saved to https://phabricator.wikimedia.org/P10366 and previous config saved to /var/cache/conftool/dbconfig/20200210-154552-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-02-11T07:07:21Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Repool db1107 for 10.4 testing - T242702', diff saved to https://phabricator.wikimedia.org/P10371 and previous config saved to /var/cache/conftool/dbconfig/20200211-070720-marostegui.json

Change 571447 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db2086: Disable notifications

https://gerrit.wikimedia.org/r/571447

Change 571447 merged by Marostegui:
[operations/puppet@production] db2086: Disable notifications

https://gerrit.wikimedia.org/r/571447

Mentioned in SAL (#wikimedia-operations) [2020-02-11T08:13:20Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Increase weight from 5 to 10 for db1107 - T242702', diff saved to https://phabricator.wikimedia.org/P10377 and previous config saved to /var/cache/conftool/dbconfig/20200211-081319-marostegui.json

Given that yesterday the host responded well with traffic 1 (0.06% traffic) and 5 (0.3%), I have pooled it with weight 10 for today (0.6%)

Mentioned in SAL (#wikimedia-operations) [2020-02-11T13:03:44Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1107 after 10.4 testing - T242702', diff saved to https://phabricator.wikimedia.org/P10382 and previous config saved to /var/cache/conftool/dbconfig/20200211-130343-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-02-12T07:02:50Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Repool db1107 with weight 20 for 10.4 testing - T242702', diff saved to https://phabricator.wikimedia.org/P10391 and previous config saved to /var/cache/conftool/dbconfig/20200212-070250-marostegui.json

I have started today with weight 20 instead of weight 11 as it had yesterday.

Mentioned in SAL (#wikimedia-operations) [2020-02-12T13:55:14Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1107 after 10.4 testing - T242702', diff saved to https://phabricator.wikimedia.org/P10393 and previous config saved to /var/cache/conftool/dbconfig/20200212-135514-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-02-13T07:28:44Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Repool db1107 with weight 50 for 10.4 testing - T242702', diff saved to https://phabricator.wikimedia.org/P10402 and previous config saved to /var/cache/conftool/dbconfig/20200213-072839-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-02-13T08:59:57Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Increase weight for db1107 50 -> 100 for 10.4 testing - T242702', diff saved to https://phabricator.wikimedia.org/P10403 and previous config saved to /var/cache/conftool/dbconfig/20200213-085957-marostegui.json

I have given this host more weight now, it is now serving with weight 100, which is 6% of enwiki main traffic.

Mentioned in SAL (#wikimedia-operations) [2020-02-13T14:27:35Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1107 after 10.4 testing - T242702', diff saved to https://phabricator.wikimedia.org/P10405 and previous config saved to /var/cache/conftool/dbconfig/20200213-142735-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-02-14T08:06:00Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Repool db1107 with weight 100 for 10.4 testing - T242702', diff saved to https://phabricator.wikimedia.org/P10408 and previous config saved to /var/cache/conftool/dbconfig/20200214-080600-marostegui.json

Next week I am going to start combining main traffic + API traffic, to capture some live API queries (even though they've been replayed already off-band with no problem)

Mentioned in SAL (#wikimedia-operations) [2020-02-14T09:34:56Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1107 after 10.4 testing - T242702', diff saved to https://phabricator.wikimedia.org/P10409 and previous config saved to /var/cache/conftool/dbconfig/20200214-093456-marostegui.json

Pooled db1107 with weight 100 on main and weight 10 on API for the first time:

1--- eqiad/groupLoadsBySection live
2+++ eqiad/groupLoadsBySection generated
3@@ -18,20 +18,21 @@
4 "vslow": {
5 "db1078": 100
6 },
7 "watchlist": {
8 "db1112": 100
9 }
10 },
11 "s1": {
12 "api": {
13 "db1080": 100,
14+ "db1107": 10,
15 "db1119": 100,
16 "db1134": 100
17 },
18 "contributions": {
19 "db1089": 100,
20 "db1099:3311": 100,
21 "db1105:3311": 100
22 },
23 "dump": {
24 "db1106": 100
25--- eqiad/sectionLoads live
26+++ eqiad/sectionLoads generated
27@@ -12,20 +12,21 @@
28 "s1": [
29 {
30 "db1083": 0
31 },
32 {
33 "db1080": 200,
34 "db1089": 250,
35 "db1099:3311": 50,
36 "db1105:3311": 50,
37 "db1106": 50,
38+ "db1107": 100,
39 "db1118": 500,
40 "db1119": 200,
41 "db1134": 200
42 }
43 ],
44 "s10": [
45 {
46 "db1133": 0
47 },
48 {}

Mentioned in SAL (#wikimedia-operations) [2020-02-17T10:22:18Z] <marostegui@cumin1001> dbctl commit (dc=all): ' db1107 increase API weight from 10 to 15 for 10.4 testing - T242702', diff saved to https://phabricator.wikimedia.org/P10420 and previous config saved to /var/cache/conftool/dbconfig/20200217-102218-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-02-17T14:31:47Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1107 after 10.4 testing - T242702', diff saved to https://phabricator.wikimedia.org/P10422 and previous config saved to /var/cache/conftool/dbconfig/20200217-143146-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-02-18T06:25:00Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Repool db1107 with weight 100 and weight 10 in API for 10.4 testing - T242702', diff saved to https://phabricator.wikimedia.org/P10438 and previous config saved to /var/cache/conftool/dbconfig/20200218-062459-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-02-18T06:38:19Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Increase weight for db1107 100 -> 200 for 10.4 testing - T242702', diff saved to https://phabricator.wikimedia.org/P10439 and previous config saved to /var/cache/conftool/dbconfig/20200218-063819-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-02-18T10:49:58Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Increase API weight for db1107 15 -> 25 for 10.4 testing - T242702', diff saved to https://phabricator.wikimedia.org/P10445 and previous config saved to /var/cache/conftool/dbconfig/20200218-104958-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-02-18T13:55:26Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Increase API weight for db1107 25 -> 50 for 10.4 testing - T242702', diff saved to https://phabricator.wikimedia.org/P10448 and previous config saved to /var/cache/conftool/dbconfig/20200218-135525-marostegui.json

Hello,

An update on 10.4 testing:
Everything has been running fine for the last week , we haven't found any major regressions, so I am going to leave db1107 now serving traffic 24h, during the week - it won't serve traffic during weekends.
Right now it has similar weights than other 10.1 hosts serving s1 (enwiki).

Currently it is serving:
14% of API traffic
11% of main traffic.

Next week I will also include it on Special slaves traffic (logpager, recentchanges, watchlist, contributions) to capture and analyze those queries.

If for any reason you believe this host is causing issues:
To depool the host from either cumin1001 or cumin2001:
dbctl instance db1107 depool && dbctl config commit -m "Emergency depool db1107 - T242702"

Mentioned in SAL (#wikimedia-operations) [2020-02-19T06:57:27Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Increase API weight for db1107 50 -> 100 for 10.4 testing - T242702', diff saved to https://phabricator.wikimedia.org/P10454 and previous config saved to /var/cache/conftool/dbconfig/20200219-065726-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-02-21T08:54:06Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1107 after 10.4 testing - T242702', diff saved to https://phabricator.wikimedia.org/P10473 and previous config saved to /var/cache/conftool/dbconfig/20200221-085405-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-02-24T07:03:37Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Repool db1107 for 10.4 testing in main and API - T242702', diff saved to https://phabricator.wikimedia.org/P10487 and previous config saved to /var/cache/conftool/dbconfig/20200224-070337-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-02-24T07:12:02Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Pool db1107 for 10.4 testing in special slaves group with weight 10 - T242702', diff saved to https://phabricator.wikimedia.org/P10488 and previous config saved to /var/cache/conftool/dbconfig/20200224-071201-marostegui.json

I have pooled db1107 into special slave group (recenthanges, contributions, watchlist, logpager) with just a 2.4% of the traffic on each group.

Mentioned in SAL (#wikimedia-operations) [2020-02-25T06:57:42Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1107 to analyze recentchanges table - T242702', diff saved to https://phabricator.wikimedia.org/P10508 and previous config saved to /var/cache/conftool/dbconfig/20200225-065741-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-02-25T07:53:04Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Repool db1107 for 10.4 testing in main API and special groups - T242702', diff saved to https://phabricator.wikimedia.org/P10510 and previous config saved to /var/cache/conftool/dbconfig/20200225-075304-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-02-25T12:32:23Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Increase traffic on db1107 for 10.4 on special groups 10 -> 50 - T242702', diff saved to https://phabricator.wikimedia.org/P10511 and previous config saved to /var/cache/conftool/dbconfig/20200225-123222-marostegui.json

jcrespo updated the task description. (Show Details)Feb 25 2020, 1:26 PM
jcrespo updated the task description. (Show Details)Feb 25 2020, 1:38 PM

I have left a heartbeat running on db1107, it should have no problem, but let's give it a few days to make sure it doesn't die or crash or whatever

root     25652  0.0  0.0  34892 17016 ?        Ss   13:41   0:00 /usr/bin/perl /usr/local/bin/pt-heartbeat-wikimedia --defaults-file=/dev/null --user=root --host=localhost -D heartbeat --shard=s1 --datacenter=eqiad --update --replace --interval=1 --set-vars=binlog_format=STATEMENT -S /run/mysqld/mysqld.sock --daemonize --pid /var/run/pt-heartbeat.pid

On the other hand, I have also installed (manually) wmf-pt-kill 2.2.20-1+wmf5 and changed the service unit to hardcode some values (as they are supposed to be populated from hiera), and it also works fine. This last package is only used on labs, so we should be fine if we want to properly migrate it to buster although it works out of the box now.
We are good on that front too.

jcrespo updated the task description. (Show Details)Feb 27 2020, 10:21 AM
Marostegui updated the task description. (Show Details)Feb 28 2020, 6:50 AM
Marostegui added a subscriber: jcrespo.

@jcrespo what do believe could be broken within the prometheus exporter?

@jcrespo what do believe could be broken within the prometheus exporter?

I don't know exactly what, but I can see that Graphana (prometheus, really) reports mysql exporter collections as a failure: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1107&var-port=9104 (the other is the percona/mysql 8 node, which is expected/out of scope here). By looking at the graph, several metrics are missing. I am going to assume we just either used the same version as before or upgraded blindly to the new version available in Buster, so it needs checking if it is a pure orchestation/configuration issue or the package needs change/upgrade.

I added it as a check to try to have it ready before further rollout, without further research (yet).

@jcrespo what do believe could be broken within the prometheus exporter?

I don't know exactly what, but I can see that Graphana (prometheus, really) reports mysql exporter collections as a failure: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1107&var-port=9104 (the other is the percona/mysql 8 node, which is expected/out of scope here). By looking at the graph, several metrics are missing. I am going to assume we just either used the same version as before or upgraded blindly to the new version available in Buster, so it needs checking if it is a pure orchestation/configuration issue or the package needs change/upgrade.
I added it as a check to try to have it ready before further rollout, without further research (yet).

This might be https://phabricator.wikimedia.org/T244696

While I agree, note after Grafana fixes to be done, the reported error is from prometheus collection, not just missing data points.

Should be fixed now: PEBKAC :)

jcrespo updated the task description. (Show Details)Feb 28 2020, 8:58 AM

Change 575482 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Move db1114 to s8

https://gerrit.wikimedia.org/r/575482

Change 575482 merged by Marostegui:
[operations/puppet@production] mariadb: Move db1114 to s8

https://gerrit.wikimedia.org/r/575482

db1107 has been performing fine during the whole week, so I am not removing it from production and will leave it serving traffic during the weekend.

Marostegui closed this task as Resolved.Mon, Mar 2, 6:19 AM

10.4 has been tested in production on s1 for 3 weeks, there are pending things that already have their tracking task. The task for placing 1 10.4 host per section will be created now.
Closing this

Change 576368 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] prometheus-mysqld-exporter: Fix options for multiinstance hosts

https://gerrit.wikimedia.org/r/576368

Change 576368 merged by Jcrespo:
[operations/puppet@production] prometheus-mysqld-exporter: Fix options for multiinstance hosts

https://gerrit.wikimedia.org/r/576368

Change 576398 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] prometheus-mysqld-exporter: Workaround upstream package regression

https://gerrit.wikimedia.org/r/576398

Marostegui updated the task description. (Show Details)Thu, Mar 5, 7:35 AM
Marostegui updated the task description. (Show Details)

Change 576398 merged by Jcrespo:
[operations/puppet@production] prometheus-mysqld-exporter: Workaround upstream package regression

https://gerrit.wikimedia.org/r/576398