Page MenuHomePhabricator

Repurpose db1108 as generic Analytics db replica
Closed, ResolvedPublic

Description

In T231208 some issue were highlighted:

  • the analytics dbs (matomo, superset, oozie, etc..) are all running on single db hosts without any replication, and taking backups following SRE best practices leads to issue like excessive lock contention between applications and backup software.
  • the current backups of the Analytics databases diverged a lot from the best practices that SRE follows and might lead to inconsistent snapshots in some cases.

In T231858 some issues were highlighted:

  • the log database on db1107 is way different from db1108, they can't be really exchanged without user noticing it in their query results.
  • the logdatabase contains historical data that would be nice to have available (read-only) for more time before completely relying on HDFS data. We still have the past 1.5y of Eventlogging data on HDFS already, we are sunsetting the Mysql support.
  • moving the log database to one of the dbstore nodes would require a lot of engineering time and probably not be the best solution in terms of availability and resource usage of the dbstore cluster.

Given the above points, I have a proposal for db1108:

  • after the deprecation of mysql-eventlogging, remove all eventlogging-related replication code.
  • repurpose it as generic analytics database replica: keep the log database as it is, and replicate the matomo, superset, etc.. from the Analytics db hosts (an-coord1001, matomo1001)
  • add mariadb-bacula backups configuration for db1108

Important note about the log database: the plan is to take a full snapshot of the db and archive it in HDFS before starting any procedure. It will be made clear that the log database will be available as read-only support with the caveat that any maintenance or host hw-event will require downtime. The Analytics team is going to maintain the log database as best effort support and it will be made clear to users.

How does the proposal sound?

Details

SubjectRepoBranchLines +/-
operations/dnsmaster+0 -2
operations/puppetproduction+4 -0
operations/homer/publicmaster+12 -1
operations/puppetproduction+1 -1
operations/software/wmfmariadbpymaster+1 -1
operations/puppetproduction+13 -0
operations/software/wmfmariadbpymaster+15 -0
operations/puppetproduction+1 -1
operations/puppetproduction+2 -2
operations/puppetproduction+2 -2
operations/puppetproduction+6 -6
operations/puppetproduction+6 -0
operations/puppetproduction+8 -1
operations/dnsmaster+2 -1
operations/puppetproduction+7 -0
operations/puppetproduction+3 -3
operations/puppetproduction+5 -134
operations/puppetproduction+2 -2
operations/puppetproduction+2 -2
operations/puppetproduction+3 -3
operations/puppetproduction+9 -9
operations/puppetproduction+79 -170
operations/dnsmaster+0 -1
operations/puppetproduction+1 -75
operations/puppetproduction+3 -5
operations/puppetproduction+6 -1
operations/puppetproduction+5 -7
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
elukey changed the task status from Open to Stalled.Apr 21 2020, 1:30 PM

I had a chat with my team, and after some discussion we created T250709 to figure out how hard it would be to load all the data in the log database to Hive/HDFS. In case of a negative outcome we'll try to come up with an alternative plan for those big tables, like dropping them etc..

Setting this task as stalled for the moment.

elukey changed the task status from Stalled to Open.Jun 16 2020, 7:15 AM

In T250709 Dan was able to pull all data from db1108's log database on HDFS, and my team is currently vetting the data to double check that everything looks good. If so, we'll be able to drop the log database from db1108 and start fresh.

@Marostegui in this case, should we just reimage the host to Buster as first step (once my team is ok with the data on HDFS), delete all puppet config and come up with a simple multi-instance set up for the Analytics database to mirror? (currently two, matomo and analytics-meta).

That sounds good to me.
Reminder, use: echo partman/custom/db.cfg ;; \ recipe when reimaging so everything is wiped

Change 607219 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] eventlogging: Change basedir depending on the OS

https://gerrit.wikimedia.org/r/607219

Change 607219 merged by Marostegui:
[operations/puppet@production] eventlogging: Change basedir depending on the OS

https://gerrit.wikimedia.org/r/607219

Change 607438 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Reimage db1108 to Debian Buster

https://gerrit.wikimedia.org/r/607438

Change 607438 merged by Elukey:
[operations/puppet@production] Reimage db1108 to Debian Buster

https://gerrit.wikimedia.org/r/607438

I was about to reimage and then I remembered about the staging database. I checked and there are ~270 tables on the db, most of them probably not used anymore but I see some RU config:

 1 ## Configuration for grabbing datasets on page creation events$
 2 ## for all wikis, as well as some events related to ACTRIAL.$
 3 ## Ref: https://phabricator.wikimedia.org/T170850$
 4 $
 5 databases:$
 6     dbstore:$
 7         host: "analytics-slave.eqiad.wmnet"$
 8         port: 3306$
 9         creds_file: /etc/mysql/conf.d/stats-research-client.cnf$
10         db: log$

The above file is in reportupdater-queries/page-creation/config.yaml. @mforns can we check together what RU jobs are still using the staging db on db1108 and possibly move them elsewhere? (either hive or the staging db on the dbstores).

EDIT: my bad, I just realized that the job uses the log database, so I guess that this job has been not active or failing for a long time?

Change 607537 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::reportupdater::jobs: absent old RU job

https://gerrit.wikimedia.org/r/607537

Change 607537 merged by Elukey:
[operations/puppet@production] profile::reportupdater::jobs: absent old RU job

https://gerrit.wikimedia.org/r/607537

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['db1108.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202006241616_elukey_28643.log.

Mentioned in SAL (#wikimedia-operations) [2020-06-24T16:17:28Z] <elukey> reimage db1108 to debian Buster - T234826

Completed auto-reimage of hosts:

['db1108.eqiad.wmnet']

and were ALL successful.

Change 607569 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/dns@master] Remove the analytics-slave CNAME

https://gerrit.wikimedia.org/r/607569

Change 607569 merged by Elukey:
[operations/dns@master] Remove the analytics-slave CNAME

https://gerrit.wikimedia.org/r/607569

Change 553742 merged by Elukey:
[operations/puppet@production] Introduce profile::mariadb::misc::analytics

https://gerrit.wikimedia.org/r/c/operations/puppet/ /553742

Change 609106 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Rename analytics meta mariadb instance for Backup host

https://gerrit.wikimedia.org/r/c/operations/puppet/ /609106

Change 609106 merged by Elukey:
[operations/puppet@production] Rename analytics meta mariadb instance for Backup host

https://gerrit.wikimedia.org/r/c/operations/puppet/ /609106

Change 609112 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::mariadb::misc::analytics::multiinstance: use underscore

https://gerrit.wikimedia.org/r/c/operations/puppet/ /609112

Change 609112 merged by Elukey:
[operations/puppet@production] profile::mariadb::misc::analytics::multiinstance: use underscore

https://gerrit.wikimedia.org/r/c/operations/puppet/ /609112

Change 609160 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Allow db1108 to replicate data from matomo and meta databases

https://gerrit.wikimedia.org/r/c/operations/puppet/ /609160

Change 609160 merged by Elukey:
[operations/puppet@production] Allow db1108 to replicate data from matomo and meta databases

https://gerrit.wikimedia.org/r/c/operations/puppet/ /609160

Change 609161 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Allow db1108's IPv6 address to replicate matomo/meta databases

https://gerrit.wikimedia.org/r/c/operations/puppet/ /609161

Change 609161 merged by Elukey:
[operations/puppet@production] Allow db1108's IPv6 address to replicate matomo/meta databases

https://gerrit.wikimedia.org/r/c/operations/puppet/ /609161

Status of the databases:

analytics-meta has binlog enabled, with ROW format but not gtid. Mariadb version 10.1 (Stretch)
matomo doesn't have binlog enabled, and not gtid. Mariadb version 10.4 (Buster)

Both servers will need server-id setup (that is why we set it up with the ipv4 integer by default on other hosts) and binlog enabled. matomo will need a restart for enabling binlog. GTID is nice but not a blocker. We can try anyway (we will need an accounts and set of grants for it) and see how it goes.

Change 609421 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] piwik: add binlog and server-id

https://gerrit.wikimedia.org/r/609421

Both servers will need server-id setup (that is why we set it up with the ipv4 integer by default on other hosts) and binlog enabled. matomo will need a restart for enabling binlog. GTID is nice but not a blocker. We can try anyway (we will need an accounts and set of grants for it) and see how it goes.

TIL, I was completely unaware of what mariadb::config is doing behind the scenes. Both matomo and meta have server-id with ipv4 integer, the only thing missing is binlog for matomo/piwik. My bad I was reading puppet and the my.cnf config in the wrong way!

Change 609421 merged by Elukey:
[operations/puppet@production] piwik: add binlog to database config.

https://gerrit.wikimedia.org/r/609421

Status of the databases:

analytics-meta has binlog enabled, with ROW format but not gtid. Mariadb version 10.1 (Stretch)
matomo doesn't have binlog enabled, and not gtid. Mariadb version 10.4 (Buster)

From what I remember we agreed that db1108 will be Buster + Mariadb 10.4, right?
Just checking that those OS and Mariadb versions are the ones that are currently running their active hosts, no?

Thank you!

Yes correct I was checking and reporting the current status to find with you folks if there was any blocker that I didn't know. I just enabled ROW binlog for matomo, that was the missing thing, if the rest looks good then I think we are ready for the replicas :)

Side note - in T257412 I am investigating a failover plan to write down in case an-coord1001 goes down (currently a big SPOF). The idea would be to be able to failover to db1108 (with the necessary consequences, like setting it read/write temporarily, rebuild the master afterwards, etc..) but if possible I'd like to avoid using dbproxies for the moment. One idea would be to have db1108's meta replica to listen on port 3306 (like an-coord1001) and then create a simple DNS cname like analytics-meta.eqiad.wmnet to either an-coord1001 (regular use case) or db1108 (when failing over). Having both on the same port would require less changes in puppet, and an easier failover process. Would it be possible to use 3306 on db1108 or is there a standard/convention that would be violated? (so better not and find a different solution).

Change 612864 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::mariadb::misc::analytics::multiinstance: move meta to 3306

https://gerrit.wikimedia.org/r/612864

Change 612864 abandoned by Elukey:
[operations/puppet@production] profile::mariadb::misc::analytics::multiinstance: move meta to 3306

Reason:
Not needed anymore.

https://gerrit.wikimedia.org/r/612864

Side note - in T257412 I am investigating a failover plan to write down in case an-coord1001 goes down (currently a big SPOF). The idea would be to be able to failover to db1108 (with the necessary consequences, like setting it read/write temporarily, rebuild the master afterwards, etc..) but if possible I'd like to avoid using dbproxies for the moment. One idea would be to have db1108's meta replica to listen on port 3306 (like an-coord1001) and then create a simple DNS cname like analytics-meta.eqiad.wmnet to either an-coord1001 (regular use case) or db1108 (when failing over). Having both on the same port would require less changes in puppet, and an easier failover process. Would it be possible to use 3306 on db1108 or is there a standard/convention that would be violated? (so better not and find a different solution).

After a chat with Andrew we decided to just use a centralized hostname:port combination in puppet since most of the services will need a restart anyway.

Side note - in T257412 I am investigating a failover plan to write down in case an-coord1001 goes down (currently a big SPOF). The idea would be to be able to failover to db1108 (with the necessary consequences, like setting it read/write temporarily, rebuild the master afterwards, etc..) but if possible I'd like to avoid using dbproxies for the moment. One idea would be to have db1108's meta replica to listen on port 3306 (like an-coord1001) and then create a simple DNS cname like analytics-meta.eqiad.wmnet to either an-coord1001 (regular use case) or db1108 (when failing over). Having both on the same port would require less changes in puppet, and an easier failover process. Would it be possible to use 3306 on db1108 or is there a standard/convention that would be violated? (so better not and find a different solution).

I thought the idea was to use multi-instance on db1108, in that case, you can only have a given instance running on 3306, the rest would need to run on different ports.
For multi-instance we normally do not use 3306 for any of the instances, but this is not a set-in-stone kind of rule, it is more for internal organization.

@Marostegui yep yep it was only an idea to avoid changing the port in puppet when doing the failover (and just update the CNAME with the correct hostname) but after a chat with Andrew we decided that it would have been more cumbersome that helpful, so idea ended up in /dev/null :D

Today I added the user repl (with a custom password saved in the private repo) to mariadb on matomo1002 and an-coord1001, the two master dbs. Also granted the following: GRANT REPLICATION SLAVE ON *.* TO 'repl'@'10.%';

Change 613651 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::piwik::database: add TLS config for mariadb

https://gerrit.wikimedia.org/r/613651

Change 613651 merged by Elukey:
[operations/puppet@production] profile::piwik::database: add TLS config for mariadb

https://gerrit.wikimedia.org/r/613651

Ok so on Monday I'll restart mariadb on an-coord1001 and matomo1002 to pick up TLS changes. With the repl user created, in theory we should be able to set up replication.

IIUC the next steps should be something like:

  • dump the databases with mysqldump --single-transaction --master-data=2 -u $USER -p $PASSWORD -h $host $DATABASE >> file.sql
  • get the id of the last transaction recorded in the binlog, and then set up replication from db1108 on both instances

I am pretty sure that I am not aware of something important, in case please tell me :)

Change 614751 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/dns@master] Add PTR/AAAA records for db1108

https://gerrit.wikimedia.org/r/614751

Tested that the two dbs are capable of using TLS. Two minor things to review/merge (that are not blockers):

Today me and Manuel (well, more Manuel :P) set up the replication between matomo1002 and an-coord1001 to db1108, all good! I wrote https://wikitech.wikimedia.org/wiki/Analytics/Systems/DB_Replica to document all the process. The next step is to set up bacula backups!

Change 616452 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Monitor for mariadb backups of matomo&analytics_meta

https://gerrit.wikimedia.org/r/616452

Change 616453 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Setup db1108 as the source of backups for analytics dbs

https://gerrit.wikimedia.org/r/616453

Change 614751 merged by Elukey:
[operations/dns@master] Add PTR/AAAA records for db1108

https://gerrit.wikimedia.org/r/614751

Change 616452 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Monitor for mariadb backups of matomo&analytics_meta

https://gerrit.wikimedia.org/r/616452

Change 616453 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Setup db1108 as the source of backups for analytics dbs

https://gerrit.wikimedia.org/r/616453

There is 2 ongoing issues:

https://github.com/wikimedia/operations-dns/commit/29ff38c263c9f7b4fd366aeb1999ebc4f2d3d8a5 added the ipv6 ip of netflow1001 to db1108, this is causing quite a log of connection timeouts.

Also I need a list of databases to backup on analytics_meta, as there are a few.

So the first test about the backups seems to indicate the it works, but we will need some tuning about the backup options and the server to generate backups regularly.

Fixed the DNS issue with https://gerrit.wikimedia.org/r/c/operations/dns/+/616864, thanks a lot for spotting it! My bad :(

About the list of databases - if possible all the ones on analytics-meta by default, but if an explicit list is needed: druid druid_public_eqiad hue oozie search_airflow superset_production hive_metastore

Thanks!

Defaulting to ipv6 is becoming a high complex problem because while mysql has no issue with it, our mysql accounting is based on ip, not dns (and so far it was assumed that a host only had an IP), and it adds an extra layer of refactoring to our authentication management we hadn't had to deal with before.

I added the AAAA record since all the Analytics hosts have one, so replication from db1108 would fail/timeout multiple times before dropping to IPv4. It works as well without AAAA records for db1108, so I can remove them since not strictly needed, a big refactoring is not needed.

Change 617064 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/dns@master] Remove AAAA/PTR records for db1108

https://gerrit.wikimedia.org/r/617064

Change 617077 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Create ugly exception for port assignment for db1108

https://gerrit.wikimedia.org/r/617077

Change 617077 abandoned by Jcrespo:
[operations/puppet@production] mariadb: Create ugly exception for port assignment for db1108

Reason:
moving port instead

https://gerrit.wikimedia.org/r/617077

Change 617479 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::mariadb::misc::analytics::multiinstance: change ports

https://gerrit.wikimedia.org/r/617479

Change 617479 merged by Elukey:
[operations/puppet@production] profile::mariadb::misc::analytics::multiinstance: change ports

https://gerrit.wikimedia.org/r/617479

Change 617649 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/homer/public@master] analytics-in[46]: add new ports for term mysql-replica

https://gerrit.wikimedia.org/r/617649

Change 617077 restored by Jcrespo:
[operations/puppet@production] mariadb: Create ugly exception for port assignment for db1108

https://gerrit.wikimedia.org/r/617077

Change 617650 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Move db1108 (analytics db) backups' ports

https://gerrit.wikimedia.org/r/617650

Change 617077 merged by Jcrespo:
[operations/puppet@production] mariadb: Match port 3351 and 3352 to 2 analytics sections

https://gerrit.wikimedia.org/r/617077

Change 617650 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Move db1108 (analytics db) backups' ports

https://gerrit.wikimedia.org/r/617650

Change 617653 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Increase analytics binlog retention time to 14 days

https://gerrit.wikimedia.org/r/617653

Change 617653 merged by Jcrespo:
[operations/puppet@production] mariadb: Increase analytics binlog retention time to 14 days

https://gerrit.wikimedia.org/r/617653

Change 617656 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Increase misc db binlog retention to 14 days

https://gerrit.wikimedia.org/r/617656

Change 617656 merged by Jcrespo:
[operations/puppet@production] mariadb: Increase misc db binlog retention to 14 days

https://gerrit.wikimedia.org/r/617656

Change 617661 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/software/wmfmariadbpy@master] mariadb: Add port analytics assignment to wmfmariadbpy and backups

https://gerrit.wikimedia.org/r/617661

Change 617661 merged by Jcrespo:
[operations/software/wmfmariadbpy@master] mariadb: Add port analytics assignment to wmfmariadbpy and backups

https://gerrit.wikimedia.org/r/617661

Change 617662 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Update backup automation to wmfmariadbpy's HEAD

https://gerrit.wikimedia.org/r/617662

Change 617662 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Update backup automation to wmfmariadbpy's HEAD

https://gerrit.wikimedia.org/r/617662

Change 617668 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/software/wmfmariadbpy@master] mariadb-backups: Add _ to the list of characters alowed for section names

https://gerrit.wikimedia.org/r/617668

Change 617668 merged by jenkins-bot:
[operations/software/wmfmariadbpy@master] mariadb-backups: Add _ to the list of characters alowed for section names

https://gerrit.wikimedia.org/r/617668

Change 617670 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Update backup_mariadb.py to HEAD

https://gerrit.wikimedia.org/r/617670

Change 617670 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Update backup_mariadb.py to HEAD

https://gerrit.wikimedia.org/r/617670

Change 617649 merged by Elukey:
[operations/homer/public@master] analytics-in[46]: add new ports for term mysql-replica

https://gerrit.wikimedia.org/r/617649

Change 617064 abandoned by Elukey:
[operations/dns@master] Remove AAAA/PTR records for db1108

Reason:

https://gerrit.wikimedia.org/r/617064

The db1108 host has been running fine for the past couple of months, replicating correctly analytics-meta and piwik data. There are some follow ups to do but this task can be considered done.

I can confirm backups have been flowing weekly as expected:

+------+------------------------------------------+---------------------+------------+----------+
| id   | name                                     | end_date            | total_size | status   |
+------+------------------------------------------+---------------------+------------+----------+
| 8347 | dump.analytics_meta.2020-10-20--02-08-31 | 2020-10-20 02:10:10 | 1902085735 | finished |
| 8346 | dump.matomo.2020-10-20--02-08-16         | 2020-10-20 02:08:31 |  257155353 | finished |
| 8244 | dump.analytics_meta.2020-10-13--02-08-23 | 2020-10-13 02:10:02 | 1923935772 | finished |
| 8243 | dump.matomo.2020-10-13--02-08-08         | 2020-10-13 02:08:23 |  254349238 | finished |
| 8135 | dump.analytics_meta.2020-10-06--02-08-27 | 2020-10-06 02:10:11 | 1955733786 | finished |
| 8134 | dump.matomo.2020-10-06--02-08-12         | 2020-10-06 02:08:27 |  247783361 | finished |
| 8029 | dump.analytics_meta.2020-09-29--02-07-47 | 2020-09-29 02:09:27 | 1933654247 | finished |
| 8028 | dump.matomo.2020-09-29--02-07-32         | 2020-09-29 02:07:47 |  251649181 | finished |
| 7904 | dump.analytics_meta.2020-09-22--02-08-26 | 2020-09-22 02:09:33 | 1172324325 | finished |
| 7903 | dump.matomo.2020-09-22--02-08-12         | 2020-09-22 02:08:26 |  243380912 | finished |
| 7789 | dump.analytics_meta.2020-09-15--01-13-59 | 2020-09-15 01:14:58 | 1147047998 | finished |
| 7788 | dump.matomo.2020-09-15--01-13-45         | 2020-09-15 01:13:59 |  237464288 | finished |
| 7686 | dump.analytics_meta.2020-09-08--02-15-57 | 2020-09-08 02:16:58 | 1145772141 | finished |
| 7685 | dump.matomo.2020-09-08--02-15-43         | 2020-09-08 02:15:57 |  232434960 | finished |
| 7560 | dump.analytics_meta.2020-09-01--02-07-39 | 2020-09-01 02:08:37 | 1103338608 | finished |
| 7559 | dump.matomo.2020-09-01--02-07-25         | 2020-09-01 02:07:39 |  242438080 | finished |
| 7460 | dump.analytics_meta.2020-08-25--15-56-09 | 2020-08-25 15:57:12 | 1102649799 | finished |
| 7459 | dump.matomo.2020-08-25--15-55-55         | 2020-08-25 15:56:09 |  240587129 | finished |

We can do a demo recovery any time you want.

Everything there looks fine! There may be procedures that I could help you simplify to be done more easily, we can talk on a different medium at a later time to avoid spamming other people here.