Page MenuHomePhabricator

Migrate dbstore1002 to a multi instance setup on dbstore100[3-5]
Closed, ResolvedPublic

Description

T209620 shows that dbstore100[3-5] are ready to be used in production, so we should start planning the move from dbstore1002 to the new multi instance setup as soon as possible. Two important things needs to be kept in mind:

  • the new set of hosts will probably be used a lot by the Analytics team so a shared knowledge of procedures/setup/etc.. would be really great, so we'll try to bother the Data Persistence team as few as possible :)
  • Ubuntu Trusty 14.04.5 is going EOL In April 2019 (https://wiki.ubuntu.com/Releases) so we have a hard deadline to decommission dbstore1002.

I would use this task to make a high level plan (and possibly subtasks) about the steps needed to do this work.

A very coarse grained list of things to do:

  • review data in dbstore1002 and come up with a multi-instance setup scheme for dbstore100[3-5] T212487

Distribution:

hostsectionssize
dbstore1003s1, s5, s72.1T
dbstore1004s2, s3, s42,5T
dbstore1005s8, s6,x12,3T

Where to place the existing staging database: dbstore1005:3350 (decided as of 11th Feb)

  • review special settings that are not puppetized, and then come up with a basic setup for dbstore100[3-5] (this is a goal for Analytics in this quarter)
  • review strategies to allow multi-user set up on dbstore100[3-5]. The Analytics team would really love to decommission the research user account and introduce a per-user account scheme, in order to be able to track down query patterns and in the future (hopefully) migrate all of them to Hadoop. This is absolutely a stretch goal and can be done afterwards if too complicated/long. To be followed up at: {T214469}
  • When the new instances are ready, start the data replication.
  • Fully migrate users to the new hosts T215589: Migrate users to dbstore100[3-5]
  • Create a decommission task for dbstore1002 T216491: Decommission dbstore1002
    • Clean up mariadb::dbstore (nothing else should be using it apart from dbstore1002)

Instances to be set up:

  • dbstore1003
    • s1
    • s5
    • s7
  • dbstore1004
    • s2
    • s3
    • s4
  • dbstore1005
    • s6
    • s8
    • x1
    • staging (port 3350)

Details

Related Gerrit Patches:
operations/puppet : productionanalytics-grants.sql: Remove file
operations/puppet : productiondbstore_multiinstance.pp: Specify read-only for staging
operations/puppet : productionmariadb: Make staging not read-only
operations/puppet : productiondbstore1002: Set the host to read-only
operations/puppet : productiondbstore1005: Enable notifications
operations/puppet : productionprofile::mariadb::dbstore_multiinstance: add a note for Analytics
operations/puppet : productiondbstore1003: Enable notifications
operations/puppet : productionmariadb: Change staging DB to 3350
operations/dns : masterwmnet: Change staging db port
operations/dns : masterwmnet: Move staging db from dbstore1003 to dbstore1005
operations/puppet : productiondbstore analytics: Move staging database
operations/puppet : productiondbstore1004: Enable notifications
operations/dns : masterAdd staging-db-analytics.eqiad.wmnet CNAME to dbstore1003
operations/puppet : productiondbstore1003: Increase number mysql of instances
operations/puppet : productioncheck_mariadb.py: Add staging port
operations/dns : masterRename sX-analytics-slave to replicas and update dbstore CNAME targets
operations/puppet : productionmariadb: Rename sdb instance to staging
operations/puppet : productionmariadb: Provision dbstore1005
operations/puppet : productiondbstore_multiinstance: Add staging db
operations/puppet : productionmariadb: Fix prometheus config dbstore1004
operations/puppet : productionmariadb: Provision dbstore1004
operations/mediawiki-config : masterdb-eqiad.php: Depool db1090:3317
operations/mediawiki-config : masterdb-eqiad.php: Depool db1089
operations/mediawiki-config : masterdb-eqiad.php: Depool db1113:3315
operations/puppet : productiondbstore1003: Replace s6 with s5
operations/puppet : productionsite.pp: Convert dbstore1003 to multiinstance
operations/puppet : productionmariadb: dbstore_multi - dbstore1003-dbstore1005

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Marostegui updated the task description. (Show Details)Jan 25 2019, 9:50 AM
Marostegui updated the task description. (Show Details)

Change 487339 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Rename sdb instance to staging

https://gerrit.wikimedia.org/r/487339

Change 487339 merged by Marostegui:
[operations/puppet@production] mariadb: Rename sdb instance to staging

https://gerrit.wikimedia.org/r/487339

Mentioned in SAL (#wikimedia-operations) [2019-02-05T07:13:11Z] <marostegui> Taking mysqldump from dbstore1002.staging - T210478

Change 488004 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/dns@master] Rename sX-analytics-slave to replicas and update dbstore CNAME targets

https://gerrit.wikimedia.org/r/488004

@elukey from what I can see, the research user is used to access wikis, not only for staging. So I guess we need to create it on all the instances and not only where staging lives.
Example: T200801#4570548

The wikiuser isn't really used, so that one can be dropped from the sX and x1 instances.

elukey added a comment.Feb 5 2019, 9:56 AM

Yes exactly, basically people only use research to consult wiki replicas and to store things in staging, sorry if I didn't make clear before!

Change 488004 merged by Elukey:
[operations/dns@master] Rename sX-analytics-slave to replicas and update dbstore CNAME targets

https://gerrit.wikimedia.org/r/488004

Mentioned in SAL (#wikimedia-operations) [2019-02-05T14:05:07Z] <marostegui> Delete non used grants from dbstore1002: log, warehouse,project_illustration, cognate\_wiktionary, datasets - T212487 T210478

elukey moved this task from Next Up to In Progress on the Analytics-Kanban board.Feb 5 2019, 5:36 PM
Marostegui updated the task description. (Show Details)Feb 6 2019, 6:43 AM

Mentioned in SAL (#wikimedia-operations) [2019-02-06T09:33:04Z] <marostegui> Remove wikiuser from dbstore1003-dbstore1005 T210478

@elukey after merging: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/487000/ I have done the following:

  • Create the research role on all dbstore hosts on all the instances
  • Created the research user on all dbstore hosts on all the instances (to be deprecated: T214469)
  • Assigned the research role to the research user on all dbstore hosts on all the instances
  • Dropped the wikiuser from all the dbstore hosts on all the instances as that is not to be used T210478#4926775

Could you please test (or get someone) that they can connect fine and do some usual reads on the wikis?
Keep in mind that the staging instance is not yet available, that is why it wasn't included.
Host and ports to test:

s1: dbstore1003:3311
s2: dbstore1003:3315
s3: dbstore1003:3317
s4: dbstore1004:3312
s5: dbstore1004:3313
s6: dbstore1004:3314
s7: dbstore1005:3316
s8: dbstore1005:3318
x1: dbstore1005:3320

I want to make sure, that grants wise we are good to go.

Thank you!

elukey added a comment.Feb 6 2019, 1:49 PM

First of all, I just realized that all these IPs ports need to be whitelisted on the Analytics VLAN's firewall to make the connections happening!

Then also make sure to whitelist dbstore1003:3340 as that is where the staging database will leave.

elukey added a comment.Feb 6 2019, 2:19 PM

analytics-in4 diff:

+      term mysql-dbstore {
+          from {
+              destination-address {
+                  /* dbstore1003 */
+                  10.64.0.137/32;
+                  /* dbstore1004 */
+                  10.64.16.26/32;
+                  /* dbstore1005 */
+                  10.64.32.30/32;
+              }
+              protocol tcp;
+              destination-port [ 3311 3315 3317 3312 3313 3314 3316 3318 3320 3340 ];
+          }
+          then accept;
+      }

analytics-in4 diff:

+      term mysql-dbstore {
+          from {
+              destination-address {
+                  /* dbstore1003 */
+                  10.64.0.137/32;
+                  /* dbstore1004 */
+                  10.64.16.26/32;
+                  /* dbstore1005 */
+                  10.64.32.30/32;
+              }
+              protocol tcp;
+              destination-port [ 3311 3315 3317 3312 3313 3314 3316 3318 3320 3340 ];
+          }
+          then accept;
+      }

Looks good!

elukey added a comment.Feb 6 2019, 2:26 PM

After a first quick test it looks good:

elukey@stat1007:~$ mysql -e 'show databases' -P 3312 -u research -p -h s2-analytics-replica.eqiad.wmnet
Enter password:
+--------------------+
| Database           |
+--------------------+
| bgwiki             |
| bgwiktionary       |
| cswiki             |
| enwikiquote        |
| enwiktionary       |
| eowiki             |
| fiwiki             |
| idwiki             |
| information_schema |
| itwiki             |
| nlwiki             |
| nowiki             |
| plwiki             |
| ptwiki             |
| svwiki             |
| thwiki             |
| trwiki             |
| zhwiki             |
+--------------------+

Mentioned in SAL (#wikimedia-operations) [2019-02-06T14:29:49Z] <elukey> add term mysql-dbstore to analytics-in4/6 on cr1/2-eqiad to allow tcp connections to dbstore100[3-5] - T210478

Change 488451 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] check_mariadb.py: Add staging port

https://gerrit.wikimedia.org/r/488451

Change 488454 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] dbstore1003: Increase numbre of instances

https://gerrit.wikimedia.org/r/488454

Change 488451 merged by Marostegui:
[operations/puppet@production] check_mariadb.py: Add staging port

https://gerrit.wikimedia.org/r/488451

Did a quick test, it's working for me (one of them) :)
Thanks a milion @Marostegui and !@elukey

Change 488535 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/dns@master] Add staging-db-analytics.eqiad.wmnet CNAME to dbstore1003

https://gerrit.wikimedia.org/r/488535

Change 488454 merged by Marostegui:
[operations/puppet@production] dbstore1003: Increase number mysql of instances

https://gerrit.wikimedia.org/r/488454

Change 488535 merged by Elukey:
[operations/dns@master] Add staging-db-analytics.eqiad.wmnet CNAME to dbstore1003

https://gerrit.wikimedia.org/r/488535

Marostegui updated the task description. (Show Details)Feb 8 2019, 6:35 AM

Mentioned in SAL (#wikimedia-operations) [2019-02-08T06:54:42Z] <marostegui> Take a mysqldump from staging on dbstore1003 from dbstore1002 - T210478

Marostegui updated the task description. (Show Details)Feb 8 2019, 11:03 AM
Marostegui updated the task description. (Show Details)

I have disabled notifications for lag checks (only for lag ones) for dbstore1002, as they are very noisy. They will show up on icinga, but not on IRC.

Mentioned in SAL (#wikimedia-operations) [2019-02-10T09:25:54Z] <marostegui> Disable notifications for lag checks on dbstore1002 - T210478

Now that we know that the biggest and and most painful table (as it was Aria and it was huge - around 180GB) was killed T215450: Sqoop staging.mep_word_persistence to HDFS and drop the table from dbstore1002
I have tested a migration process for the staging database, to see how much we'd need to have it on read only.

Read only required:

  • Taking the backup from dbstore1002: 534m34.090s
  • Importing the data on dbstore1003:3340: 424m11.932s
  • Total read only required for the migration: Around 16 hours.

Additional step (with not read-only neeed)

  • Compress InnoDB on all the staging tables: 09:30h

Change 489622 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] dbstore1004: Enable notifications

https://gerrit.wikimedia.org/r/489622

Change 489622 merged by Marostegui:
[operations/puppet@production] dbstore1004: Enable notifications

https://gerrit.wikimedia.org/r/489622

After having al the sections ready and compressed on all hosts, there is one thought I had, where to leave staging database, either dbstore1003 or dbstore1005.

This is the current situation:

dbstore1003: s1, s5, s7, staging

root@dbstore1003:/srv# df -hT /srv
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   4.4T  2.7T  1.8T  61% /srv
root@dbstore1003:/srv# du -sh *
926G	sqldata.s1
695G	sqldata.s5
914G	sqldata.s7
144G	sqldata.staging

dbstore1005: s6, s8, x1

root@dbstore1005:/srv# df -hT /srv
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   4.4T  1.7T  2.7T  39% /srv
root@dbstore1005:/srv# du -sh *
479G	sqldata.s6
1.1T	sqldata.s8
172G	sqldata.x1

I think it would make sense to move staging to dbstore1005

Change 489633 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] dbstore analytics: Move staging database

https://gerrit.wikimedia.org/r/489633

Change 489633 merged by Marostegui:
[operations/puppet@production] dbstore analytics: Move staging database

https://gerrit.wikimedia.org/r/489633

Change 489636 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/dns@master] wmnet: Move staging db from dbstore1003 to dbstore1005

https://gerrit.wikimedia.org/r/489636

Change 489636 merged by Elukey:
[operations/dns@master] wmnet: Move staging db from dbstore1003 to dbstore1005

https://gerrit.wikimedia.org/r/489636

Change 489644 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Change staging DB to 3350

https://gerrit.wikimedia.org/r/489644

Change 489645 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/dns@master] wmnet: Change staging db port

https://gerrit.wikimedia.org/r/489645

Change 489645 merged by Marostegui:
[operations/dns@master] wmnet: Change staging db port

https://gerrit.wikimedia.org/r/489645

Change 489644 merged by Marostegui:
[operations/puppet@production] mariadb: Change staging DB to 3350

https://gerrit.wikimedia.org/r/489644

Marostegui updated the task description. (Show Details)Feb 11 2019, 10:12 AM
Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)

After having al the sections ready and compressed on all hosts, there is one thought I had, where to leave staging database, either dbstore1003 or dbstore1005.
This is the current situation:
dbstore1003: s1, s5, s7, staging

root@dbstore1003:/srv# df -hT /srv
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   4.4T  2.7T  1.8T  61% /srv
root@dbstore1003:/srv# du -sh *
926G	sqldata.s1
695G	sqldata.s5
914G	sqldata.s7
144G	sqldata.staging

dbstore1005: s6, s8, x1

root@dbstore1005:/srv# df -hT /srv
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   4.4T  1.7T  2.7T  39% /srv
root@dbstore1005:/srv# du -sh *
479G	sqldata.s6
1.1T	sqldata.s8
172G	sqldata.x1

I think it would make sense to move staging to dbstore1005

Moved to dbstore1005:3350

Mentioned in SAL (#wikimedia-operations) [2019-02-11T10:19:12Z] <marostegui> Add dbstore1005:3350 to tendril and zarcillo - T210478

Mentioned in SAL (#wikimedia-operations) [2019-02-11T14:21:09Z] <marostegui> Remove staging from dbstore1003 - T210478

Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)Feb 11 2019, 4:45 PM
Marostegui added a subtask: Restricted Task.
Marostegui updated the task description. (Show Details)Feb 11 2019, 4:56 PM

Change 489972 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] dbstore1003: Enable notifications

https://gerrit.wikimedia.org/r/489972

Change 489972 merged by Marostegui:
[operations/puppet@production] dbstore1003: Enable notifications

https://gerrit.wikimedia.org/r/489972

elukey added a subscriber: ayounsi.Feb 12 2019, 7:28 AM

@ayounsi modified a bit analytics-in4 term mysql-dbstore:

[edit firewall family inet filter analytics-in4 term mysql-dbstore from]
-       destination-address {
-           10.64.0.137/32;
-           10.64.16.26/32;
-           10.64.32.30/32;
-       }
-       protocol tcp;
-       destination-port [ 3311 3315 3317 3312 3313 3314 3316 3318 3320 3340 ];
+       destination-address {
+           /* dbstore1003 */
+           10.64.0.137/32;
+           /* dbstore1004 */
+           10.64.16.26/32;
+           /* dbstore1005 */
+           10.64.32.30/32;
+       }
+       protocol tcp;
+       destination-port [ 3311 3315 3317 3312 3313 3314 3316 3318 3320 3350 ];

This basically replaces port 3340 with 3350 :)

Change 489983 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::mariadb::dbstore_multiinstance: add a note for Analytics

https://gerrit.wikimedia.org/r/489983

Change 489983 merged by Elukey:
[operations/puppet@production] profile::mariadb::dbstore_multiinstance: add a note for Analytics

https://gerrit.wikimedia.org/r/489983

Change 490068 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] dbstore1005: Enable notifications

https://gerrit.wikimedia.org/r/490068

Change 490068 merged by Marostegui:
[operations/puppet@production] dbstore1005: Enable notifications

https://gerrit.wikimedia.org/r/490068

Marostegui closed subtask Restricted Task as Resolved.Feb 14 2019, 10:57 AM

Mentioned in SAL (#wikimedia-operations) [2019-02-18T05:52:22Z] <marostegui> Set dbstore1002 on read only to start the migration T210478 T215589

Change 491195 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] dbstore1002: Set the host to read-only

https://gerrit.wikimedia.org/r/491195

Change 491195 merged by Marostegui:
[operations/puppet@production] dbstore1002: Set the host to read-only

https://gerrit.wikimedia.org/r/491195

The migration finished. These are the times in UTC from 18th Feb 2019:

  • Read only on dbstore1002: 05:53
  • mysqldump start: 05:54:44
  • mysqldump finished: 15:18:15
  • data import in dbstore1005 start: 15:18:15
  • data import dbstore1005 finished: 21:43:54
  • read only OFF on dbstore1005: 21:43:54
Marostegui updated the task description. (Show Details)Feb 19 2019, 6:10 AM

Change 491408 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] dbstore_multiinstance.pp: Specify read-only for staging

https://gerrit.wikimedia.org/r/491408

Change 491713 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Make staging not read-only

https://gerrit.wikimedia.org/r/491713

Change 491713 merged by Marostegui:
[operations/puppet@production] mariadb: Make staging not read-only

https://gerrit.wikimedia.org/r/491713

Change 491408 abandoned by Marostegui:
dbstore_multiinstance.pp: Specify read-only for staging

Reason:
Done via 491713

https://gerrit.wikimedia.org/r/491408

Marostegui updated the task description. (Show Details)Feb 22 2019, 6:56 AM

Change 492275 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] analytics-grants.sql: Remove file

https://gerrit.wikimedia.org/r/492275

Change 492275 merged by Marostegui:
[operations/puppet@production] analytics-grants.sql: Remove file

https://gerrit.wikimedia.org/r/492275

Marostegui removed a subtask: Restricted Task.Feb 22 2019, 11:54 AM

Mentioned in SAL (#wikimedia-operations) [2019-03-04T06:46:20Z] <marostegui> Stop MySQL on dbstore1002 for decommission T210478 T172410 T216491 T215589

MySQL has been stopped on dbstore1002 and won't be started again, as this host will be decommissioned

Marostegui closed this task as Resolved.Mar 6 2019, 8:38 AM

I am going to close this as resolved. Everything is essentially already done as we have almost decommissioned dbstore1002.
The pending tasks they already have tickets where they can be followed up on.
This one is still pending and quite important {T214469}

Marostegui updated the task description. (Show Details)Mar 6 2019, 8:38 AM