Decommission old dbstore hosts (db1046, db1047)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Ottomata
	Jan 31 2017, 8:01 PM

Description

2 analytics MySQL databases need to either be removed or replaced. Analytics should plan for this, and decide what should be done.

db1046 - m2-master
db1047 - analytics-slave

Details

Subject	Repo	Branch	Lines +/-
Set role::system::spare to db104[67]	operations/puppet	production	+3 -73
mariadb-eventlogging: Move prometheus monitoring to /run socket	operations/puppet	production	+1 -1
mariadb-eventlogging: db1108 will contain only log database	operations/puppet	production	+1 -1
m4.hosts: Remove db1046 and db1047	operations/software	master	+2 -2
Delete role::mariadb::analytics	operations/puppet	production	+1 -15
Drop the Eventlogging support for dbstore1002	operations/puppet	production	+1 -10
haproxy: Migrate dbproxy1004 to stretch (haproxy 1.7) format	operations/puppet	production	+1 -11
eventlogging: Repoint m4-master CNAME to dbproxy1009	operations/dns	master	+2 -1
eventlogging: Point m4-master to db1107 with db1108 as backup	operations/puppet	production	+8 -8
haproxy: Followup to Update configuration template to haproxy 1.7 syntax	operations/puppet	production	+1 -1
haproxy: Update configuration template to haproxy 1.7 syntax	operations/puppet	production	+26 -2
hieradata::regex: allow notifications for db1107, disable them for db1046	operations/puppet	production	+2 -2
dbproxy: Add proxy 4 and 9 to reimage as stretch for upgrade	operations/puppet	production	+4 -0
db1047 => db1108	wikimedia/discovery/golden	master	+22 -9
db1047 => db1108	wikimedia/discovery/wmf	master	+8 -4
Remove any trace of db1047 from analytics CNAMEs	operations/dns	master	+3 -3

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T134476 Decommission old coredb machines (<=db1050)
Resolved	elukey	T156844 Decommission old dbstore hosts (db1046, db1047)
		Unknown Object (Task)
Resolved	elukey	T177405 rack and setup db1107 and db1108

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Update on this. @luca is working on T156933, and in talking, we realized that if we get rid of the second slave (db1047), we will only have one copy of EL data older than 90 days that we keep on the slave. I think I had thought that smart purging was also done on the master, so the retained data would remain there, but this is not true.

For redundancy purposes, we should keep two slaves. We should refresh both db1046 and db1047. Sorry for the waffling on this.

If redundancy is the main reason, and not load balancing, I would suggest having the redundant server on codfw. But there is now no analytics server on codfw, so up to you.

+1, that sounds like a good idea to me!

jcrespo mentioned this in T166141: db1046 BBU looks faulty.May 26 2017, 10:12 AM

Marostegui mentioned this in T166205: Convert unique keys into primary keys for some wiki tables on s2.Jun 16 2017, 2:32 PM

elukey mentioned this in Unknown Object (Task).Jun 22 2017, 3:53 PM

elukey created subtask Unknown Object (Task).

faidon closed subtask Unknown Object (Task) as Resolved.Oct 17 2017, 5:18 PM

Updating this task in light of the recent discussions. The analytics and DBA teams have been fighting a lot with disk space consumption on dbstore1002 due to too many databases replicated in there (log + wikis). We would like to proceed as following:

Replace db1047 (analytics-slave) with db1108 (currently being configured in https://phabricator.wikimedia.org/T177405), and replicate only the log database.
Replace db1046 with db1107 (currently being configured in https://phabricator.wikimedia.org/T177405)
Stop replication of the log database on dbstore1002 (keeping wikis of course).

jcrespo added a subtask: T177405: rack and setup db1107 and db1108.Oct 19 2017, 3:07 PM

Mentioned in SAL (#wikimedia-operations) [2017-10-26T07:43:37Z] <marostegui> Stop MySQL on db1047 to copy data over db1108 - T177405 T156844

db1047's data has been migrated to db1108.
It is working fine now.

We are going to leave db1108 working for a few days, make sure the event logging sync script works fine and then move the analytics-slave CNAME to db1108. After that we will ping all the people that have data on db1047 to migrate stuff to the new server.

Performance looks really good!

root@db1108[log]> select count(*) from MediaViewer_10867062_15423246;
+------------+
| count(*)   |
+------------+
| 1151169110 |
+------------+
1 row in set (3 min 0.40 sec)

All right we are ready to outline the next steps with some deadlines (tentative):

November 6th: the analytics-slave CNAME moves from db1047 to db1108
November 13th: the log database will be dropped from dbstore1002/analytics-store together with the EL replica scripts
December 4th: shutdown of db1047 (prior backup of non-log database tables)

elukey added a project: User-Elukey.Nov 2 2017, 12:55 PM

In T156844#3729337, @elukey wrote:

All right we are ready to outline the next steps with some deadlines (tentative):

November 6th: the analytics-slave CNAME moves from db1047 to db1108

November 13th: the log database will be dropped from dbstore1002/analytics-store together with the EL replica scripts

December 4th: shutdown of db1047 (prior backup of non-log database tables)

Sounds good to me!! :-)

elukey moved this task from Backlog to In Progress on the User-Elukey board.Nov 7 2017, 10:32 AM

Just sent a summary of what's happening to engineering@ and analytics@, new deadlines:

- November 13th: the analytics-slave CNAME moves from db1047 to db1108
- November 20th: the log database will be dropped from dbstore1002/analytics-store together with the event-logging replication script
- December 4th: shutdown of db1047 (prior backup of non-log database tables)

Addshore mentioned this in T180025: Evaluate the use of analytics-slave.eqiad.wmnet in analytics/wmde/scripts.Nov 8 2017, 1:00 PM

Does that mean it's not going to be possible to JOIN EventLogging tables with MediaWiki tables in the future? I'm not working with user-facing features these days, but when I did, that would have been pretty painful.

@Tgr
What types of selects were you doing?

We think the best place to do this type of joining is hadoop, we are working into refining EL data into hadoop (https://phabricator.wikimedia.org/T162610) and the mediawiki edit reconstructuction already pulls mediawiki tables and edit data for all wikis into the cluster.

Checking how much different user cohorts (bucketed by editcount) click on a button, for example.

If it can be done in Hadoop, that sounds great (I was doing this two years ago, I don't think Hadoop had the MediaWiki tables then). How would that work privacy-wise? Right now Hadoop is the high-security area, and MySQL is the low-security area (in terms of how many people have access), with much less sensitive data. Would there be a separate webrequest hadoop access and EL-only Hadoop access?

Would there be a separate webrequest hadoop access and EL-only Hadoop access?

It doesn't work 100% as it should, but for Hadoop access control, analytics-privatedata-users will giveyou Hadoop access AND allow you to read things like webrequest, whereas analytics-users is just Hadoop access. I think EventLogging analytics and MW data don't need privatedata level restrictions. So, yes! This should be possible now.

@Tgr : i think your use case would work in hadoop as edit data related information is available since the beginning of time, now, it is true that hadoop at this time does not give you the ability to join with "any" mediawiki table.

Change 391020 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/dns@master] Modify CNAME for analytics-slave from db1047 to db1108

https://gerrit.wikimedia.org/r/391020

gerritbot added a project: Patch-For-Review.Nov 13 2017, 3:23 PM

Change 391020 merged by Elukey:
[operations/dns@master] Remove any trace of db1047 from analytics CNAMEs

https://gerrit.wikimedia.org/r/391020

Change 391062 had a related patch set uploaded (by Bearloga; owner: Bearloga):
[wikimedia/discovery/golden@master] db1047 => db1108

https://gerrit.wikimedia.org/r/391062

Change 391063 had a related patch set uploaded (by Bearloga; owner: Bearloga):
[wikimedia/discovery/wmf@master] db1047 => db1108

https://gerrit.wikimedia.org/r/391063

Change 391063 merged by Chelsyx:
[wikimedia/discovery/wmf@master] db1047 => db1108

https://gerrit.wikimedia.org/r/391063

Change 391062 merged by Chelsyx:
[wikimedia/discovery/golden@master] db1047 => db1108

https://gerrit.wikimedia.org/r/391062

mpopov mentioned this in R1821:d0af60b49dd0: db1047 => db1108.Nov 13 2017, 8:09 PM

mpopov mentioned this in R1821:39e5297ded97: db1047 => db1108.

Change 391528 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] dbproxy: Add proxy 4 and 9 to reimage as stretch for upgrade

https://gerrit.wikimedia.org/r/391528

Change 391528 merged by Jcrespo:
[operations/puppet@production] dbproxy: Add proxy 4 and 9 to reimage as stretch for upgrade

https://gerrit.wikimedia.org/r/391528

Script wmf-auto-reimage was launched by jynus on neodymium.eqiad.wmnet for hosts:

['dbproxy1009.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201711151013_jynus_6035.log.

Completed auto-reimage of hosts:

['dbproxy1009.eqiad.wmnet']

and were ALL successful.

Change 391536 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] haproxy: Update configuration template to haproxy 1.7 syntax

https://gerrit.wikimedia.org/r/391536

Change 391540 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] hieradata::regex: allow notifications for db1107, disable them for db1046

https://gerrit.wikimedia.org/r/391540

Change 391540 merged by Elukey:
[operations/puppet@production] hieradata::regex: allow notifications for db1107, disable them for db1046

https://gerrit.wikimedia.org/r/391540

Change 391536 merged by Jcrespo:
[operations/puppet@production] haproxy: Update configuration template to haproxy 1.7 syntax

https://gerrit.wikimedia.org/r/391536

Change 391542 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] haproxy: Followup to Update configuration template to haproxy 1.7 syntax

https://gerrit.wikimedia.org/r/391542

Change 391542 merged by Jcrespo:
[operations/puppet@production] haproxy: Followup to Update configuration template to haproxy 1.7 syntax

https://gerrit.wikimedia.org/r/391542

Change 391543 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] eventlogging: Point m4-master to db1107 with db1108 as backup

https://gerrit.wikimedia.org/r/391543

Change 391545 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/dns@master] eventlogging: Repoint m4-master CNAME to dbproxy1009

https://gerrit.wikimedia.org/r/391545

Change 391543 merged by Jcrespo:
[operations/puppet@production] eventlogging: Point m4-master to db1107 with db1108 as backup

https://gerrit.wikimedia.org/r/391543

Change 391545 merged by Jcrespo:
[operations/dns@master] eventlogging: Repoint m4-master CNAME to dbproxy1009

https://gerrit.wikimedia.org/r/391545

elukey closed subtask T177405: rack and setup db1107 and db1108 as Resolved.Nov 15 2017, 5:56 PM

I am currently reviewing what tables to drop on db1047 and which ones to copy over to db1108, and this is what I gathered from old entries of this task:

DROP database

halfak, staeiou, giovanni, declerambaul, nimish, rfaulk, whym, erik, jmorgan

DO NOT COPY TO db1108

ops, otto_test_sync_to_delete, test

COPY to db1108

staging

To review

dartar
diederik
shawn
zexley

Re-ping for:

@DarTar (dartar)
@drdee (diederik)

ops are ours, we can handle that- just leave things as you found them.
test is probably a mistake and probably should be deleted unless you find something interesting there.

Script wmf-auto-reimage was launched by jynus on neodymium.eqiad.wmnet for hosts:

['dbproxy1004.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201711171416_jynus_31942.log.

Change 392044 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] haproxy: Migrate dbproxy1004 to stretch (haproxy 1.7) format

https://gerrit.wikimedia.org/r/392044

Completed auto-reimage of hosts:

['dbproxy1004.eqiad.wmnet']

and were ALL successful.

Change 392044 merged by Jcrespo:
[operations/puppet@production] haproxy: Migrate dbproxy1004 to stretch (haproxy 1.7) format

https://gerrit.wikimedia.org/r/392044

Re @elukey

Seems to me that the DROP DATABASE list is correct.

To the list of databases to review I would add: akhanna (https://wikimediafoundation.org/wiki/User:Akhanna)
Also, I think zexley corresponds to: https://fr.wikipedia.org/wiki/Utilisateur:ZExley_(WMF)
Could not find who shawn corresponds to...

jcrespo moved this task from Blocked external/Not db team to Done on the DBA board.Nov 17 2017, 6:06 PM

I just spoke to @JAllemandou, we can help review these legacy tables early next week.

Mentioned in SAL (#wikimedia-operations) [2017-11-20T08:46:03Z] <marostegui> Run mydumper for db1047.staging - T156844

Ottomata renamed this task from Prep to decommission old dbstore hosts (db1046, db1047) to Decommission old dbstore hosts (db1046, db1047).Nov 21 2017, 3:33 PM

The data behind Page Creation Dashboard is configured to read data from the log database on dbstore1002. Can I at this point submit a patch to the ReportUpdater configuration that updates it to use db1108.eqiad.wmnet, as that now has the updated log database?

In T156844#3778225, @Nettrom wrote:

The data behind Page Creation Dashboard is configured to read data from the log database on dbstore1002. Can I at this point submit a patch to the ReportUpdater configuration that updates it to use db1108.eqiad.wmnet, as that now has the updated log database?

Yes correct, or you can use the domain analytics-slave.eqiad.wmnet (that is a CNAME to db1108 :)

Nevermind, turns out @mforns has already updated that configuration, should've checked that first. Thanks again for taking care of it!

@Nettrom Right, is not only that dashboard but all the ones that are feed data via reportupdater that needed the new configuration

In T156844#3767146, @elukey wrote:

To review

dartar
diederik
shawn
zexley

This is a list of tables for all the above databases, in case somebody wants to quickly glance over them:

========= dartar

Table	Size in MB
clean_patrol_log	15.52
pages_reviewed	2.52
temp	0.00

========= diederik

Table	Size in MB
archive	4609.74
logging	4532.85
sandbox	54.55
gender	7.86

========= shawn

Table	Size in MB
categorylinks	2624.42
user_meta	212.50
hlp_user_meta	111.10
yr_hlp_user_meta	35.10
wtemplates	21.89
wp_members	4.65
wp_editors	2.81
wp_members_categorylink	2.74
wp_members_section	2.31
wp_members_memberpage	1.50
categorylinks_wp	0.19
wikiproject_editors	0.19
wikiprojects_member_pages	0.11
wp_page_count	0.09
WikiProject_Pages	0.08
w_users	0.00

========= zexley

Table	Size in MB
suspectboteditsamp	3099.15
suspectbotrevs	1823.20
3ers	1196.98
user_meta_qs	443.50
user_qs	316.53
user_meta_qs_archive	187.50
ncc	152.18
no_deleted_edits	130.02
crazy_article_edits	3.80
3ersMeta12	2.40
lass_rev	1.40
lass_catlinks	0.59
boteditsmonthlybybot	0.37
fwjru	0.35
lass_page	0.23
lass_category	0.11
lass_archive	0.09
crazy_editors	0.07
bots	0.05
ncu	0.03
nc	0.01
try	0.01
boteditsmonthly	0.00
test2	0.00
test	0.00
huggle_monthly	0.00

diederik and zexley seem to be previous employee of the WMF, we didn't find info about shawn and dartar should be @DarTar.

If anybody knows anything about the above tables please speak up. I am going to re-check with the research team.

Change 393238 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Drop the Eventlogging support for dbstore1002

https://gerrit.wikimedia.org/r/393238

elukey edited projects, added Analytics-Kanban; removed Analytics.Nov 24 2017, 3:24 PM

elukey moved this task from Next Up to In Progress on the Analytics-Kanban board.

elukey claimed this task.Nov 24 2017, 3:27 PM

The shawn table belonged to Shawn Walker, a research intern in 2011. These tables can be safely deleted.

Mentioned in SAL (#wikimedia-operations) [2017-11-27T13:22:33Z] <elukey> remove eventlogging replication support (log database) from dbstore1002 - T156844

Change 393238 merged by Elukey:
[operations/puppet@production] Drop the Eventlogging support for dbstore1002

https://gerrit.wikimedia.org/r/393238

Change 393597 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Delete role::mariadb::analytics

https://gerrit.wikimedia.org/r/393597

Mentioned in SAL (#wikimedia-operations) [2017-11-28T09:04:36Z] <marostegui> Drop database log from dbstore1002 - T156844

root@dbstore1002:~#  mysql --skip-ssl
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 56286751
Server version: 10.0.22-MariaDB MariaDB Server

Copyright (c) 2000, 2015, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

root@DBSTORE[(none)]> set session sql_log_bin=0;
Query OK, 0 rows affected (0.00 sec)

root@DBSTORE[(none)]> drop database if exists log;
Query OK, 379 rows affected (9.01 sec)

Change 393597 merged by Elukey:
[operations/puppet@production] Delete role::mariadb::analytics

https://gerrit.wikimedia.org/r/393597

In T156844#3786146, @Capt_Swing wrote:

The shawn table belonged to Shawn Walker, a research intern in 2011. These tables can be safely deleted.

Thanks!

I dumped anyway the dbs dartar, diederik, shawn and zexley on my home directory on stat1006:

elukey@stat1006:/srv/home/elukey$ du -hs *
15M     db1047_dartar_backup.sql
11G     db1047_diederik_backup.sql
920M    db1047_shawn_backup.sql
7.3G    db1047_zexley_backup.sql

This should be enough to unblock the decom process for db104[67].

Marostegui awarded a token.Nov 30 2017, 2:24 PM

Change 394318 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/software@master] m4.hosts: Remove db1046 and db1047

https://gerrit.wikimedia.org/r/394318

Change 394318 merged by jenkins-bot:
[operations/software@master] m4.hosts: Remove db1046 and db1047

https://gerrit.wikimedia.org/r/394318

Change 394326 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-eventlogging: db1108 will contain only log database

https://gerrit.wikimedia.org/r/394326

Change 394326 merged by Jcrespo:
[operations/puppet@production] mariadb-eventlogging: db1108 will contain only log database

https://gerrit.wikimedia.org/r/394326

Change 394332 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-eventlogging: Move prometheus monitoring to /run socket

https://gerrit.wikimedia.org/r/394332

Change 394332 merged by Jcrespo:
[operations/puppet@production] mariadb-eventlogging: Move prometheus monitoring to /run socket

https://gerrit.wikimedia.org/r/394332

Change 394531 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Set role::system::spare to db104[67]

https://gerrit.wikimedia.org/r/394531

Mentioned in SAL (#wikimedia-operations) [2017-12-01T08:41:23Z] <marostegui> Remove db1046 and db1047 from tendril - T156844

Change 394531 merged by Elukey:
[operations/puppet@production] Set role::system::spare to db104[67]

https://gerrit.wikimedia.org/r/394531

elukey moved this task from In Progress to Done on the Analytics-Kanban board.Dec 1 2017, 10:33 AM

Opened https://phabricator.wikimedia.org/T181784 to fully decom db104[67]

elukey moved this task from In Progress to Done on the User-Elukey board.Dec 4 2017, 9:24 AM

As per our chat, closing this.
Thanks for all the hard work you've put to make this happen!

Decommission old dbstore hosts (db1046, db1047)Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

DROP database

DO NOT COPY TO db1108

COPY to db1108

To review

To review

Decommission old dbstore hosts (db1046, db1047)
Closed, ResolvedPublic
Actions

Related Objects
Search...