Page MenuHomePhabricator

Replace labsdb (wikireplicas) dbproxies: dbproxy1010 and dbproxy1011
Closed, ResolvedPublic

Description

wikireplicas service (labsdb1009, labsdb1010, labsdb1011) run behind a pair of proxies: dbproxy1010 and dbproxy1011.

These two proxies were scheduled to be decommissioned January 2016, as they were purchased in 2011.
Finally, we were able to purchase and provision dbproxy1018 and dbproxy1019 as replacements (T202367)

  • dbproxy1010 -> dbproxy1018
  • dbproxy1011 -> dbproxy1019

The replacement should be transparent for the users, as it is a matter of changing the DNS record to point to the new hosts and let the connections move slowly. Once the old proxies are fully empty from connections we can proceed to decommission the old ones.

Event Timeline

Marostegui moved this task from Triage to Pending comment on the DBA board.
Marostegui updated the task description. (Show Details)

Change 534573 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/dns@master] wmnet: Replace dbproxy1010 with dbproxy1018

https://gerrit.wikimedia.org/r/534573

I have been talking to Jaime about this, and we might better wait for https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/531670/ to be pushed so we can start with the new hosts fully configured.

Change 534577 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] wikireplica_dns: Replace dbproxy1010 with dbproxy1018

https://gerrit.wikimedia.org/r/534577

Everything seems to be working fine on dbproxy1019 and dbproxy1018 after merging the above changes. Everything is reachable normally and haproxy looks good too,

root@cumin1001:/home/marostegui# telnet dbproxy1019 3306
Trying 10.64.37.28...
Connected to dbproxy1019.eqiad.wmnet.
Escape character is '^]'.
Y
5.5.5-10.1.43-MariaDBGi�fjd9-s!F�??�}e_0o^jw<bWmmysql_native_passwordConnection closed by foreign host.


root@tools-sgebastion-07:~# telnet dbproxy1019.eqiad.wmnet 3306
Trying 10.64.37.28...
Connected to dbproxy1019.eqiad.wmnet.
Escape character is '^]'.
Y
5.5.5-10.1.43-MariaDB&
                      /!XwnA:>�??�hq}Bo*"hjX[Cmysql_native_password

root@dbproxy1019:~# echo "show stat" | socat /run/haproxy/haproxy.sock stdio
# pxname,svname,qcur,qmax,scur,smax,slim,stot,bin,bout,dreq,dresp,ereq,econ,eresp,wretr,wredis,status,weight,act,bck,chkfail,chkdown,lastchg,downtime,qlimit,pid,iid,sid,throttle,lbtot,tracked,type,rate,rate_lim,rate_max,check_status,check_code,check_duration,hrsp_1xx,hrsp_2xx,hrsp_3xx,hrsp_4xx,hrsp_5xx,hrsp_other,hanafail,req_rate,req_rate_max,req_tot,cli_abrt,srv_abrt,comp_in,comp_out,comp_byp,comp_rsp,lastsess,last_chk,last_agt,qtime,ctime,rtime,ttime,agent_status,agent_code,agent_duration,check_desc,agent_desc,check_rise,check_fall,check_health,agent_rise,agent_fall,agent_health,addr,cookie,mode,algo,conn_rate,conn_rate_max,conn_tot,intercepted,dcon,dses,
mariadb,FRONTEND,,,0,0,5000,0,0,0,0,0,0,,,,,OPEN,,,,,,,,,1,2,0,,,,0,0,0,0,,,,,,,,,,,0,0,0,,,0,0,0,0,,,,,,,,,,,,,,,,,,,,,tcp,,0,0,0,,0,0,
mariadb,labsdb1009,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP,1,1,0,0,0,17,0,,1,2,1,,0,,2,0,,0,L7OK,0,0,,,,,,,,,,,0,0,,,,,-1,5.5.5-10.1.43-MariaDB,,0,0,0,0,,,,Layer7 check passed,,99999999,20,100000018,,,,,,tcp,,,,,,,,
mariadb,labsdb1010,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP,1,1,0,0,0,17,0,,1,2,2,,0,,2,0,,0,L7OK,0,0,,,,,,,,,,,0,0,,,,,-1,5.5.5-10.1.43-MariaDB,,0,0,0,0,,,,Layer7 check passed,,99999999,20,100000018,,,,,,tcp,,,,,,,,
mariadb,BACKEND,0,0,0,0,500,0,0,0,0,0,,0,0,0,0,UP,2,2,0,,0,17,0,,1,2,0,,0,,1,0,,0,,,,,,,,,,,,,,0,0,0,0,0,0,-1,,,0,0,0,0,,,,,,,,,,,,,,tcp,,,,,,,,
root@cumin1001:/home/marostegui# telnet dbproxy1018 3306
Trying 10.64.37.27...
Connected to dbproxy1018.eqiad.wmnet.
Escape character is '^]'.
Y
5.5.5-10.1.43-MariaDB˅�
k[gs+zI=�??�53=w`^64>{rJmysql_native_password

root@tools-sgebastion-07:~# telnet dbproxy1018.eqiad.wmnet 3306
Trying 10.64.37.27...
Connected to dbproxy1018.eqiad.wmnet.
Escape character is '^]'.
Y
5.5.5-10.1.43-MariaDB���
V]Ras<qE�??�EgZm6cp)xE/Wmysql_native_password

root@dbproxy1018:~#  echo "show stat" | socat /run/haproxy/haproxy.sock stdio
# pxname,svname,qcur,qmax,scur,smax,slim,stot,bin,bout,dreq,dresp,ereq,econ,eresp,wretr,wredis,status,weight,act,bck,chkfail,chkdown,lastchg,downtime,qlimit,pid,iid,sid,throttle,lbtot,tracked,type,rate,rate_lim,rate_max,check_status,check_code,check_duration,hrsp_1xx,hrsp_2xx,hrsp_3xx,hrsp_4xx,hrsp_5xx,hrsp_other,hanafail,req_rate,req_rate_max,req_tot,cli_abrt,srv_abrt,comp_in,comp_out,comp_byp,comp_rsp,lastsess,last_chk,last_agt,qtime,ctime,rtime,ttime,agent_status,agent_code,agent_duration,check_desc,agent_desc,check_rise,check_fall,check_health,agent_rise,agent_fall,agent_health,addr,cookie,mode,algo,conn_rate,conn_rate_max,conn_tot,intercepted,dcon,dses,
mariadb,FRONTEND,,,0,1,5000,5,5,502,0,0,0,,,,,OPEN,,,,,,,,,1,2,0,,,,0,0,0,1,,,,,,,,,,,0,0,0,,,0,0,0,0,,,,,,,,,,,,,,,,,,,,,tcp,,0,1,5,,0,0,
mariadb,labsdb1011,0,0,0,1,,5,5,502,,0,,0,1,0,0,UP,1,1,0,0,0,85113,0,,1,2,1,,5,,2,0,,1,L7OK,0,0,,,,,,,,,,,0,1,,,,,18,5.5.5-10.1.43-MariaDB,,0,0,0,41,,,,Layer7 check passed,,99999999,20,100000018,,,,,,tcp,,,,,,,,
mariadb,labsdb1010,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP,1,0,1,0,0,85113,0,,1,2,2,,0,,2,0,,0,L7OK,0,0,,,,,,,,,,,0,0,,,,,-1,5.5.5-10.1.43-MariaDB,,0,0,0,0,,,,Layer7 check passed,,2,3,4,,,,,,tcp,,,,,,,,
mariadb,BACKEND,0,0,0,1,500,5,5,502,0,0,,0,1,0,0,UP,1,1,1,,0,85113,0,,1,2,0,,5,,1,0,,1,,,,,,,,,,,,,,0,1,0,0,0,0,18,,,0,0,0,41,,,,,,,,,,,,,,tcp,,,,,,,,

Mentioned in SAL (#wikimedia-operations) [2020-03-30T07:40:17Z] <marostegui> Replace dbproxy1010 with dbproxy1011 for wiki replicas, analytics - T231520

Change 534573 merged by Marostegui:
[operations/dns@master] wmnet: Replace dbproxy1010 with dbproxy1018

https://gerrit.wikimedia.org/r/534573

Change 534577 merged by Marostegui:
[operations/puppet@production] wikireplica_dns: Replace dbproxy1010 with dbproxy1018

https://gerrit.wikimedia.org/r/534577

Mentioned in SAL (#wikimedia-operations) [2020-03-30T07:48:16Z] <marostegui> Run cloudcontrol1003:~# wmcs-wikireplica-dns to promote dbproxy1018 to wikireplicas active proxy T231520

@Bstorm @bd808 I have pushed https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/534577/3/modules/openstack/files/util/wikireplica_dns.yaml and run:

$ ssh cloudcontrol1003.wikimedia.org
$ source <(sudo cat ~root/novaenv.sh)
$ wmcs-wikireplica-dns

This is all I had in my notes to be able to change the DNS to point to a different dbproxy. This change should be transparent for everyone.
I am already seeing:

root@tools-sgebastion-07:~# host s1.analytics.db.svc.eqiad.wmflabs
s1.analytics.db.svc.eqiad.wmflabs has address 10.64.37.27

And also seeing connections arriving thru the new proxy:

root@cumin1001:/home/marostegui# mysql.py -h labsdb1011 -e "show processlist" | grep "37.27"
217150574	s52835	10.64.37.27:49494	frwiki_p	Query	3	Sending data	select page_namespace, ll_title, pa_importance, pa_class\n    from page, langlinks, page_assessments,	0.000
217150602	s52467	10.64.37.27:49528	tlwiki_p	Sleep	0		NULL	0.000
217150603	s52585	10.64.37.27:49530	enwiki_p	Query	0	Sending data	SELECT COUNT(*) as count FROM `page` JOIN `revision` ON `rev_page` = `page_id` WHERE `page_namespace	0.000

I have also run some queries via Quarry and I have seen them arriving correctly to labsdb1011 via the new proxy.

Is this all or should I run something else?

Thanks

Change 584539 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] dbproxy101: Clarify that it is not an active proxy

https://gerrit.wikimedia.org/r/584539

Change 584539 merged by Marostegui:
[operations/puppet@production] dbproxy101: Clarify that it is not an active proxy

https://gerrit.wikimedia.org/r/584539

I have also run some queries via Quarry and I have seen them arriving correctly to labsdb1011 via the new proxy.

Is this all or should I run something else?

I think that's all that should be needed. The s<number>.analytics... records all seem to have been updated as expected:

$ for n in $(seq 1 8); do host s${n}.analytics.db.svc.eqiad.wmflabs; done
s1.analytics.db.svc.eqiad.wmflabs has address 10.64.37.27
s2.analytics.db.svc.eqiad.wmflabs has address 10.64.37.27
s3.analytics.db.svc.eqiad.wmflabs has address 10.64.37.27
s4.analytics.db.svc.eqiad.wmflabs has address 10.64.37.27
s5.analytics.db.svc.eqiad.wmflabs has address 10.64.37.27
s6.analytics.db.svc.eqiad.wmflabs has address 10.64.37.27
s7.analytics.db.svc.eqiad.wmflabs has address 10.64.37.27
s8.analytics.db.svc.eqiad.wmflabs has address 10.64.37.27

I have also run some queries via Quarry and I have seen them arriving correctly to labsdb1011 via the new proxy.

Is this all or should I run something else?

I think that's all that should be needed. The s<number>.analytics... records all seem to have been updated as expected:

$ for n in $(seq 1 8); do host s${n}.analytics.db.svc.eqiad.wmflabs; done
s1.analytics.db.svc.eqiad.wmflabs has address 10.64.37.27
s2.analytics.db.svc.eqiad.wmflabs has address 10.64.37.27
s3.analytics.db.svc.eqiad.wmflabs has address 10.64.37.27
s4.analytics.db.svc.eqiad.wmflabs has address 10.64.37.27
s5.analytics.db.svc.eqiad.wmflabs has address 10.64.37.27
s6.analytics.db.svc.eqiad.wmflabs has address 10.64.37.27
s7.analytics.db.svc.eqiad.wmflabs has address 10.64.37.27
s8.analytics.db.svc.eqiad.wmflabs has address 10.64.37.27

Thank you!
Most of connections are already gone, there's just one pending. Once it is finished, I will stop haproxy on dbproxy1010 and leave it stopped for a few days before proceeding with the decommissioning T248944

Change 586206 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] wikireplicas_dns: Replace dbproxy1011 with dbproxy1019

https://gerrit.wikimedia.org/r/586206

Change 586207 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/dns@master] wmnet: Replace dbproxy1011 with dbproxy1019

https://gerrit.wikimedia.org/r/586207

Change 586206 merged by Marostegui:
[operations/puppet@production] wikireplicas_dns: Replace dbproxy1011 with dbproxy1019

https://gerrit.wikimedia.org/r/586206

Change 586207 merged by Marostegui:
[operations/dns@master] wmnet: Replace dbproxy1011 with dbproxy1019

https://gerrit.wikimedia.org/r/586207

Mentioned in SAL (#wikimedia-operations) [2020-04-07T07:47:36Z] <marostegui> Failover dbproxy1011 to dbproxy1019 - T231520)

I have changed DNS so dbproxy1019 is now becoming the active proxy for the web service:

root@tools-sgebastion-07:~# host s1.web.db.svc.eqiad.wmflabs
s1.web.db.svc.eqiad.wmflabs has address 10.64.37.28
root@tools-sgebastion-07:~# host 10.64.37.28
28.37.64.10.in-addr.arpa domain name pointer dbproxy1019.eqiad.wmnet.

And I already see connections coming from that proxy:

root@cumin1001:~# mysql.py -hlabsdb1009 -e "show processlist" | grep "10.64.37.28"
39443264	s53278	10.64.37.28:51454	enwiki_p	Sleep	30		NULL	0.000
39443396	s53003	10.64.37.28:51866	enwiki_p	Query	0	Copying to tmp table	SET STATEMENT max_statement_time = 600 FOR\nSELECT c.page_namespace, e.page_title, c.page_is_redirect	0.000
39443403	u2815	10.64.37.28:51886	enwiki_p	Sleep	0		NULL	0.000
39443405	s51187	10.64.37.28:51894	enwiki_p	Query	0	Sending data	SET STATEMENT max_statement_time = 60 FOR\nSELECT *, (\n

Looks like there are no more connections going through dbproxy1011:

root@cumin1001:/home/marostegui# host dbproxy1011
dbproxy1011.eqiad.wmnet has address 10.64.37.15
root@cumin1001:/home/marostegui# for i in labsdb1009 labsdb1010 labsdb1011 labsdb1012; do mysql.py -h$i -e "show processlist" | grep "37.15" ; done
root@cumin1001:/home/marostegui#

Mentioned in SAL (#wikimedia-operations) [2020-04-08T06:11:24Z] <marostegui> Stop haproxy on dbproxy1011 - T231520

I have stopped haproxy on dbproxy1011

Marostegui updated the task description. (Show Details)

dbproxy1019 has been working fine for a week. Considering this done.
Next step is to decommission dbproxy1011 : T249590