Productionize pc2011-pc2014 and pc1011-pc1014
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Marostegui
	Jun 11 2021, 2:28 PM

Details

Subject	Repo	Branch	Lines +/-
mariadb: Set mysql_role for primary pc hosts.	operations/puppet	production	+6 -0
wmnet: Update pcX-master	operations/dns	master	+3 -3
ProductionServices: Promote pc1012 to primary of pc2.	operations/mediawiki-config	master	+2 -2
ProductionServices: Promote pc1011 to primary of pc1.	operations/mediawiki-config	master	+2 -2
ProductionServices: Promote pc1013 to primary of pc3.	operations/mediawiki-config	master	+11 -11
ProductionServices: Promote pc2013 to primary of pc3.	operations/mediawiki-config	master	+6 -6
ProductionServices: Promote pc2012 to primary of pc2.	operations/mediawiki-config	master	+2 -2
ProductionServices: Promote pc2011 to primary of pc1.	operations/mediawiki-config	master	+2 -2
ProductionServices: Add new pc hosts.	operations/mediawiki-config	master	+12 -0
pc101[1-4]: Enable notifications.	operations/puppet	production	+0 -4
pc101[1-4]: Add to parsercache role and sections.	operations/puppet	production	+16 -9
pc201[1-4]: Enable notifications.	operations/puppet	production	+0 -4
pc201[1-4]: Add to mariadb::parsercache role	operations/puppet	production	+7 -8
pc201[1-4]: Disable notifications	operations/puppet	production	+8 -0
site.pp: Add new parsercache hosts as insetup	operations/puppet	production	+11 -0

Related Objects
Search...

Status	Assigned	Task
Resolved	• Kormat	T284825 Productionize pc2011-pc2014 and pc1011-pc1014
Resolved	Jclark-ctr	T282484 (Need By: TBD) rack/setup/install pc1011-pc1014
Resolved	Papaul	T282482 (Need By: TBD) rack/setup/install pc2011-pc2014

Event Timeline

Marostegui created this task.Jun 11 2021, 2:28 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 11 2021, 2:28 PM

For the record, I did assign them a partman recipe in order to get them installed with the proper one partitioning scheme.
I haven't added them to site.pp or anywhere else.

LSobanski subscribed.Jun 11 2021, 2:31 PM

Change 699424 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] site.pp: Add new parsercache hosts as insetup

https://gerrit.wikimedia.org/r/699424

Change 699424 merged by Marostegui:

[operations/puppet@production] site.pp: Add new parsercache hosts as insetup

https://gerrit.wikimedia.org/r/699424

Per my chat with Papaul, I have just added them to site.pp as insetup.

Maintenance_bot removed a project: Patch-For-Review.Jun 11 2021, 3:10 PM

Papaul closed subtask T282482: (Need By: TBD) rack/setup/install pc2011-pc2014 as Resolved.Jun 11 2021, 9:50 PM

codfw hosts are now ready to be productionized as the racking and installing task in codfw is done (T282482)
Reminder: set this hosts into Active mode in netbox.

Marostegui changed the task status from Stalled to Open.Jun 14 2021, 5:00 AM

Marostegui triaged this task as Medium priority.Jun 14 2021, 5:02 AM

Marostegui moved this task from Blocked to Ready on the DBA board.

Change 699580 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] pc201[1-4]: Disable notifications

https://gerrit.wikimedia.org/r/699580

Change 699580 merged by Marostegui:

[operations/puppet@production] pc201[1-4]: Disable notifications

https://gerrit.wikimedia.org/r/699580

Maintenance_bot removed a project: Patch-For-Review.Jun 14 2021, 5:10 AM

Just ran lvextend -L+1100G /dev/mapper/tank-data && sudo xfs_growfs /srv on all the codfw hosts.

===== NODE GROUP =====
(4) pc[2011-2014].codfw.wmnet
----- OUTPUT of 'df -hT /srv;' -----
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   8.7T  9.3G  8.7T   1% /srv
================

Marostegui mentioned this in T282761: purgeParserCache.php should not take over 24 hours for its daily run.Jun 23 2021, 5:13 AM

Change 706335 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/puppet@production] pc201[1-4]: Add to mariadb::parsercache role

https://gerrit.wikimedia.org/r/706335

gerritbot added a project: Patch-For-Review.Jul 22 2021, 9:36 AM

Change 706335 merged by Kormat:

[operations/puppet@production] pc201[1-4]: Add to mariadb::parsercache role

https://gerrit.wikimedia.org/r/706335

Maintenance_bot removed a project: Patch-For-Review.Jul 22 2021, 10:10 AM

Change 706436 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/puppet@production] pc201[1-4]: Enable notifications.

https://gerrit.wikimedia.org/r/706436

gerritbot added a project: Patch-For-Review.Jul 22 2021, 12:46 PM

In T284825#7154152, @Marostegui wrote:

codfw hosts are now ready to be productionized as the racking and installing task in codfw is done (T282482)
Reminder: set this hosts into Active mode in netbox.

Done.

Change 706436 merged by Kormat:

[operations/puppet@production] pc201[1-4]: Enable notifications.

https://gerrit.wikimedia.org/r/706436

• Kormat claimed this task.Jul 22 2021, 12:48 PM

• Kormat updated the task description. (Show Details)

• Kormat moved this task from Ready to In progress on the DBA board.

The new pc hosts in codfw are now in service. They're replicating from a blank start, so it will take 3 weeks for them to be populated fully. Once that's done, we can make one or more primary to see how that affects performance.

/srv resized on all eqiad hosts:

(4) pc[1011-1014].eqiad.wmnet                                                                                                     
----- OUTPUT of 'df -hT /srv' -----                                                                                               
Filesystem            Type  Size  Used Avail Use% Mounted on                                                                      
/dev/mapper/tank-data xfs   8.7T  9.3G  8.7T   1% /srv

Maintenance_bot removed a project: Patch-For-Review.Jul 22 2021, 1:10 PM

Change 706475 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/puppet@production] pc101[1-4]: Add to parsercache role and sections.

https://gerrit.wikimedia.org/r/706475

gerritbot added a project: Patch-For-Review.Jul 22 2021, 1:24 PM

Change 706475 merged by Kormat:

[operations/puppet@production] pc101[1-4]: Add to parsercache role and sections.

https://gerrit.wikimedia.org/r/706475

RobH closed subtask T282484: (Need By: TBD) rack/setup/install pc1011-pc1014 as Resolved.Jul 22 2021, 2:18 PM

Change 706507 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/puppet@production] pc101[1-4]: Enable notifications.

https://gerrit.wikimedia.org/r/706507

Change 706507 merged by Kormat:

[operations/puppet@production] pc101[1-4]: Enable notifications.

https://gerrit.wikimedia.org/r/706507

All hosts are now in service. Including:

sys schema deployed
set to 'active' in netbox

• Kormat closed this task as Resolved.Jul 22 2021, 3:17 PM

• Kormat updated the task description. (Show Details)

Change 712115 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/mediawiki-config@master] ProductionServices: Add new pc hosts.

https://gerrit.wikimedia.org/r/712115

Change 712115 merged by jenkins-bot:

[operations/mediawiki-config@master] ProductionServices: Add new pc hosts.

https://gerrit.wikimedia.org/r/712115

Change 712120 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/mediawiki-config@master] ProductionServices: Promote pc2011 to primary of pc1.

https://gerrit.wikimedia.org/r/712120

Going to reopen this, and use it to track making these hosts primaries.

Change 712120 merged by jenkins-bot:

[operations/mediawiki-config@master] ProductionServices: Promote pc2011 to primary of pc1.

https://gerrit.wikimedia.org/r/712120

Mentioned in SAL (#wikimedia-operations) [2021-08-12T09:27:12Z] <kormat@deploy1002> Synchronized wmf-config/ProductionServices.php: Promote pc2011 to primary of pc1 T284825 (duration: 01m 10s)

Mentioned in SAL (#wikimedia-operations) [2021-08-12T09:28:52Z] <kormat> reconfiguring replication tree for pc1 T284825

Mentioned in SAL (#wikimedia-operations) [2021-08-12T09:30:37Z] <kormat@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on 8 hosts with reason: Reconfiguring replication tree T284825

Mentioned in SAL (#wikimedia-operations) [2021-08-12T09:30:44Z] <kormat@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 8 hosts with reason: Reconfiguring replication tree T284825

Steps to update replication tree after making pc2011 primary:

downtime all of pc1, as we have circular replication in place with eqiad
move other pc1/codfw nodes beneath pc2011:
- db-move-replica pc2010 pc2011
- db-move-replica pc2014 pc2011
reset replication for all affected nodes:
- mysql.py -h pc1007 -e 'stop slave; reset slave all'
- mysql.py -h pc2007 -e 'stop slave; reset slave all'
- mysql.py -h pc2011 -e 'stop slave; reset slave all'
re-setup replication for the remaining nodes using binlog coords:
- mysql.py -h pc1007 -e "change master to master_node='pc2011..."
- mysql.py -h pc2007 -e "change master to master_node='pc2011..."
- mysql.py -h pc2011 -e "change master to master_node='pc1007..."
reenable gtid everywhere

ppelberg mentioned this in T280599: Reduce DiscussionTools' usage of the parser cache.Aug 13 2021, 2:09 AM

Current state: After running for a day, the graphs for the new node (db2011) are looking very promising. In particular, disk latency is massively improved.

Old primary:
Read latencies: 3.92s to 38.1s. Avg: 14.8s
Write latencies: 1.42s to 36.1s. Avg: 9.4s

New primary:
Read latencies: 172ms to 445ms. Avg: 260ms
Write latencies: 623ms to 3.63s. Avg: 1.39s

This makes sense, but it's still good to see :)

Change 713466 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/mediawiki-config@master] ProductionServices: Promote pc2012 to primary of pc2.

https://gerrit.wikimedia.org/r/713466

Change 713471 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/mediawiki-config@master] ProductionServices: Promote pc2013 to primary of pc3.

https://gerrit.wikimedia.org/r/713471

Change 713466 merged by jenkins-bot:

[operations/mediawiki-config@master] ProductionServices: Promote pc2012 to primary of pc2.

https://gerrit.wikimedia.org/r/713466

Mentioned in SAL (#wikimedia-operations) [2021-08-17T14:20:38Z] <kormat@deploy1002> Synchronized wmf-config/ProductionServices.php: Promote pc2012 to primary of pc2 T284825 (duration: 00m 59s)

Change 713471 merged by jenkins-bot:

[operations/mediawiki-config@master] ProductionServices: Promote pc2013 to primary of pc3.

https://gerrit.wikimedia.org/r/713471

Mentioned in SAL (#wikimedia-operations) [2021-08-17T14:37:58Z] <kormat@deploy1002> Synchronized wmf-config/ProductionServices.php: Promote pc2013 to primary of pc3 T284825 (duration: 00m 58s)

pc201[1-3] are now the primaries for pc1-3 in codfw, respectively.

Change 713845 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/mediawiki-config@master] ProductionServices: Promote pc1011 to primary of pc1.

https://gerrit.wikimedia.org/r/713845

Change 713866 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/mediawiki-config@master] ProductionServices: Promote pc1012 to primary of pc2.

https://gerrit.wikimedia.org/r/713866

Change 713867 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/mediawiki-config@master] ProductionServices: Promote pc2013 to primary of pc3.

https://gerrit.wikimedia.org/r/713867

Change 713845 merged by jenkins-bot:

[operations/mediawiki-config@master] ProductionServices: Promote pc1011 to primary of pc1.

https://gerrit.wikimedia.org/r/713845

Change 713866 merged by jenkins-bot:

[operations/mediawiki-config@master] ProductionServices: Promote pc1012 to primary of pc2.

https://gerrit.wikimedia.org/r/713866

Change 713867 merged by jenkins-bot:

[operations/mediawiki-config@master] ProductionServices: Promote pc1013 to primary of pc3.

https://gerrit.wikimedia.org/r/713867

Mentioned in SAL (#wikimedia-operations) [2021-08-19T13:09:06Z] <kormat@deploy1002> Synchronized wmf-config/ProductionServices.php: Promote new h/w to primary of eqiad pc sections T284825 (duration: 01m 08s)

Mentioned in SAL (#wikimedia-operations) [2021-08-19T13:24:17Z] <kormat> reconfiguring replication tree on pc1 T284825

Mentioned in SAL (#wikimedia-operations) [2021-08-19T13:30:22Z] <kormat> reconfiguring replication tree on pc2 T284825

Mentioned in SAL (#wikimedia-operations) [2021-08-19T13:34:24Z] <kormat> reconfiguring replication tree on pc3 T284825

pc101[1-3] are now the primaries for pc1-3 in eqiad, respectively.

All done!

<3!!

Change 716936 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/dns@master] wmnet: Update pcX-master

https://gerrit.wikimedia.org/r/716936

Change 716936 merged by Marostegui:

[operations/dns@master] wmnet: Update pcX-master

https://gerrit.wikimedia.org/r/716936

Mentioned in SAL (#wikimedia-operations) [2021-09-07T12:51:42Z] <mvernon@deploy1002> Synchronized wmf-config/ProductionServices.php: Remove old decommissioned pc hosts T284825 (duration: 01m 02s)

Change 729935 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/puppet@production] mariadb: Set mysql_role for primary pc hosts.

https://gerrit.wikimedia.org/r/729935

Change 729935 merged by Kormat:

[operations/puppet@production] mariadb: Set mysql_role for primary pc hosts.

https://gerrit.wikimedia.org/r/729935

	F34594398: image.png
	Aug 13 2021, 10:11 AM

	F34594394: image.png
	Aug 13 2021, 10:11 AM

Productionize pc2011-pc2014 and pc1011-pc1014Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Productionize pc2011-pc2014 and pc1011-pc1014
Closed, ResolvedPublic
Actions

Related Objects
Search...