Page MenuHomePhabricator

Move XHGui from tungsten to xhgui-001
Closed, ResolvedPublic

Description

(This task was originally only about setting up XHGui in Beta Cluster, but has been repurposed to be about migrating XHGui from an old server to a new server in production, and this time, doing the same in Beta Cluster).

  • Create "xhgui-001" production instances. – T238098
  • Create Beta Cluster instance: deployment-xhgui01.
  • Create new webperf role for xhgui and apply it to xhgui-001 in prod (via Puppet) and xhgui01 in Beta Cluster (via Horizon).
  • Finish porting of xhgui provisioning in Puppet to actually work from old Debian Jessie (non-role class, currently applied to tungsten) to Stretch.
  • Switch performance.wikimedia.org/xhgui to new server in Beta Cluster.
  • Enable new XHGui server in MediaWiki for Beta cluster.
  • Switch performance.wikimedia.org/xhgui new XHGui server in production.
  • Import MongoDB data from tungsten to xhgui1001/xhgui2001 in production & enable new XHGui server in production MediaWiki.

Details

ProjectBranchLines +/-Subject
operations/puppetproduction+2 -2
operations/mediawiki-configmaster+1 -1
operations/mediawiki-configmaster+8 -4 K
operations/mediawiki-configmaster+3 -4
operations/puppetproduction+0 -13
operations/puppetproduction+1 -1
operations/puppetproduction+7 -0
operations/mediawiki-configmaster+7 -7
operations/puppetproduction+5 -1
operations/puppetproduction+88 -30
operations/puppetproduction+1 -0
operations/puppetproduction+1 -1
operations/dnsmaster+5 -4
labs/privatemaster+5 -1
operations/mediawiki-configmaster+23 -9
operations/mediawiki-configmaster+166 -14
operations/puppetproduction+6 -0
operations/mediawiki-configmaster+1 -1
operations/puppetproduction+4 -0
operations/puppetproduction+0 -2
operations/puppetproduction+2 -5
operations/puppetproduction+5 -4
operations/puppetproduction+28 -1
operations/puppetproduction+1 -1
operations/puppetproduction+20 -2
operations/puppetproduction+8 -0
operations/puppetproduction+3 -3
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

And... now it all works. I guess either Apache or PHP needed a restart for some of this in a way that Puppet didn't do during provisioning.

Change 556857 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/mediawiki-config@master] profiler: Switch production xhgui destination from tungsten to xhgui1001

https://gerrit.wikimedia.org/r/556857

Krinkle reassigned this task from Krinkle to Dzahn.Dec 13 2019, 3:12 AM

Signing over to @Dzahn for the last few steps in prod:

  1. Switch perf.wm.o proxy on webperf1001 from tungsten to xhgui1001. – https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/552357/
  2. Clear xhgui1001 database and import from tungsten (one last time).
  3. Switch writing new profiles to xhgui1001. – https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/556857/

Between 1 and 2, people won't see their profiles.
Between 2 and 3, submitted profiles are lost/ignored.
Both is fine as it should be immediately obvious to anyone trying to submit debug profiles and should only take a few minutes.

Change 556854 merged by Alexandros Kosiaris:
[operations/puppet@production] [Beta Cluster] mediawiki: install tideways on beta app servers

https://gerrit.wikimedia.org/r/556854

It looks like the sync might not suffice for mongo. I've not been able to find any indication online that others were able to upgrade smoothly from MongoDB 2.4 (jessie) to MongoDB 3.2 (stretch).

However, e.g. this question suggests that MongoDB 2.6 can read 2.4 data and prepare it for 3.x. Might need a manual process. Alternatively, there may be some kind of "dump" or "export" format we can use as intermediary.

Krinkle reassigned this task from Dzahn to dpifke.Feb 8 2020, 8:31 PM

Signing over within the team, but feel free to reach out to @Dzahn with any questions :)

Dzahn added a comment.Apr 8 2020, 11:40 AM

Hi @dpifke I'd be happy to get this finished. Wanna talk about next steps on IRC?

Dzahn added a comment.Apr 8 2020, 12:16 PM

I asked about the upgrade path for mongo in #mongodb on Freenode and the advice i got was:

  • The needed version steps are: 2.4 -> 2.6 , 2.6 -> 3.0 , 3.0 -> 3.2 (or 3.x ?)
  • "don't recommend going through the dump, and even less export data"
  • recommend to use mongodb apt repos instead of Debian ones ("as much as i like Debian...")
dpifke added a comment.Apr 8 2020, 4:58 PM

Sorry for not documenting the latest progress in this ticket; discussion has been taking place in the performance team meetings. It sounds like at least three of us have now gone through the same process of determining that upgrading MongoDB is infeasible. :(

XHGui has partial support for using PHP's PDO driver with MariaDB, and I've gotten this mostly working in a local install. There are a handful of issues that need to be bundled up into a patch, to be submitted upstream (and possibly patched into a local deb, depending on upstream's responsiveness). Once this is done, and the relevant MariaDB database is created, we can drop MongoDB completely. This is an OKR for me this quarter.

Change 603546 had a related patch set uploaded (by Dave Pifke; owner: Dave Pifke):
[operations/mediawiki-config@master] Use PDO for XHGui storage if configured

https://gerrit.wikimedia.org/r/603546

Change 603550 had a related patch set uploaded (by Dave Pifke; owner: Dave Pifke):
[operations/puppet@production] [WIP] webperf: Remove XHGui dependency on MongoDB

https://gerrit.wikimedia.org/r/603550

Change 604498 had a related patch set uploaded (by Dave Pifke; owner: Dave Pifke):
[labs/private@master] Add passwords for labs XHGui database

https://gerrit.wikimedia.org/r/604498

dpifke added a comment.EditedJun 11 2020, 4:03 AM

The xhgui package (and dependencies) were tested on deployment-xhgui02.deployment-prep.eqiad.wmflabs (new Debian buster instance), by manually uploading the packages to /opt/debs, running dpkg-scanpackages to build the index, and then creating /etc/apt/sources.list.d/local.list as:

deb [trusted=yes] file:///opt/debs ./

This last step got overwritten on the next puppet run, but I was able to apt install the new packages for testing before that happened.

The puppet changes were then successfully debugged, and I briefly pointed https://performance-beta.wmflabs.org/xhgui to it (by editing hierdata in horizon) to verify things were working. (I switched it back, since I'm going to be away for a week and I'm not sure if anyone else is using the current XHGui setup in beta.)

I did a test deploy of the mediawiki-config patch in beta, and everything looked OK, but the profiles weren't making it into the DB. This might have been a test setup issue; I will debug more as soon as I'm back.

I'm leaving the operaitons/puppet and labs/private changes cherry-picked in deployment-prep, as they shouldn't have any effect on anything else running there. (The changes have been tagged as such in Gerrit.)

Change 604498 merged by Dzahn:
[labs/private@master] Add passwords for Cloud VPS XHGui database

https://gerrit.wikimedia.org/r/604498

Change 603546 merged by jenkins-bot:
[operations/mediawiki-config@master] profiler: Add PDO driver for XHGui and enable on Beta Cluster

https://gerrit.wikimedia.org/r/603546

Change 605690 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/mediawiki-config@master] profiler: Fix undefined $wmgXhguiDBuser

https://gerrit.wikimedia.org/r/605690

Change 605690 merged by jenkins-bot:
[operations/mediawiki-config@master] profiler: Fix undefined $wmgXhguiDBuser

https://gerrit.wikimedia.org/r/605690

Mentioned in SAL (#wikimedia-operations) [2020-06-23T15:27:57Z] <mutante> removing ganeti VM xhgui1001 from eqiad row_A, will recreate in another row for rebalancing VMs between rows (T180761 T238098)

Change 607304 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] move xhgui1001 from row A to row D to rebalance VMs

https://gerrit.wikimedia.org/r/607304

Change 607304 merged by Dzahn:
[operations/dns@master] move xhgui1001 from row A to row D to rebalance VMs

https://gerrit.wikimedia.org/r/607304

Change 607371 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] DHCP: update MAC address for xhgui1001

https://gerrit.wikimedia.org/r/607371

Change 607371 merged by Wolfgang Kandek:
[operations/puppet@production] DHCP: update MAC address for xhgui1001

https://gerrit.wikimedia.org/r/607371

Dzahn added a comment.Jun 24 2020, 5:28 PM

work on getting the needed mysql db up continued here: T254795#6253646

https://performance.wikimedia.beta.wmflabs.org/xhgui now points at the new instance in beta.

I've temporarily mapped https://performance.wikimedia.beta.wmflabs.org/xhgui-old to the old instance, but it still thinks it's being served from /xhgui, so links require some manual munging in the URL bar. Profiles are being written to both instances, but have different identifiers, so one must match up the timestamps when comparing old vs. new.

As far as Krinkle & I can tell, there are no substantial differences between the two.

We want to migrate the data from the old instance in prod, so I'm going to hack together a quick script for that next.

Dzahn awarded a token.Jun 24 2020, 5:43 PM

Change 603550 merged by Dzahn:
[operations/puppet@production] webperf: Remove XHGui dependency on MongoDB

https://gerrit.wikimedia.org/r/c/operations/puppet/ /603550

Change 608911 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] webperf: set xhgui_old_host parameter to tungsten, xhgui host to xhgui1001

https://gerrit.wikimedia.org/r/c/operations/puppet/ /608911

Dzahn added a comment.EditedJul 1 2020, 4:43 PM

@dpifke I merged the "dependency on MongoDB" change now that we have the database and credentials are in PrivateSettings.

It did not affect tungsten. It did affect webperf1001 though which is now missing a value for "xhgui_old_host".

I made this to follow-up, good to go?

https://gerrit.wikimedia.org/r/c/operations/puppet/+/608911

edit: I abandoned that and instead amended to existing https://gerrit.wikimedia.org/r/c/operations/puppet/+/552357

Change 608911 abandoned by Dzahn:
webperf: set xhgui_old_host parameter to tungsten, xhgui host to xhgui1001

Reason:
duplicate of https://gerrit.wikimedia.org/r/c/operations/puppet/ /552357

https://gerrit.wikimedia.org/r/c/operations/puppet/ /608911

Change 608911 restored by Dzahn:
webperf: set xhgui_old_host parameter to tungsten, xhgui host to xhgui1001

https://gerrit.wikimedia.org/r/c/operations/puppet/ /608911

Change 608911 merged by Dzahn:
[operations/puppet@production] webperf: set xhgui_old_host parameter to tungsten

https://gerrit.wikimedia.org/r/c/operations/puppet/ /608911

Dzahn added a comment.Jul 1 2020, 5:49 PM

On webperf1001 and 2001 a "/xhgui-old" has been added to the httpd config, with ProxyPass to tungsten.

Krinkle raised the priority of this task from Medium to High.Jul 16 2020, 11:41 PM
Dzahn added a comment.Jul 30 2020, 1:20 AM

xhgui and the php- packages it needs are now available from apt.wikimedia.org for buster. (T254310#6347259)

xhgui* hosts need to be reinstalled with buster next so that they can be used (T259206)

Change 617546 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] switch xhgui2001 from xhgui::app to webperf::xhgui

https://gerrit.wikimedia.org/r/617546

Change 617546 merged by Dzahn:
[operations/puppet@production] switch xhgui2001 from xhgui::app to webperf::xhgui

https://gerrit.wikimedia.org/r/617546

  • xhgui2001 has been reinstalled with buster
  • xhgui2001 has been switched to the new puppet role without the mongo db dependency
  • some puppet dependency issues have been fixed
  • xhgui has been installed succesfully by puppet
  • xhgui1001 is also on buster now

MongoDB data has been migrated to MariaDB in beta.

dpifke added a comment.EditedAug 12 2020, 1:51 AM

The production MongoDB data has been dumped and converted to SQL, using the same process as with beta.

Uploading the results is taking a long time, so I'm going to have to load it into MariaDB in the morning.

dpifke added a comment.EditedAug 12 2020, 11:17 PM

After fixing a couple of minor issues with the migration script, the production xhgui database is now up-to-date as of yesterday's MongoDB dump. We can run the dump/export cycle again after cutting over, to pick up any records written in the meantime.

Dzahn added a comment.Aug 13 2020, 1:34 AM

Great news! Thank you. So we just need to agree on when to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/552357 now?

Yes. Ideally we flip the switch ~simultaneously with pointing MediaWiki to the new instance. I'll prep the mediawiki-config patch.

Are we comfortable doing this first thing tomorrow (Pacific time), heading into a three-day weekend, or do we want aim for Monday? The MediaWiki change should only affect the mwdebug servers, so I don't think it needs to wait for the backport window.

Assuming things seem stable after the switch, I'd want to do one last dump of MongoDB on tungsten, then keep it around for a few days just in case we need to roll back for some reason.

Dzahn added a comment.Aug 13 2020, 1:51 AM

I wouldn't mind doing it tomorrow morning and merging the puppet change. Would you be able to deploy the MW config change though?

Change 619886 had a related patch set uploaded (by Dave Pifke; owner: Dave Pifke):
[operations/mediawiki-config@master] xhgui: enable prod MariaDB, disable labs MongoDB

https://gerrit.wikimedia.org/r/619886

I can do the deploy, assuming someone with +2 rights in mediawiki-config approves.

I forgot that I had set it up to send to both destinations if both are configured; this means we should plan to deploy the config patch to the mwdebug hosts first, then verify that profiles are making it into the MariaDB instance, then flip the front-end in Puppet.

Change 619886 merged by jenkins-bot:
[operations/mediawiki-config@master] xhgui: enable prod MariaDB, disable labs MongoDB

https://gerrit.wikimedia.org/r/619886

Mentioned in SAL (#wikimedia-operations) [2020-08-13T18:05:29Z] <dpifke@deploy1001> Synchronized wmf-config/ProductionServices.php: Enabling new XHGui backend (T180761) (duration: 00m 56s)

Change 620126 had a related patch set uploaded (by Dave Pifke; owner: Dave Pifke):
[operations/puppet@production] xhgui: increase PHP memory limit to 512MB

https://gerrit.wikimedia.org/r/620126

Change 620126 merged by Dzahn:
[operations/puppet@production] xhgui: increase PHP memory limit to 512MB

https://gerrit.wikimedia.org/r/620126

Change 552357 merged by Dzahn:
[operations/puppet@production] webperf: switch xhgui_host from tungsten to xhgui1001

https://gerrit.wikimedia.org/r/552357

Mentioned in SAL (#wikimedia-operations) [2020-08-13T22:03:41Z] <mutante> switching xhgui from tungsten to xhgui1001 - ran puppet on webperf*001 - T180761 T158837

Dzahn updated the task description. (Show Details)Aug 13 2020, 10:26 PM

Change 620128 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] webperf: remove the xhgui_old_host parameter

https://gerrit.wikimedia.org/r/620128

Change 620142 had a related patch set uploaded (by Dave Pifke; owner: Dave Pifke):
[operations/mediawiki-config@master] xhgui: remove MongoDB backend

https://gerrit.wikimedia.org/r/620142

Change 620128 merged by Dzahn:
[operations/puppet@production] webperf: remove the xhgui_old_host parameter

https://gerrit.wikimedia.org/r/620128

Change 620729 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] mediawiki: remove mongodb PHP extension from appservers

https://gerrit.wikimedia.org/r/620729

Change 621095 had a related patch set uploaded (by Dave Pifke; owner: Dave Pifke):
[operations/mediawiki-config@master] [WIP] profiler: remove MongoDB client

https://gerrit.wikimedia.org/r/621095

Change 620142 merged by jenkins-bot:
[operations/mediawiki-config@master] xhgui: remove MongoDB backend

https://gerrit.wikimedia.org/r/620142

Mentioned in SAL (#wikimedia-operations) [2020-08-19T00:49:21Z] <dpifke@deploy1001> Synchronized wmf-config/ProductionServices.php: Disabling old XHGui backend (T180761) (duration: 05m 13s)

I did a final dump of the MongoDB database and confirmed it contains no records since 2020-08-18. To save the profiles written between when the previous dump was taken and when we switched over, unique records are now being imported into MariaDB.

We are good to go with decommissioning tungsten.

Krinkle closed this task as Resolved.Aug 20 2020, 10:09 PM
Krinkle removed a project: Patch-For-Review.
Krinkle updated the task description. (Show Details)
Krinkle removed a subtask: T260395: decom tungsten.

Mentioned in SAL (#wikimedia-releng) [2020-08-20T22:17:46Z] <dpifke> Deleted deployment-xhgui01.deployment-prep - no longer need MongoDB test instance (T180761)

Mentioned in SAL (#wikimedia-operations) [2020-08-20T22:20:00Z] <mutante> permanently shut down tungsten.eqiad.wmnet T260395 T158837 T180761 T224549

Change 556857 abandoned by Krinkle:
[operations/mediawiki-config@master] profiler: Switch production xhgui destination from tungsten to xhgui1001

Reason:
Superseded.

https://gerrit.wikimedia.org/r/556857