Page MenuHomePhabricator

Move XHGui from tungsten to xhgui-001
Open, MediumPublic

Description

(This task was originally only about setting up XHGui in Beta Cluster, but has been repurposed to be about migrating XHGui from an old server to a new server in production, and this time, doing the same in Beta Cluster).

  • Create "xhgui-001" production instances. – T238098
  • Create Beta Cluster instance: deployment-xhgui01.
  • Create new webperf role for xhgui and apply it to xhgui-001 in prod (via Puppet) and xhgui01 in Beta Cluster (via Horizon).
  • Finish porting of xhgui provisioning in Puppet to actually work from old Debian Jessie (non-role class, currently applied to tungsten) to Stretch.
  • Switch performance.wikimedia.org/xhgui to new server in Beta Cluster.
  • Enable new XHGui server in MediaWiki for Beta cluster.
  • Switch performance.wikimedia.org/xhgui new XHGui server in production.
  • Import MongoDB data from tungsten to xhgui1001/xhgui2001 in production & enable new XHGui server in production MediaWiki.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I think I know what's going on here: https://github.com/wikimedia/puppet/blob/production/modules/role/manifests/webperf/profiling_tools.pp#L17-L21

  • The require package on line 17 causes apache2 to be installed, because it's a dependency of libapache2-mod-php7.0.
  • When apache2 is installed as a dependency, it's configured to use mpm_event, as defined in the control file (see prior comment)
  • The next line causes apache to be configured to use php 7, but the mpm is never changed.

Probably the easiest option is to insert this on line 18:

class { '::httpd::mpm':
    mpm => 'worker'
}

(Or mpm => 'prefork', depending on context)

I'm not sure whether there's another option that's preferred in our environment, though.

On a related note, we might want to change the httpd::init class such that it detects the case where php[5|7] is being enabled, and will automatically call the httpd::mpm class.

Change 449367 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/puppet@production] webperf: Move require_package for PHP from role to XHGui profile

https://gerrit.wikimedia.org/r/449367

Change 449367 merged by Dzahn:
[operations/puppet@production] webperf: Move require_package for PHP from role to XHGui profile

https://gerrit.wikimedia.org/r/449367

Mentioned in SAL (#wikimedia-releng) [2018-07-30T23:03:41Z] <Krinkle> Delete and recreate deployment-webperf13 (T195312 / T180761)

Mentioned in SAL (#wikimedia-releng) [2018-07-30T23:04:52Z] <Krinkle> Create instance deployment-webperf13 (deployment-webperf13 ) - T195312 / T180761

Mentioned in SAL (#wikimedia-releng) [2018-07-30T23:28:19Z] <Krinkle> Setting up puppet cert for deployment-webperf12; T195312 / T180761

Mentioned in SAL (#wikimedia-releng) [2018-07-31T17:50:49Z] <Krinkle> Apply role::webperf::profiling_tools to deployment-webperf12; T195312 / T180761

Change 449532 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/puppet@production] webperf: Set mpm=worker explicitly for httpd.

https://gerrit.wikimedia.org/r/449532

Change 449532 merged by Dzahn:
[operations/puppet@production] webperf: Set mpm=prefork explicitly for profiling_tools' httpd

https://gerrit.wikimedia.org/r/449532

Restricted Application added a subscriber: Gilles. · View Herald TranscriptDec 20 2018, 10:39 PM

I've upgraded XHGUi from 0.8.0 to 0.9.0-d5e9bd94 (see the commit).

Subset of changes from https://github.com/perftools/xhgui/compare/0.8.1...d5e9bd94:

Good things:

  • MongoDB 3.5 support (relevant for this task, as webperf1002 is running a newer Mongo than tungsten).
  • Broken flamegraph views have been removed.

Unsure:

  • There are public web-facing APIs for uploading and deleting profiles.

Given our install is public, we probably need to disable these. I'll see what it's like after the upgrade and either configure or revert as needed.

The xhgui role (to become a profile or regular class) currently fails to provision on webperf12 in Beta (and presumably would fail as-is on webperf002 in prod as well), because it's written for Debian Jessie (php5, older Mongo) instead of Debian Stretch (php7, newer Mongo).

There's also at least 1 resource conflict we'll need to resolve:

Error: Could not retrieve catalog from remote server: Error 500 on SERVER:
Server Error: Evaluation Error: Error while evaluating a Resource Statement,
Duplicate declaration: Class[Httpd] is already declared in file /etc/puppet/modules/role/manifests/webperf/profiling_tools.pp:27; 
  cannot redeclare at /etc/puppet/modules/role/manifests/xhgui/app.pp:22 at /etc/puppet/modules/role/manifests/xhgui/app.pp:22:5
  on node deployment-webperf12.deployment-prep.eqiad.wmflabs
Krinkle claimed this task.Jan 22 2019, 9:03 PM
Krinkle moved this task from Doing to Backlog: Future Goals on the Performance-Team board.
Krinkle added a subscriber: Imarlier.

Change 485990 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/puppet@production] webperf: Document which class is which regarding xhgui

https://gerrit.wikimedia.org/r/485990

Dzahn added a subscriber: Dzahn.Jan 23 2019, 6:19 AM

Change 485990 merged by Dzahn:
[operations/puppet@production] webperf: Document which class is which regarding xhgui

https://gerrit.wikimedia.org/r/485990

Krinkle removed Krinkle as the assignee of this task.Mar 3 2019, 7:20 PM
Dzahn added a comment.EditedMar 4 2019, 11:06 AM

Duplicate declaration: Class[Httpd] is already declared in file /etc/puppet/modules/role/manifests/webperf/profiling_tools.pp:27;
cannot redeclare at /etc/puppet/modules/role/manifests/xhgui/app.pp:22 at /etc/puppet/modules/role/manifests/xhgui/app.pp:22:5

The root problem is that multiple roles are applied on the same instance. The code has to be refactored so that exactly one role is setting up the things needed on one node. This usually means that one of the existing roles needs to be converted to a profile. After that is done we have to move the httpd setup to that single role or into only one of the profiles included by the role.

Dzahn added a comment.EditedMar 4 2019, 12:54 PM

role::webperf::profiling_tools includes ::profile::webperf::xhgui

The latter should be the replacement of role::xhgui::app.

The fix should be to only include role::webperf::profiling_tools on the instance and no other role. @Krinkle

Change 494422 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] xhgui: require php-mongodb package

https://gerrit.wikimedia.org/r/494422

Change 494425 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] xhgui: setup git cloning and apache site

https://gerrit.wikimedia.org/r/494425

Dzahn updated the task description. (Show Details)Mar 5 2019, 8:41 AM

Change 494422 merged by Dzahn:
[operations/puppet@production] xhgui: require php-mongodb package

https://gerrit.wikimedia.org/r/494422

Change 494425 merged by Dzahn:
[operations/puppet@production] xhgui: setup git cloning and apache site

https://gerrit.wikimedia.org/r/494425

Change 496185 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] webperf::profiling_tools: include ::passwords::ldap::production

https://gerrit.wikimedia.org/r/496185

Change 496185 merged by Dzahn:
[operations/puppet@production] webperf::profiling_tools: include ::passwords::ldap::production

https://gerrit.wikimedia.org/r/496185

This is blocking the removal of tungsten, what are the remaining blockers/work to do?

Krinkle added a comment.EditedAug 7 2019, 12:36 PM

From a high-level, four things:

  1. Decide on the multi-dc strategy for XHGui (keep SPOF, active-active with replication, active-active without replication).
  2. Verify and test that XHGui works with the Stretch version of MongoDB (currently tungsten is Jessie). We can do this on the Beta Cluster instance (webperf12).
  3. Migrate existing data from tungsten (Jessie-Mongo) to webperf1002/2002 (Stretch-Mongo) – or to MySQL.
  4. Update write traffic (in wmf-config). Update read proxy (perf.wm.o Apache config).

Regarding multi-dc, we have four options I know of:

  1. have app server debug requests (when WikimediaDebug-profile mode is on), write to both Mongo instances.
    • no consistency effort in case one is not responding, but we can log an error in that case for monitoring.
    • no need to do any replication or other extra dependencies.
    • performance.wikimedia.org/xhgui will become active-active, like the rest of perf.wm.o is already.
  2. Or; Replicate the Mongo instance in both directions. In terms of schema this should be easy because all append-only with decentralised primary keys (e.g. like a UUID). But it's also something we have't done before.
    • nice eventual consistency.
    • extra maintenance cost for perf-team and SRE.
    • performance.wikimedia.org/xhgui will become active-active, like the rest of perf.wm.o is already.
  3. Or; Push back this problem and migrate from tungsten to webperf1002 first.
    • no standby/failover. no backup.
    • performance.wikimedia.org/xhgui will remain SPOF.
  4. Or; Help upstream XHGui add support for MySQL (they're working on this currently) and use that with bidi-replication (somewhat like how we do with parser caches). Ideally also stored on a proper misc DB server and not webperf VMs.
    • re-use existing knowledge and expertise.
    • no more Mongo.
    • performance.wikimedia.org/xhgui will become active-active, like the rest of perf.wm.o is already.

Regarding multi-dc, we have four options I know of:

  1. Or; Push back this problem and migrate from tungsten to webperf1002 first.
    • no standby/failover. no backup.
    • performance.wikimedia.org/xhgui will remain SPOF.

I'd prefer that as an interim in order to untangle this for now from moving away from the old server: tungsten is an extremely old server (it was bought over 8.5 (!) years ago, our warranty for servers is three years and we're normally looking to replace hardware after 4-5 years. tungsten could fail any day and it seems vastly preferable to have an SPOF on modern hardware under warranty compared to having an SPOF on that server :-)

Dzahn added a comment.Nov 12 2019, 3:54 PM

xhgui is supposed to move to its own VMs instead of moving to webperf1002.

Creating a new VM request for that.

Krinkle renamed this task from Move XHGui from tungsten to webperf-002 to Move XHGui from tungsten to xhui-001.Nov 20 2019, 8:54 PM
Krinkle updated the task description. (Show Details)

Change 552135 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/puppet@production] webperf: Remove xhgui profile from webperf::profiling_tools role

https://gerrit.wikimedia.org/r/552135

Krinkle renamed this task from Move XHGui from tungsten to xhui-001 to Move XHGui from tungsten to xghui-001.Nov 20 2019, 9:25 PM
Krinkle renamed this task from Move XHGui from tungsten to xghui-001 to Move XHGui from tungsten to xhgui-001.

Mentioned in SAL (#wikimedia-releng) [2019-11-20T21:28:12Z] <Krinkle> Create deployment-xhgui01 in Beta Cluster as m1.small/Debian 10 Buster. – T238098, T180761

Krinkle updated the task description. (Show Details)Nov 20 2019, 9:28 PM
Krinkle updated the task description. (Show Details)Nov 20 2019, 9:30 PM
Krinkle updated the task description. (Show Details)

Change 552135 merged by Dzahn:
[operations/puppet@production] webperf: Remove xhgui profile from webperf::profiling_tools role

https://gerrit.wikimedia.org/r/552135

Change 552324 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] install_server: downgrade xhgui servers from buster to stretch

https://gerrit.wikimedia.org/r/552324

Change 552324 merged by Dzahn:
[operations/puppet@production] install_server: downgrade xhgui servers from buster to stretch

https://gerrit.wikimedia.org/r/552324

Change 552325 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] xhgui: add support for stretch/PHP7.2

https://gerrit.wikimedia.org/r/552325

Change 552325 merged by Dzahn:
[operations/puppet@production] xhgui: add support for stretch/PHP7.0

https://gerrit.wikimedia.org/r/552325

Mentioned in SAL (#wikimedia-releng) [2019-11-24T03:25:45Z] <Krinkle> Create deployment-xhgui01 as Beta version of xhgui1001/xhgui2001 (Debian 9 Stretch, m1.small). – T238788, T238098, T180761

Change 552357 had a related patch set uploaded (by Krinkle; owner: Dzahn):
[operations/puppet@production] webperf: switch xhgui_host from tungsten to xhgui1001

https://gerrit.wikimedia.org/r/552357

Mentioned in SAL (#wikimedia-releng) [2019-12-13T00:22:49Z] <Krinkle> Apply Puppet role class "xhgui::app" to deployment-xhgui01. T238788, T180761

Krinkle updated the task description. (Show Details)Dec 13 2019, 12:29 AM

@Dzahn I've set the puppet role, confirmed puppet agent ran without errors, but looks like it's not (yet) working.

I've exposed the host directly via https://xhgui-beta.wmflabs.org/ for easy testing (will later be via https://performance-beta.wmflabs.org/xhgui/), and /var/log/apache2/error.log shows:

PHP Fatal error:  Uncaught Error: Class 'MongoClient' not found in /srv/xhgui/src/Xhgui/ServiceContainer.php:79
Stack trace:
#0 /srv/xhgui/vendor/…

Looks like ext-mongo is missing.

Krinkle assigned this task to Dzahn.Dec 13 2019, 12:35 AM
Krinkle claimed this task.Dec 13 2019, 1:03 AM
Krinkle added a comment.EditedDec 13 2019, 1:15 AM

Never mind. It needs ext-mongodb (for PHP 7+) instead of ext-mongo (PHP5-only). And the puppet provisioning already installs this correctly.

However, the vendor check in for xhgui needs to be updated because this means it needs a different wrapper library.
I've updated https://wikitech.wikimedia.org/wiki/Performance/Runbook/XHGui_service#Upgrade_XHGui,
and recommitted the relevant vendor package at https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/xhgui/.

🎉 It's now working at https://xhgui-beta.wmflabs.org/

Mentioned in SAL (#wikimedia-releng) [2019-12-13T01:26:35Z] <Krinkle> Set profile::webperf::site::xhgui_host: deployment-xhgui01.deployment-prep.eqiad.wmflabs in Hiera for deployment-webperf11. T180761

Change 556850 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/mediawiki-config@master] [Beta Cluster] profiler: Enable XHGui backend

https://gerrit.wikimedia.org/r/556850

Krinkle updated the task description. (Show Details)Dec 13 2019, 1:31 AM

Change 556850 merged by jenkins-bot:
[operations/mediawiki-config@master] [Beta Cluster] profiler: Enable XHGui backend

https://gerrit.wikimedia.org/r/556850

Mentioned in SAL (#wikimedia-releng) [2019-12-13T02:09:12Z] <Krinkle> Create 'mongo' security group and apply to deployment-xhgui01 (ingress tcp/27017). T180761

Change 556854 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/puppet@production] [Beta Cluster] mediawiki: install tideways on beta app servers

https://gerrit.wikimedia.org/r/556854

Mentioned in SAL (#wikimedia-releng) [2019-12-13T02:32:20Z] <Krinkle> It appears puppet-agent has been locally disabled on deployment-mediawiki-07 for at least three days with "no reason given". Re-enabling now to unbreak https://gerrit.wikimedia.org/r/556854 for T180761

Krinkle added a comment.EditedDec 13 2019, 2:54 AM

While all parts are up and running and working correctly when probed independently, it seems the overall pipeline is not working.

  1. Run curl 'https://en.wikipedia.beta.wmflabs.org/wiki/Special:RecentChanges' -H 'X-Wikimedia-Debug: backend=1; profile' (or use WikimediaDebug with "XHGui" ticked, and browse any Beta cluster url).
  2. Refresh https://performance-beta.wmflabs.org/xhgui/.

I've confirmed so far:

  • The MongoDB and Apache for XHGui are up and running and have no errors logged.
  • The XHGui server is reachable from other Beta Cluster VMs (can be pinged, and can open telnet from mediawiki-07 to xhgui01:27017).
  • Other XWD features (such as forceprofile and log) are working fine, which confirms the header is making its way to PHP without issue and is parsed correctly.
  • Tideways is installed (php -a, function_exists).
  • The code in wmf-config/profiler.php for capturing the profile and sending to XHGui works from the same host from the command-line (I've opened php -a on mediawiki-07 and executed the individual steps from wmf-config/profiler.php one by one and confirmed each statement succeeds). And when doing it this way, if I run some random statements between start/end of capture, they appear at https://performance-beta.wmflabs.org/xhgui/run/view?id=5df2fa3f88e941732d636515. Which further confirms all the logic around XHGui and Tideways is working fine.
  • No relevant errors at https://logstash-beta.wmflabs.org/app/kibana#/dashboard/mediawiki-errors.

So the header is making its way there.
The header is parsed correctly.
The logic that runs after interpreting the header is working if triggered from that point manually.

Mentioned in SAL (#wikimedia-releng) [2019-12-13T02:57:39Z] <Krinkle> Restarting deployment-mediawiki-07. - T180761

And... now it all works. I guess either Apache or PHP needed a restart for some of this in a way that Puppet didn't do during provisioning.

Change 556857 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/mediawiki-config@master] profiler: Switch production xhgui destination from tungsten to xhgui1001

https://gerrit.wikimedia.org/r/556857

Krinkle reassigned this task from Krinkle to Dzahn.Dec 13 2019, 3:12 AM

Signing over to @Dzahn for the last few steps in prod:

  1. Switch perf.wm.o proxy on webperf1001 from tungsten to xhgui1001. – https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/552357/
  2. Clear xhgui1001 database and import from tungsten (one last time).
  3. Switch writing new profiles to xhgui1001. – https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/556857/

Between 1 and 2, people won't see their profiles.
Between 2 and 3, submitted profiles are lost/ignored.
Both is fine as it should be immediately obvious to anyone trying to submit debug profiles and should only take a few minutes.

Change 556854 merged by Alexandros Kosiaris:
[operations/puppet@production] [Beta Cluster] mediawiki: install tideways on beta app servers

https://gerrit.wikimedia.org/r/556854

It looks like the sync might not suffice for mongo. I've not been able to find any indication online that others were able to upgrade smoothly from MongoDB 2.4 (jessie) to MongoDB 3.2 (stretch).

However, e.g. this question suggests that MongoDB 2.6 can read 2.4 data and prepare it for 3.x. Might need a manual process. Alternatively, there may be some kind of "dump" or "export" format we can use as intermediary.