Move flame graphs hosting from mwlog1001 to webperf-2 and enable in Beta Cluster
Closed, ResolvedPublic

Description

Follows-up from T180766: Make MediaWiki profiler in Beta match production, and potentially depends on some of T158837.

Specifically, we should:

  • Set up the Redis instance in Beta for stack logs (in prod on mwlog1001).
  • Set up the deamon that writes from Redis to log files on disk (in prod on mwlog1001).
  • Set up the deamon that generates SVG flame graphs from stack log files on disk (in prod on mwlog1001).
  • Set up the Apache that serves the flame graphs and stack logs (in prod on mwlog1001).
  • Set up a public proxy to expose that Apache (either custom for beta, or by deploying the performance-site to Beta Cluster, the latter would be nice).

Doing this would help with T176916: Set up sampling profiler for PHP 7 (alternative to HHVM Xenon).

For T176916, I'd like to set up a copy of the flame graph pipeline for xhprof-sampled data. This way we can keep them separate from the existing one. Because 1) to avoid corruption and duplication due to different data sources (Xenon vs xhprof) that also have different sampler intervals, and 2) To be able to compare them side by side, and 3) To be able to ramp up the new one slowly without being lost in the noise.

We could do all that directly in production, but I'm proposing we take this opportunity to clean up the flamegraph stack in production. Part of it currently runs on hardware not really intended for its purpose. Once that is done, it should be easy to recreate the setup in Beta Cluster by using the same puppet profiles. Probably by adding a webperf VM to the Beta Cluster.

That would also give us the ability to try out puppet-level changes (e.g. to change Apache configuration or otherewise) in Beta before production.

This should also make it easier to experiment with the xhprof-based samplings, which we could leave running in Beta for a while to compare results side-by-side.

Krinkle created this task.May 22 2018, 4:40 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 22 2018, 4:40 PM

Mentioned in SAL (#wikimedia-releng) [2018-05-22T16:53:28Z] <Krinkle> Created deployment-webperf instance (m1.small) - ref T195312

Vvjjkkii renamed this task from Set up PHP flame graphs for Beta Cluster to zhcaaaaaaa.Jul 1 2018, 1:08 AM
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot raised the priority of this task from High to Needs Triage.
CommunityTechBot renamed this task from zhcaaaaaaa to Set up PHP flame graphs for Beta Cluster.
CommunityTechBot added a subscriber: Aklapper.
Krinkle renamed this task from Set up PHP flame graphs for Beta Cluster to Move flame graphs hosting from mwlog1001 to webperf-2 and enable in Beta Cluster.Jul 3 2018, 3:48 AM

Mentioned in SAL (#wikimedia-releng) [2018-07-03T03:49:07Z] <Krinkle> Create deployment-webperf12 as equivalent of webperf1002/webperf2002 in prod (T195312, T194390)

Mentioned in SAL (#wikimedia-releng) [2018-07-03T04:31:21Z] <Krinkle> Setting up puppetmaster/cerf for deployment-webperf12 (T195312)

Krinkle triaged this task as High priority.

Change 443757 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/puppet@production] webperf: Rename role::xenon to profile::webperf::xenon

https://gerrit.wikimedia.org/r/443757

Change 443760 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/mediawiki-config@master] profiler: Enable xenon collection in labs (same as prod)

https://gerrit.wikimedia.org/r/443760

Change 443760 merged by jenkins-bot:
[operations/mediawiki-config@master] profiler: Enable xenon collection in labs (same as prod)

https://gerrit.wikimedia.org/r/443760

Change 443762 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/puppet@production] mediawiki: Change xenon interval for Beta Cluster from 10min to 30s

https://gerrit.wikimedia.org/r/443762

[operations/puppet@production] webperf: Rename role::xenon to profile::webperf::xenon
https://gerrit.wikimedia.org/r/443757

[operations/mediawiki-config@master] profiler: Enable xenon collection in labs (same as prod)
https://gerrit.wikimedia.org/r/443760

[operations/puppet@production] mediawiki: Change xenon interval for Beta Cluster from 10min to 30s
https://gerrit.wikimedia.org/r/443762

Now that these patches are merged and/or puppet-beta-cherry-pick'ed, we now have a pubsub going in Beta Cluster on its deployment-flourine02 host (equiv of mwlog1001) receiving data from HHVM on beta's app servers.

krinkle@deployment-fluorine02:~$ redis-cli subscribe xenon
Reading messages... (press Ctrl-C to quit)
1) "subscribe"
2) "xenon"
3) (integer) 1

1) "message"
2) "xenon"
3) "index.php;{GET};/srv/mediawiki/php-master/index.php;/srv/mediawiki/php-master/includes/WebStart.php;MediaWiki\\Ses...\\JCSingleton::init;JsonConfig\\JCSingleton::parseConfiguration 1"

1) "message"
2) "xenon"
3) "load.php;{GET};/srv/mediawiki/php-master/load.php;..;/srv/mediawiki/php-master/vendor/composer/autoload_real.php 1"
1) "message"

[..]

Change 443764 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/puppet@production] webperf: Enable xenondata_host on perfsite in Beta Cluster

https://gerrit.wikimedia.org/r/443764

Change 444331 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/puppet@production] webperf: Split Redis from the rest of the arclamp profile

https://gerrit.wikimedia.org/r/444331

Change 445066 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/puppet@production] webperf: Add arclamp profile to webperf::profiling_tools role

https://gerrit.wikimedia.org/r/445066

Mentioned in SAL (#wikimedia-releng) [2018-07-17T01:36:42Z] <Krinkle> Applying role::webperf::profiling_tools class to webperf12 in Beta Cluster - T195312, T180761.

Mentioned in SAL (#wikimedia-releng) [2018-07-23T15:34:15Z] <Krinkle> Deleting deployment-webperf12 - T195312

Mentioned in SAL (#wikimedia-releng) [2018-07-23T15:34:55Z] <Krinkle> Creating deployment-webperf13 - T195312

Mentioned in SAL (#wikimedia-releng) [2018-07-23T15:57:09Z] <Krinkle> Set 'puppetmaster' Hiera for deployment-webperf13 / T195312

Mentioned in SAL (#wikimedia-releng) [2018-07-23T16:04:21Z] <Krinkle> Set up puppet cert stuff on deployment-webperf13 T195312

Mentioned in SAL (#wikimedia-releng) [2018-07-23T16:25:06Z] <Krinkle> Applying role::webperf::profiling_tools class to deployment-webperf13, T195312

This comment was removed by Krinkle.

Mentioned in SAL (#wikimedia-releng) [2018-07-30T23:03:41Z] <Krinkle> Delete and recreate deployment-webperf13 (T195312 / T180761)

Mentioned in SAL (#wikimedia-releng) [2018-07-30T23:04:52Z] <Krinkle> Create instance deployment-webperf13 (deployment-webperf13 ) - T195312 / T180761

Mentioned in SAL (#wikimedia-releng) [2018-07-30T23:28:19Z] <Krinkle> Setting up puppet cert for deployment-webperf12; T195312 / T180761

Mentioned in SAL (#wikimedia-releng) [2018-07-31T17:50:49Z] <Krinkle> Apply role::webperf::profiling_tools to deployment-webperf12; T195312 / T180761

Krinkle added a comment.EditedAug 7 2018, 8:11 PM

deployment-webperf12 in Beta is now working as an Arc Lamp host. I've confirmed it has xenon-log running, configured to subscribe to the Redis on deployment-fluorine02, storing data in /srv/xenon, and exposing it over HTTP at http://deployment-webperf12/xenon/.

Krinkle updated the task description. (Show Details)Aug 7 2018, 8:28 PM

Change 451107 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/puppet@production] webperf: Switch arclamp_host in Beta from mwlog host to webperf13

https://gerrit.wikimedia.org/r/451107

[operations/puppet@production] webperf: Switch arclamp_host in Beta from mwlog host to webperf13
https://gerrit.wikimedia.org/r/451107

syslog for puppet-agent
Info: Caching catalog for deployment-webperf11.deployment-prep.eqiad.wmflabs
Info: Applying configuration version '1533674995'
Notice: /Stage[main]/Profile::Webperf::Site/Httpd::Site[performance-wikimedia-org]/Httpd::Conf[performance-wikimedia-org]/File[/etc/apache2/sites-available/50-performance-wikimedia-org.conf]/content: 
--- /etc/apache2/sites-available/50-performance-wikimedia-org.conf	2018-07-04 05:05:42.664611765 +0000
+++ /tmp/puppet-file20180807-1476-3aept6	2018-08-07 20:50:14.419982978 +0000
@@ -22,8 +22,8 @@
         Require all granted
     </Directory>
 
-    ProxyPass /xenon http://deployment-fluorine02.deployment-prep.eqiad.wmflabs/xenon
-    ProxyPassReverse /xenon http://deployment-fluorine02.deployment-prep.eqiad.wmflabs/xenon
+    ProxyPass /xenon http://deployment-webperf12.deployment-prep.eqiad.wmflabs/xenon
+    ProxyPassReverse /xenon http://deployment-webperf12.deployment-prep.eqiad.wmflabs/xenon
 
 
 </VirtualHost>

Info: Computing checksum on file /etc/apache2/sites-available/50-performance-wikimedia-org.conf
Info: /Stage[main]/Profile::Webperf::Site/Httpd::Site[performance-wikimedia-org]/Httpd::Conf[performance-wikimedia-org]/File[/etc/apache2/sites-available/50-performance-wikimedia-org.conf]: Filebucketed /etc/apache2/sites-available/50-performance-wikimedia-org.conf to puppet with sum e129f79e85bc324aa2c11235d49fd679
Notice: /Stage[main]/Profile::Webperf::Site/Httpd::Site[performance-wikimedia-org]/Httpd::Conf[performance-wikimedia-org]/File[/etc/apache2/sites-available/50-performance-wikimedia-org.conf]/content: content changed '{md5}e129f79e85bc324aa2c11235d49fd679' to '{md5}9434e76f7ea8f02e4bc14028537ebe6f'
Info: /Stage[main]/Profile::Webperf::Site/Httpd::Site[performance-wikimedia-org]/Httpd::Conf[performance-wikimedia-org]/File[/etc/apache2/sites-available/50-performance-wikimedia-org.conf]: Scheduling refresh of Service[apache2]
Notice: /Stage[main]/Httpd/Service[apache2]: Triggered 'refresh' from 1 events
Notice: Applied catalog in 9.47 seconds
Krinkle updated the task description. (Show Details)Aug 7 2018, 10:52 PM

Change 443757 merged by Giuseppe Lavagetto:
[operations/puppet@production] webperf: Rename role::xenon to profile::webperf::xenon

https://gerrit.wikimedia.org/r/443757

Change 443764 merged by Giuseppe Lavagetto:
[operations/puppet@production] webperf: Enable xenondata_host on perfsite in Beta Cluster

https://gerrit.wikimedia.org/r/443764

Krinkle claimed this task.Aug 12 2018, 12:50 PM
Krinkle moved this task from Doing to Blocked on the Performance-Team board.

Change 452449 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/puppet@production] webperf: Switch webperf::site to use arclamp from webperf-2

https://gerrit.wikimedia.org/r/452449

Change 444331 merged by Giuseppe Lavagetto:
[operations/puppet@production] webperf: Split Redis from the rest of the arclamp profile

https://gerrit.wikimedia.org/r/444331

Change 445066 merged by Giuseppe Lavagetto:
[operations/puppet@production] webperf: Add arclamp profile to webperf::profiling_tools role

https://gerrit.wikimedia.org/r/445066

Change 451107 merged by Giuseppe Lavagetto:
[operations/puppet@production] webperf: Switch arclamp_host in Beta from mwlog host to webperf12

https://gerrit.wikimedia.org/r/451107

Change 452449 merged by Giuseppe Lavagetto:
[operations/puppet@production] webperf: Switch webperf::site to use arclamp from webperf-2

https://gerrit.wikimedia.org/r/452449

Krinkle closed this task as Resolved.Aug 14 2018, 3:51 PM
Krinkle removed a project: Patch-For-Review.

Copied xenon/logs/daily/2018-*{all,load,index,api,RunSingleJob}.log from mwlog1001 to webperfX002 hosts.

Then, in batches of a dozen files, copied them from my home directory to /srv/xenon/logs/daily as the xenon user, and ran sudo -u xenon /usr/local/bin/xenon-generate-svgs.

Krinkle moved this task from Blocked to Doing on the Performance-Team board.Aug 16 2018, 5:55 PM