Set up webperf-1 node in Beta Cluster
Closed, ResolvedPublic

Description

The webperf node in Beta Cluster would be be similar to the webperf1001 and webperf1002 nodes in production.

It would run coal, navtiming, and the performance site. Where coal and navtiming would consume from beta's kafka brokers, and produce to beta's statsd/graphite hosts. That should happen automatically given Hiera configuration, but would be good to confirm.

The main thing that'd be useful to be able to test is the performance site. E.g. changes to its Apache configuration and other things that require puppet changes, which we can then cherry-pick ourselves to beta's puppet-master.

The navtiming/coal would mostly be dormant, but that's fine.

Aside from the performance site, it'd also be a useful place to be able to test upgrades to XHGui, as well as T195312 in the future.

  • performance::site - https://performance-beta.wmflabs.org/
  • coal::web
    • Check the web API is up and working.
    • Fix the perfsite JS to use the relative local one instead of hardcoding the prod url.
  • webperf::statsv
    • Check that /statsv beacons requests to Beta varnishes result in webperf/statsv writing to Beta statsd/graphite.
  • webperf::navtiming
    • Check that NavigationTiming data from Beta EventLogging is written to Beta graphite/frontend.navtiming.
  • coal::processor
    • Check that NavigationTiming data from Beta EventLogging is written to Beta graphite/coal.

Issues:

See also:

Krinkle created this task.May 22 2018, 4:58 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 22 2018, 4:58 PM
Krinkle added subscribers: Stashbot, Peter.EditedMay 22 2018, 5:07 PM

Mentioned in SAL (#wikimedia-releng) [2018-05-22T16:53:28Z] <Krinkle> Created deployment-webperf01 instance (m1.small) - ref T195312

  • Created the above instance (using the default debian9.3 image). No roles or hiera config yet, just empty/idle.
  • I've granted Aaron admin access to the deployment-prop project (was already a member).
  • I've added Ian to the deployment-prep project and made an admin.
  • @Peter It looks like your LDAP/Wikitech account isn't yet set up for Cloud VPS. We should fix that :)

To all, please confirm that you're able to access to the following two hosts with SSH:

deployment-puppetmaster02.deployment-prep.eqiad.wmflabs
deployment-webperf01.deployment-prep.eqiad.wmflabs

See Help:Access on Wikitech to get started. In a nut shell:

Mentioned in SAL (#wikimedia-releng) [2018-05-31T00:40:46Z] <Krinkle> Apply 'role::webperf' to deployment-webperf01, T195314

Krinkle claimed this task.May 31 2018, 12:44 AM
Krinkle triaged this task as Normal priority.

Change 436435 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/puppet@production] webperf: Fix jumbo-eqiad reference to be compatible with Beta Cluster

https://gerrit.wikimedia.org/r/436435

Mentioned in SAL (#wikimedia-releng) [2018-05-31T01:19:32Z] <Krinkle> Add web proxy in Horizon for 'performance-beta', mapping to deployment-webperf01 port 80 - T195314

Krinkle updated the task description. (Show Details)May 31 2018, 1:28 AM

Change 436439 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/puppet@production] deployment-prep: Remove override for scap::sources

https://gerrit.wikimedia.org/r/436439

Summary from the initial puppet run(s):

webperf01 syslog
Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Function Call, undefined method `[]' for nil:NilClass at /etc/puppet/modules/profile/manifests/webperf.pp:26:22 on node deployment-webperf01.deployment-prep.eqiad.wmflabs

Fixed by:

Change 436435 by @Krinkle:

[operations/puppet@production] webperf: Fix jumbo-eqiad reference to be compatible with Beta Cluster
https://gerrit.wikimedia.org/r/436435

webperf01 syslog
(/Stage[main]/Webperf/Group[webperf]/ensure) created
(/Stage[main]/Webperf/User[webperf]/ensure) created
(/Stage[main]/Webperf/File[/srv/webperf]/ensure) created
..
(/Stage[main]/Webperf::Navtiming/Service[navtiming]/ensure) ensure changed 'stopped' to 'running'
..
(/Stage[main]/Coal::Processor/File[/var/log/coal]/ensure) created
(/Stage[main]/Packages::Apache2/Package[apache2]/ensure) created
..
(/Stage[main]/Profile::Performance::Site/File[/srv/org/wikimedia]/ensure) created
..

Then at some point it tries to set up Scap for statsv, but that seems to fail.

webperf01 syslog
(/Stage[main]/Scap/Package[scap]/ensure) created
(/Stage[main]/Scap/File[/etc/scap.cfg]/ensure) defined content as '{md5}9ce71a39ab3508d10b09f40d10fbf1f0'
(/Stage[main]/Webperf::Statsv/Scap::Target[statsv/statsv]/Group[deploy-service]/ensure) created
(/Stage[main]/Webperf::Statsv/Scap::Target[statsv/statsv]/User[deploy-service]/ensure) created
(/Stage[main]/Webperf::Statsv/Scap::Target[statsv/statsv]/File[/var/lib/deploy-service]/ensure) created
Execution of '/usr/bin/scap deploy-local --repo statsv/statsv -D log_json:False' returned 70: 01:16:30 WARNING  - Unhandled error:
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/scap/cli.py", line 329, in run
    app._load_config()
  File "/usr/lib/python2.7/dist-packages/scap/deploy.py", line 114, in _load_config
    overrides = self._get_config_overrides()
  File "/usr/lib/python2.7/dist-packages/scap/deploy.py", line 529, in _get_config_overrides
    config = self._get_remote_overrides()
  File "/usr/lib/python2.7/dist-packages/scap/deploy.py", line 544, in _get_remote_overrides
    raise IOError(errno.ENOENT, 'Config file not found', cfg_url)
IOError: [Errno 2] Config file not found: 'deployment-tin.deployment-prep.eqiad.wmflabs/statsv/statsv/.git/DEPLOY_HEAD'
01:16:30 ERROR    - deploy-local failed: <IOError> [Errno 2] Config file not found: 'deployment-tin.deployment-prep.eqiad.wmflabs/statsv/statsv/.git/DEPLOY_HEAD'
#007
(/Stage[main]/Webperf::Statsv/Scap::Target[statsv/statsv]/Package[statsv/statsv]/ensure) change from absent to present failed: Execution of '/usr/bin/scap deploy-local --repo statsv/statsv -D log_json:False' returned 70: 01:16:30 WARNING  - Unhandled error:

It looks like this is because the scap source for statsv is missing on Beta's deployment host. I've filed T196034 about that.

Change 436439 abandoned by Krinkle:
deployment-prep: Remove override for scap::sources

Reason:
Discussing at T161675

https://gerrit.wikimedia.org/r/436439

Change 436581 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/puppet@production] Move scap::sources from role::deployment_server to common

https://gerrit.wikimedia.org/r/436581

Change 436586 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/puppet@production] deployment-prep: add webperf to scap::dsh::groups

https://gerrit.wikimedia.org/r/436586

Krinkle updated the task description. (Show Details)May 31 2018, 5:01 PM
Krinkle updated the task description. (Show Details)

Change 436601 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/puppet@production] webperf: Add navtiming and coal to scap::sources

https://gerrit.wikimedia.org/r/436601

Krinkle updated the task description. (Show Details)May 31 2018, 5:39 PM
Krinkle moved this task from Current Quarter Goals to Doing on the Performance-Team board.

Mentioned in SAL (#wikimedia-releng) [2018-05-31T18:03:08Z] <Krinkle> removed scap::sources override from Horizon Puppet config for deployment-prep - T195314

@demon @thcipriani I could use some help debugging a scap issue. The following Puppet classes are used in production on the webperf1001 host via role::webperf, and have now been applied to deployment-webperf01 in Beta.

The puppet run after applying this role keeps failing. I've resolved some of the issues, but haven't been able to get a clean pass.

May 31 16:58:03 deployment-webperf01 puppet-agent[9387]: IOError: [Errno 2] Config file not found: 'deployment-tin.deployment-prep.eqiad.wmflabs/performance/navtiming/.git/DEPLOY_HEAD'
May 31 16:58:03 deployment-webperf01 puppet-agent[9387]: 16:58:03 ERROR    - deploy-local failed: <IOError> [Errno 2] Config file not found: 'deployment-tin.deployment-prep.eqiad.wmflabs/performance/navtiming/.git/DEPLOY_HEAD'


May 31 23:59:00 deployment-webperf01 puppet-agent[26964]: 23:59:00 ERROR    - deploy-local failed: <IOError> [Errno 2] Config file not found: 'deployment-tin.deployment-prep.eqiad.wmflabs/performance/coal/.git/DEPLOY_HEAD'

I resolved this with

https://gerrit.wikimedia.org/r/436586 [operations/puppet@production] deployment-prep: add webperf to scap::dsh::groups
https://gerrit.wikimedia.org/r/436601 [operations/puppet@production] webperf: Add navtiming and coal to scap::sources

I've confirmed on deployment-tin that the relevant git clones exist in /srv/deployment and that they have a scap/ sub directory and .git/DEPLOY_HEAD file. But it is now failing with this:

webperf01: syslog
Jun  1 02:01:54 deployment-webperf01 puppet-agent[1362]: Execution of '/usr/bin/scap deploy-local --repo statsv/statsv -D log_json:False' returned 70: #033[32m02:01:54 Fetch from: http://deployment-tin.deployment-prep.eqiad.wmflabs/statsv/statsv/.git#033[0m
Jun  1 02:01:54 deployment-webperf01 puppet-agent[1362]: #033[32m02:01:54 Checkout rev: c1863407a18b0578bc5a1e802c38ed059c95b236#033[0m
Jun  1 02:01:54 deployment-webperf01 puppet-agent[1362]: #033[32m02:01:54 config_deploy is not enabled in scap.cfg, skipping.#033[0m
Jun  1 02:01:54 deployment-webperf01 puppet-agent[1362]: #033[32m02:01:54 Restarting service 'statsv'#033[0m
Jun  1 02:01:54 deployment-webperf01 puppet-agent[1362]: #033[33m02:01:54 Unhandled error:
Jun  1 02:01:54 deployment-webperf01 puppet-agent[1362]: Traceback (most recent call last):
Jun  1 02:01:54 deployment-webperf01 puppet-agent[1362]:   File "/usr/lib/python2.7/dist-packages/scap/cli.py", line 336, in run
Jun  1 02:01:54 deployment-webperf01 puppet-agent[1362]:     exit_status = app.main(app.extra_arguments)
Jun  1 02:01:54 deployment-webperf01 puppet-agent[1362]:   File "/usr/lib/python2.7/dist-packages/scap/deploy.py", line 147, in main
Jun  1 02:01:54 deployment-webperf01 puppet-agent[1362]:     getattr(self, stage)()
Jun  1 02:01:54 deployment-webperf01 puppet-agent[1362]:   File "/usr/lib/python2.7/dist-packages/scap/deploy.py", line 431, in restart_service
Jun  1 02:01:54 deployment-webperf01 puppet-agent[1362]:     service, self.config.get('require_valid_service', False))
Jun  1 02:01:54 deployment-webperf01 puppet-agent[1362]:   File "/usr/lib/python2.7/dist-packages/scap/tasks.py", line 787, in handle_services
Jun  1 02:01:54 deployment-webperf01 puppet-agent[1362]:     restart_service(service)
Jun  1 02:01:54 deployment-webperf01 puppet-agent[1362]:   File "/usr/lib/python2.7/dist-packages/scap/utils.py", line 402, in context_wrapper
Jun  1 02:01:54 deployment-webperf01 puppet-agent[1362]:     return func(*args, **kwargs)
Jun  1 02:01:54 deployment-webperf01 puppet-agent[1362]:   File "/usr/lib/python2.7/dist-packages/scap/tasks.py", line 801, in restart_service
Jun  1 02:01:54 deployment-webperf01 puppet-agent[1362]:     subprocess.check_call(cmd)
Jun  1 02:01:54 deployment-webperf01 puppet-agent[1362]:   File "/usr/lib/python2.7/subprocess.py", line 186, in check_call
Jun  1 02:01:54 deployment-webperf01 puppet-agent[1362]:     raise CalledProcessError(retcode, cmd)
Jun  1 02:01:54 deployment-webperf01 puppet-agent[1362]: CalledProcessError: Command '['sudo', '-n', '/usr/sbin/service', 'statsv', 'restart']' returned non-zero exit status 5#033[0m
Jun  1 02:01:54 deployment-webperf01 puppet-agent[1362]: #033[31m02:01:54 deploy-local failed: <CalledProcessError> Command '['sudo', '-n', '/usr/sbin/service', 'statsv', 'restart']' returned non-zero exit status 5#033[0m

It seems it wants to execute a /usr/sbin/service statsv ... command. Checking sudo /usr/sbin/service --status-all manually shows that the statsv service is not yet recognized. That kind of makes sense given that the service will only be defined after the Package[statsv/statsv] resource is fulfilled, which is failing due to the above. Looks like a paradox?

I suspect perhaps webperf/manifests/statsv.pp or analytics/statsv.git:scap/scap.cfg might be doing something different from the recommended way. Any ideas?

One idea would be to set the require_valid_service config value to True in your scap.cfg file.

This will prevent scap from restarting the service if systemctl show --property LoadState is masked or not-found which is likely what it would be on the first run. So scap the puppet provider could then deploy the code and config, puppet could then setup the service file and start the service, and subsequent scap deploys would then restart it.

Looking at deployment-webperf01 the service seems to be loaded and running, so it seems like you worked something out. Sorry for the late reply, but hopefully it's helpful going forward.

Mentioned in SAL (#wikimedia-releng) [2018-06-01T21:29:34Z] <Krinkle> Re-creating webperf01 in deploymet-prep, T195314

Mentioned in SAL (#wikimedia-releng) [2018-06-01T21:37:23Z] <Krinkle> Re-create performance-beta.wmflabs.org webproxy (wired to webperf01) - T195314

Change 436908 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[performance/navtiming@master] scap: Enable require_valid_service

https://gerrit.wikimedia.org/r/436908

Change 436914 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[performance/coal@master] scap: Minor settings clean up

https://gerrit.wikimedia.org/r/436914

Change 436908 merged by jenkins-bot:
[performance/navtiming@master] scap: Enable require_valid_service

https://gerrit.wikimedia.org/r/436908

Change 436914 merged by jenkins-bot:
[performance/coal@master] scap: Minor settings clean up

https://gerrit.wikimedia.org/r/436914

Change 436920 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[analytics/statsv@master] scap: Enable require_valid_service

https://gerrit.wikimedia.org/r/436920

Change 436920 merged by Krinkle:
[analytics/statsv@master] scap: Enable require_valid_service

https://gerrit.wikimedia.org/r/436920

Krinkle updated the task description. (Show Details)Jun 1 2018, 10:21 PM
Krinkle updated the task description. (Show Details)Jun 1 2018, 10:50 PM

Change 436962 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[performance/docroot@master] Make coal-web server address configurable

https://gerrit.wikimedia.org/r/436962

Change 436435 merged by Ottomata:
[operations/puppet@production] webperf: Fix jumbo-eqiad reference to be compatible with Beta Cluster

https://gerrit.wikimedia.org/r/436435

Change 436962 merged by jenkins-bot:
[performance/docroot@master] Make coal-web server address configurable

https://gerrit.wikimedia.org/r/436962

Krinkle updated the task description. (Show Details)Jun 4 2018, 6:05 PM
Krinkle updated the task description. (Show Details)

Change 436586 merged by Dzahn:
[operations/puppet@production] deployment-prep: add webperf to scap::dsh::groups

https://gerrit.wikimedia.org/r/436586

demon removed a subscriber: demon.Jun 4 2018, 8:32 PM

Change 436581 abandoned by Krinkle:
Move scap::sources from role::deployment_server to common

https://gerrit.wikimedia.org/r/436581

Krinkle updated the task description. (Show Details)Jun 25 2018, 5:42 PM
Krinkle added a comment.EditedJun 25 2018, 5:45 PM

I've confirmed that webperf::statsv works as expected in the Beta Cluster using an end-to-end test that involves /statsv from the browser console on a Beta Cluster wiki page, and confirming the metric ends up in Beta's Graphite. This indirectly also verified that the following are working well in Beta Cluster:

  • Varnish VCL for /beacon/statsv
  • varnishkafka/statsv
  • Kafka
  • webperf/statsv (NEW, on deployment-webperf)
  • statsd-lb + statsite
  • Graphite
Krinkle updated the task description. (Show Details)Jun 26 2018, 8:24 PM
Krinkle updated the task description. (Show Details)

I've confirmed that the webperf::navtiming service is also working as expected in the Beta Cluster.

  • Varnish VCL for /beacon/event.
  • varnishkafka producing to Kafka topic eventlogging-client-side.
  • EventLogging consuming the above, and producing to Kafka topic eventlogging_<schema>.
  • webperf/navtiming consuming the above, and producing to Graphite via statsd.

From graphite-labs.wikimedia.org/render/..:

Change 442198 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/mediawiki-config@master] Beta Cluster: Increase wgNavigationTimingSamplingFactor from 0.1% to 10%

https://gerrit.wikimedia.org/r/442198

Change 442232 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/puppet@production] [WIP] webperf: Make performance::site apache config more dynamic

https://gerrit.wikimedia.org/r/442232

Change 442900 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/puppet@production] webperf: Get graphite_host for coal::processor from Hiera

https://gerrit.wikimedia.org/r/442900

Krinkle updated the task description. (Show Details)Jun 28 2018, 8:30 PM
Krinkle updated the task description. (Show Details)
Krinkle updated the task description. (Show Details)Jun 28 2018, 8:32 PM
Vvjjkkii renamed this task from Set up webperf node in Beta Cluster to xhcaaaaaaa.Jul 1 2018, 1:08 AM
Vvjjkkii raised the priority of this task from Normal to High.
Vvjjkkii removed Krinkle as the assignee of this task.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed subscribers: gerritbot, Aklapper.
CommunityTechBot assigned this task to Krinkle.
CommunityTechBot lowered the priority of this task from High to Normal.
CommunityTechBot renamed this task from xhcaaaaaaa to Set up webperf node in Beta Cluster.
CommunityTechBot added subscribers: gerritbot, Aklapper.
Krinkle renamed this task from Set up webperf node in Beta Cluster to Set up webperf-1 node in Beta Cluster.Jul 3 2018, 3:42 AM
Krinkle updated the task description. (Show Details)Jul 3 2018, 3:46 AM

Change 442232 merged by Muehlenhoff:
[operations/puppet@production] webperf: Make performance::site apache config more dynamic

https://gerrit.wikimedia.org/r/442232

Change 443739 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/puppet@production] webperf: Move site vars to profile class params (set from Hiera)

https://gerrit.wikimedia.org/r/443739

Krinkle updated the task description. (Show Details)Jul 3 2018, 10:54 PM

Change 443739 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/puppet@production] webperf: Move site vars to profile class params (set from Hiera)

https://gerrit.wikimedia.org/r/443739

This has been cherry-picked to Beta Cluster and works as expected. https://performance-beta.wmflabs.org/xenon/ now displays 404 Not Found (as it should) instead of timing out.

Krinkle closed this task as Resolved.Jul 3 2018, 10:55 PM

Change 443752 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/puppet@production] webperf: Rename webperf profiles for clarity

https://gerrit.wikimedia.org/r/443752

Change 436601 merged by Muehlenhoff:
[operations/puppet@production] webperf: Add statsv, navtiming and coal to scap::sources

https://gerrit.wikimedia.org/r/436601

Krinkle updated the task description. (Show Details)Jul 6 2018, 11:31 PM

Change 442900 merged by Muehlenhoff:
[operations/puppet@production] webperf: Get graphite_host for coal::processor from Hiera

https://gerrit.wikimedia.org/r/442900

Change 443739 merged by Muehlenhoff:
[operations/puppet@production] webperf: Move site vars to profile class params (set from Hiera)

https://gerrit.wikimedia.org/r/443739

Change 443752 merged by Giuseppe Lavagetto:
[operations/puppet@production] webperf: Rename webperf profiles for clarity

https://gerrit.wikimedia.org/r/443752