Page MenuHomePhabricator

Consolidate performance website and related software
Closed, ResolvedPublic

Description

When moving graphite1001 to jessie I noticed the performance website is also hosted there, together with coal, xenon is instead hosted on fluorine (now mwlog1001), xhgui and its mongodb are on tungsten. Since planning for capex/opex next year is on, I'm proposing to consolidate all performance-related software onto VM or baremetal depending on required specs. Thoughts?


Sub tasks

  • Write up understanding of the current (scattered) services. (Below: Current services)
  • Figure out overall migration plan in terms of hardware. – (Below: Proposed topology)
  • New stack:
    • Request production VM for webperf (metrics processing). – T179036
    • Migrate webperf from hafnium to webperf#001. – T186774
    • Figure out how to migrate performance.wikimedia.org site.
    • Migrate coal from graphite1001 to webperf#001. – T159354
    • Request production VM for web apps (xhgui, xenon). - T194390
    • make xhgui::app role support stretch/buster and deploy on new xhgui machines - T238788
    • Migrate xhghui from tungsten to new VM. - VMs created in T238098
    • Update wmf-config to write XWD profiles to new XHGUI location.
    • Figure out how to migrate Xenon processing. (See "Moveability" below)
    • Move Xenon's ArcLamp and Apache-for-xenon from mwlog1001 to webperf1002.
  • Old stack:
    • Decom webperf/asset-check service. – T164419
    • Decom webperf/ve service. – T175083
    • Decom old xhprof viewer. – T196406
    • Decom osmium.eqiad host. – T175093
    • Decom hafnium.eqiad host. – T193420
    • Decom tungsten.eqiad host. – T260395

Current services

  • performance.wikimedia.org: Static website. Proxied services: coal (local), xenon (mwlog1001), xhgui/xhprof (tungsten)
  • coal: EventLogging subscriber (Kafka) that processes Navigation Timing events and stores aggregated data into a dedicated Graphite backend).
  • coal-web: Python http service that serves json-formatted data from the coal storage.
    • Currently on webperf1001
    • Uses the graphite HTTP API to pull raw data and to format that correctly. The http service is exposed via a private socket file, proxied from performance.wikimedia.org Apache config (currently on the same server).
    • Code: performance/coal
  • xenon: Receives data from all app servers on a Redis instance (see operations/mediawiki-config.git:/StartProfiler). The xenon-log python service subscribes to this Redis feeds and produces searchable log files. A cron (xenon-generate-svgs) periodically produces SVGs which are stored in a local directory and made available on performance.wikimedia.org through a local Apache proxy.
  • xhgui: A webapp for viewing and analyzing PHP profiling data. Wikimedia-Debug requests can generate a profile that is submitted to the XHGui's MongoDB database. (StartProfiler).
  • webperf: EventLogging subscriber deamons that send data to Statsd/Graphite. webperf::statsv, webperf::ve, webperf::navtiming.

Moveability

  • performance.wikimedia.org:
    • Currently on graphite1001.
    • Should be trivial to move to another server.
  • coal and coal-web:
    • Currently on graphite1001.
    • Could be moved to a separate server if that separate server becomes a secondary graphite backend (like graphite1003). But I'd rather not make our perf/misc server a production graphite backend. So, unless we allocate two servers, probably makes sense to keep coal-web on graphite1001 for now.
  • xenon:
    • Currently on mwlog1001. 4 things: 1) A Redis server for incoming data from app servers, 2) Process to create text files from Redis data, 3) Process to create SVG files, 4) Apache to serve these files.
    • 2, 3 and 4 are easy to move to another server. The Redis server, and therefore the part that is interacted with from production I'd prefer to keep on mwlog1001.
  • xhgui:
    • Currently on tungsten. Requires MongoDB, PHP, Apache.
    • Easily kept or moved, if another server would become the perf server.
  • webperf
    • Currently on hafnium.
    • Easily moved.

Old topology

Based on T158837#3368030:

graphite1001: (Our stuff is minor/secondary)

  • (Our) services: coal, coal-web, perf-site.

mwlog1001: (Our stuff is minor/secondary)

  • (Our) services: Redis (endpoint receiving xenon data), xenon-log (reads Redis, writes TXT), xenon-generate-svgs (reads TXT, writes SVG), Apache (serves TXT and SVG, proxied from perf-site)

tungsten (former db server; old, should be decom)

  • Spec: 16 cores, 64G RAM, 40G and 1.6T HHD
  • Services: XHGui (MongoDB, Apache)

osmium (former app server):

  • Spec: 16 cores, 64G RAM, 2x 500G HDD
  • Services: – (unused, previously: visualeditor, devwiki, jsbench)

hafnium (misc; old; should be decom/replaced):

  • Spec: 24 cores, 32G RAM, 50G HHD
  • Services: webperf (navtiming, statsv)

New topology

Based on T158837#3582514:

webperf x001 (Ganeti VM - multi-dc)

  • Specs: 4 vCPU, 8GB RAM, 50GB HHD
  • Services: webperf/processors_and_site (perf-site, coal::processor, coal::web, statsv, navtiming)

webperf x002 (Ganeti VM - multi-dc)

  • Specs: 4 vCPU, 8GB RAM, 50GB HHD
  • Services: webperf/profiling_tools (xhgui, arc-lamp, Apache for arc-lamp)

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+1 -1
operations/puppetproduction+2 -127
operations/puppetproduction+1 -1
operations/puppetproduction+1 -0
operations/puppetproduction+34 -0
operations/puppetproduction+9 -0
operations/puppetproduction+3 -16
operations/puppetproduction+4 -0
operations/puppetproduction+55 -45
performance/docrootmaster+5 -5
operations/puppetproduction+1 -0
operations/puppetproduction+29 -24
operations/puppetproduction+2 -4
operations/puppetproduction+7 -0
operations/puppetproduction+5 -1
performance/docrootmaster+134 -91
operations/puppetproduction+2 -1
performance/coalmaster+99 -105
operations/puppetproduction+28 -22
performance/coalmaster+113 -60
operations/puppetproduction+8 -7
Show related patches Customize query in gerrit

Related Objects

StatusSubtypeAssignedTask
Resolved dpifke
InvalidNone
Resolved Imarlier
ResolvedDzahn
ResolvedKrinkle
ResolvedKrinkle
ResolvedKrinkle
Resolved Cmjohnson
ResolvedDzahn
Resolved Imarlier
Resolved Cmjohnson
Resolved Imarlier
ResolvedKrinkle
ResolvedKrinkle
ResolvedKrinkle
Resolved dpifke
ResolvedKrinkle
ResolvedDzahn
ResolvedDzahn
ResolvedMarostegui
ResolvedDzahn
ResolvedDzahn
ResolvedDzahn
Resolvedakosiaris
ResolvedJclark-ctr

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Dzahn updated the task description. (Show Details)
Dzahn subscribed.

You should now be able to move xhgui over to webperf1002/2002. New VMs have been created and are ready to use. See details in T194390#4212203.

Change 433710 had a related patch set uploaded (by Imarlier; owner: Imarlier):
[operations/puppet@production] webperf: separate permissions from specific apps

https://gerrit.wikimedia.org/r/433710

@Imarlier @Krinkle I forgot about the shell access part. https://gerrit.wikimedia.org/r/#/c/433738/ added the perfteam admin group to the new VMs. You can now SSH to webperf1002 / webperf2002. I added the "perfteam" admin group but not the eventlogging-admin group so far.

@Imarlier I was just thinking about whether the etcd code is live or not for webperf/navtiming. The puppet invocation hasn't changed, but given navtiming.systemd uses the run_navtiming.py entry point (which reads config.ini), that presumably means the etcd_path/etcd_domain are used in prod. Is that right?

Anyhow, it currently uses MediaWiki's WMFMasterDatacenter key which seems inappropriate. It certainly seems desirable that, in the event of an overall switch-over that webperf/navtiming follows the other services, but we should also be able to switch over independently. Especially for navtiming, which would otherwise have downtime in the case of server maintenance.

I was discussing this with @Dzahn on IRC just now (who was asking whether it's fine to reboot the server to try enabling IPv6), which made me think it'd be nice if we could just switch over to codfw for a bit, upgrade the eqiad server, and then be able to switch back to try out, and switch back again if it doesn't work work. Instead of being subject to downtime.

Puppet has various inheritance options we can use to make our etcd config automatically follow MediaWiki by default, but with a way to override. In addition, for a major switchover we use orchestration scripts. In those scripts, MediaWiki is merely one among many, each of which is semi-automatically switched together.

This isn't documented, and at the time I reviewed navtiming's etcd code, it made sense as-is, but in retrospect, we probably shouldn't read mediawiki-config/WMFMasterDatacenter from etcd, given no other service does (besides mediawiki).

added mapped IPv6 addresses on the interface for webperf1001/2001:

https://gerrit.wikimedia.org/r/#/c/433299/

after applying that and confirming we got the new IPs from puppet i added the records in DNS:

https://gerrit.wikimedia.org/r/#/c/434423/

I confirmed with host and ping6 connectivity between 1001/2001, 2002/1001 etc.

There was no server reboot or service restart. Apache was already listening on tcp6 :::80.

This was to ensure that all webperf machines are treated the same. 1002/2002 came with v6 from the start.

If there are any unexpected issue then reverting is reverting the 2 patches above. I used a TTL of 5 min in DNS just in case to minimize downtime as requested.

22:50 Krinkle: webperf1001: restart navtiming service, to test with new ipv6 capabilities

The service remained fine after restart. Depending on the target (kafka vs statsd) it'll have either started using ipv6 without issue, or keeps using ipv4 without issue. Kafka hosts have ipv6, but graphite/statsd not (I think).

The actual network interface and DNS were added about 45min earlier at 22:05, and looking back, right after Puppet enabled the net6 interface, an unexpected outage started for navtiming, which apparently lasted for 10min before it corrected itself.

render.png (400×800 px, 22 KB)

The logs on webperf1001 don't show any messages relating to navtiming in either journal nor syslog. The service had an uptime of 3 weeks before I restarted it at 22:50.

The burst of data around 22:15 suggests that not all the data was lost during the 10min gap, but that some was somehow buffered or clogged somewhere, and then made its way from webperf1001 to statsd at 22:15. It'd be interesting to know what caused it to stay offline for so long (10min), and where (some of the) data was kept during that time, and whether we can do something to improve that within navtiming.py.

Change 433710 abandoned by Imarlier:
webperf: separate permissions from specific apps

https://gerrit.wikimedia.org/r/433710

Change 433710 restored by Imarlier:
webperf: separate permissions from specific apps

https://gerrit.wikimedia.org/r/433710

@Imarlier I was just thinking about whether the etcd code is live or not for webperf/navtiming. The puppet invocation hasn't changed, but given navtiming.systemd uses the run_navtiming.py entry point (which reads config.ini), that presumably means the etcd_path/etcd_domain are used in prod. Is that right?

Anyhow, it currently uses MediaWiki's WMFMasterDatacenter key which seems inappropriate. It certainly seems desirable that, in the event of an overall switch-over that webperf/navtiming follows the other services, but we should also be able to switch over independently. Especially for navtiming, which would otherwise have downtime in the case of server maintenance.

I was discussing this with @Dzahn on IRC just now (who was asking whether it's fine to reboot the server to try enabling IPv6), which made me think it'd be nice if we could just switch over to codfw for a bit, upgrade the eqiad server, and then be able to switch back to try out, and switch back again if it doesn't work work. Instead of being subject to downtime.

Puppet has various inheritance options we can use to make our etcd config automatically follow MediaWiki by default, but with a way to override. In addition, for a major switchover we use orchestration scripts. In those scripts, MediaWiki is merely one among many, each of which is semi-automatically switched together.

This isn't documented, and at the time I reviewed navtiming's etcd code, it made sense as-is, but in retrospect, we probably shouldn't read mediawiki-config/WMFMasterDatacenter from etcd, given no other service does (besides mediawiki).

@Krinkle I'm fine with creating a different etcd key, as long as there's an easy way to change that key if/when a full data center failover happens. Sounds like that just means a modification to an orchestration script, which is good enough for me.

What I do _not_ want to do is to add anything to puppet in order to do this :-)

Change 433710 merged by Dzahn:
[operations/puppet@production] webperf: Make the different webperf roles explicit

https://gerrit.wikimedia.org/r/433710

Change 439648 had a related patch set uploaded (by Imarlier; owner: Imarlier):
[operations/puppet@production] Need to install mongodb on xhgui machines

https://gerrit.wikimedia.org/r/439648

Change 439648 merged by Herron:
[operations/puppet@production] Need to install mongodb on xhgui machines

https://gerrit.wikimedia.org/r/439648

Change 440649 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[performance/docroot@master] Make links from php-profiling.html to to /xenon/ relative

https://gerrit.wikimedia.org/r/440649

Change 440649 merged by jenkins-bot:
[performance/docroot@master] Make links from php-profiling.html to to /xenon/ relative

https://gerrit.wikimedia.org/r/440649

Note to self with regards to the perf-site. The site's Apache configuration (src) currently hardcodes ServerName performance.wikimedia.org as well as the proxy backends for XHGui (tungsten) and Xenon data (mwlog1001).

We should make these parameters of the conf.erb template so that they can point to something else in Beta Cluster. Possibly made optional as well.

Change 443752 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/puppet@production] webperf: Rename webperf profiles for clarity

https://gerrit.wikimedia.org/r/443752

Krinkle updated the task description. (Show Details)

Change 443752 merged by Giuseppe Lavagetto:
[operations/puppet@production] webperf: Rename webperf profiles for clarity

https://gerrit.wikimedia.org/r/443752

Change 452689 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/puppet@production] webperf: Add 'Server: <fqdn>' header to performance.wikimedia.org

https://gerrit.wikimedia.org/r/452689

Krinkle updated the task description. (Show Details)

Change 452689 abandoned by Krinkle:
webperf: Add 'Server: <fqdn>' header to performance.wikimedia.org

Reason:
Might reconsider at a later point.

https://gerrit.wikimedia.org/r/452689

(un-assigning the tracking task to reduce clutter on the workboard, see open sub tasks for current assignees)

Change 503675 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/puppet@production] webperf: Remove arclamp subscriber from mwlog servers

https://gerrit.wikimedia.org/r/503675

Change 503675 merged by Effie Mouzeli:
[operations/puppet@production] webperf: Remove arclamp subscriber from mwlog servers

https://gerrit.wikimedia.org/r/503675

@Krinkle I have stopped and disabled xenon-log, excimer-log, and apache on mwlog* servers, and I have removed the arclamp-generate-svgs cron job. Let me know if there is anything else related to change 503675

Change 530771 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/puppet@production] hieradata: Move beta 'cache::app_directors' from Horizon to Puppet

https://gerrit.wikimedia.org/r/530771

Change 530773 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/puppet@production] hieradata: Add 'performance.wikimedia.beta.wmflabs.org' routing

https://gerrit.wikimedia.org/r/530773

Change 552357 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] webperf: switch xhgui_host from tungsten to xhgui1001

https://gerrit.wikimedia.org/r/552357

xhgui1001/2001 should now be about ready to replace tungsten. see T238788#5683132

Change 552362 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] xhgui: rsync mongodb data from tungsten to xhgui1001

https://gerrit.wikimedia.org/r/552362

Change 552362 merged by Dzahn:
[operations/puppet@production] xhgui: rsync mongodb data from tungsten to xhgui1001

https://gerrit.wikimedia.org/r/552362

Mentioned in SAL (#wikimedia-operations) [2019-11-22T00:46:45Z] <mutante> xhgui1001/xhgui2001 - rsyncing /srv/mongod from tungsten to /srv/tungsten/mongod/ on both new machines (T158837)

Change 530771 abandoned by Krinkle:
hieradata: Move beta 'cache::app_directors' from Horizon to Puppet

Reason:
This has since been done as part of the ATS migration, the backend mapping for those is now on puppet.git both for beta cluster and for prod.

https://gerrit.wikimedia.org/r/530771

Change 530773 merged by RLazarus:
[operations/puppet@production] hieradata: Include cache-text in Beta Cluster 'cache_hosts'

https://gerrit.wikimedia.org/r/530773

Change 552357 merged by Dzahn:
[operations/puppet@production] webperf: switch xhgui_host from tungsten to xhgui1001

https://gerrit.wikimedia.org/r/552357

Mentioned in SAL (#wikimedia-operations) [2020-08-13T22:03:41Z] <mutante> switching xhgui from tungsten to xhgui1001 - ran puppet on webperf*001 - T180761 T158837

Change 620130 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] delete role::xhgui::app

https://gerrit.wikimedia.org/r/620130

Mentioned in SAL (#wikimedia-operations) [2020-08-20T22:20:00Z] <mutante> permanently shut down tungsten.eqiad.wmnet T260395 T158837 T180761 T224549

Dzahn raised the priority of this task from Low to Medium.Aug 20 2020, 10:22 PM
Dzahn updated the task description. (Show Details)

Change 620130 merged by Dzahn:
[operations/puppet@production] delete role::xhgui::app

https://gerrit.wikimedia.org/r/620130

Change 621606 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] cumin: update xhgui alias to apply to new role name

https://gerrit.wikimedia.org/r/621606

Change 621606 merged by Dzahn:
[operations/puppet@production] cumin: update xhgui alias to apply to new role name

https://gerrit.wikimedia.org/r/621606

From my side this ticket looks done now.