Page MenuHomePhabricator

Consolidate performance website and related software
Open, LowPublic

Description

When moving graphite1001 to jessie I noticed the performance website is also hosted there, together with coal, xenon is instead hosted on fluorine (now mwlog1001), xhgui and its mongodb are on tungsten. Since planning for capex/opex next year is on, I'm proposing to consolidate all performance-related software onto VM or baremetal depending on required specs. Thoughts?


Sub tasks

  • Write up understanding of the current (scattered) services. (Below: Current services)
  • Figure out overall migration plan in terms of hardware. – (Below: Proposed topology)
  • New stack:
    • Request production VM for webperf (metrics processing). – T179036
    • Migrate webperf from hafnium to webperf#001. – T186774
    • Figure out how to migrate performance.wikimedia.org site.
    • Migrate coal from graphite1001 to webperf#001. – T159354
    • Request production VM for web apps (xhgui, xenon). - T194390
    • Migrate xhghui from tungsten to new VM.
    • Update wmf-config to write XWD profiles to new XHGUI location.
    • Figure out how to migrate Xenon processing. (See "Moveability" below)
    • Move Xenon's ArcLamp and Apache-for-xenon from mwlog1001 to webperf1002.
  • Old stack:
    • Decom webperf/asset-check service. – T164419
    • Decom webperf/ve service. – T175083
    • Decom old xhprof viewer. – T196406
    • Decom osmium.eqiad host. – T175093
    • Decom hafnium.eqiad host. – T193420
    • Decom tungsten.eqiad host. (Blocked - xhgui runs here)

Current services

  • performance.wikimedia.org: Static website. Proxied services: coal (local), xenon (mwlog1001), xhgui/xhprof (tungsten)
  • coal: EventLogging subscriber (Kafka) that processes Navigation Timing events and stores aggregated data into a dedicated Graphite backend).
  • coal-web: Python http service that serves json-formatted data from the coal storage.
    • Currently on webperf1001
    • Uses the graphite HTTP API to pull raw data and to format that correctly. The http service is exposed via a private socket file, proxied from performance.wikimedia.org Apache config (currently on the same server).
    • Code: performance/coal
  • xenon: Receives data from all app servers on a Redis instance (see operations/mediawiki-config.git:/StartProfiler). The xenon-log python service subscribes to this Redis feeds and produces searchable log files. A cron (xenon-generate-svgs) periodically produces SVGs which are stored in a local directory and made available on performance.wikimedia.org through a local Apache proxy.
  • xhgui: A webapp for viewing and analyzing PHP profiling data. Wikimedia-Debug requests can generate a profile that is submitted to the XHGui's MongoDB database. (StartProfiler).
  • webperf: EventLogging subscriber deamons that send data to Statsd/Graphite. webperf::statsv, webperf::ve, webperf::navtiming.

Moveability

  • performance.wikimedia.org:
    • Currently on graphite1001.
    • Should be trivial to move to another server.
  • coal and coal-web:
    • Currently on graphite1001.
    • Could be moved to a separate server if that separate server becomes a secondary graphite backend (like graphite1003). But I'd rather not make our perf/misc server a production graphite backend. So, unless we allocate two servers, probably makes sense to keep coal-web on graphite1001 for now.
  • xenon:
    • Currently on mwlog1001. 4 things: 1) A Redis server for incoming data from app servers, 2) Process to create text files from Redis data, 3) Process to create SVG files, 4) Apache to serve these files.
    • 2, 3 and 4 are easy to move to another server. The Redis server, and therefore the part that is interacted with from production I'd prefer to keep on mwlog1001.
  • xhgui:
    • Currently on tungsten. Requires MongoDB, PHP, Apache.
    • Easily kept or moved, if another server would become the perf server.
  • webperf
    • Currently on hafnium.
    • Easily moved.

Old topology

Based on T158837#3368030:

graphite1001: (Our stuff is minor/secondary)

  • (Our) services: coal, coal-web, perf-site.

mwlog1001: (Our stuff is minor/secondary)

  • (Our) services: Redis (endpoint receiving xenon data), xenon-log (reads Redis, writes TXT), xenon-generate-svgs (reads TXT, writes SVG), Apache (serves TXT and SVG, proxied from perf-site)

tungsten (former db server; old, should be decom)

  • Spec: 16 cores, 64G RAM, 40G and 1.6T HHD
  • Services: XHGui (MongoDB, Apache)

osmium (former app server):

  • Spec: 16 cores, 64G RAM, 2x 500G HDD
  • Services: – (unused, previously: visualeditor, devwiki, jsbench)

hafnium (misc; old; should be decom/replaced):

  • Spec: 24 cores, 32G RAM, 50G HHD
  • Services: webperf (navtiming, statsv)

New topology

Based on T158837#3582514:

webperf x001 (Ganeti VM - multi-dc)

  • Specs: 4 vCPU, 8GB RAM, 50GB HHD
  • Services: webperf/processors_and_site (perf-site, coal::processor, coal::web, statsv, navtiming)

webperf x002 (Ganeti VM - multi-dc)

  • Specs: 4 vCPU, 8GB RAM, 50GB HHD
  • Services: webperf/profiling_tools (xhgui, arc-lamp, Apache for arc-lamp)

Details

Related Gerrit Patches:

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 427664 had a related patch set uploaded (by Imarlier; owner: Imarlier):
[performance/coal@master] coal: convert to using graphite instead of writing to whisper

https://gerrit.wikimedia.org/r/427664

Change 427664 merged by jenkins-bot:
[performance/coal@master] coal: convert to using graphite instead of writing to whisper

https://gerrit.wikimedia.org/r/427664

Change 428659 had a related patch set uploaded (by Imarlier; owner: Imarlier):
[operations/puppet@production] coal: Point systemd and uwsgi config to scap-deployed version

https://gerrit.wikimedia.org/r/428659

Krinkle updated the task description. (Show Details)Apr 24 2018, 4:11 PM

Change 428836 had a related patch set uploaded (by Imarlier; owner: Imarlier):
[operations/puppet@production] graphite: allow data requests from performance.wikimedia.org

https://gerrit.wikimedia.org/r/428836

Change 428950 had a related patch set uploaded (by Imarlier; owner: Imarlier):
[performance/docroot@master] docroot: Pull graph data from graphite, not coal-web

https://gerrit.wikimedia.org/r/428950

Change 428659 merged by Herron:
[operations/puppet@production] coal: Point systemd and uwsgi config to scap-deployed version

https://gerrit.wikimedia.org/r/428659

Change 428978 had a related patch set uploaded (by Imarlier; owner: Imarlier):
[performance/coal@master] coal: revert coal-web back to what it was

https://gerrit.wikimedia.org/r/428978

Change 428978 merged by jenkins-bot:
[performance/coal@master] coal: revert coal-web back to what it was

https://gerrit.wikimedia.org/r/428978

Change 428836 merged by Ottomata:
[operations/puppet@production] graphite: allow data requests from performance.wikimedia.org

https://gerrit.wikimedia.org/r/428836

Change 428950 merged by jenkins-bot:
[performance/docroot@master] docroot: Pull graph data from graphite, not coal-web

https://gerrit.wikimedia.org/r/428950

Krinkle updated the task description. (Show Details)May 1 2018, 2:21 AM
Krinkle updated the task description. (Show Details)

Change 431659 had a related patch set uploaded (by Krinkle; owner: Imarlier):
[operations/puppet@production] performance.wikimedia.org: serve from webperfX001

https://gerrit.wikimedia.org/r/431659

Krinkle updated the task description. (Show Details)May 7 2018, 10:23 PM

Change 431659 merged by Dzahn:
[operations/puppet@production] performance.wikimedia.org: serve from webperfX001

https://gerrit.wikimedia.org/r/431659

Mentioned in SAL (#wikimedia-operations) [2018-05-08T15:53:24Z] <mutante> switching performance.wikimedia.org from graphite to webperf backends - running puppet on cache::misc servers (T158837)

Change 431779 had a related patch set uploaded (by Imarlier; owner: Imarlier):
[operations/puppet@production] performance website: allow traffic

https://gerrit.wikimedia.org/r/431779

Imarlier updated the task description. (Show Details)May 8 2018, 4:42 PM

Change 431779 merged by Dzahn:
[operations/puppet@production] performance website: allow traffic

https://gerrit.wikimedia.org/r/431779

Change 431792 had a related patch set uploaded (by Imarlier; owner: Imarlier):
[operations/puppet@production] performance website: remove from graphite hosts

https://gerrit.wikimedia.org/r/431792

Change 431792 merged by Dzahn:
[operations/puppet@production] performance website: remove from graphite hosts

https://gerrit.wikimedia.org/r/431792

Imarlier updated the task description. (Show Details)
Imarlier moved this task from Next In This Quarter to Doing on the Performance-Team board.
Krinkle updated the task description. (Show Details)May 10 2018, 12:02 PM
Imarlier updated the task description. (Show Details)May 10 2018, 12:34 PM
Dzahn added a subscriber: Dzahn.May 17 2018, 4:36 PM

You should now be able to move xhgui over to webperf1002/2002. New VMs have been created and are ready to use. See details in T194390#4212203.

Change 433710 had a related patch set uploaded (by Imarlier; owner: Imarlier):
[operations/puppet@production] webperf: separate permissions from specific apps

https://gerrit.wikimedia.org/r/433710

Dzahn added a comment.May 18 2018, 4:37 PM

@Imarlier @Krinkle I forgot about the shell access part. https://gerrit.wikimedia.org/r/#/c/433738/ added the perfteam admin group to the new VMs. You can now SSH to webperf1002 / webperf2002. I added the "perfteam" admin group but not the eventlogging-admin group so far.

@Imarlier I was just thinking about whether the etcd code is live or not for webperf/navtiming. The puppet invocation hasn't changed, but given navtiming.systemd uses the run_navtiming.py entry point (which reads config.ini), that presumably means the etcd_path/etcd_domain are used in prod. Is that right?

Anyhow, it currently uses MediaWiki's WMFMasterDatacenter key which seems inappropriate. It certainly seems desirable that, in the event of an overall switch-over that webperf/navtiming follows the other services, but we should also be able to switch over independently. Especially for navtiming, which would otherwise have downtime in the case of server maintenance.

I was discussing this with @Dzahn on IRC just now (who was asking whether it's fine to reboot the server to try enabling IPv6), which made me think it'd be nice if we could just switch over to codfw for a bit, upgrade the eqiad server, and then be able to switch back to try out, and switch back again if it doesn't work work. Instead of being subject to downtime.

Puppet has various inheritance options we can use to make our etcd config automatically follow MediaWiki by default, but with a way to override. In addition, for a major switchover we use orchestration scripts. In those scripts, MediaWiki is merely one among many, each of which is semi-automatically switched together.

This isn't documented, and at the time I reviewed navtiming's etcd code, it made sense as-is, but in retrospect, we probably shouldn't read mediawiki-config/WMFMasterDatacenter from etcd, given no other service does (besides mediawiki).

added mapped IPv6 addresses on the interface for webperf1001/2001:

https://gerrit.wikimedia.org/r/#/c/433299/

after applying that and confirming we got the new IPs from puppet i added the records in DNS:

https://gerrit.wikimedia.org/r/#/c/434423/

I confirmed with host and ping6 connectivity between 1001/2001, 2002/1001 etc.

There was no server reboot or service restart. Apache was already listening on tcp6 :::80.

This was to ensure that all webperf machines are treated the same. 1002/2002 came with v6 from the start.

If there are any unexpected issue then reverting is reverting the 2 patches above. I used a TTL of 5 min in DNS just in case to minimize downtime as requested.

22:50 Krinkle: webperf1001: restart navtiming service, to test with new ipv6 capabilities

The service remained fine after restart. Depending on the target (kafka vs statsd) it'll have either started using ipv6 without issue, or keeps using ipv4 without issue. Kafka hosts have ipv6, but graphite/statsd not (I think).

The actual network interface and DNS were added about 45min earlier at 22:05, and looking back, right after Puppet enabled the net6 interface, an unexpected outage started for navtiming, which apparently lasted for 10min before it corrected itself.

The logs on webperf1001 don't show any messages relating to navtiming in either journal nor syslog. The service had an uptime of 3 weeks before I restarted it at 22:50.

The burst of data around 22:15 suggests that not all the data was lost during the 10min gap, but that some was somehow buffered or clogged somewhere, and then made its way from webperf1001 to statsd at 22:15. It'd be interesting to know what caused it to stay offline for so long (10min), and where (some of the) data was kept during that time, and whether we can do something to improve that within navtiming.py.

Change 433710 abandoned by Imarlier:
webperf: separate permissions from specific apps

https://gerrit.wikimedia.org/r/433710

Change 433710 restored by Imarlier:
webperf: separate permissions from specific apps

https://gerrit.wikimedia.org/r/433710

@Imarlier I was just thinking about whether the etcd code is live or not for webperf/navtiming. The puppet invocation hasn't changed, but given navtiming.systemd uses the run_navtiming.py entry point (which reads config.ini), that presumably means the etcd_path/etcd_domain are used in prod. Is that right?
Anyhow, it currently uses MediaWiki's WMFMasterDatacenter key which seems inappropriate. It certainly seems desirable that, in the event of an overall switch-over that webperf/navtiming follows the other services, but we should also be able to switch over independently. Especially for navtiming, which would otherwise have downtime in the case of server maintenance.
I was discussing this with @Dzahn on IRC just now (who was asking whether it's fine to reboot the server to try enabling IPv6), which made me think it'd be nice if we could just switch over to codfw for a bit, upgrade the eqiad server, and then be able to switch back to try out, and switch back again if it doesn't work work. Instead of being subject to downtime.
Puppet has various inheritance options we can use to make our etcd config automatically follow MediaWiki by default, but with a way to override. In addition, for a major switchover we use orchestration scripts. In those scripts, MediaWiki is merely one among many, each of which is semi-automatically switched together.
This isn't documented, and at the time I reviewed navtiming's etcd code, it made sense as-is, but in retrospect, we probably shouldn't read mediawiki-config/WMFMasterDatacenter from etcd, given no other service does (besides mediawiki).

@Krinkle I'm fine with creating a different etcd key, as long as there's an easy way to change that key if/when a full data center failover happens. Sounds like that just means a modification to an orchestration script, which is good enough for me.

What I do _not_ want to do is to add anything to puppet in order to do this :-)

Change 433710 merged by Dzahn:
[operations/puppet@production] webperf: Make the different webperf roles explicit

https://gerrit.wikimedia.org/r/433710

Krinkle updated the task description. (Show Details)Jun 4 2018, 9:04 PM

Change 439648 had a related patch set uploaded (by Imarlier; owner: Imarlier):
[operations/puppet@production] Need to install mongodb on xhgui machines

https://gerrit.wikimedia.org/r/439648

Change 439648 merged by Herron:
[operations/puppet@production] Need to install mongodb on xhgui machines

https://gerrit.wikimedia.org/r/439648

Change 440649 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[performance/docroot@master] Make links from php-profiling.html to to /xenon/ relative

https://gerrit.wikimedia.org/r/440649

Change 440649 merged by jenkins-bot:
[performance/docroot@master] Make links from php-profiling.html to to /xenon/ relative

https://gerrit.wikimedia.org/r/440649

Krinkle updated the task description. (Show Details)Jun 26 2018, 10:32 PM

Note to self with regards to the perf-site. The site's Apache configuration (src) currently hardcodes ServerName performance.wikimedia.org as well as the proxy backends for XHGui (tungsten) and Xenon data (mwlog1001).

We should make these parameters of the conf.erb template so that they can point to something else in Beta Cluster. Possibly made optional as well.

Change 443752 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/puppet@production] webperf: Rename webperf profiles for clarity

https://gerrit.wikimedia.org/r/443752

Krinkle updated the task description. (Show Details)Jul 18 2018, 4:15 AM
Krinkle updated the task description. (Show Details)
Krinkle updated the task description. (Show Details)Jul 20 2018, 9:49 PM

Change 443752 merged by Giuseppe Lavagetto:
[operations/puppet@production] webperf: Rename webperf profiles for clarity

https://gerrit.wikimedia.org/r/443752

Change 452689 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/puppet@production] webperf: Add 'Server: <fqdn>' header to performance.wikimedia.org

https://gerrit.wikimedia.org/r/452689

Krinkle updated the task description. (Show Details)Aug 14 2018, 3:55 PM
Krinkle updated the task description. (Show Details)

Change 452689 abandoned by Krinkle:
webperf: Add 'Server: <fqdn>' header to performance.wikimedia.org

Reason:
Might reconsider at a later point.

https://gerrit.wikimedia.org/r/452689

Krinkle claimed this task.Jan 22 2019, 9:03 PM
Krinkle moved this task from Doing to Backlog: Future Goals on the Performance-Team board.
Krinkle removed Krinkle as the assignee of this task.Jan 24 2019, 6:28 AM

(un-assigning the tracking task to reduce clutter on the workboard, see open sub tasks for current assignees)

Change 503675 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/puppet@production] webperf: Remove arclamp subscriber from mwlog servers

https://gerrit.wikimedia.org/r/503675

Change 503675 merged by Effie Mouzeli:
[operations/puppet@production] webperf: Remove arclamp subscriber from mwlog servers

https://gerrit.wikimedia.org/r/503675

jijiki added a subscriber: jijiki.Apr 23 2019, 1:58 PM

@Krinkle I have stopped and disabled xenon-log, excimer-log, and apache on mwlog* servers, and I have removed the arclamp-generate-svgs cron job. Let me know if there is anything else related to change 503675

Change 530771 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/puppet@production] hieradata: Move beta 'cache::app_directors' from Horizon to Puppet

https://gerrit.wikimedia.org/r/530771

Change 530773 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/puppet@production] hieradata: Add 'performance.wikimedia.beta.wmflabs.org' routing

https://gerrit.wikimedia.org/r/530773