Follows-up T194390: EQIAD & CODFW: 1 VM in each data center for xhprof/xhgui/other profiling tools.
The VM request for webperf1002 was modelled after webperf1001 (T179036), however we overlooked their difference in role, specifically in terms of storage. webperf1001 has virtually no need for persistent storage (besides OS and installed packages) as it gets and sends all its data elsewhere (Statsd, Graphite, Prometheus, Kafka).
However, webperf1002 will be hosting two applications that both need persistent storage.
XHGui stores web request profiles from X-Wikimedia-Debug in MongoDB. XHGui currently runs on tungsten (to be moved to webperf-02, T180761). On tungsten, XHGui's MongoDB has a 1.5TB drive available.
When we still profiled 1:10,000 production requests with XHProf (stored in XHGui) it created 1.4 TB worth of profiles in 13 months (Feb 2016 - March 2017; T161196: tungsten is out of space on /srv). However, we no longer profile production traffic with XHProf, and focus solely on Xenon instead for that purpose.
We do still use XHGui for storing and visualising profiles from X-Wikimedia-Debug requests. These should be stored indefinitely and can be accessed from a permalink. Given we had to do a hard reset following T161196, the current usage on tungsten only reflects space allocated from March 2017 to now (June 2018). In these 14 months, we've created 14 GB worth of data.
krinkle@tungsten:~$ df -h Filesystem Size Used | Mounted on /dev/sda1 37G 3.5G | / /dev/mapper/tank-data 1.6T 14G | /srv
If we plan for the next 5 years, and account for 2x usage by developers, we should give it at least 150 GB to work with.
ArcLamp creates and stores Xenon log files and flame graphs. It currently runs on mwlog1001 (to be moved to webperf-02, T195312). On mwlog1001 ArcLamp writes to /srv/ which is a 7 TB drive of which 4 TB is used.
Currently we use about 25 GB of that:
krinkle at mwlog1001.eqiad.wmnet in ~ $ du -sh /srv/xenon/ 24G /srv/xenon/ $ du -sh /srv/xenon/*/* 21G /srv/xenon/logs/daily 3.1G /srv/xenon/logs/hourly 59M /srv/xenon/svgs/daily 153M /srv/xenon/svgs/hourly
Unlike XHGui, Xenon daily data is currently rotated after 90 days, and Xenon hourly data rotated after 14 days. As such, it's disk usage is fairly stable. It only increases when there are more MediaWiki entry points (such as the recently rpc entry point for RunJobs), and when run-time sees deeper or more diverse call stacks. See also T166624.
I'd like to increase retention of profiles, and keep hourly data for 90 days, and daily data for 1-2 years. Using the above as basis, that would require:
- daily: 22 GB per 90d = 179 GB per 2 years.
- hourly: 4 GB per 14d = 26 GB per 90 days.
That brings the projected need to a total of 355 GB (150 +179+26)
The webperf-02 VMs currently have 50 GB disk space. This request is to expand that to 500 GB.