Page MenuHomePhabricator

Increase webperf1002/webperf2002 space from 50GB to 150GB (Ganeti)
Closed, ResolvedPublic

Description

Follows-up T194390: EQIAD & CODFW: 1 VM in each data center for xhprof/xhgui/other profiling tools.

The VM request for webperf1002 was modelled after webperf1001 (T179036), however we overlooked their difference in role, specifically in terms of storage. webperf1001 has virtually no need for persistent storage (besides OS and installed packages) as it gets and sends all its data elsewhere (Statsd, Graphite, Prometheus, Kafka).

However, webperf1002 will be hosting two applications that both need persistent storage.

XHGui

XHGui stores web request profiles from X-Wikimedia-Debug in MongoDB. XHGui currently runs on tungsten (to be moved to webperf-02, T180761). On tungsten, XHGui's MongoDB has a 1.5TB drive available.

When we still profiled 1:10,000 production requests with XHProf (stored in XHGui) it created 1.4 TB worth of profiles in 13 months (Feb 2016 - March 2017; T161196: tungsten is out of space on /srv). However, we no longer profile production traffic with XHProf, and focus solely on Xenon instead for that purpose.

We do still use XHGui for storing and visualising profiles from X-Wikimedia-Debug requests. These should be stored indefinitely and can be accessed from a permalink. Given we had to do a hard reset following T161196, the current usage on tungsten only reflects space allocated from March 2017 to now (June 2018). In these 14 months, we've created 14 GB worth of data.

krinkle@tungsten:~$ df -h
Filesystem             Size  Used  | Mounted on
/dev/sda1               37G  3.5G  | /
/dev/mapper/tank-data  1.6T   14G  | /srv

If we plan for the next 5 years, and account for 2x usage by developers, we should give it at least 150 GB to work with.

ArcLamp

ArcLamp creates and stores Xenon log files and flame graphs. It currently runs on mwlog1001 (to be moved to webperf-02, T195312). On mwlog1001 ArcLamp writes to /srv/ which is a 7 TB drive of which 4 TB is used.

Currently we use about 25 GB of that:

krinkle at mwlog1001.eqiad.wmnet in ~
$ du -sh /srv/xenon/
24G     /srv/xenon/
$ du -sh /srv/xenon/*/*
21G     /srv/xenon/logs/daily
3.1G    /srv/xenon/logs/hourly
59M     /srv/xenon/svgs/daily
153M    /srv/xenon/svgs/hourly

Unlike XHGui, Xenon daily data is currently rotated after 90 days, and Xenon hourly data rotated after 14 days. As such, it's disk usage is fairly stable. It only increases when there are more MediaWiki entry points (such as the recently rpc entry point for RunJobs), and when run-time sees deeper or more diverse call stacks. See also T166624.

I'd like to increase retention of profiles, and keep hourly data for 90 days, and daily data for 1-2 years. Using the above as basis, that would require:

  • daily: 22 GB per 90d = 179 GB per 2 years.
  • hourly: 4 GB per 14d = 26 GB per 90 days.

That brings the projected need to a total of 355 GB (150 +179+26)

The webperf-02 VMs currently have 50 GB disk space. This request is to expand that to 500 GB.

Event Timeline

herron triaged this task as Medium priority.Jul 18 2018, 6:06 PM
herron added subscribers: akosiaris, herron.

It might be worth considering hardware for this purpose. Most Ganeti hosts have roughly 1T free disk space, and afaik with DRBD replica overhead 500G becomes 1T (2x500G volumes) per VM. Looping in @akosiaris

Dedicated hardware makes me so sad :-(

Is there any sort of shared storage option (Swift or otherwise)? We could also use public cloud storage since there's no PII in these files, but right now those options aren't accessible from these hosts.

Swift could be an option since data size isn't huge. In that case I would recommend:

  • write path through a swift client talking https, possibly uploading to both codfw/eqiad if so desired.
  • read path through reverse proxy on webperf that talks to swift. In the read case we can skip swift auth and make containers public for reads.

Maintenance wise you'd need to take care of creating the swift containers as needed and cleaning up the objects insides as required.

@fgiunchedi Cool, that sounds good.

Regarding storage use and scaling, I assume it gets distributed among different backend servers as needed without needing to worry about a single "directory" reaching a certain size, right?

About creating containers, is that something I can/should do? Or should I file a request? Looking at https://wikitech.wikimedia.org/wiki/Swift and https://wikitech.wikimedia.org/wiki/Swift/How_To, the documentation looks quite good and comprehensive, although it does claim to be outdated.

For the swift client and proxy, is there a recommended package we should use? Python seems like natural fit, but I'm open to other preferences.

@fgiunchedi Cool, that sounds good.

Regarding storage use and scaling, I assume it gets distributed among different backend servers as needed without needing to worry about a single "directory" reaching a certain size, right?

Correct, for "big" containers like commons we shard the containers to avoid too much contention. We're talking about tens/hundreds thousands objects though, in your case it isn't going to be a problem.

About creating containers, is that something I can/should do? Or should I file a request? Looking at https://wikitech.wikimedia.org/wiki/Swift and https://wikitech.wikimedia.org/wiki/Swift/How_To, the documentation looks quite good and comprehensive, although it does claim to be outdated.

Indeed the documentation needs updating :( please file a request for a new swift account under SRE-swift-storage and I'll give that account container creation privileges too.

For the swift client and proxy, is there a recommended package we should use? Python seems like natural fit, but I'm open to other preferences.

The swift python client is going to be indeed a natural fit, I've also used the swift commandline client which is based on the python client and worked well.

@herron That would reduce this request to needing ~150 GB (for XHGui's Mongo). Is that doable?

I'll slice the Swift support for ArcLamp into a separate task, which in turn will block the task for increasing retention of Xenon files.

Krinkle renamed this task from Increase webperf1002/webperf2002 space from 50GB to 500 GB (Ganeti) to Increase webperf1002/webperf2002 space from 50GB to 150GB (Ganeti).Jul 23 2018, 4:09 PM

@herron That would reduce this request to needing ~150 GB (for XHGui's Mongo). Is that doable?

Yes, that's totally doable. Best path forward is to probably add 1 new disk to each VM (webperf1002/webperf2002) sized at 150GB, then format and mount it under some path (/srv/xhprof or similar?). Would the above work for you ? Alternatively we can resize the root disk but it's a bit more more and carries some minor risk as it is a manual process.

herron added a project: User-herron.
herron moved this task from Backlog to Awaiting Input/Review on the User-herron board.

Mentioned in SAL (#wikimedia-operations) [2018-07-26T14:24:40Z] <akosiaris> sudo gnt-instance modify --disk add:size=150G webperf2002 T199853

Mentioned in SAL (#wikimedia-operations) [2018-07-26T14:24:45Z] <akosiaris> sudo gnt-instance modify --disk add:size=150G webperf1002 T199853

Mentioned in SAL (#wikimedia-operations) [2018-07-27T05:59:20Z] <akosiaris> reboot webperf1002, webperf2002 for new disk to appear T199853

On both webperf1002 and webperf2002 we have

/dev/vdb 147G 331M 139G 1% /srv

I 've mounted the new space under /srv and moved the mongodb datadir that was there before into the newly created and mounted disk.

I 'll close this as resolved, feel free to reopen if anything else is required.

For the record, if https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/472032/ is merged, the space occupied by xenon logs will increase (upto 2X).

Xenon occupies a mostly constant amount of space, but that mount may double from 25G upto 50G. This will essentially chip away at the space reserved for XHGui, which I estimated (in the task description) as being able to fit current and next 5 years of data. With this change in estimate for Xenon, that would instead accomodate (100G/2G per month) about 4 years instead.

For the record, if https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/472032/ is merged, the space occupied by xenon logs will increase (upto 2X).

Xenon occupies a mostly constant amount of space, but that mount may double from 25G upto 50G. This will essentially chip away at the space reserved for XHGui, which I estimated (in the task description) as being able to fit current and next 5 years of data. With this change in estimate for Xenon, that would instead accomodate (100G/2G per month) about 4 years instead.

Good point and thanks for pointing out. We can easily add more space to that disk when the need arises so not a problem AFAICT.