Maniphest T199853

Increase webperf1002/webperf2002 space from 50GB to 150GB (Ganeti)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Krinkle
	Jul 18 2018, 2:47 AM

Description

Follows-up T194390: EQIAD & CODFW: 1 VM in each data center for xhprof/xhgui/other profiling tools.

The VM request for webperf1002 was modelled after webperf1001 (T179036), however we overlooked their difference in role, specifically in terms of storage. webperf1001 has virtually no need for persistent storage (besides OS and installed packages) as it gets and sends all its data elsewhere (Statsd, Graphite, Prometheus, Kafka).

However, webperf1002 will be hosting two applications that both need persistent storage.

XHGui

XHGui stores web request profiles from X-Wikimedia-Debug in MongoDB. XHGui currently runs on tungsten (to be moved to webperf-02, T180761). On tungsten, XHGui's MongoDB has a 1.5TB drive available.

When we still profiled 1:10,000 production requests with XHProf (stored in XHGui) it created 1.4 TB worth of profiles in 13 months (Feb 2016 - March 2017; T161196: tungsten is out of space on /srv). However, we no longer profile production traffic with XHProf, and focus solely on Xenon instead for that purpose.

We do still use XHGui for storing and visualising profiles from X-Wikimedia-Debug requests. These should be stored indefinitely and can be accessed from a permalink. Given we had to do a hard reset following T161196, the current usage on tungsten only reflects space allocated from March 2017 to now (June 2018). In these 14 months, we've created 14 GB worth of data.

krinkle@tungsten:~$ df -h
Filesystem             Size  Used  | Mounted on
/dev/sda1               37G  3.5G  | /
/dev/mapper/tank-data  1.6T   14G  | /srv

If we plan for the next 5 years, and account for 2x usage by developers, we should give it at least 150 GB to work with.

ArcLamp

ArcLamp creates and stores Xenon log files and flame graphs. It currently runs on mwlog1001 (to be moved to webperf-02, T195312). On mwlog1001 ArcLamp writes to /srv/ which is a 7 TB drive of which 4 TB is used.

Currently we use about 25 GB of that:

krinkle at mwlog1001.eqiad.wmnet in ~
$ du -sh /srv/xenon/
24G     /srv/xenon/
$ du -sh /srv/xenon/*/*
21G     /srv/xenon/logs/daily
3.1G    /srv/xenon/logs/hourly
59M     /srv/xenon/svgs/daily
153M    /srv/xenon/svgs/hourly

Unlike XHGui, Xenon daily data is currently rotated after 90 days, and Xenon hourly data rotated after 14 days. As such, it's disk usage is fairly stable. It only increases when there are more MediaWiki entry points (such as the recently rpc entry point for RunJobs), and when run-time sees deeper or more diverse call stacks. See also T166624.

I'd like to increase retention of profiles, and keep hourly data for 90 days, and daily data for 1-2 years. Using the above as basis, that would require:

daily: 22 GB per 90d = 179 GB per 2 years.
hourly: 4 GB per 14d = 26 GB per 90 days.

That brings the projected need to a total of 355 GB (150 +179+26)

The webperf-02 VMs currently have 50 GB disk space. This request is to expand that to 500 GB.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		• dpifke	T158837 Consolidate performance website and related software
		Resolved		akosiaris	T199853 Increase webperf1002/webperf2002 space from 50GB to 150GB (Ganeti)

Event Timeline

Krinkle created this task.Jul 18 2018, 2:47 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 18 2018, 2:47 AM

Krinkle added a parent task: T158837: Consolidate performance website and related software.Jul 18 2018, 2:47 AM

It might be worth considering hardware for this purpose. Most Ganeti hosts have roughly 1T free disk space, and afaik with DRBD replica overhead 500G becomes 1T (2x500G volumes) per VM. Looping in @akosiaris

Dedicated hardware makes me so sad :-(

Is there any sort of shared storage option (Swift or otherwise)? We could also use public cloud storage since there's no PII in these files, but right now those options aren't accessible from these hosts.

Looping in @fgiunchedi re: swift

Swift could be an option since data size isn't huge. In that case I would recommend:

write path through a swift client talking https, possibly uploading to both codfw/eqiad if so desired.
read path through reverse proxy on webperf that talks to swift. In the read case we can skip swift auth and make containers public for reads.

Maintenance wise you'd need to take care of creating the swift containers as needed and cleaning up the objects insides as required.

@fgiunchedi Cool, that sounds good.

Regarding storage use and scaling, I assume it gets distributed among different backend servers as needed without needing to worry about a single "directory" reaching a certain size, right?

About creating containers, is that something I can/should do? Or should I file a request? Looking at https://wikitech.wikimedia.org/wiki/Swift and https://wikitech.wikimedia.org/wiki/Swift/How_To, the documentation looks quite good and comprehensive, although it does claim to be outdated.

For the swift client and proxy, is there a recommended package we should use? Python seems like natural fit, but I'm open to other preferences.

In T199853#4440040, @Krinkle wrote:

@fgiunchedi Cool, that sounds good.

Regarding storage use and scaling, I assume it gets distributed among different backend servers as needed without needing to worry about a single "directory" reaching a certain size, right?

Correct, for "big" containers like commons we shard the containers to avoid too much contention. We're talking about tens/hundreds thousands objects though, in your case it isn't going to be a problem.

About creating containers, is that something I can/should do? Or should I file a request? Looking at https://wikitech.wikimedia.org/wiki/Swift and https://wikitech.wikimedia.org/wiki/Swift/How_To, the documentation looks quite good and comprehensive, although it does claim to be outdated.

Indeed the documentation needs updating :( please file a request for a new swift account under SRE-swift-storage and I'll give that account container creation privileges too.

For the swift client and proxy, is there a recommended package we should use? Python seems like natural fit, but I'm open to other preferences.

The swift python client is going to be indeed a natural fit, I've also used the swift commandline client which is based on the python client and worked well.

@herron That would reduce this request to needing ~150 GB (for XHGui's Mongo). Is that doable?

I'll slice the Swift support for ArcLamp into a separate task, which in turn will block the task for increasing retention of Xenon files.

Krinkle mentioned this in T200108: Increase retention of ArcLamp SVGs to 2 years.Jul 20 2018, 9:41 PM

Krinkle renamed this task from Increase webperf1002/webperf2002 space from 50GB to 500 GB (Ganeti) to Increase webperf1002/webperf2002 space from 50GB to 150GB (Ganeti).Jul 23 2018, 4:09 PM

• Imarlier moved this task from Inbox, needs triage to Radar on the Performance-Team board.Jul 23 2018, 8:07 PM

• Imarlier edited projects, added Performance-Team (Radar); removed Performance-Team.

In T199853#4442280, @Krinkle wrote:

@herron That would reduce this request to needing ~150 GB (for XHGui's Mongo). Is that doable?

Yes, that's totally doable. Best path forward is to probably add 1 new disk to each VM (webperf1002/webperf2002) sized at 150GB, then format and mount it under some path (/srv/xhprof or similar?). Would the above work for you ? Alternatively we can resize the root disk but it's a bit more more and carries some minor risk as it is a manual process.

herron claimed this task.Jul 25 2018, 2:36 PM

herron added a project: User-herron.

herron moved this task from Backlog to Awaiting Input/Review on the User-herron board.

@herron yep, that would work fine.

herron moved this task from Awaiting Input/Review to Working on on the User-herron board.Jul 25 2018, 2:40 PM

Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.Jul 25 2018, 4:54 PM

Mentioned in SAL (#wikimedia-operations) [2018-07-26T14:24:40Z] <akosiaris> sudo gnt-instance modify --disk add:size=150G webperf2002 T199853

Mentioned in SAL (#wikimedia-operations) [2018-07-26T14:24:45Z] <akosiaris> sudo gnt-instance modify --disk add:size=150G webperf1002 T199853

herron reassigned this task from herron to akosiaris.Jul 26 2018, 2:29 PM

herron removed a project: User-herron.

Mentioned in SAL (#wikimedia-operations) [2018-07-27T05:59:20Z] <akosiaris> reboot webperf1002, webperf2002 for new disk to appear T199853

On both webperf1002 and webperf2002 we have

/dev/vdb 147G 331M 139G 1% /srv

I 've mounted the new space under /srv and moved the mongodb datadir that was there before into the newly created and mounted disk.

I 'll close this as resolved, feel free to reopen if anything else is required.

For the record, if https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/472032/ is merged, the space occupied by xenon logs will increase (upto 2X).

Xenon occupies a mostly constant amount of space, but that mount may double from 25G upto 50G. This will essentially chip away at the space reserved for XHGui, which I estimated (in the task description) as being able to fit current and next 5 years of data. With this change in estimate for Xenon, that would instead accomodate (100G/2G per month) about 4 years instead.

In T199853#4727463, @Krinkle wrote:

For the record, if https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/472032/ is merged, the space occupied by xenon logs will increase (upto 2X).

Xenon occupies a mostly constant amount of space, but that mount may double from 25G upto 50G. This will essentially chip away at the space reserved for XHGui, which I estimated (in the task description) as being able to fit current and next 5 years of data. With this change in estimate for Xenon, that would instead accomodate (100G/2G per month) about 4 years instead.

Good point and thanks for pointing out. We can easily add more space to that disk when the need arises so not a problem AFAICT.

Krinkle mentioned this in T227026: Deploy ArcLamp process as stateless/scalable service (Kubernetes).Jul 1 2019, 7:31 PM

Krinkle mentioned this in T235425: webperf*002 running out of disk space (arc lamp, xhgui).Oct 14 2019, 3:47 PM

Krinkle mentioned this in T235455: Resolve arclamp disk exhaustion problem (Oct 2019).Oct 14 2019, 10:30 PM

Krinkle mentioned this in T238098: vm request for xhgui.Nov 12 2019, 4:05 PM

Krinkle mentioned this in T254795: Database for XHGui profiles.Jun 8 2020, 9:06 PM