Page MenuHomePhabricator

determine hardware needs for dumps in eqiad and codfw
Closed, ResolvedPublic

Description

Need an overview of everything that's needed in both dcs, with current dumps architecture.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Initial thoughts: we have dataset1001 with ms1001 as a fallback for our web server and for storage. I think that's fine for the next year; if one of those fails we'll want to invest in a new server with arrays.

We should allocate funds for two new snapshot hosts; I would like them both to look like snapshot1001 (32 cores, 64GB ram). I would then give back two of the smaller snapshots (1002, 1004) to be reclaimed for other use. This would make sure we get all wikis that aren't en wiki completed by the 15th of the month for the first monthly run, and cover us in case snapshot1001 itself dies, so that en wiki dumps could continue to run. The last snapshot host (1003) would be kept around for cron jobs; if it were to die, any spare available would probably suffice to replace it.

ArielGlenn moved this task from Backlog to Active on the Dumps-Generation board.
ArielGlenn added a subscriber: mark.

I've claimed and added this to hardware-requests. With the details provided by @ArielGlenn, I'll request a quote for two snapshot hosts from both vendors and link off this task.

Any info from those vendors yet?

Just checking in to see if the vendors have come back to us yet.

Sorry, this fell off my radar, and it shouldn't have.

I'm picking this back up now, and I'll be generating sub-tasks for quote generation from vendors.

RobH mentioned this in Unknown Object (Task).Dec 2 2015, 8:39 PM

Initial thoughts: we have dataset1001 with ms1001 as a fallback for our web server and for storage. I think that's fine for the next year; if one of those fails we'll want to invest in a new server with arrays.

We should allocate funds for two new snapshot hosts; I would like them both to look like snapshot1001 (32 cores, 64GB ram). I would then give back two of the smaller snapshots (1002, 1004) to be reclaimed for other use. This would make sure we get all wikis that aren't en wiki completed by the 15th of the month for the first monthly run, and cover us in case snapshot1001 itself dies, so that en wiki dumps could continue to run. The last snapshot host (1003) would be kept around for cron jobs; if it were to die, any spare available would probably suffice to replace it.

@ArielGlenn: snapshot1001-1004 are all 4+ years old now, well out of warranty, so they sure won't be reused for anything, and should be replaced soon. Why do you only want to replace 2 of the 4?

I wil happily replace 3 if we want to do that now. I figured I would wait to replace snapshot1001 til it fell over; the two new replacements plus snapshot1003 will cover our needs if it does fall over. If snapshot1003 falls over on the other hand, one of the new hosts will be able to pick up its load as they have more cores.

@ArielGlenn: 3? or 4?

Could you please outline what you need, in an ideal situation, if we were starting from scratch today and would order all new? Then we will figure out what to do with the (pretty old!) equipment we already have after that. We try to have all hardware replaced by the time it's 5 years old.

if I were starting from scratch I would ask for 3 boxes all alike, as quoted by Robh, and I would give back all the old snapshot hosts. We don't need 4 of them because 3 out of 4 of the old snapshot hosts were wimpy in comparison to these.

(After chat with mark on IRC) Currently we have 4 snapshot hosts, one of which is dedicated for en wp dumps with 32 cores, the other three of which have 8 cores each, one of them runs all the dump related jobs that are scheduled out of cron and the other two run dumps for the rest of the wikis.

With the three replacement hosts all having 32 cores, any one of them may run en wp, the other two can run the rest of the wikis and either would be able to run the cron jobs in addition. If we have hardware issues with one of them, the remaining two will still be enough to get a full dump run done in a month while the third box is being fixed., and a partial run for the second run of the month as long as we don't dump en wp on the second run.

mark removed RobH as the assignee of this task.Dec 16 2015, 12:19 PM
mark added a subscriber: RobH.

Alright, let's get a quote for 3 new boxes then.

RobH raised the priority of this task from Medium to High.Dec 21 2015, 5:58 PM

Please note the quote for 3 new boxes is pending @mark's approval on task T120126

RobH lowered the priority of this task from High to Medium.Jan 6 2016, 6:37 PM

See also T123094 (replacing dataset servers since they are out of warranty).

@RobH: I've been requested to provide a list of all hardware needs in both codfw and eqiad, both snapshot producers and data servers, since this is the only way budgetary decisions can be made.

In one dc we should have a data server for redundancy's sake. It should have at least 55T of storage when configured for raid 10, at least 8 cores, 64GB RAM and a 10G network card. ( need to check these memory and cpu requirements with @mark.) Note that we are currently running with raid 6 but this will be changed for performance reasons.
In the other dc we should have an identical data server, along with 3 snapshot producers. These snapshot producers should have a minimum of32 cores and 64GB RAM as in the quote above.

EDIT: after talking with Mark, he suggests that we ask for 16 cores and quotes for 64GB and 128GB RAM, looking for the $ figure where we really don't save much by reducing the specs.

This assumes that if the data server in use by the snapshot producers dies, we can live without dump production until it comes back on line. If dumps are deemed critical enough that this is not the case, then we will need 2 data servers in the same dc as the snapshot hosts.

It doesn't matter to me which dc hosts the snapshot producers; they can be in the secondary dc as long as the databases in that dc remain up to date. We've done that before with dumps running out of Tampa while Eqiad was the primary dc. If we run dumps out of Eqiad, one of the MD1200's on the current data server is still in warranty and would be reused.

ArielGlenn renamed this task from determine hardware needs for dumps in eqiad (boxes out of warranty, capacity planning) to determine hardware needs for dumps in eqiad and codfw.Feb 1 2016, 9:31 AM
ArielGlenn updated the task description. (Show Details)

When figuring out memory and cpu needs for the dataset servers, we should keep in mind:

the dataset host has the following going on at any time:

  • one or more rsyncs pulling from it by third party mirror sites
  • rsyncing from one or more internal servers to pick up files for download, such as pagecounts
  • rsyncing to replica (ms1001) or copying dumps to labstore
  • a cron job from one of the snapshots, writing output via nfs
  • web service of datasets to the world:
    • files downloaded range from a few kb (small sql tables for small wikis) to 93GB (de wiki content file)
    • on any given day there are XX bytes (to be filled in!) transferred to downloaders
  • stats1002 and 1003 have the data filesystem mounted via nfs, don't know how they are using it
  • snapshot producing hosts have the data filesystem mounted via nfs, they read and write to it:
    • mysql table dumps are sequential writes to disk
    • stub xml dumps are sequential writes to disk
    • page content xml dumps are sequential reads at one spot and sequential writes to another

@paravoid, I'm adding you to this because you had volunteered your help and know-how earlier in making our dumps download setup sufficient for our downloaders without needing to use external mirrors. Could you weigh in on the hardware needs for the dataset servers, since we need to refresh them? If there are stats you are missing about the downloading, let me know. We're running with a severe bandwidth cap and limiting to two connections per IP for the moment.

A few more comments after discussion with Mark:

We thought about splitting up the dumps between dcs but this is expensive because it requires db slaves in both dcs so we won't do that.

We do want a data server in both dcs, one serves as a backup if the other dc dies a horrible death (and in the worst case scenario we could spin up a bunch of misc spares to run dumps, albeit slower than normal).

We should separate out webservice, rsyncs to mirrors and lab usage to a separate server, leaving the one server just for dumps production and collecting datasets from other parts of the infrastructure. Maybe with labs holding one full copy of the data for labs/web/mirrors? Mark mentioned looking into openstack block storage, e.g. CINDER (https://wiki.openstack.org/wiki/Cinder), does that make sense? This wouldn't happen overnight but could be folded into plans for the coming year.

RobH mentioned this in Unknown Object (Task).
RobH added a subtask: Unknown Object (Task).
RobH closed subtask Unknown Object (Task) as Declined.Feb 19 2016, 5:47 PM

These notes are on task T125422, just summarizing here for ease of reference.

T125422 is for the dataset host specification. We needed that pricing for budget, but its purchase will not be until next year (the current dataset hosts will suffice until then.)

RobH claimed this task.

Since all specs have been updated and orders planned or already done, I'm resolving this task.

ArielGlenn claimed this task.
ArielGlenn removed a project: hardware-requests.
mark closed subtask Unknown Object (Task) as Resolved.Aug 8 2016, 3:51 PM

I'm reviving this old ticket since now's the time to talk about hardwre and spend some money. I'm also adding labs (soon to be Cloud!) folks @chasemp and @bd808 since the labs hosts a copy of some dumps, @Ottomata since dumps filesystems are remotely mounted on stats100* hosts, and I want to rethink how all of this is handled.

First thoughts:

  • I'd like to split out dumps generation from dumps web service to downloaders, dumps nfs mounts for stats, and dumps rsync to mirrors.
  • I'd also like to have a few small hosts in a cluster providing web/rsync service for various portions of the files, rather than one large box with a pile of arrays. Smaller, cheaper, easier to replace hw.
  • I'd like us to factor inredundancy across datacenters.
  • I'd like a solution for the stats hosts that involves someting better than an nfs mount. Maybe putting the files they use directly into hdfs?
  • @fgiunchedi mentioned that the esams swift cluster could be used to hold dumps as a viability test if we want to go that route.

So: what are your thoughts? Proposals? What do you want for labs copies of dumps, of pageviews, anything else?

I'd like a solution for the stats hosts that involves someting better than an nfs mount. Maybe putting the files they use directly into hdfs?

cc @mforns as well.

We'd like to do this too, and I think might already be doing it. But, I believe the NFS mounted dumps are used by @ezachte to generate stats.wikimedia.org every month (correct me if I'm wrong). The active (and long term) Wikistats 2.0 project intends to replace stats.wikimedia.org, but it'll be a while before we can actually generating new data on stats.wikimedia.org.

Thanks @ArielGlenn, I've been meaning to email you to followup :)

To recap a bit of our conversation (and really my limited insight here):

  • Labstore1003 is currently serving these dumps via NFS as ro to Labs as whole (gets it's copy from the active dataset node)
  • Labstore1003 is a single point of failure and uses external SAS storage for which we do not have an alternate lined up
  • Labstore1003 could become a two node cluster and also be the point of consumption some measure of the analytics load. Possibly? Which might mean that dumps nfs mounts for stats is both the analytics and Labs use cases.

Just a memory dump for syncing up.

  • @fgiunchedi mentioned that the esams swift cluster could be used to hold dumps as a viability test if we want to go that route.

Indeed, there's about 25/30TB available there. Of course access would be through HTTP and not a filesystem, from the discussions so far it seems that most things require a filesystem (?) though.

Summary:

35T swift or equivalent (before replication)
30T (after raid) labs box for nfs mounts to labs, statst100* hosts. Should be more than one for redundancy.
30T (after raid) box for rsync to mirrors, web service to public (serving everything not in swift). Should be in two data centers, for redundancy. Can this be the same as the labs box and meet performance needs for labs/analytics users, downloaders, rsyncers?
5T storage (after raid) added to each snapshot box for local dump generation, and to spare box for redundancy. Note: these are only in one DC.

Questions:

  • 10G nic for boxes serving to public? We get a rush of downloaders as each dump run completes. Will that meet their needs?
  • How viable is service to downloaders/rsyncers directly from swift? Do we need caching (eww of 2GB files)?
  • Any chance to move off of nfs for good somehow, have everything in somethign swift-like except for current generated dumps and the data needed for their production?
  • Best hw setup to ensure download service is always available even when a disk or a box fails?

Long version.

OK, let's talk amount of storage and where.

"archive" (dumps from 2008 and earlier) is about 1.5T. This could be stored in swift since it's not requested as often.

"other" datasets (https://dumps.wikimedia.org/other/) are about 20T. I claim there's around 14T that are not extremely high traffic and could perhaps be hosted in swift:

  • pageviews-raw is deprecated (see T142574), about 5.5T
  • cirrussearch 3.1T (@demon, do you know who uses these pimarily and what the demand is like?)
  • kiwix 5.2T

The other 6T in "other" grow and are wanted by labs/analytics users via nfs. Say 10T for 3 years. If labs hosts these and serves them to the public and via nfs to labs users _and_ analytics folks, what would that look like? We'd want redundancy in case of hw failure, so are we talking two large boxes each with a large array, both in the same dc? Is it problematic to have web and nfs traffic on the same box, for performance reasons? Is a 10g on it/those good enough? Should those be the box(es) that public mirrors rsync from?

For dumps generation, we need the last set of full dumps, plus space to write the new ones. With the current setup, that's about 9GB, and we should allow for expansion so let's say 15GB for the next 3 years. This would allow us to stop wrigin these dumps on nfs-mounted filesystems as they are generated. I would love it.

If we are a bit clever about dumps generation we could have each snapshot host hold space for a given subset of these dumps, say 5T per box, with one spare. The current snapshot hosts have minimal storage so we'd need to look into adding storage to leased boxes. I need to check the numbers for en wp to make sure 5T is enough to cover three runs of that.

Once a dump is generated it could be rsynced elsewhere for rsync to mirrors, nfs mount to analytics/web, download to users. We keep about 25 or 26T of dumps excluding "archive" dumps. We can also look forward to other dumps such as HTML format dumps, sample dumps generated from refinery data, and so on. Let's guess 40T in three years with room to add a lot more storage if we wind up with more formats.

Of these, which need to be nfs accessible? I claim, only the last couple rounds of dumps (4.6G and growing), plust the "other" data not moved to swift (about 6T but growing). For the next thee years let's guess 20T, including HTML and other new format dumps. Those have to be on a regular filesystem; the rest (also estimate 20T for 3 years) could be in swift or a similar distributed filesystem.

Comments/Questions please!

Here are mine:

  • What is missing from this picture?
  • What could be simplified?
  • What will let us eliminate bandwidth caps/connection limits for downloaders?
  • What will let us easily update hardware as it dies/gets old without costing an arm and a leg?
  • cirrussearch 3.1T (@demon, do you know who uses these pimarily and what the demand is like?)

Nope, not my data anymore! @EBernhardson?

  • cirrussearch 3.1T (@demon, do you know who uses these pimarily and what the demand is like?)

Nope, not my data anymore! @EBernhardson?

We use these for loading data into relforge, which is a place where we test changes to indexing/searching/etc before shipping it live. It is additionally used by a few external users, although i don't know how much. There is for example a search engine under development called bitfunnel that has a script for ingesting the cirrus dumps into their engine. It is generally useful even outside search as a json formatted dump of the current state of each wiki with a bunch of extract metadata, such as links out, templates used, categories, # of incoming links, page popularity, etc.

@EBernhardson, can you get that file via http for processing or do you need it to be on something that pretends to be a local filesystem?

I think we've only ever grabbed it via http, sometimes via the mirrors because the 2MB/s limit on dumps.wikimedia.org is a bit slow.

[this is based on me talking to @ArielGlenn about this task and us trying to work things out -- hopefully I am summarizing well]

Dumps User classes:

  1. Generation (operational)
    • scripts that run needing interim storage and resources for dumps creation
    • run monthly for full set and mid-month for point in time
  2. Web consumers via https://dumps.wikimedia.org
    • currently served via web server on dataset1001
    • can access all data on dataset1001 (6 historical months of project dumps + misc)
    • throttled bw
  3. Mirrors: rsync from us to them
    • typically looking to rsync daily for newest portions of latest dump
    • we throttle bw (or ask them to?)
  4. Analytics
    • NFS mount of dataset1001 archives
    • read-only
    • not currently throttled
  5. Labs users (labstore1003)
    • NFS mount from labstore1003 where dataset1001 files are copied to for access
    • labstore1003 has a few other use cases now including some statistics DaaS
    • all dumps are read-only
    • dupe historical dumps are around 19T (3 or so and misc)

Current Setup:

Dumps run and generate on dataset1001 which holds the canonical copies.  Dataset1001 has mounted an NFS share from labstore1003 and copies the relevant files to that for Labs consumption.  Analytics mounts directly a share on dataset1001 for consumption.  Web users and rsync mirror both access the copy of dumps on datset1001 directly.  ms1001 is the backup for dataset1001.  dataset1001 has 57T with 49T in use.   There are no dumps components in codfw.  dataset1001, ms1001, and labstore1003 are all in need of refresh.  Labstore1003 is currently a SPoF.

Proposed setup:

Phase 1:

  • introduce dataset1003 and dataset1004 servers with 15-20T of RAID10 using remaining 2016-2017 budget
  • this setup takes on the generation case only and last available. Possibly rsync mirrors pull from here.
  • dumps runs with the latest and in-flight copies stored here

Phase 2:

  • introduce labstore1006 and labstore1007 with 75T redundant storage. (current dataset1001 uses 50T)
  • Multi-host attachable or dual SAS stacks?
  • labstore1006 and labstore1007 serves: ro analytics via NFS, ro labs via NFS, rsync users, and web
  • if latest copies are the most demand we can optionally have rsync mirrors pull from dataset1003 and/or dataset1004

Phase 3 (Not in in current fiscal planning cycles):

  • labstore2008 and labstore2009 (75T?) in codfw
  • Unsure if online or offline storage. If offline less storage and compress. Depends on load of current user profiles.
  • Keep offsite historical copies of dumps and associated data
  • This could fold into other labs offsite storage work and/or capacity and will be after 2017/2018 FY

The above may be too terse but I hope it makes sense, and one of the assumptions therein is we continue the dumps (ariel) and labs (wmcs) partnership here. We shift the focal point for resourcing to consolidate the use cases so we can be practical about storage allocation and support the user roles as needed on both sides. @ArielGlenn having historically supported analytics, mirrors, and web users we (labs) would be involved and share equity but not primary (I believe?), and the reverse would be true for the Labs NFS use cases.

The only thing this leaves out is future work:

  • possible use of swift or a similar system to store some datasets for which nfs access is not needed
  • eventual phase out of nfs mounts of anything on stat100* hosts (doable for statistics generation eventually; what would researchers use instead though?)

It also commits you to serving files over nfs to lab users forever, basically. That may be what you want, I'm just making it explicit.

Is there enough in the budget to get dataset1003 and labstore1006 right now, so that no one is relying completely on badly out of warranty hardware?

The only thing this leaves out is future work:

  • possible use of swift or a similar system to store some datasets for which nfs access is not needed
  • eventual phase out of nfs mounts of anything on stat100* hosts (doable for statistics generation eventually; what would researchers use instead though?)

It also commits you to serving files over nfs to lab users forever, basically. That may be what you want, I'm just making it explicit.

Is there enough in the budget to get dataset1003 and labstore1006 right now, so that no one is relying completely on badly out of warranty hardware?

Next gen is mostly left out there, yeah. I'm not sure what that will look like, but I do think we can say it's a significant research and proof of concept project (or several).

I am comfortable saying we'll share these data sets and probably others over NFS (or some equivalent mechanism) for the long haul.

We don't currently have any budget for labstore1006/7 type things. My impression was there may be enough budget in dumps in 16/17 to look at doing dataset1003/1004 to set us up for 17/18 fiscal allocation to the new labstore setup.

At Marks' request, here is a precise description of what dataset1003 and 1004 would be doing. (Maybe we want to give them different names?)

Dataset1003

  • nfs mounted to the snapshot hosts
  • new dumps are written out to the nfs share from the snapshot hosts
  • one run of old dumps are kept; they are read and used as input to generate new dumps
  • at start of next run, oldest full dumps are tossed, so that we only have two complete sets of fulls at any time, plus possibly one or two partial runs ("current" pages only)
  • rsyncs current dump files, as they are completed, to dataset1004 in the same dc

Dataset1004

  • has the same hardware as dataset1003
  • resides in the same dc as dataset1003, ready to be deployed if dataset1003 dies
  • receives rsyncs of files from dataset1003 as they are completed
  • rsyncs files to labstore1006 as they are received, or perhaps via a regular cron job to pull from the lab server

They do NOT:

  • nfs mount anywhere else
  • rsync to mirrors
  • serve files to the public

Is raid10 the right setup for these boxes?

Though earlier comments mention the possibility of adding dataset1003/4 to the server pool for serving most current dumps during the spike of the first few days after new ones become available, it's better if we can make sure the labs boxes have the capacity to handle the spike (per discussion with Mark).

Update: I believe we have worked out a plan to move forward with labstore1006/7 in FY 16/17

So @chasemp, can we get with Rob and get these boxes ordered?

So @chasemp, can we get with Rob and get these boxes ordered?

+1 thank you

**Proposed setup version 2 (revised from https://phabricator.wikimedia.org/T118154#3054146)**

Phase 1:

  • introduce labstore1006 and labstore1007 each with 72T RAID10 storage. (current dataset1001 uses 50T)
  • labstore1006 and labstore1007 each have full independent copies of data
  • labstore1006 and labstore1007 serve: ro analytics via NFS, ro labs via NFS, rsync users, and web

Phase 2:

  • introduce dataset1003 and dataset1004 servers with at least 15-20T of RAID10
  • this setup takes on the generation case only and last available.
  • dumps runs with the latest and in-flight copies stored here

Phase 3 (Not in in current fiscal planning cycles):

  • labstore2008 and labstore2009 (75T?) in codfw
  • Unsure if online or offline storage. If offline less storage and compress. Depends on load of current user profiles.
  • Keep offsite historical copies of dumps and associated data
  • This could fold into other labs offsite storage work and/or capacity and will be after 2017/2018 FY

Time to close this ticket. At this point we have: labstore boxes coming on line soon, dumpsdata hosts deployed months ago, snapshot testbed refresh is at least on the radar with a ticket, and snapshot cron job host is in the request queue. Mid-term plans may look very different, depending on what dumps look like by then.