Page MenuHomePhabricator

Hardware for HTML / zim dumps
Closed, ResolvedPublic

Description

As discussed in T17017 we'd like to start producing HTML (and possibly ZIM) dumps from Parsoid HTML through RESTBase. For this, we'll need at least one, ideally two hosts with the following characteristics:

  • At least 1.5T of storage. 3T preferred if available.
  • Sufficient bandwidth and configuration to sustain several external downloads of large files.

Event Timeline

GWicke raised the priority of this task from to Medium.
GWicke updated the task description. (Show Details)
GWicke added subscribers: GWicke, Kelson, ArielGlenn.

@RobH, one of the Dell PowerEdge R420, dual Intel Xeon E5-2450 v2 2.50GHz, 64GB Memory, (4) 3TB Disks on the server spares page should be more than enough to get us started. These boxes are overkill re CPU and memory, but they are the only ones on the spares list that have the storage capacity right now. Could move to less powerful hardware with big disks later to optimize resource usage.

I've chatted with Gabriel and Ariel about this particular request in IRC (plus reviewed the linked tasks.)

This is an actual need, and discussion results in the storage space being the primary selection criteria. If we had a system with a single cpu and half as much memory, it would likely also work. However, we have no spare 3TB disks on site (if we did, we could install them in the old decommissioned lsearch node and use for this.)

Options:

  • Use wmf4543, a Dell PowerEdge R420, dual Intel Xeon E5-2450 v2 2.50GHz, 64GB Memory, (4) 3TB Disks, one of three of these spares already onsite. Warranty doesn't end until 2017-03-19.
  • Use old lsearch decommissioned and out of warranty host and order/install 4 3TB SATA disks.
  • Quote out a new system order with single cpu and 4 3TB disks.

I'd go with the first option, and just over-provision for cpu/memory, since we already have the spare systems on site. Any other option either orders more hardware or systems entirely. That being said, this is a budgeting decision as well, so I'd appreciate some additional feedback from @mark.

Also perhaps @ArielGlenn could offer insight as one of the opsen with knowledge of dumps? We chatted some in IRC, but on task is ideal.

My understanding is listed above in my last comment, but feedback is appreciated!

I chatted with Gabriel about this and we agreed that locally generating the dumps on a host plus keeping one run aroun to use as input to the next run, is the best approach. Although he's starting out with dumps just of the mainspace this would soon grow to include dumps of current revisions in all namespaces. So a box with a few T and expandable later is what he needs, wmf4543 would be nice so he could get started right away.

But the cpu and memory requirements for the intended host are not as high as wmf4543 (Xeon E5-2450 v2, 64gm ram) right?

@RobH, for plain HTML dumps only compression will use significant CPU, and basically nothing significant memory. For other formats there could be some moderate memory / CPU use, but it's not clear yet if that would actually be directly on these nodes. I don't see anything that has CPU / memory requirements as high as these spares.

The first iteration will basically be running a script manually, so it would be easy to start with one of the spares until a more permanent box is found.

@RobH, for the ZIM files, I need a more CPU resources than for HTML only Parsoid dumps ; see my previous emails to get more details about the reasons https://lists.wikimedia.org/pipermail/labs-l/2015-March/003457.html. But, like Gabriel has already written: "I don't see anything that has CPU / memory requirements as high as these spares.": 8 cores/2.5Ghz should not be too limiting.

@RobH, can we go ahead with one of the spares?

RobH raised the priority of this task from Medium to High.Mar 18 2015, 3:43 PM

I chatted with Mark about our spare levels and systems we have allocated. Since the one system with the disks is over-provisioned in terms of memory and processing, we'll instead order disks for a more modestly provisioned system.

I'll be allocating either francium or WMF4575 for this task.

@GWicke: will this be joining a cluster of systems (and thus have a somethingX name) or will it be a misc system? If the latter, I'll assign francium, if the former, WMF4575.

I'll be ordering 4 * 3TB disks for the host.

Disk order = https://rt.wikimedia.org/Ticket/Display.html?id=9268

Once these come in, they'll go in one of the two machines (pending hostname needs, francium or WMF4575).

These are: Dell PowerEdge R420, single Intel Xeon E5-2450 v2 2.50GHz, 16GB Memory

These will be used as opposed to the initial suggestion of: Dell PowerEdge R420, dual Intel Xeon E5-2450 v2 2.50GHz, 64GB Memory, (4) 3TB Disks

Since the dual processor setup is overkill for the requirements of the service.

chatted with Gabriel @ office.

We're allocating server francium for this task.

I've linked T93113 for the setup and deployment of the system. I'm resolving this hardware-request task.

RobH lowered the priority of this task from High to Medium.Mar 18 2015, 10:52 PM

Setting to normal priority as we have done all we can until the disks come in. Once they arrive, the install can proceed normally.

hardware has arrived, and linked T93113 is the deployment. the hardware-requests is resolved.