Page MenuHomePhabricator

WDQS testing setup platform sizing
Closed, ResolvedPublic

Description

Currently, WDQS has a testing setup on http://wdqs-test.wmflabs.org. However, we are experiencing both storage and performance issues there. Max storage on labs right now is 160G, and WDQS database size is rapidly outgrowing this limitation.

Currently, the growth rate for the database is about 75% a year. Production database is now 193G. That means even if we double the labs instance storage, it would only be enough for another year, or maybe two if we store reduced data set. So, we need to find a solution for a test setup.

Ways for solving it:

  1. Create a huge storage labs instance (300G, 500G) template
  2. Store reduced data set for testing, e.g. only truthy data, only English labels, no sitelinks, etc.
  3. Migrate test setup to real hardware outside of labs (maybe less high-powered than production hosts) - maybe to labs-support like relforge

This task is to explore and evaluate possible solutions for this issue.

Event Timeline

Restricted Application added projects: Wikidata, Discovery. · View Herald TranscriptJun 28 2017, 7:37 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Smalyshev updated the task description. (Show Details)Jun 28 2017, 7:38 PM
Smalyshev updated the task description. (Show Details)Jun 28 2017, 8:08 PM
  1. Create a huge storage labs instance (300G, 500G) template

It is likely that Cloud-Services can find space to grant a quota in the 300-500G range. To do that we need to have a few specifics about the exact sizing desired so we can look for a labvirt host that can support the local storage needs and provide the IOPS needed as well. The normal process for this is to file a subtask of T140904: Existing Labs project quota increase requests (Tracking). The request really does need to explain what the additional resources will be used for so that the process of allocating a larger than normal component of our shared pool of resources is documented and transparent.

  1. Migrate test setup to real hardware outside of labs (maybe less high-powered than production hosts) - maybe to labs-support like relforge

I was not a part of the negotiations that led to the creation of Relforge, but it is a very unique situation within the Cloud Services projects. A very detailed case will need to be made for us to consider adding another special service like this. I would really be more inclined to look at adding dedicated labvirt nodes before I gave the ok to placing more custom boxes into the labs-support vlan.

I think this is not the last time we're hearing about such setup. With more and more data being processed and us starting to use algorithms and tools that require significant computing power (ML, graph DBs, etc.) we'd have more use cases that need storage beyond 160G and larger processing power, while not being fully production setups.
I do not insist on any particular solution for this one, but I think we need to start thinking about how we can run real-hardware non-production setups, or cover such requirements in some other way.

bd808 added a comment.Jun 29 2017, 1:03 AM

I do not insist on any particular solution for this one, but I think we need to start thinking about how we can run real-hardware non-production setups, or cover such requirements in some other way.

It is certainly a topic worth discussion, but respectfully I have not yet heard an explanation of the resources needed to be able to fairly judge if such resources can be provided by Cloud Services or not. Dedicating a labvirt or two to a key project is well within our scope assuming that the budget can be found for any new equipment that is needed. There are really very few things that can't be run under virtualization. There may be some things that we do not currently have well sized labvirt machines for however. Fixing that would require going through the Foundation's normal procurement process. That's really not different than the process that would be needed to procure new bare metal to place in some other VLAN with someone else helping you manage it.

From my point of view there are a couple of things that virtualization helps with in practice. The first is economies of scale that can be achieved for smaller workloads (lots of small vms on a large box). The second is ease of re-purposing resources when they are no longer needed or outgrown for a particular project. The only place I have ever run into legitimate constraints on virtualizing workloads is in providing enough IOPS for very high demand services. With modern hardware however this is becoming less and less common. Its quite possible to run even a fairly high traffic database server under virtualization with the right local storage on the virt host.

The separate discussion of non-virtualized server access is in my mind primarily a topic to take up with the core techops team. Existing legacy deals excluded, I do not feel that Cloud Services should be in the business of finding places to put "non-production" hardware or new means to manage its allocation, maintenance, and recovery.

I have not yet heard an explanation of the resources needed

Sure, sorry for not specifying it early. So, in production we have:

  • Database size now is 193G, 75% growth per year for now, though not sure it will keep growing this way for years. We have 500 to 700G of diskspace on production hosts, so I'd say we need about 500G on test host too (though we could reduce the data set on test DB, possibly, but the growth will still remain). Production uses (and requires) SSD, but I think for test host spinning disk would be ok.
  • Memory - directly related to performance, so the more the better, but it can run inside 16G, though something like 64G would probably work better.
  • CPU - here's it a bit hard for me to say. The workload is mostly I/O (disk) bound, production has Dual Xeon E5-2620 v3 @2.4Ghz, but it's hard to see what we need in labs since we won't be serving any traffic. What we have in production is probably overkill for test setup, but what we have now in labs with 8 cores can't reliably keep up with update steam (maybe not CPU but I/O issue).

Summarily, what is needed is:

  • Much more diskspace and I/O throughput
  • Somewhat more memory
  • Maybe more CPU power, not sure - need to see
Smalyshev triaged this task as Medium priority.Jun 29 2017, 5:20 AM
Gehel added a comment.Aug 21 2017, 3:35 PM

Doing some more guess work:

IO performance on a test instance of WDQS is mostly limited by writes, which should be similar to what we have in production. Grafana shows writes between 500 and 1000 IOPS.

Looking at the GC logs on wdqs-beta (for the last 24 days), it looks like heap after GC is almost never above 4Gb (we allocate 8Gb), so we can most probably save 4Gb, probably more. Since we don't expect much traffic having memory as disk cache is probably not as critical as production servers.

So my guess for the specs of a VM:

  • 8 cores
  • 500Gb disk, minimum 1k write IOPS (short term 300Gb would work)
  • 16Gb RAM (possibly 32Gb)
chasemp added a comment.EditedAug 21 2017, 3:45 PM

How many VMs at that spec (count)?

@chasemp two should be enough - one for wdq-beta and one for wdq-deploy.

Gehel added a comment.Aug 28 2017, 2:43 PM

To summarize an IRC discussion with @chasemp:

  • The specs above (8 CPU, 16Gb RAM, 500Gb disk) are the same as the current wdq-deploy and wdqs-beta. So the increase is only disk
  • Disk is sadly the most contentious resource
  • Expanding our cloud means new hardware, which implies time and money

Side note: in the shorter term, we can get by with only 50Gb more disk per VM (so a total of 100Gb more disk, with no other resource increase).

Side note 2: @Gehel will see if there is some budget that could be reallocated here.

Yes, the main problem is the diskspace. Though the performance is kinda an issue too since we can't keep up with full update stream on VMs. But running out of space is the most immediate one.

Gehel added a comment.Sep 1 2017, 8:57 AM

@Smalyshev I'm wondering what the performance bottleneck is. I suppose it is IO, but with the current state of wdqs-beta, it is hard to confirm. We might be able to get better performances from the same resources by tuning things a bit. Reducing the memory allocated to the JVM might help...

@chasemp: Would it be possible to replace the current wdq-deploy and wdqs-beta VMs with VMs with the same specs except more disk? What would be a reasonable increase in disk size? +50BG per VM is the minimum, +300 per VM would be nice to have.

chasemp added a comment.EditedSep 1 2017, 1:20 PM

@Gehel

We could bump up the quota (on the project) temporarily to allow rebuild of those instances with a flavor that has a larger disk. We have a flavor that has 300G of disk already we have granted selectively elsewhere so that would work out. I believe this is a 20G root partition and using the /srv extension puppet manifest to mount the remaining at /srv. We don't have the ability to sanely extend the disk of an existing instance FYI so changing instance sizes is a rebuild.

Gehel added a comment.Sep 1 2017, 1:26 PM

That would be great! If you grant us access to that flavor with 300GB disk, we don't event need a quota extension. We have one instance (wdq-beta) that needs to be deleted after some checks, so we should be all good.

I'm making a note to discuss it today in our meeting ;)

+1'd to grant access to the 300G flavor via meeting

bd808 assigned this task to Andrew.Sep 5 2017, 3:53 PM

Approved in team meeting for 300GB image flavor.

Andrew added a comment.Sep 5 2017, 5:17 PM

I don't see a project named 'WDQS' -- can you clarify what actual VPS project we're talking about here? I see a few things it could be...

Andrew added a comment.Sep 5 2017, 5:21 PM

ok, flavor added. I'm a bit nervous about how the scheduler will handle this, so please ping me after your instance is created but before you get attached to it so I can make sure it landed someplace reasonable.

I see it but cannot use it because it goes over the RAM quota. Can we have it increased a little for wikidata-query?

bd808 added a comment.Sep 6 2017, 1:01 AM

@Smalyshev please make a separate task for the RAM quota bump so that we can keep track of it and drop it back down after you build the new VM and shutdown an old one. There is a handy link on Cloud-VPS (Quota-requests) that will give you a template task to fill in.

Smalyshev closed this task as Resolved.Sep 21 2017, 8:01 PM

I think it's ok for now with bigdisk template.