Page MenuHomePhabricator

Give misc dump crons their own host
Closed, ResolvedPublic

Description

We're going to need more shards for the wikidata weekly dumps, and the new s8 shard will permit that. We get new cron job requests from time to time, this need only grows. We should move these jobs off to their own host with a lot fewer cores but a reasonable amount of memory. Maybe one of the spares will do.

Event Timeline

This host will be snapshot1008, in eqiad.

Specs needed:

16 cores would be nice, for growth.
32GB ram is plenty and we might make do with less.
A couple 300GB internal drives in raid is fine, don't need any fancy raid controller or any of that.

Looking at the spares, nothing looks like it meets our needs, due to the core count.

I'm adding @Nikerabbit, @demon and @hoo because they will be the main beneficiaries of this new host. How do you see your capacity needs increasing over the next few years? Do you have plans to add new cron jobs to the mix?

Also adding @RobH to chat about what can be done with the above specs.

How do you see your capacity needs increasing over the next few years? Do you have plans to add new cron jobs to the mix?

We (= Wikidata team) plan to add new cron jobs over time. We will probably get more dump types for Wikidata (hopefully derived from the JSON dump, so will eat CPU time+block storage only).
Also keep in mind the strong-ish growth of Wikidata. Given all of this I'd expect us to use no more than 250% of the CPU time we use now by the end of the year.

Also there will be new dumps for the WikibaseMediaInfo entities on commons at some point.

For Content Translation we are expecting a stable increase in dumps size. See https://en.wikipedia.org/wiki/Special:ContentTranslationStats#global-translations-weekly which seems to indicate that the weekly activity is not currently growing – future improvements to the tool (CX2, translation lists) could bring in some additional growth.

We are not planning to add new dump types, but we are discussing where to permanently store the source data. This might have some impact on the CPU and memory use of the dump scripts, but we are looking to avoid to make them slower than what they already are. They current scripts have gone through a few rounds of optimization, mostly to reduce memory use.

In eqiad spares we have the following system:

wmf4749 - Dual Intel Xeon E5-2640 v3 2.6GHz/8Core per CPU - 64GB RAM - dual 1TB SATA

This spare remains in warranty until 2019-03-24.

We don't really order any machines with dual CPU and less than 64GB of memory in recent memory. I'd advise we assign this spare, as its sat in spares awhile, rather than purchase a new machine.

@apergos: Please advise if this spare system would work for this. If it will, please escalate this to @mark (or @faidon) for allocation approval. If it won't work, please assign back to me for followup.

Once approval is granted, please assign back to me and I'll get the system setup and deployed for you to take over!

...

Also keep in mind the strong-ish growth of Wikidata. Given all of this I'd expect us to use no more than 250% of the CPU time we use now by the end of the year.

Also there will be new dumps for the WikibaseMediaInfo entities on commons at some point.

Am I reading that right, 250% CPU growth in 12 months? What does this look like for 3 years out? Are we talking 18 shards at the end of the year, 36 in 3 years? What about simultaneous database accesses of this many parallel processes? Perhaps we should rope a dba into this discussion?

Pinging you, @hoo :-)

I'm adding @Nikerabbit, @demon and @hoo because they will be the main beneficiaries of this new host. How do you see your capacity needs increasing over the next few years? Do you have plans to add new cron jobs to the mix?

I have no needs and no plans for new cronjobs :P

@hoo, hopefully you are back and recovered from the various trips and things, could you please give more detail on the 3 year needs for json and other sort of dumps? We'd like to move on this right away.

Rough answer without knowing all details @hoo might have been referring to: We see mostly linear growth on https://grafana.wikimedia.org/dashboard/db/wikidata-datamodel. So I believe the numbers given for 1 year can be extrapolated linearly. @Lydia_Pintscher might be able to provide a more substantial answer.

While projecting the dumps, we should also account for:

  1. Lexemes, which would probably cause some bump in addition to current growth patterns
  2. T144103: Create .nt (NTriples) dumps for wikidata data which may or may not happen but if it does, it's at least the same size/resource as TTL dump and we need to be ready for it.
  3. Structured data on Commons of course will require all regular types of dumps plus the types of dumps we have for Wikidata data. Since right now Wikidata and Commons have roughly the same number of items, I'd say we should plan for SDC dump requirements be the same as Wikidata ones (not at once, of course, but eventually).

We have to get this into the budget plan by tomorrow, so I'm going to request:

Let's get a box that looks like snapshot1005. Same number of cores, same amount of memory, etc. That's a five-fold increase over what we use now for misc cron dumps. Additionally we'll be on php7 which should give us some speed gains, and we'll be able to enable use of ZEND caching in some spots. Beyond that we'll have to look in depth at ways to make these dumps more efficient, if capacity looks like it will become a problem.

Hey @RobH, what are next steps on this?

RobH mentioned this in Unknown Object (Task).Mar 20 2018, 12:01 AM
RobH changed the task status from Open to Stalled.May 1 2018, 5:13 PM
RobH removed ArielGlenn as the assignee of this task.

This has been approved for order via T190112. As such, I'm setting this to stalled until it arrives.

In theory the new host is arriving today, and if all goes well it should be available for getting its puppet role by early next week. We can probably use raid1-lvm-ext4-srv.cfg for it (can't use the one that snapshot1005-7 have because they are HP with hw raid).

Stacking up some commits we'll want for the rollout, turning off misc dumps crons on snapshot1007, setting up the new role for the dedicated host, applying the role and adding hiera settings for the new host, and increasing thenumber of shards for wikidata weekly runs on that host.

Change 432365 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] turn off misc dump crons on snapshot1007

https://gerrit.wikimedia.org/r/432365

Change 432366 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] role for new misc dumps cron host

https://gerrit.wikimedia.org/r/432366

Change 432367 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] add snapshot1008 role and hiera settings

https://gerrit.wikimedia.org/r/432367

Change 432368 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] do 8 jobs in parallel for wikidata weeklies

https://gerrit.wikimedia.org/r/432368

Change 432365 merged by ArielGlenn:
[operations/puppet@production] turn off misc dump crons on snapshot1007

https://gerrit.wikimedia.org/r/432365

Change 432366 merged by ArielGlenn:
[operations/puppet@production] role for new misc dumps cron host

https://gerrit.wikimedia.org/r/432366

Change 432367 merged by ArielGlenn:
[operations/puppet@production] add snapshot1008 role and hiera settings

https://gerrit.wikimedia.org/r/432367

To do:

make sure that cron jobs kick off as we expect (just wait for any one to run)
up number of shards for wikidata weeklies to 8 (requires T147169), patchset is here: https://gerrit.wikimedia.org/r/#/c/432368/
decide whether to keep dumps monitor on snapshot1007, as it currently is, or to move it to the new host
increase number of jobs on snapshot1007 to the same as snapshot1005,6
maybe remove cron scripts from snapahot1007 (edit role) so no one tries to run them there

@hoo: wikidata weeklies now run on snapshot1008. Do not try to run them on snapshot1007!

ArielGlenn claimed this task.

Everything looks good, let's see if I can close this with the subtask still open, heh.

Change 436276 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[operations/puppet@production] Dumps: Add groups to dumper_misc_crons_only hosts

https://gerrit.wikimedia.org/r/436276

Change 436276 merged by ArielGlenn:
[operations/puppet@production] Dumps: Add groups to dumper_misc_crons_only hosts

https://gerrit.wikimedia.org/r/436276

RobH closed subtask Unknown Object (Task) as Resolved.May 31 2018, 4:56 PM
ArielGlenn mentioned this in Unknown Object (Task).Jun 1 2018, 10:38 AM

Change 432368 merged by ArielGlenn:
[operations/puppet@production] do 8 jobs in parallel for wikidata weeklies

https://gerrit.wikimedia.org/r/432368