Page MenuHomePhabricator

wikireplicas: Define MW sections per host
Closed, ResolvedPublic


After being unable to find anywhere if we decided on which sections will go per host, I am creating a task to get a proposal.
On the planning do we have this:

Ideal is 8 slices * 2 types (web + analytics) * 2 instances each == 32 database instances. Deployed as multi-instance with 4 instances per physical node == 8 nodes

Each host will have 512GB RAM, let's use 80% for the buffer pool, so that's 410
Usable disk space for MySQL will be 8.7TB

Host 1:
s1 (enwiki): 150
s3: 70
s4 (commons): 140
s5: 50

Total disk space needed (with InnoDB compression enabled): 5TB

Host 2:

s2: 60
s6: 50
s7 (some big wikis like arwiki, eswiki, metawiki or viwiki or : 100
s8 (wikidatawiki): 200

Total disk space needed (with InnoDB compression enabled): 4.2TB

@Bstorm it is not yet clear to me whether you guys want to have full redundancy between hosts (per service), as in, let's say we are talking about the Analytics service, we can have 2 models within the same service:

Model a)
host1 and host3 having identical data and serving s1, s3, s4 and s5
host2 and host4 having identical data and serving s2, s6, s7 and s8

Or whether you'd like to have more computational power for reads and have for example:

Model b)
Host1 serving s1
Host 2 serving s8
Host 3 serving s4 and s7
Host 4 serving: s2, s3, s5 and s6

Model a give us more redundancy as we can lose up to two hosts per array (like a RAID 10!) but less computational power for reads as we share more sections and hence each section has less buffer pool available
Mode b give us more power for reads as big wikis like s1 (enwiki) s4 (commons) and s8 (wikidatawiki) have dedicated resources, but if we lose a host, we lose some sections.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 9 2020, 12:50 PM

After discussing this as a group, we came to the conclusion that more redundancy would serve everyone better. However, do you have any statistics on what the performance differences would be between the options? Especially in contrast to our existing setup, presumably both would be more performant, yes?

It is pretty hard to know the performance differences, as right now we just have a big pool and depending on the queries the pool changes. I would assume that the most hit wikis are enwiki, commons and wikidata, so I would assume most of the pool is used for those.
Though, if there are bots or users scrapping other wikis, the pool would get "dirty" for everyone, even if that specific wiki is only used once.
I am fine with both models, it is really up to WMCS to decide :-)

If we go for option a, does that proposal of sections sound good?

Seems fine, yes. We will learn more about how it affects performance after deployment. I hope it works well 😁

On the section split, WMCS agrees with your logic of the most hit wikis being enwiki, commons and wikidata. We won't be perfect in distributing load, but separating those makes sense.

So yes, option a, with the proposed sections sounds good.

In regards to performance, I understand it's hard to give numbers. We presume the performance will be at least equivalent to today, and likely better. For posterity however, how much effort would be required if we needed/wanted to change to option b during the rollout? I don't forsee this as being an issue, but I wanted to understand how much effort it would be if we needed to pivot for any reason.

We could do the switch from model a to model b but that means we'd need to repopulate the data across the hosts which means downtime for them, I would estimate around 5-7 days if all goes well.

Closing this as the question has been answered.
We are going for model a:
I am going for this data structure:

clouddb1013 s1+s3
clouddb1014 s2+s7
clouddb1015 s4+s6
clouddb1016 s8+s5

clouddb1017 s1+s3
clouddb1018 s2+s7
clouddb1019 s4+s6
clouddb1020 s8+s5

@Bstorm @nskaggs can you sign this off? Looks good?

By splitting the data like this, we can have the major wikis s1, s4 and s8 split, and sharing the other resources with other "smaller" sections. If a host on any of the "pools" goes down, we still have the other mirrored one.

Marostegui triaged this task as High priority.Nov 3 2020, 6:41 AM

Setting this to high, as once we have T265135#6598952 approved, we can unblock T260843 and then T267090

Marostegui closed this task as Resolved.Nov 4 2020, 4:05 PM
Marostegui claimed this task.

Closing! Thanks for confirming!

Marostegui changed the status of subtask T267090: Productionize clouddb10[13-20] from Open to Stalled.Nov 6 2020, 6:57 AM
Marostegui changed the status of subtask T267090: Productionize clouddb10[13-20] from Stalled to Open.Nov 10 2020, 7:15 AM