Page MenuHomePhabricator

Consider partitioning local_group_wikipedia even more
Closed, ResolvedPublic

Description

According to T169939#3490932 currently the wikipedia storage group accounts for the vast majority of storage needs. We need to consider pros and cons of splitting it to several more storage groups when migrating to the new storage model.

Event Timeline

Pchelolo created this task.Aug 1 2017, 9:14 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 1 2017, 9:14 PM
mobrovac raised the priority of this task from Medium to High.Sep 12 2017, 10:04 AM
mobrovac added a subscriber: mobrovac.

Raising the priority as we should settle on this before migrating to the new storage scheme.

Based on the importance of projects, number of articles and edit rate I would propose to make the following changes:

  • fold all non-WP projects except phase0 and the global domain into one storage group
  • separate out enwiki and commons into their own storage groups
  • put dewiki and frwiki into one storage group
  • put eswiki, itwiki, ruwiki and jawiki into one storage group

I also think we should get rid of the .group.local suffix as its meaning is outdated and no longer provides any benefit.

In configuration terms, this could look something like this:

storage_groups:
  - name: globaldomain
    domains: /^wikimedia\.org$/
  - name: enwiki
    domains: /^en\.wikipedia\.org$/
  - name: commons
    domains: /^commons\.wikimedia\.org$/
  - name: de_fr
    domains: /^(?:de|fr)\.wikipedia\.org$/
  - name: es_it_ru_ja
    domains: /^(?:es|it|ru|ja)\.wikipedia\.org$/
  - name: wikipedia
    domains: /\.wikipedia\.org$/
  - name: phase0
    domains: /^(?:test.*\.wiki.*\.org|www.mediawiki.org)$/
  - name: sister_projects
    domains: /\.wik(?:tionary|ibooks|isource|iquote|inews|iversity|ivoyage|imedia)\.org$/
  - name: catch_all
    domains: /./

In this way we are reducing the number of storage groups from 12 to 9, which means less keyspaces, all the while optimising the storage ratio amongst storage groups.

GWicke added a subscriber: GWicke.EditedSep 12 2017, 5:20 PM

Considering the scalability limits of Cassandra's schema synchronization we see in production, I think it would be good to reduce the number of storage groups more aggressively. Perhaps something like this?

  • phase0 wikis
  • enwiki
  • all other wikipedias
  • globaldomain (wikimedia.org)
  • commons
  • remaining projects

wikidata is a very specific project as well and we sometimes need to do something different with Wikidata, just like commons, so we might consider separating it as well.

Eevans added a subscriber: Eevans.Sep 12 2017, 6:24 PM

Considering the scalability limits of Cassandra's schema synchronization we see in production, I think it would be good to reduce the number of storage groups more aggressively...

It's worth noting that it's always been a little janky, but has only more recently become sort of miserable; Getting the number down by any significant amount ought to help a great deal. However, since the number of tables/use-cases is only likely to ever increase, I'm very much in favor of making the number of groups as small as possible (with the obvious proviso that doing so doesn't create problems that are worse). And, this is really our only opportunity to do so (easily).

...Perhaps something like this?

  • phase0 wikis
  • enwiki
  • all other wikipedias
  • globaldomain (wikimedia.org)
  • commons
  • remaining projects

This is half, if I'm not mistaken. That would bring us down to ~125 keyspaces, or about 250 tables.

  • phase0 wikis
  • enwiki
  • all other wikipedias
  • globaldomain (wikimedia.org)
  • commons
  • remaining projects

I think we can fold phase0 into the remaining projects group, I don't see why we would keep it separate.

wikidata is a very specific project as well and we sometimes need to do something different with Wikidata, just like commons, so we might consider separating it as well.

That's on the RESTBase/ChangeProp side. On the Cassandra side, though, we rarely want to intervene manually, so I don't think we need to separate it. Moreover, we skip most updates for WD.

As we're going to begin getting the new storage into production we should make a decision on this one.

It seems that we all agree that the following is the way to go:

enwiki
all other wikipedias
globaldomain (wikimedia.org)
commons
remaining projects

Correct?

Change 378912 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/services/restbase/deploy@master] Config: Add the table_ng options and new storage groups

https://gerrit.wikimedia.org/r/378912

At today's team sync we agreed with @Pchelolo's proposal:

enwiki
all other wikipedias
globaldomain (wikimedia.org)
commons
remaining projects

Change 378912 merged by Mobrovac:
[mediawiki/services/restbase/deploy@master] Config: Add the table_ng options and new storage groups

https://gerrit.wikimedia.org/r/378912

mobrovac closed this task as Resolved.Sep 20 2017, 11:41 AM
mobrovac claimed this task.

Patchset merged, deploy about to happen. Resolving.