Page MenuHomePhabricator

Remove groups from db configs
Open, MediumPublic

Description

Per T239453#5910993 we should look at removing these groups from the MW db lb configs:

  • contributions
  • logpager
  • recentchanges
  • recentchangeslinked
  • watchlist

That would leave us with:

  • api
  • dump
  • vslow

This would greatly simplify automating depooling/repooling db instances.

Outcome

  • Confirmation and understanding that these configs are functionally safe to remove in productrion without requiring any code changes, even if they are still referenced. To be walk through by by @Krinkle together with one or two people from PET as part of knowledge transfer and familiarity with Wikimedia-Rdbms.
  • Understanding and agreement on which of these (if any) we need to keep, and why. To be carried out by Platform Engineering based on the history of the db index and its usage in code.
  • T267077: Document remaining database load groups
  • Remove obsolete db group parameters from MW core and WMF-deployed extensions.

Removal progress

  • s1
    • eqiad
    • codfw
  • s2
    • eqiad
    • codfw
  • s3
    • eqiad
    • codfw
  • s4
    • eqiad
    • codfw
  • s5
    • eqiad
    • codfw
  • s6
    • eqiad
    • codfw
  • s7
    • eqiad
    • codfw
  • s8
    • eqiad
    • codfw

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

So in the past, the hosts serving those 5 groups used to have different schema partitioning (different indexes, PKs and even partitioning).
After completing: T239453 T233625 T223151 and the whole series of revision table unification (T132416 and friends) all the hosts within s1 sections have the same schema (citation needed).

<ref>P12749</ref> :-)

Kormat renamed this task from Remove sections from db configs to Remove groups from db configs.Sep 23 2020, 12:13 PM
Kormat updated the task description. (Show Details)

Another reason to remove load groups where possible is they make it very difficult to predict what effect depooling a db server will have on the query load over the rest of the section. Say a host has some weight in one or more groups as well as the main traffic group, and is receiving 15K qps. We can't tell from the DBA side what proportions of the incoming traffic correspond to what groups. Depooling it might spread 15K qps over all hosts in the section, or it might dump 15K qps into the other host in the same group as it.

This means we need to look at all groups a db instance is in, what other hosts are in those groups, and make an educated guess about what rebalancing needs to be done before we can depool one instance. There's a lot of room for getting things wrong.

I've added to the task description:

  • Understanding and agreement on which of these (if any) we need to keep, and why. To be carried out by Platform Engineering based on the history of the db index and its usage in code.
  • Confirmation and understanding that these configs are functionally safe to remove in productrion without requiring any code changes, even if they are still referenced. To be walk through by by @Krinkle together with one or two people from PET as part of knowledge transfer and familiarity with Wikimedia-Rdbms.

@daniel Are you doing the first one?

@Krinkle I think there are two parts to this. In my mind, the groups used in code are basically hints to the DB layer that a given cluster may or may not make use of. We should have an established set of "groups" to use in core, with a defined meaning. That set may be larger than the set we actually configure in prod. I guess the task for Platform Engineering would be to come up with a canonical list of db group hints for use in core, and properly document it. Does that sound right?

A quick inventory of DB groups used in core, based on some ad-hoc grep runs:

1includes/actions/InfoAction.php:748: $dbrWatchlist = wfGetDB( DB_REPLICA, 'watchlist' );
2includes/specials/pagers/DeletedContribsPager.php:75: $this->mDb = wfGetDB( DB_REPLICA, 'contributions' );
3includes/specials/pagers/ContribsPager.php.orig:143: $this->mDb = wfGetDB( DB_REPLICA, 'contributions' );
4includes/specials/pagers/ContribsPager.php:156: $this->mDb = wfGetDB( DB_REPLICA, 'contributions' );
5includes/specials/SpecialWatchlist.php:463: return wfGetDB( DB_REPLICA, 'watchlist' );
6includes/specials/SpecialActiveUsers.php:153: $dbr = wfGetDB( DB_REPLICA, 'recentchanges' );
7includes/specials/SpecialRecentChangesLinked.php:82: $dbr = wfGetDB( DB_REPLICA, 'recentchangeslinked' );
8includes/specials/SpecialRecentChanges.php:391: return wfGetDB( DB_REPLICA, 'recentchanges' );
9includes/api/ApiBase.php:596: $this->mReplicaDB = wfGetDB( DB_REPLICA, 'api' );
10includes/api/ApiSetNotificationTimestamp.php:62: $dbw = wfGetDB( DB_MASTER, 'api' );
11includes/specialpage/QueryPage.php:416: return wfGetDB( DB_REPLICA, [ $this->getName(), 'QueryPage::recache', 'vslow' ] );
12includes/logging/LogPager.php:104: $this->mDb = wfGetDB( DB_REPLICA, 'logpager' );
13includes/CategoryViewer.php:292: $dbr = wfGetDB( DB_REPLICA, 'category' );
14maintenance/categoryChangesAsRdf.php:134: $dbr = $this->getDB( DB_REPLICA, [ 'vslow' ] );
15maintenance/getReplicaServer.php:42: $db = $this->getDB( DB_REPLICA, $this->getOption( 'group' ) );
16maintenance/cleanupInvalidDbKeys.php:139: $dbr = $this->getDB( DB_REPLICA, 'vslow' );
17maintenance/refreshLinks.php:108: $dbr = $this->getDB( DB_REPLICA, [ 'vslow' ] );
18maintenance/refreshLinks.php:299: $dbr = $this->getDB( DB_REPLICA, [ 'vslow' ] );
19maintenance/refreshLinks.php:341: $dbr = $this->getDB( DB_REPLICA, [ 'vslow' ] );
20maintenance/dumpCategoriesAsRdf.php:152: $dbr = $this->getDB( DB_REPLICA, [ 'vslow' ] );
21maintenance/migrateArchiveText.php:67: $dbr = $this->getDB( DB_REPLICA, [ 'vslow' ] );
22maintenance/recountCategories.php:116: $dbr = $this->getDB( DB_REPLICA, 'vslow' );
23maintenance/includes/MigrateActors.php:86: $dbr = $this->getDB( DB_REPLICA, [ 'vslow' ] );
24maintenance/includes/BackupDumper.php:348: $dbr = $this->getDB( DB_REPLICA, [ 'dump' ] );
25maintenance/updateArticleCount.php:49: $dbr = $this->getDB( DB_REPLICA, 'vslow' );
26maintenance/populateIpChanges.php:73: $dbr = $this->getDB( DB_REPLICA, [ 'vslow' ] );

Not sure what category is as it doesn't appear on https://noc.wikimedia.org/dbconfig/eqiad.json

From there we have these active groups:

api
contributions
dump
logpager
recentchanges
recentchangeslinked
vslow
watchlist
Krinkle updated the task description. (Show Details)

A reminder that T195578 is waiting for feedback to see if it would be useful to gather query performance statistics.

@Marostegui Should we assume that it has already been determined that there is no significant benefit today to the (other) query groups? Or this this task also asking for that query performance/analysis to happen first? If not, what is the current proposal based on?

In any event, a first step might be to start adjusting the weights so that e.g. other replicas are part of it as well, and then later with heavier weight than the current ones in there. Or perhaps this was already tried? Or if we feel more confident, you could also remove them directly without any transition, maybe one shard at a time and stand by with a revert based on how it affects db health metrics.

daniel added a subscriber: WDoranWMF.

Back to the PET inbox per @WDoranWMF. We need to figure out where this fits in our process/roadmap.

@daniel @WDoranWMF Now that the docs have landed (thanks @nnikkhoui), I believe the next step is removing the obsolete group definitions from wmf-config, and removing the parameters from core/wmf code bases.

Until that is done, developers will face confusion such as in change 606020.

@Kormat @Marostegui I believe this is unblocked now for you to remove groups from the db configuration.

At this time, they these groups remain references in MediaWiki source code. This is so that you can remove them as safely and gradually as you like for invidivual db sections and monitor their impact, with an ability to easily reverse it if-needed. Whenever a group is not defined, the query goes to the default replica set instead. This fallback is known to work correctly and already used for groups that we never or no no longer used at WMF in the first place. Once removed and happy with the result, ping back here so that we can remove the remaining references to those removed groups from the source code.

Thanks @Krinkle - I will probably start first with s6 codfw (frwiki,jawiki,ruwiki), and using wikimediadebug to browse the codfw site to first make sure nothing is broken and then deploy to eqiad.

While inspecting an anon edit on dewiki (s5) I noticed the post request for the edit established two separate replica connections. The second one for a query tagged as contributions. This task's solution would presumably avoid that extra connection.

I want to start working on this next week on s6.

I am going to start with s6 codfw and see what happens. My plan is to leave logpager group without replicas and browse some of the wikis there from codfw (including wikitech) and see if there's something breaking.

Mentioned in SAL (#wikimedia-operations) [2021-10-27T05:31:05Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Remove logpager replicas from s6.codfw T263127', diff saved to https://phabricator.wikimedia.org/P17611 and previous config saved to /var/cache/conftool/dbconfig/20211027-053104-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2021-10-27T06:06:35Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Remove recentchanges and recentchangeslinked replicas from s6.codfw T263127', diff saved to https://phabricator.wikimedia.org/P17612 and previous config saved to /var/cache/conftool/dbconfig/20211027-060634-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2021-10-27T07:26:54Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Remove watchlist replicas from s6.codfw T263127', diff saved to https://phabricator.wikimedia.org/P17613 and previous config saved to /var/cache/conftool/dbconfig/20211027-072546-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2021-10-27T07:49:36Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Contributions replicas from s6.codfw T263127', diff saved to https://phabricator.wikimedia.org/P17614 and previous config saved to /var/cache/conftool/dbconfig/20211027-074935-marostegui.json

I have left s6 codfw without any special groups (see: https://noc.wikimedia.org/dbconfig/codfw.json - apart from API and vslow/dumps) and I have had no issues browsing frwiki and wikitech on codfw, and no errors were logged on logstash.

Given that eqiad is the active DC and we'll see the real scenario and impact there, I am going to start with watchlist as it is a "pretty" easy group to test. I am not going to remove more than one group per day as the effect on the other replicas (specially with the heavy queries we have to watchlist and recentchanges).

Mentioned in SAL (#wikimedia-operations) [2021-10-27T09:20:43Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Remove watchlist replicas from s6 eqiad T263127', diff saved to https://phabricator.wikimedia.org/P17615 and previous config saved to /var/cache/conftool/dbconfig/20211027-092043-marostegui.json

Nothing weird has showed up after removing watchlist group, so today I am going to go ahead and remove contributions group, as it is an easy one to test and it does generate some heavy queries, so we can see how the rest of the replicas behave.
Since I am off Friday and Monday, I will do this now so I have the whole Thursday to monitor.

Mentioned in SAL (#wikimedia-operations) [2021-10-28T05:00:52Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Remove contributions replicas from s6 eqiad T263127', diff saved to https://phabricator.wikimedia.org/P17622 and previous config saved to /var/cache/conftool/dbconfig/20211028-050052-marostegui.json

I anticipate that removing the contributions query group from higher traffic sections like s1 (enwiki) and s8 (wikidata) would have a positive impact on Backend Save Timing and appserver/api_appserver latencies due to no longer requiring a separate database connection to be established (per T263127#6947113).

I anticipate that removing the contributions query group from higher traffic sections like s1 (enwiki) and s8 (wikidata) would have a positive impact on Backend Save Timing and appserver/api_appserver latencies due to no longer requiring a separate database connection to be established (per T263127#6947113).

That's good news! So far so good with s6, so I am going to continue slowly removing groups. @Krinkle keep in mind though that API and vslow/dump groups will not be removed.

Mentioned in SAL (#wikimedia-operations) [2021-11-02T07:23:21Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Remove recentchangeslinked replicas from s6 eqiad T263127', diff saved to https://phabricator.wikimedia.org/P17650 and previous config saved to /var/cache/conftool/dbconfig/20211102-072320-marostegui.json

recentchangeslinked group removed from eqiad - monitoring now.

Mentioned in SAL (#wikimedia-operations) [2021-11-02T09:03:07Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Remove recentchanges replicas from s6 eqiad T263127', diff saved to https://phabricator.wikimedia.org/P17651 and previous config saved to /var/cache/conftool/dbconfig/20211102-090306-marostegui.json

I have now removed recentchanges group from s6, which is a "big" group - so we'll see how it goes in the next few hours.
Pending logpager which is another "big" one.

Mentioned in SAL (#wikimedia-operations) [2021-11-03T07:58:02Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Remove logpager replicas from s6 eqiad T263127', diff saved to https://phabricator.wikimedia.org/P17660 and previous config saved to /var/cache/conftool/dbconfig/20211103-075801-marostegui.json

Removed logpager from s6 eqiad.
There are no other groups now in s6.

I will need to tweak weights a bit once everything is stable.

"s6": {
  "api": {
    "db1131": 100,
    "db1165": 100
  },
  "dump": {
    "db1113:3316": 100
  },
  "vslow": {
    "db1113:3316": 100
  }
},

Mentioned in SAL (#wikimedia-operations) [2021-11-04T05:54:19Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Increase weight for the old special replicas T263127', diff saved to https://phabricator.wikimedia.org/P17679 and previous config saved to /var/cache/conftool/dbconfig/20211104-055419-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2021-11-08T11:59:45Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Remove contributions logpager recentchanges recentchangeslinked watchlist from s5 codfw T263127', diff saved to https://phabricator.wikimedia.org/P17707 and previous config saved to /var/cache/conftool/dbconfig/20211108-115945-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2021-11-08T12:02:04Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Adjust weights for s5 codfw replicas after removing special groups from them T263127', diff saved to https://phabricator.wikimedia.org/P17708 and previous config saved to /var/cache/conftool/dbconfig/20211108-120203-marostegui.json

I have removed s5 codfw groups and adjusted the weights.
Browsing dewiki in codfw looks fine (slow as usual but fine overall).

Mentioned in SAL (#wikimedia-operations) [2021-11-11T09:25:29Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Remove contributions from s5 eqiad T263127', diff saved to https://phabricator.wikimedia.org/P17725 and previous config saved to /var/cache/conftool/dbconfig/20211111-092528-marostegui.json

contributions special group in eqiad is gone. I am not going to remove more groups this week as tomorrow is Friday and I am off Monday and Tuesday next week. Let's see how this goes

Mentioned in SAL (#wikimedia-operations) [2021-11-17T06:04:26Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Remove logpager from s5 eqiad T263127', diff saved to https://phabricator.wikimedia.org/P17742 and previous config saved to /var/cache/conftool/dbconfig/20211117-060426-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2021-11-17T10:51:21Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Remove recentchangeslinked from s5 eqiad T263127', diff saved to https://phabricator.wikimedia.org/P17753 and previous config saved to /var/cache/conftool/dbconfig/20211117-105120-marostegui.json

Removed recentchangeslinked from s5. "Only" watchlist and recentchanges pending.

Mentioned in SAL (#wikimedia-operations) [2021-11-17T13:48:35Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Remove recentchanges from s5 eqiad T263127', diff saved to https://phabricator.wikimedia.org/P17754 and previous config saved to /var/cache/conftool/dbconfig/20211117-134835-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2021-11-17T13:49:42Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Change weights on s5 special slaves in eqiad T263127', diff saved to https://phabricator.wikimedia.org/P17755 and previous config saved to /var/cache/conftool/dbconfig/20211117-134942-marostegui.json

Also starting to adjust weight on db1096:3315 and db1144:3315

Mentioned in SAL (#wikimedia-operations) [2021-11-18T07:06:20Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Remove watchlist from s5 eqiad T263127', diff saved to https://phabricator.wikimedia.org/P17770 and previous config saved to /var/cache/conftool/dbconfig/20211118-070620-marostegui.json

watchlist removed from s5 eqiad. s5 has no special slaves anymore.