Page MenuHomePhabricator

Add sharding to site_stats table
Closed, ResolvedPublic

Description

NOTE: Documentation for users:
  • In the databases for English Wikipedia, Wikimedia Commons, Wikidata, and Test Wikipedia (enwiki, commonswiki, wikidatawiki and testwiki), the site_stats table now has more than one row, and the correct number for each column is the sum of that column across all rows.
  • Quarry queries that use this table should be updated to use the SUM() of a column; For an example, if you previously had:
      • SELECT ss_total_edits FROM site_stats;
    • change it to either:
      • SELECT SUM(ss_total_edits) FROM site_stats;
      • SELECT SUM(ss_total_edits) AS ss_total_edits FROM site_stats;.
  • (This is also safe to do on all other wikis, where the table still has only one row.)

Original Task description:
site_stats table has quite a write pressure (you can see the numbers in performance_schema tables). This is mostly the case in commons/wikidata/enwiki for updating edit count. We used to have SiteStatsAsyncFactor config saving the values in memcached and then saving them from time to time and it got removed (for good reasons, it was never deployed and did a lot of complex magic for no good reason).

An alternative idea is to simply shard. Instead of one row, keep ten (configurable number) and update one row at random and in reading the table just sum the values up. The idea is inspired by DBSerialProvider done by @tstarling in gerrit:767617

Event Timeline

Are there any metrics indicating the scale of the problem? Are there deadlocks or transaction slow downs? The idea sounds plausible.

It is one of the tables that has time spent in write (=waiting for lock). It basically shows up at anything I look:
You can look at performance metrics by doing sql wikidatawiki --write and the use performance_schema

wikiadmin@10.64.16.85(performance_schema)> select OBJECT_NAME from table_lock_waits_summary_by_table where OBJECT_SCHEMA = 'wikidatawiki' order by SUM_TIMER_WRITE desc limit 10;
+-----------------------+
| OBJECT_NAME           |
+-----------------------+
| revision              |
| page                  |
| wbt_item_terms        |
| cu_changes            |
| site_stats            |
| recentchanges         |
| user                  |
| text                  |
| revision_comment_temp |
| content               |
+-----------------------+
10 rows in set (0.006 sec)

Change 786389 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] [POC] Sharding in site_stats update

https://gerrit.wikimedia.org/r/786389

I made a POC for it.

Some issues that can be easily fixed:

  • Make it configurable
  • Make the reading work, strangely it heavily relies on the table being one row :/

A complicating note:

  • Updates can be negative and if you end up in wrong time, you might make a row value negative and since the table fields all are unsigned, it'll become zero (or fail?). Some might argue it's not a big deal to fix though.
Ladsgroup edited projects, added DBA; removed Data-Persistence (Consultation).
Ladsgroup moved this task from Triage to In progress on the DBA board.

I thought that we updated this table in autocommit mode within POSTSEND deferred updates. Is there a case that is not doing that, it should be fixed. Is there a user impact, then? Nothing stands out in the DBPerformance log either.

I thought that we updated this table in autocommit mode within POSTSEND deferred updates.

That moves the problem from user-side to the db side. Sure, the user doesn't see the slowness but the wait for lock still exists.

Is there a case that is not doing that, it should be fixed. Is there a user impact, then? Nothing stands out in the DBPerformance log either.

Again, this is an issue on the db side, if you look at performance_schema stats, not on user side. It's causing general slowdown of writes being written and committed and nothing you could measure in user-side.

Change 786389 merged by jenkins-bot:

[mediawiki/core@master] Allow sharding in site_stats update

https://gerrit.wikimedia.org/r/786389

Change 801751 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@wmf/1.39.0-wmf.14] Allow sharding in site_stats update

https://gerrit.wikimedia.org/r/801751

Change 801796 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] beta: Enable multi-shard site_stats in enwiki

https://gerrit.wikimedia.org/r/801796

Change 801796 merged by jenkins-bot:

[operations/mediawiki-config@master] beta: Enable multi-shard site_stats in enwiki

https://gerrit.wikimedia.org/r/801796

Change 801801 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] Enable MultiShardSiteStats in several large wikis and testwiki

https://gerrit.wikimedia.org/r/801801

Change 801751 merged by jenkins-bot:

[mediawiki/core@wmf/1.39.0-wmf.14] Allow sharding in site_stats update

https://gerrit.wikimedia.org/r/801751

Change 801801 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable MultiShardSiteStats in several large wikis and testwiki

https://gerrit.wikimedia.org/r/801801

Mentioned in SAL (#wikimedia-operations) [2022-05-31T19:07:47Z] <ladsgroup@deploy1002> Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:801801|Enable MultiShardSiteStats in several large wikis and testwiki (T306589)]] (duration: 03m 09s)

Change 801752 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@wmf/1.39.0-wmf.13] Allow sharding in site_stats update

https://gerrit.wikimedia.org/r/801752

Mentioned in SAL (#wikimedia-operations) [2022-05-31T19:12:58Z] <ladsgroup@deploy1002> Synchronized php-1.39.0-wmf.14/includes/: Backport: [[gerrit:801751|Allow sharding in site_stats update (T306589)]] (duration: 03m 25s)

Change 801752 merged by jenkins-bot:

[mediawiki/core@wmf/1.39.0-wmf.13] Allow sharding in site_stats update

https://gerrit.wikimedia.org/r/801752

Mentioned in SAL (#wikimedia-operations) [2022-05-31T19:33:03Z] <ladsgroup@deploy1002> Synchronized php-1.39.0-wmf.13/includes/: Backport: [[gerrit:801752|Allow sharding in site_stats update (T306589)]] (duration: 03m 20s)

Contention causes has drastically reduced and now it's been under 0.3 all the time:

image.png (398×1 px, 53 KB)

Would be nice to have this in Tech News, some Quarry queries use this table.

Would be nice to have this in Tech News, some Quarry queries use this table.

Re: Tech News - What wording would you suggest as the content? (Drafts of ~simple English descriptions for an entry are always appreciated, especially for fairly technical entries like this!) Thanks!

Well, Amir asked me to add the tag, but as he’s still on vacation I can try to write some content :D

In the databases for English Wikipedia, Wikimedia Commons, Wikidata, and Test Wikipedia (enwiki, commonswiki, wikidatawiki and testwiki), the site_stats table now has more than one row, and the correct number for each column is the sum of that column across all rows. Quarry queries that use this table should be updated to use the SUM() of a column; for instance, if you previously had SELECT ss_total_edits FROM site_stats;, change it to SELECT SUM(ss_total_edits) FROM site_stats; or SELECT SUM(ss_total_edits) AS ss_total_edits FROM site_stats;. (This is also safe to do on all other wikis, where the table still has only one row.)

I guess mw.org should also be updated (site_stats table, and a new page for this setting). Should that wait until the setting is no longer EXPERIMENTAL?

Thanks Lucas! I think that draft might be a little bit too detailed, given the small audience, so what I'd suggest, is to:

  • Copy that more detailed explanation into the top of this task's "Description" section,
  • Use a simple headline for tech news, pointing here for more info. E.g.

Some recent database changes may affect queries using the Quarry tool. Queries for site_stats at English Wikipedia, Commons, and Wikidata will need to be updated. Read more.

I'm not certain if that is accurate, and this doesn't seem urgent, and it's your weekend, and other docs updates might be useful, so I'll postpone including this entry in Tech News until the following week's edition.

Some recent database changes may affect queries using the Quarry tool. Queries for site_stats at English Wikipedia, Commons, and Wikidata will need to be updated. Read more.

Sounds good to me 👍

  1. I've updated the Description using your text (lightly formatted. Please edit further as desired!).
  2. I've added an entry to Tech News with the proposed text above (please tell me if changes are needed, or edit directly, within ~24 hours)
  3. Reminder to update the mw.org docs as/when needed.

Thanks!

Well, Amir asked me to add the tag, but as he’s still on vacation I can try to write some content :D

In the databases for English Wikipedia, Wikimedia Commons, Wikidata, and Test Wikipedia (enwiki, commonswiki, wikidatawiki and testwiki), the site_stats table now has more than one row, and the correct number for each column is the sum of that column across all rows. Quarry queries that use this table should be updated to use the SUM() of a column; for instance, if you previously had SELECT ss_total_edits FROM site_stats;, change it to SELECT SUM(ss_total_edits) FROM site_stats; or SELECT SUM(ss_total_edits) AS ss_total_edits FROM site_stats;. (This is also safe to do on all other wikis, where the table still has only one row.)

I guess mw.org should also be updated (site_stats table, and a new page for this setting). Should that wait until the setting is no longer EXPERIMENTAL?

Maybe it is easier to change the view to do the SUM instead to require to add the SUM. That would hide the technical detail, but that sounds not for something really needed there.

Maybe it is easier to change the view to do the SUM instead to require to add the SUM. That would hide the technical detail, but that sounds not for something really needed there.

Hm, that’s a good point, I don’t see an obvious reason why that wouldn’t be possible for the cloud replicas.

I respectfully disagree. We are making a lot of changes to mediawiki's schema and this slowly makes the whole system to drift between production and cloud leading to all sorts of maintenance headache in the long run. We will end up with basically two different worlds. The only time it's a good idea to use views is to hide private information (with the exception of temp b/c for heavily used data)