Page MenuHomePhabricator

Enable statement usage tracking on cebwiki
Closed, ResolvedPublic

Description

As agreed in T188730#7461581 and T188730#7525888, we want to remove the wmgWikibaseDisabledUsageAspects setting for Cebuano Wikipedia, and set $wgWBClientSettings['entityUsageModifierLimits']['C'] to 10 (instead of the default 33). This is expected to increase the number of rows in the wbc_entity_usage to no more than 60 million, most likely much less (currently 14.6 million). We should keep track of the size of the table after making the change, as well as the size of the recentchanges table and the ratio of Wikidata edits there, and check on-wiki Lua modules for obvious mistakes if it grows too large.

Acceptance criteria:

  • cebwiki is removed from the wmgWikibaseDisabledUsageAspects setting
  • a new setting for limiting usage tracking for claims is added to initialliseSettings.php
  • the setting value for cebwiki is set to 10
  • $wgWBClientSettings['entityUsageModifierLimits']['C'] is set to the new setting in Wikibase.php
NOTE: once deployed, we should make sure data is being tracked under "Statement Flags" in the cebwiki dashboard https://grafana.wikimedia.org/d/000000176/wikidata-entity-usage-project?orgId=1&var-project=cebwiki

Related Objects

View Standalone Graph
This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.
StatusSubtypeAssignedTask
OpenNone
Resolvednoarave

Event Timeline

2021-11-24:

lucaswerkmeister-wmde@stat1007:~$ sudo -u analytics-wmde analytics-mysql cebwiki <<< 'SELECT table_name, table_rows FROM information_schema.tables WHERE table_schema = "cebwiki" AND table_name IN ("wbc_entity_usage", "recentchanges");' | expand -t20
table_name          table_rows
recentchanges       841431
wbc_entity_usage    13725390

Change 754933 had a related patch set uploaded (by Noa wmde; author: Noa wmde):

[operations/mediawiki-config@master] Enable usage tracking for statement for cebwiki

https://gerrit.wikimedia.org/r/754933

Change 754937 had a related patch set uploaded (by Noa wmde; author: Noa wmde):

[operations/mediawiki-config@master] Introduce $wmgEntityUsageModifierLimitsStatement

https://gerrit.wikimedia.org/r/754937

Now that hywiki and warwiki are done, I think we can move onto this wiki soon. Here’s an update on the baseline:

MariaDB [cebwiki]> SELECT NOW() AS asof, COUNT(*) AS total, SUM(eu_aspect LIKE 'C%') AS statements, SUM(eu_aspect LIKE 'O%') AS others FROM wbc_entity_usage;
+---------------------+----------+------------+---------+
| asof                | total    | statements | others  |
+---------------------+----------+------------+---------+
| 2022-03-09 16:16:25 | 15229870 |          0 | 4797898 |
+---------------------+----------+------------+---------+
1 row in set (6.703 sec)

Updated baseline on the number of entity usage rows:

MariaDB [cebwiki]> SELECT NOW() AS asof, COUNT(*) AS total, SUM(eu_aspect LIKE 'C%') AS statements, SUM(eu_aspect LIKE 'O%') AS others FROM wbc_entity_usage;
+---------------------+----------+------------+---------+
| asof                | total    | statements | others  |
+---------------------+----------+------------+---------+
| 2022-08-02 10:29:47 | 15038456 |          0 | 4780944 |
+---------------------+----------+------------+---------+
1 row in set (16.876 sec)

And also on the amount of Wikidata recentchanges entries:

MariaDB [cebwiki]> SELECT NOW() AS asof, COUNT(*) AS total, SUM(rc_type = 5 /* RC_EXTERNAL */) AS fromwikidata FROM recentchanges;
+---------------------+--------+--------------+
| asof                | total  | fromwikidata |
+---------------------+--------+--------------+
| 2022-08-02 10:34:10 | 144266 |       136686 |
+---------------------+--------+--------------+
1 row in set (0.070 sec)

Change 754937 merged by jenkins-bot:

[operations/mediawiki-config@master] Introduce $wmgEntityUsageModifierLimitsStatement

https://gerrit.wikimedia.org/r/754937

Mentioned in SAL (#wikimedia-operations) [2022-08-02T13:30:59Z] <lucaswerkmeister-wmde@deploy1002> Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:754937|Introduce $wmgEntityUsageModifierLimitsStatement (T296384)]] (1/2) (duration: 03m 16s)

Mentioned in SAL (#wikimedia-operations) [2022-08-02T13:34:34Z] <lucaswerkmeister-wmde@deploy1002> Synchronized wmf-config/Wikibase.php: Config: [[gerrit:754937|Introduce $wmgEntityUsageModifierLimitsStatement (T296384)]] (2/2) (duration: 03m 21s)

Change 754933 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable usage tracking for statement for cebwiki

https://gerrit.wikimedia.org/r/754933

Mentioned in SAL (#wikimedia-operations) [2022-08-02T13:40:37Z] <lucaswerkmeister-wmde@deploy1002> Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:754933|Enable usage tracking for statement for cebwiki (T296384)]] – expected to gradually increase number of wbc_entity_usage and probably recentchanges rows on cebwiki, but not too much, see task for details (duration: 03m 06s)

Deployed; we should start to see the impact over the coming days and weeks, as cebwiki pages gradually get reparsed and have their usage tracking updated.

Since we can’t expect that to happen organically for all pages, at some point we’ll want to actively purge them. We could use the script I prepared for T296383; alternatively (or in addition), we could trigger a recursive purge for {{paghimo ni bot}} via the API, which should effectively purge most pages on the wiki (it has 5½ million transclusions).

Oh, and I tested it for an individual page ('Aria, info) with a null edit:
Screenshot 2022-08-02 at 15-54-09 Information for 'Aria - Wikipedia.png (137×594 px, 10 KB)Screenshot 2022-08-02 at 15-54-18 Information for 'Aria - Wikipedia.png (236×594 px, 17 KB)

Updated numbers:

MariaDB [cebwiki]> SELECT NOW() AS asof, COUNT(*) AS total, SUM(eu_aspect LIKE 'C%') AS statements, SUM(eu_aspect LIKE 'O%') AS others FROM wbc_entity_usage;
+---------------------+----------+------------+--------+
| asof                | total    | statements | others |
+---------------------+----------+------------+--------+
| 2022-08-29 10:37:54 | 28565735 |   17755420 | 551913 |
+---------------------+----------+------------+--------+
1 row in set (9.765 sec)

MariaDB [cebwiki]> SELECT NOW() AS asof, COUNT(*) AS total, SUM(rc_type = 5 /* RC_EXTERNAL */) AS fromwikidata FROM recentchanges;
+---------------------+--------+--------------+
| asof                | total  | fromwikidata |
+---------------------+--------+--------------+
| 2022-08-29 10:38:08 | 170182 |       166925 |
+---------------------+--------+--------------+
1 row in set (0.102 sec)

So statement usages have mostly replaced “other” usages, with a total of 28½ million rows, which is indeed “no more than 60 million, most likely much less”.

MariaDB [cebwiki]> SELECT NOW() AS asof, COUNT(*) AS total, SUM(eu_aspect LIKE 'C%') AS statements, SUM(eu_aspect LIKE 'O%') AS others FROM wbc_entity_usage;
+---------------------+----------+------------+--------+
| asof                | total    | statements | others |
+---------------------+----------+------------+--------+
| 2022-08-31 11:01:52 | 28402531 |   17755452 | 388666 |
+---------------------+----------+------------+--------+
1 row in set (9.727 sec)

If the “other” usages continue to evaporate at a comparable rate, they should be gone by the end of the week:

$ units
Currency exchange rates from FloatRates (USD base) on 2020-11-15 
3677 units, 109 prefixes, 114 nonlinear units

You have: 388666 / ((551913 - 388666) / 24.5 hours)
You want: time
        2 day + 10 hr + 19 min + 50.63505 sec

The cleanup of “other” usages slowed down a bit and now seems to have come at a halt:

MariaDB [cebwiki]> SELECT NOW() AS asof, COUNT(*) AS total, SUM(eu_aspect LIKE 'C%') AS statements, SUM(eu_aspect LIKE 'O%') AS others FROM wbc_entity_usage;
+---------------------+----------+------------+--------+
| asof                | total    | statements | others |
+---------------------+----------+------------+--------+
| 2022-09-12 09:12:07 | 28016614 |   17755028 |   4250 |
+---------------------+----------+------------+--------+
1 row in set (9.757 sec)
MariaDB [cebwiki]> SELECT NOW() AS asof, COUNT(*) AS total, SUM(eu_aspect LIKE 'C%') AS statements, SUM(eu_aspect LIKE 'O%') AS others FROM wbc_entity_usage;
+---------------------+----------+------------+--------+
| asof                | total    | statements | others |
+---------------------+----------+------------+--------+
| 2022-09-12 09:43:13 | 28016614 |   17755028 |   4250 |
+---------------------+----------+------------+--------+
1 row in set (9.852 sec)

I’ll run purge-other-usages and see if that gets rid of the rest. Should take about three hours.

So far, usages are only going down very slowly.

(venv) lucaswerkmeister-wmde@tools-sgebastion-10:~/purge-other-usages$ ./purge-other-usages ceb.wikipedia.org '' 0
Processed up to P9395, 10621606.
Processed up to Q1020685, 9869118.
Processed up to Q1020939, 55358.
Processed up to Q103825, 10942920.
Processed up to Q103958, 10942920.
Processed up to Q104071, 10942920.
Processed up to Q106790, 10942920.
Processed up to Q107454, 9869118.
Processed up to Q107559, 9869118.
Processed up to Q111372, 10942920.
Processed up to Q111677, 10942920.
Processed up to Q12692, 10942920.
Processed up to Q12853, 9869118.
Processed up to Q12893, 9869118.
^CTraceback (most recent call last):
  File "./purge-other-usages", line 79, in <module>
    time.sleep(75)
KeyboardInterrupt
MariaDB [cebwiki]> SELECT NOW() AS asof, COUNT(*) AS total, SUM(eu_aspect LIKE 'C%') AS statements, SUM(eu_aspect LIKE 'O%') AS others FROM wbc_entity_usage;
+---------------------+----------+------------+--------+
| asof                | total    | statements | others |
+---------------------+----------+------------+--------+
| 2022-09-12 10:24:58 | 28016613 |   17755028 |   4249 |
+---------------------+----------+------------+--------+
1 row in set (9.862 sec)

I also notice that wikibase-addUsagesForPage jobs are currently backlogged (Grafana):

image.png (343×1 px, 67 KB)

So I’ve stopped the script for now (KeyboardInterrupt above) – I assume the number previously wasn’t going down due to this backlog, so let’s let it recover first.

The job queue has recovered and there are still “other” usages left, so I’ll resume the script:

MariaDB [cebwiki]> SELECT NOW() AS asof, COUNT(*) AS total, SUM(eu_aspect LIKE 'C%') AS statements, SUM(eu_aspect LIKE 'O%') AS others FROM wbc_entity_usage;
+---------------------+----------+------------+--------+
| asof                | total    | statements | others |
+---------------------+----------+------------+--------+
| 2022-09-13 08:59:18 | 28016659 |   17755039 |   4272 |
+---------------------+----------+------------+--------+
1 row in set (9.662 sec)

I forgot that there can still be legitimate “other” usages (I confused it with “X”, which we really expect to not exist at all). And most of the remaining “other” usages come from two pages that really just have a lot of those usages:

MariaDB [cebwiki]> SELECT eu_page_id, COUNT(*) FROM wbc_entity_usage WHERE eu_aspect LIKE 'O%' GROUP BY eu_page_id ORDER BY COUNT(*) DESC LIMIT 3;
+------------+----------+
| eu_page_id | COUNT(*) |
+------------+----------+
|   10942920 |     1737 |
|    9869118 |     1737 |
|      46428 |       17 |

So I hard-coded the script to skip those page IDs and am letting it run for the remaining pages now. And once that’s done, I think we can call this finished, even if a few “other” usages will remain.


Edit: the SQL query is much more efficient if you make it use the (covering) eu_entity_id index:

SELECT eu_page_id, COUNT(*) FROM wbc_entity_usage USE INDEX (`eu_entity_id`) WHERE eu_aspect LIKE 'O%' GROUP BY eu_page_id ORDER BY COUNT(*) DESC LIMIT 25;

Alright, the script finished and the remaining 4256 “other” usages are probably real ones.

Also, it looks like barely any pages actually have enough statement usages to hit the customized, lower entity usage modifier limit:

MariaDB [cebwiki]> SELECT NOW() AS asof, SUM(eu_aspect LIKE 'C.%') AS modifier, SUM(eu_aspect = 'C') AS general FROM wbc_entity_usage;
+---------------------+----------+---------+
| asof                | modifier | general |
+---------------------+----------+---------+
| 2022-09-13 10:18:21 | 17754901 |     138 |
+---------------------+----------+---------+
1 row in set (9.446 sec)

So I suggest we remove that from cebwiki again (now that we know the entity usage table won’t explode from too many statement usages), and leave it at the default for all other wikis (33 instead of 10). @Ladsgroup does that sound okay?

Change 836227 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[operations/mediawiki-config@master] Remove wmgEntityUsageModifierLimitsStatement on cebwiki

https://gerrit.wikimedia.org/r/836227

Change 836227 merged by jenkins-bot:

[operations/mediawiki-config@master] Remove wmgEntityUsageModifierLimitsStatement on cebwiki

https://gerrit.wikimedia.org/r/836227

Mentioned in SAL (#wikimedia-operations) [2022-09-29T13:15:10Z] <lucaswerkmeister-wmde@deploy1002> Started scap: Backport for [[gerrit:836227|Remove wmgEntityUsageModifierLimitsStatement on cebwiki (T296384)]]

Mentioned in SAL (#wikimedia-operations) [2022-09-29T13:15:33Z] <lucaswerkmeister-wmde@deploy1002> lucaswerkmeister-wmde and lucaswerkmeister-wmde: Backport for [[gerrit:836227|Remove wmgEntityUsageModifierLimitsStatement on cebwiki (T296384)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2022-09-29T13:20:33Z] <lucaswerkmeister-wmde@deploy1002> Finished scap: Backport for [[gerrit:836227|Remove wmgEntityUsageModifierLimitsStatement on cebwiki (T296384)]] (duration: 05m 23s)