Page MenuHomePhabricator

Enable statement usage tracking on warwiki
Closed, ResolvedPublic

Description

As agreed in T188730#7461581 and T188730#7525888, we want to remove the wmgWikibaseDisabledUsageAspects setting for Waray Wikipedia. This is expected to increase the number of rows in the wbc_entity_usage to no more than 40 million, most likely much less (currently 2.6 million). We should keep track of the size of the table after making the change, as well as the size of the recentchanges table and the ratio of Wikidata edits there, and check on-wiki Lua modules for obvious mistakes if it grows too large.

Related Objects

View Standalone Graph
This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.
StatusSubtypeAssignedTask
OpenNone
Resolvednoarave

Event Timeline

2021-11-24:

lucaswerkmeister-wmde@stat1007:~$ sudo -u analytics-wmde analytics-mysql warwiki <<< 'SELECT table_name, table_rows FROM information_schema.tables WHERE table_schema = "warwiki" AND table_name IN ("wbc_entity_usage", "recentchanges");' | expand -t20
table_name          table_rows
recentchanges       44847
wbc_entity_usage    2558116

Change 755322 had a related patch set uploaded (by Noa wmde; author: Noa wmde):

[operations/mediawiki-config@master] Enable usage tracking for statements in Waray Wikipedia

https://gerrit.wikimedia.org/r/755322

After this is deployed we should make sure data is being tracked under "Statement Flags" in the warwiki dashboard: https://grafana.wikimedia.org/d/000000176/wikidata-entity-usage-project?orgId=1&var-project=warwiki

Change 755322 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable usage tracking for statements in Waray Wikipedia

https://gerrit.wikimedia.org/r/755322

Mentioned in SAL (#wikimedia-operations) [2022-01-20T12:31:14Z] <lucaswerkmeister-wmde@deploy1002> Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:755322|Enable usage tracking for statements in Waray Wikipedia (T296383)]] (expecting some gradual increase of wbc_entity_usage rows on warwiki) (duration: 00m 51s)

Quick check on the recentchanges table: so far (i.e. a few days after we deployed the change), it contains some 104k rows, 97k of which come from Wikidata.

MariaDB [warwiki]> SELECT COUNT(*) AS total, SUM(rc_type = 5) FROM recentchanges;
+--------+------------------+
| total  | SUM(rc_type = 5) |
+--------+------------------+
| 103826 |            97438 |
+--------+------------------+
1 row in set (0.078 sec)

At a glance, it looks like most of these are sitelink changes. It should be interesting to see if later there are more statement changes between them. (It will probably take a while for the statement usage tracking to fully propagate, though.)

Screenshot from 2022-01-25 12-53-10.png (702×1 px, 325 KB)

Current status:

MariaDB [warwiki]> SELECT NOW() AS asof, COUNT(*) AS total, SUM(eu_aspect LIKE 'C%') AS statements, SUM(eu_aspect LIKE 'O%') AS others FROM wbc_entity_usage;
+---------------------+---------+------------+---------+
| asof                | total   | statements | others  |
+---------------------+---------+------------+---------+
| 2022-02-01 10:29:38 | 3098324 |     537285 | 1132406 |
+---------------------+---------+------------+---------+
1 row in set (1.298 sec)

Three million total entity usage rows, half a million of them are statement rows and a bit over a million are “other” ones. It’ll probably take a while longer for the “other” ones to dwindle away entirely.

One month later:

MariaDB [warwiki]> SELECT NOW() AS asof, COUNT(*) AS total, SUM(eu_aspect LIKE 'C%') AS statements, SUM(eu_aspect LIKE 'O%') AS others FROM wbc_entity_usage;
+---------------------+---------+------------+---------+
| asof                | total   | statements | others  |
+---------------------+---------+------------+---------+
| 2022-02-28 10:09:32 | 3685767 |    1137914 | 1119163 |
+---------------------+---------+------------+---------+
1 row in set (1.272 sec)

I think this calls for a bot run to purge all the pages with “other” usages, it doesn’t look like natural edits or page views will do the trick on this wiki. We’ll want to do that slowly, though – statement usages have gone up a lot more than other usages have gone down over this past month.

I’ve started putting together a script to do the purge in ~lucaswerkmeister-wmde/purge-other-usages/ on Toolforge. (It’s a Git repository, with some history too.) The script can be run on a Buster bastion using the venv in that directory; it goes through the “other” usages in entity_id, page_id order (both ASC), taking the entity and page IDs to resume from as arguments and printing the last ones after each batch (i.e. after interrupting the script, resume it with the last IDs it printed). It queries the DB in batches of 500 at a time, and (so far) exits after the first batch, so the first run purged 500 pages.

Processed up to Q10261989, 2244976.

Before:

MariaDB [warwiki]> SELECT NOW() AS asof, COUNT(*) AS total, SUM(eu_aspect LIKE 'C%') AS statements, SUM(eu_aspect LIKE 'O%') AS others FROM wbc_entity_usage;
+---------------------+---------+------------+---------+
| asof                | total   | statements | others  |
+---------------------+---------+------------+---------+
| 2022-02-28 15:26:52 | 3686466 |    1138696 | 1119080 |
+---------------------+---------+------------+---------+
1 row in set (1.242 sec)

After:

MariaDB [warwiki]> SELECT NOW() AS asof, COUNT(*) AS total, SUM(eu_aspect LIKE 'C%') AS statements, SUM(eu_aspect LIKE 'O%') AS others FROM wbc_entity_usage;
+---------------------+---------+------------+---------+
| asof                | total   | statements | others  |
+---------------------+---------+------------+---------+
| 2022-02-28 15:51:05 | 3686016 |    1138751 | 1118575 |
+---------------------+---------+------------+---------+
1 row in set (1.257 sec)

So this batch decreased the number of other usages by 505, and increased the number of statement usages by 55. It also took over twenty minutes – purging is rate limited (for good reason!), so the script only processes 30 pages per 75 seconds. (30 pages per 60 seconds is the sharp rate limit, I’m letting the script sleep for a bit longer to leave a bit of “breathing space”.) At this rate, getting rid of the remaining 1119163 other usages would take a bit over 32 days, just over a month.

I’ve started putting together a script to do the purge in ~lucaswerkmeister-wmde/purge-other-usages/ on Toolforge. (It’s a Git repository, with some history too.)

Aaand that’s useful to nobody since user home directories aren’t readable to others by default. I’ve pushed the script to https://github.com/lucaswerkmeister/purge-other-usages now.

Moving back to review for someone™ to take a look at that script; if it looks okay, I think we can start running larger batches of it as Toolforge jobs.

At least on my side, a script running for a month is totally fine, we have maintenance script running for five months.

The most usage is coming from one template https://war.wikipedia.org/wiki/Batakan:Wikidata_image

select count(*)
  from templatelinks
 where tl_namespace = 10
   and tl_title = 'Wikidata_image'
   and tl_from_namespace = 0;
+----------+
| count(*) |
+----------+
|  1140204 |
+----------+
1 row in set (9.82 sec)

action=purge with forcerecursivelinkupdate fills the jobqueue with the necessary pages and process it in the background. Lets wait until the jobqueue has handled my recursive link update (This is similiar when a user edits the template) and recheck the numbers and see if it stills needs individual purges (hopefully less than hundred pages or so).

I guess that corresponds to the spike in refreshLinks jobs at 18:36 UTC, though I don’t know how to check when the recursive link update is done. Right now we’re at 800k other usages:

MariaDB [warwiki]> SELECT NOW() AS asof, COUNT(*) AS total, SUM(eu_aspect LIKE 'C%') AS statements, SUM(eu_aspect LIKE 'O%') AS others FROM wbc_entity_usage;
+---------------------+---------+------------+--------+
| asof                | total   | statements | others |
+---------------------+---------+------------+--------+
| 2022-03-02 14:43:13 | 3374966 |    1146236 | 800042 |
+---------------------+---------+------------+--------+
1 row in set (1.463 sec)

I still welcome review of my script, since we might well need it again for cebwiki and maybe commonswiki too.

Technical done is the jobqueue when the page_links_updated timestamp is higher than the start of the recursive link update for all the pages embedding the template. Pages getting a update from another job queue (on page edit) or from a script doing purges are deduplicated by the job queue.

MariaDB [warwiki_p]> SELECT NOW() AS asof, COUNT(*) AS total, SUM(page_links_updated < '20220301182633') AS backlog FROM page INNER JOIN templatelinks ON page_id = tl_from WHERE tl_namespace = 10 AND tl_title = 'Wikidata_image';
+---------------------+---------+---------+
| asof                | total   | backlog |
+---------------------+---------+---------+
| 2022-03-02 18:43:21 | 1140205 |  778531 |
+---------------------+---------+---------+
1 row in set (1.05 sec)

MariaDB [warwiki_p]> SELECT NOW() AS asof, COUNT(*) AS total, SUM(eu_aspect LIKE 'C%') AS statements, SUM(eu_aspect LIKE 'O%') AS others FROM wbc_entity_usage;
+---------------------+---------+------------+--------+
| asof                | total   | statements | others |
+---------------------+---------+------------+--------+
| 2022-03-02 18:43:25 | 3341964 |    1147314 | 765962 |
+---------------------+---------+------------+--------+
1 row in set (1.12 sec)

It starts 24 h back with 1118575 and is now at 765962 , 352613 within 24 h, means that this needs another two days (and a bit more) to complete.

[Edit] The difference between the "backlog of jobs 778531" and the "other usages 765962" is that all refreshes since the config change and before the recursive update are already done, the number are more equal if using the timestamp from the config change: SELECT NOW() AS asof, COUNT(*) AS total, SUM(page_links_updated <= '20220120123114') AS backlog FROM page INNER JOIN templatelinks ON page_id = tl_from WHERE tl_namespace = 10 AND tl_title = 'Wikidata_image';

It is a bit slower (only 267k in 24h, not 352k in 24h):

MariaDB [warwiki_p]> SELECT NOW() AS asof, COUNT(*) AS total, SUM(page_links_updated < '20220301182633') AS backlog FROM page INNER JOIN templatelinks ON page_id = tl_from WHERE tl_namespace = 10 AND tl_title = 'Wikidata_image';
+---------------------+---------+---------+
| asof                | total   | backlog |
+---------------------+---------+---------+
| 2022-03-03 18:41:42 | 1140205 |  511570 |
+---------------------+---------+---------+
1 row in set (1.04 sec)

MariaDB [warwiki_p]> SELECT NOW() AS asof, COUNT(*) AS total, SUM(eu_aspect LIKE 'C%') AS statements, SUM(eu_aspect LIKE 'O%') AS others FROM wbc_entity_usage;
+---------------------+---------+------------+--------+
| asof                | total   | statements | others |
+---------------------+---------+------------+--------+
| 2022-03-03 18:41:46 | 3079103 |    1150208 | 500208 |
+---------------------+---------+------------+--------+
1 row in set (1.06 sec)
MariaDB [warwiki_p]> SELECT NOW() AS asof, COUNT(*) AS total, SUM(page_links_updated < '20220301182633') AS backlog FROM page INNER JOIN templatelinks ON page_id = tl_from WHERE tl_namespace = 10 AND tl_title = 'Wikidata_image';
+---------------------+---------+---------+
| asof                | total   | backlog |
+---------------------+---------+---------+
| 2022-03-04 18:41:29 | 1140205 |  260970 |
+---------------------+---------+---------+
1 row in set (1.36 sec)

MariaDB [warwiki_p]> SELECT NOW() AS asof, COUNT(*) AS total, SUM(eu_aspect LIKE 'C%') AS statements, SUM(eu_aspect LIKE 'O%') AS others FROM wbc_entity_usage;
+---------------------+---------+------------+--------+
| asof                | total   | statements | others |
+---------------------+---------+------------+--------+
| 2022-03-04 18:41:31 | 2833026 |    1150142 | 254196 |
+---------------------+---------+------------+--------+
1 row in set (1.47 sec)

Its done, the remaining page ids are orphaned page ids (at least on replica) and does not exists in the page table

MariaDB [warwiki_p]> SELECT eu_page_id, eu_row_id FROM wbc_entity_usage WHERE eu_aspect LIKE 'O%' ORDER BY eu_row_id;
+------------+-----------+
| eu_page_id | eu_row_id |
+------------+-----------+
|    2889794 |   5149392 |
|    2889799 |   5149403 |
|    2889822 |   5149426 |
|    2889904 |   5149457 |
+------------+-----------+
4 rows in set (1.19 sec)

MariaDB [warwiki_p]> SELECT * FROM page WHERE page_id IN ( 2889794, 2889799, 2889822, 2889904 );
Empty set (0.00 sec)

Confirmed:

MariaDB [warwiki]> SELECT NOW() AS asof, COUNT(*) AS total, SUM(eu_aspect LIKE 'C%') AS statements, SUM(eu_aspect LIKE 'O%') AS others FROM wbc_entity_usage;
+---------------------+---------+------------+--------+
| asof                | total   | statements | others |
+---------------------+---------+------------+--------+
| 2022-03-09 16:11:04 | 2578839 |    1150143 |      4 |
+---------------------+---------+------------+--------+
1 row in set (1.542 sec)

We somehow ended up with less total entity usage rows than we started with, I’m not sure why that would be the case. 🤷

We somehow ended up with less total entity usage rows than we started with, I’m not sure why that would be the case. 🤷

On the grafana dashboard it is visible that the counter for the statements starts growing right after the config change (which was at 2022-01-20)
https://grafana.wikimedia.org/d/000000176/wikidata-entity-usage-project?orgId=1&var-project=warwiki&from=1642632344937&to=1646535184632

While the Other type does not fail down in the same amout. It seems the statements are inserted by something (job? rc updates?), but only the parse of the page from edit or purge removes the Other type from the database.
The "Total Entity Usage Flags" shows that the graph is now at the same count (around 2.5 M) where it starts.

Its done, the remaining page ids are orphaned page ids (at least on replica) and does not exists in the page table

MariaDB [warwiki_p]> SELECT eu_page_id, eu_row_id FROM wbc_entity_usage WHERE eu_aspect LIKE 'O%' ORDER BY eu_row_id;
+------------+-----------+
| eu_page_id | eu_row_id |
+------------+-----------+
|    2889794 |   5149392 |
|    2889799 |   5149403 |
|    2889822 |   5149426 |
|    2889904 |   5149457 |
+------------+-----------+
4 rows in set (1.19 sec)

MariaDB [warwiki_p]> SELECT * FROM page WHERE page_id IN ( 2889794, 2889799, 2889822, 2889904 );
Empty set (0.00 sec)

orphaned page ids seems to be handled in T190460