Page MenuHomePhabricator

Enable statement usage tracking on Commons and Co
Open, MediumPublic

Description

On Commons (and some other wikis), we're still tracking Statements as part of the "Other" usage aspect. We may want to consider changing that.

wmf-config/InitialiseSettings.php:22343:'wmgWikibaseDisabledUsageAspects' => [
wmf-config/InitialiseSettings.php-22344-        'default' => [],
wmf-config/InitialiseSettings.php-22345-        // disabled statement usage tracking wikis
wmf-config/InitialiseSettings.php-22346-        'commonswiki' => [ 'C' => 'O' ],
wmf-config/InitialiseSettings.php-22347-        'cebwiki' => [ 'C' => 'O' ],
wmf-config/InitialiseSettings.php-22348-        'warwiki' => [ 'C' => 'O' ],
wmf-config/InitialiseSettings.php-22349-        'hywiki' => [ 'C' => 'O' ],
wmf-config/InitialiseSettings.php-22350-],

Acceptance criteria:

  • figure out why this wasn't enabled on Commons (and those other wikis) when it was first introduced in T151717, db07845
  • get an idea about what impact this would have in terms of database usage
  • get the ok from the DBAs
  • enable Statement group usage tracking on Commons (and those other wikis)

Notes:

Related Objects

View Standalone Graph
This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.

Event Timeline

Ladsgroup created this task.

According to @Lucas_Werkmeister_WMDE it seems all but Commons (and testcommons) are done. Should we close this ticket then and keep the subticket for Commons only?

I think so, unless there’s a difference between “fine-grained usage tracking” and “statement usage tracking”? In T186716 I only looked for config changes related to fine-grained usage tracking. (But looking through Wikibase.git I haven’t found a separate config setting for statement usage tracking so far, so I suspect it’s the same thing.)

(@Michael figured out what the difference between “fine-grained usage tracking” and “statement usage tracking” is, and now this task has an actually useful description.)

Quick comparison of entity usage rows between some other large / Wikidata-using wikis and those four:

lucaswerkmeister-wmde@stat1007:~$ for wiki in {en,sv,de,fr,eu,ca,commons,ceb,war,hy}wiki; do printf '%s\n' "$wiki"; sudo -u analytics-wmde analytics-mysql -N "$wiki" <<< "SELECT table_rows FROM information_schema.tables WHERE table_schema = '$wiki' AND table_name = 'wbc_entity_usage';"; done
enwiki
34642428
svwiki
26239019
dewiki
16629396
frwiki
53829737
euwiki
24686505
cawiki
35653134
commonswiki
730604655
cebwiki
13661965
warwiki
2558111
hywiki
4962673

Overall, for large wikis the order of magnitude is tens of millions of rows. Cebuano is in that ballpark, Commons one order of magnitude above, Waray and Armenian one below.

As an absolute worst case, the number of rows would be multiplied by 34. In production, we set:

$wgWBClientSettings['entityUsageModifierLimits'] = [ 'D' => 10, 'L' => 10, 'C' => 33 ];

So each page is allowed to track at most 33 distinct “C” usages (C.P31, C.P279 etc.) before they all get merged into a general “C”. The worst case is that every row is an “O” row (other entity usage), and when we stop mapping statement (“C”) to other usage, each page will have exactly 33 “C” usages (just enough not to be merged), plus an “O” usage, giving us 34× as many pages as before. Even in that absolute worst case, cebwiki, warwiki and hywiki would still have less rows in the table than Commons has now.

I’ll post a follow-up with some more detailed analysis later.

Sharper worst-case estimate for each wiki, based on the number of “O” rows they actually have (but still assuming each “O” row would expand to 34 rows):

lucaswerkmeister-wmde@stat1007:~$ for wiki in {war,hy,ceb,commons}wiki; do printf '%s\n' "$wiki"; sudo -u analytics-wmde analytics-mysql -N "$wiki" <<< "SELECT SUM(IF(eu_aspect = 'O', 34, 1)) FROM wbc_entity_usage;"; done
warwiki
40212655
hywiki
64513019
cebwiki
165007102
commonswiki
4782597269

Waray actually has a worst-case factor of 16 (rather than 34), Armenian of 13, and Cebuano of 12. For Commons the worst-case factor is only 6½ (i.e. there are already many non-“O” rows), but that’s still enough to push it into “billions of rows” territory. The eu_row_id is a BIGINT, so we wouldn’t immediately run into T63111-style overflow problems, but that’s still concerning.

(FYI if anyone wants to run similar queries: the Commons one took 6¼ minutes on stat1007.)

One thing we can do to limit the increase in the number of rows is to set $wgWBClientSettings['entityUsageModifierLimits']['C'] to a lower value for some wikis. If we go all the way down and set it to 1 (i.e. a page is only allowed to have a single “C” usage), then the theoretical worst case is still twice the number of rows (one “C”, one “O”), but in practice “O” usages are fairly uncommon:

lucaswerkmeister-wmde@stat1007:~$ for wiki in {en,sv,de,fr,eu,ca}wiki; do printf '%s\n' "$wiki"; sudo -u analytics-wmde analytics-mysql -N "$wiki" <<< "SELECT SUM(eu_aspect = 'O'), COUNT(*), SUM(eu_aspect = 'O') / COUNT(*) FROM wbc_entity_usage;"; done
enwiki
3496573	35321668	0.0990
svwiki
9668	26492181	0.0004
dewiki
272287	18520655	0.0147
frwiki
1096	55630719	0.0000
euwiki
174484	24230941	0.0072
cawiki
16369	37859576	0.0004
MariaDB [dewiki]> SELECT SUM(eu_aspect = 'O') AS other, SUM(eu_aspect LIKE 'C%') AS statement, COUNT(*) AS total FROM wbc_entity_usage;
+--------+-----------+----------+
| other  | statement | total    |
+--------+-----------+----------+
| 272286 |  11521557 | 18520654 |
+--------+-----------+----------+
1 row in set (8.682 sec)

So in reality, the increase in the number of rows would likely be smaller, maybe somewhere around 1.02× using the dewiki numbers, or 1.1× using the enwiki numbers.

(Minor correction: the entity usage modifier limits control how many e.g. “C” usages a page is allowed to have per entity. But the mapping to “O” usage would also happen per entity and page – a page can have “O” usages for several entities – so I think the factors for the row counts are still correct.)

A closer look at the statement usages on Swedish Wikipedia (probably the closest approximation of Cebuano Wikipedia, with both having many Ljsbot-generated articles?):

MariaDB [svwiki]> SELECT COUNT(*), AVG(count) FROM (SELECT eu_entity_id, eu_page_id, COUNT(*) AS count FROM wbc_entity_usage WHERE eu_aspect LIKE 'C.%' GROUP BY eu_entity_id, eu_page_id) AS counts;
+----------+------------+
| COUNT(*) | AVG(count) |
+----------+------------+
|  4162100 |     3.7754 |
+----------+------------+
1 row in set (4 min 37.571 sec)

MariaDB [svwiki]> SELECT COUNT(*) FROM wbc_entity_usage WHERE eu_aspect = 'C';
+----------+
| COUNT(*) |
+----------+
|   238389 |
+----------+
1 row in set (6.527 sec)

There are only some 240k “merged” statement usages, compared to 4.2M page-entity pairs having less than 33 statement usages (i.e. they didn’t get their “C” usages merged); out of those, the average is just under 4 statement usages per page-entity pair. If we assume that each “O” row gets turned into four new rows (three point something “C” rows plus a few remaining actual “O” rows), we arrive at a new estimate of 28181823 cebwiki rows, just a 2× increase.

MariaDB [cebwiki]> SELECT SUM(IF(eu_aspect = 'O', 4, 1)) AS estimate, COUNT(*) AS current, SUM(IF(eu_aspect = 'O', 4, 1)) / COUNT(*) AS factor FROM wbc_entity_usage;
+----------+----------+--------+
| estimate | current  | factor |
+----------+----------+--------+
| 28181823 | 14499273 | 1.9437 |
+----------+----------+--------+
1 row in set (6.906 sec)

For Wikimedia Commons, I suspect the average number of statement usages per page-entity pair would be rather higher, but I’m not sure if we can get a decent estimate for it.

For Wikimedia Commons, I suspect the average number of statement usages per page-entity pair would be rather higher, but I’m not sure if we can get a decent estimate for it.

One idea I’ve had is that many statement usages probably come from {{Wikidata Infobox}}, which shows statement values next to the label of the property. So maybe the number of property label usages can give us some idea of statement usage.

MariaDB [commonswiki]> SELECT COUNT(*), AVG(count) FROM (SELECT eu_page_id, COUNT(*) AS count FROM wbc_entity_usage WHERE eu_entity_id LIKE 'P%' AND eu_aspect IN ('L', 'L.en') GROUP BY eu_page_id) AS eu;
+----------+------------+
| COUNT(*) | AVG(count) |
+----------+------------+
|  7899871 |     3.5575 |
+----------+------------+
1 row in set (1 min 59.222 sec)

MariaDB [commonswiki]> SELECT COUNT(*) FROM wbc_entity_usage WHERE eu_entity_id LIKE 'P%' AND eu_aspect IN ('L', 'L.en');
+----------+
| COUNT(*) |
+----------+
| 28104685 |
+----------+
1 row in set (13.841 sec)

The first query suggests that pages which use property labels at all use, on average, the labels of some three and a half properties (less than I would’ve expected). The second one gives a lower bound of 28 million statement usages, based on the existing 28 million property label usages.

This doesn’t factor in other sources of statement usage, such as the 6½ million transclusions of {{Institution}}.

I think for Commons we don’t want to risk just enabling statement usage tracking; we probably want to start out with a fairly low entity usage modifier limit, say three (based on that 3.6 average above), and then slowly ramp that up and see how the number of rows develops, and how many rows with merged “C” usages (rather than e.g. C.P31) remain after each increase. Assuming a 1% ratio of real “O” usages (i.e. ones that won’t go away once we stop turning “C” into “O”), a limit of 3 would only increase the row count by a factor of 1.3 if I’m not mistaken:

MariaDB [commonswiki]> SELECT SUM(IF(eu_aspect = 'O', 3.01, 1)) AS estimate, COUNT(*) AS current, SUM(IF(eu_aspect = 'O', 3.01, 1)) / COUNT(*) AS factor  FROM wbc_entity_usage;
+--------------+-----------+----------+
| estimate     | current   | factor   |
+--------------+-----------+----------+
| 989746880.21 | 743719019 | 1.330808 |
+--------------+-----------+----------+
1 row in set (8 min 53.370 sec)

To summarize, I suggest:

  • For warwiki and hywiki (both s3), directly remove wmgWikibaseDisabledUsageAspects and treat them like all other wikis. Even in the worst possible case, they’re only expected to reach 40 and 65 million rows in wbc_entity_usage, respectively – which is in the vicinity of frwiki, and well below commonswiki – and most likely it’ll be well below that.
  • For cebwiki (s5), remove wmgWikibaseDisabledUsageAspects, but set $wgWBClientSettings['entityUsageModifierLimits']['C'] to 10 (instead of the default 33). This caps the absolute worst case for cebwiki at 60 million rows (rather than 165 million).
  • For commonswiki (s4), remove wmgWikibaseDisabledUsageAspects, but set $wgWBClientSettings['entityUsageModifierLimits']['C'] to 3 (instead of the default 33 or cebwiki’s 10). This would grow the table to ca. 990 million rows (currently 731 million). Note that this one isn’t an unrealistic worst case, I’d expect the table to actually grow to that size or something close to it (but not much more).

And then later:

  • Check how many cebwiki rows have eu_aspect = 'C' (rather than C.P31 etc.). If it’s not very many, then we can estimate how many additional rows the table might get when we remove the custom $wgWBClientSettings['entityUsageModifierLimits']['C'] and use the default (33) for cebwiki, and hopefully decide to make that change.
  • Gradually increase the $wgWBClientSettings['entityUsageModifierLimits']['C'] for commonswiki, checking how many eu_aspect = 'C' rows remain each time. At some point we may decide to leave the commonswiki limit at a lower value than 33 indefinitely.

Note that the table on each wiki would grow fairly gradually – entity usages would only be updated (i.e. “O” rows be replaced with one or more “C” rows) as each page gets re-parsed. If we wanted to see the effect of our changes within a shorter timeframe, we might have to actively purge pages.

DBA and SRE people: does this sound acceptable to you?

I will talk to @Ladsgroup about this, as I am missing lots of context here and the implications this could have.

First of all, Sorry it took me so long to comment. Vacation, onboarding, etc.

I was involved in the work of collapsing aspects and the changes related. IIRC, The reason we kept hywiki and warwiki was that they were using an inefficient module of Wikidata making them practically subscribe to every property of each item they used and I couldn't fix it because it was protected. That's a long time ago and I think the access pattern has been fixed (and if not, I can change protected modules now). We probably should quickly check if it's fixed for these three (hywiki, cebwiki, warwiki) and also we should check other client wikis to see if the access pattern has changed and possibly change the value of collapsing aspects too (or maybe fix their modules).

Regarding hywiki and warwiki, I think it's fine to go ahead with the change but keep track of size of rc table in these wikis and if it's growing too fast (and ratio of rc entries from wikidata is high). Then check their lua modules for obvious mistakes.

the plan for cebwiki sounds good to me too. Again, keep track of their rc table (and ratio of rc entries from wikidata) and make sure it doesn't explode.

But for commons, I suggest avoiding the change for now (and specially enabling injecting rc entries). Commons database is heavily under pressure, there is space on the disk with the new spec but innodb buffer pool is getting less efficient (to a dangerous value). Finishing the actor migration will help a bit but the actual issue is something that will take a while to fix (T222224).

FWIW: these are size of wbc_entity_usage in these wikis:

  • hywiki: 1.8G
  • warwiki: 325M
  • cebwiki: 1.4 GB
  • commonswiki: 62GB

HTH, let me know if I can help with anything.

Would it be better on Commons if we set $wgWBClientSettings['entityUsageModifierLimits']['C'] to 1 instead of 3? Then the number of rows shouldn’t increase very much (only for the pages, presumably rare, that would have both a “C” usage and also a legitimate, non-mapped “O” usage).

Note that, due to T294451, the status quo doesn’t just mean that the usage tracking is less precise than it could be: relevant statement changes actually aren’t being dispatched at all.

Would it be better on Commons if we set $wgWBClientSettings['entityUsageModifierLimits']['C'] to 1 instead of 3?

What you are suggesting has two impacts:

  • It changes the row values from O to C. That's good and safe.
  • It enables dispatching. The RC part is disabled so that's safe but parsercache stuff might put pressure on the system. We recently removed a lot of useless parsing so I think we are generally safe there but something to keep an eye on.

Two ideas for improving the current design:

  • Normalize the table based on eu_aspect.
    • While this would have been something I would do if I were to design the table but it's clearly too much effort for too little gain on its own, maybe if we merge this with the other, it'd be easier.
  • Split the table to wbc_property_usage and wbc_item_usage and use numeric ids there.
    • I don't know what is the status of federation and if that can affect feasibility of this idea (e.g. having property usage from two different repos and how to track that).

We can also start looking at moving these table to x1 but that's not something I can say "we should do it" on my own, DBAs need to look at the feasibility, growth, capacity, etc. before saying this is doable. You will lose the ability to join and I think this just moves the problem from one place to another.

  • Split the table to wbc_property_usage and wbc_item_usage and use numeric ids there.
    • I don't know what is the status of federation and if that can affect feasibility of this idea (e.g. having property usage from two different repos and how to track that).

T235901: Implement Lua access to Lexemes, Senses and Forms adds usage for lexeme entities – it’s currently not fine-grained at all, but it works as far as I can tell, and we want to start deploying that to production fairly soon. Commons also has entity usage rows for MediaInfo entities (though those are entirely useless at the moment, see T293354).

Yeah, then possibly split that column into two, one numeric id and numeric identifier of the entity type (item=0, property=2, etc. etc.) but I don't know how we can make it handle subentities (maybe just give up on granularity on those?)

That sounds like a very cumbersome hack to me, and I also think it’s too early to know whether we’ll want granular usage tracking for subentities or not (i.e. whether Wiktionary pages are likely to only use data from a subset of a lexeme’s senses or forms).

Is there anything short of a table redesign that we can do to allow statement usage tracking to become functional on Commons? Some ideas:

  • lower $wgWBClientSettings['entityUsageModifierLimits']['C'] all the way down to 1, to keep the number of additional rows in the table low, though (as you pointed out) it still introduces additional parser load
  • add tracking for when the entity usage modifier limits are hit, tracking the original number of usages (probably as a statsd “timing”, so we get percentiles etc.), so that we can estimate what a better long-term $wgWBClientSettings['entityUsageModifierLimits']['C'] limit would be
  • randomly react to only a subset of dispatched statement changes, controlled by a new setting (defaults to 1/all but starts at 0/none for Commons), so that we can gradually ramp up the amount of dispatching that happens, separate from the slow migration of entity usage rows from “O” to “C”, to mitigate the additional parser load
    • we could also do something similar injectRecentChanges setting, which is currently a boolean (that would be T179010 and could happen separately afterwards)

The problem is that if we want to have the long-term vision in mind, we need to move towards granular tracking so we can give better experience to patrollers or users. (and avoid commons dying under the load of rc entries)

…and this has to block a move from incorrect tracking to correct tracking? I’m not sure what you mean.

According to T188730#7461581 the $wgWBClientSettings['entityUsageModifierLimits']['C'] should be set to 10 for cebwiki to control the size of the whole table on that wiki.

Just to note that there are also a wiki which is bigger and that extra work may not needed

[arzwiki_p]> SELECT COUNT(*) AS total, SUM(eu_aspect LIKE 'C%') AS statements, SUM(eu_aspect LIKE 'O%') AS others FROM wbc_entity_usage;
+-----------+------------+--------+
| total     | statements | others |
+-----------+------------+--------+
| 187974968 |   71519292 |  12575 |
+-----------+------------+--------+
1 row in set (1 min 0.69 sec)

[cebwiki_p]> SELECT COUNT(*) AS total, SUM(eu_aspect LIKE 'C%') AS statements, SUM(eu_aspect LIKE 'O%') AS others FROM wbc_entity_usage;
+----------+------------+---------+
| total    | statements | others  |
+----------+------------+---------+
| 15235070 |          0 | 4798965 |
+----------+------------+---------+
1 row in set (13.53 sec)

Also big:

[ukwiki_p]> SELECT COUNT(*) AS total, SUM(eu_aspect LIKE 'C%') AS statements, SUM(eu_aspect LIKE 'O%') AS others FROM wbc_entity_usage;
+-----------+------------+--------+
| total     | statements | others |
+-----------+------------+--------+
| 125081999 |   22233576 |  76657 |
+-----------+------------+--------+
1 row in set (1 min 50.90 sec)

Bad ratio for others/statements:

[enwiki_p]> SELECT COUNT(*) AS total, SUM(eu_aspect LIKE 'C%') AS statements, SUM(eu_aspect LIKE 'O%') AS others FROM wbc_entity_usage;
+----------+------------+---------+
| total    | statements | others  |
+----------+------------+---------+
| 38122593 |   11668302 | 4764302 |
+----------+------------+---------+
1 row in set (39.40 sec)

Hi @ItamarWMDE this seems to be on the tech board already, right?

yes, but if we want it to be picked up next sprint, it should go to the tech backlog column in the Lexidata board