Page MenuHomePhabricator

Usage tracking: record which statement group is used
Closed, ResolvedPublic

Description

When tracking label usage, we record which language is used, so we only purge the page when that language changes.
Similarly, we should record which statement group is used (by property ID), so we only purge the page when that statement group changes.

We will enable this on elwiki initially in order to evaluate the number of new usage tracked.

Implementation issues:

  • We will write a lot (?) more rows to wbc_entity_usage. This will be evaluated based on the data from elwiki
  • We currently pass the entire serialized Entity to Lua. We'd have to use magic accessors to track usage of individual bits of the structure.
    • We could however easily track access via a specialized Lua function for getting or formatting Statement groups.

Patch-For-Review:

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 375544 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable statement usage tracking on elwiki

https://gerrit.wikimedia.org/r/375544

Mentioned in SAL (#wikimedia-operations) [2017-09-20T12:08:38Z] <hoo@tin> Synchronized wmf-config/Wikibase-production.php: Enable statement usage tracking on elwiki (T151717) (duration: 00m 49s)

This is in effect now and the first statement usages are coming in on elwiki:

+--------+-----------+
| No     | eu_aspect |
+--------+-----------+
| 270848 | X         |
| 192094 | S         |
| 150218 | T         |
|  99255 | L.el      |
|  79981 | O         |
|   6461 | L.en      |
|    675 | C.P21     |
|    575 | C.P373    |
|    428 | C.P570    |
|    412 | C.P856    |
|    382 | C.P569    |
|    367 | C.P1362   |
|    367 | C.P245    |
|    367 | C.P428    |
|    367 | C.P1157   |
|    367 | C.P2188   |
|    367 | C.P1015   |
|    367 | C.P434    |
|    367 | C.P886    |
|    367 | C.P1222   |
…
179 rows in set (0.72 sec)

I'll keep monitoring this for the rest of the day and will write a summary.

@jcrespo If there are any problems, c333e7124d6428a46aa8e24398c21f8c20437b3e can be reverted at any time.

After a few hours (w/o any mass purges from my side), the table looks like this:

+----------+
| COUNT(*) |
+----------+
|  1577837 |
+----------+
1 row in set (0.50 sec)

+--------+-----------+
| No     | eu_aspect |
+--------+-----------+
| 270820 | X         |
| 192107 | S         |
| 150242 | T         |
|  99272 | L.el      |
|  79991 | O         |
|  23945 | C.P21     |
|  19367 | C.P373    |
|  17364 | C.P856    |
|  15823 | C.P18     |
|  14846 | C.P570    |
|  14254 | C.P569    |
|  13728 | C.P451    |
|  13717 | C.P27     |
|  13610 | C.P69     |
|  13584 | C.P22     |
|  13581 | C.P25     |
|  13559 | C.P40     |
…
221 rows in set (1.26 sec)

Note: Before the deploy, elwiki had 798858 usages only:

mysql:wikiadmin@db1038 [elwiki]> SELECT COUNT(*) AS No, eu_aspect FROM wbc_entity_usage GROUP BY eu_aspect ORDER BY No DESC;
+--------+-----------+
| No     | eu_aspect |
+--------+-----------+
| 270849 | X         |
| 192094 | S         |
| 150219 | T         |
|  99255 | L.el      |
|  79980 | O         |
|   6461 | L.en      |
+--------+-----------+
6 rows in set (4.57 sec)

Just started refreshLinks.php for all articles on elwiki (https://wikitech.wikimedia.org/w/index.php?diff=1770578&oldid=1770577), this should fully update the entity usage table.

Cool, get if you can some `SHOW TABLE STATUS like <tablename> stats, to get the "before" state in bytes (even if it had started some time ago).

(Shortly) after the refresh links got started:

mysql:wikiadmin@db1038 [elwiki]> SHOW TABLE STATUS like "wbc_entity_usage"\G
*************************** 1. row ***************************
           Name: wbc_entity_usage
         Engine: InnoDB
        Version: 10
     Row_format: Compact
           Rows: 1611080
 Avg_row_length: 71
    Data_length: 114737152
Max_data_length: 0
   Index_length: 164265984
      Data_free: 5242880
 Auto_increment: 2489116
…
1 row in set (0.00 sec)

mysql:wikiadmin@db1035 [elwiki]> SHOW TABLE STATUS like "wbc_entity_usage"\G
*************************** 1. row ***************************
           Name: wbc_entity_usage
         Engine: InnoDB
        Version: 10
     Row_format: Compact
           Rows: 1624654
 Avg_row_length: 77
    Data_length: 125124608
Max_data_length: 0
   Index_length: 175751168
      Data_free: 5242880
 Auto_increment: 2493583
…
1 row in set (0.00 sec)

wbc_entity_usage tracking for statement group - PASS
As hoo showed in the previous comments, we get C records, and the table doesn't blowup too much. (currently 2037539 records in that table).

only purge the page when that statement group changes - FAIL
Keep in mind that the original goal of this ticket is:

Similarly, we should record which statement group is used (by property ID), so we only purge the page when that statement group changes.

Here I'm going to use recentchanges table with rc_source=wb as proxy to purges.

While this works in the micro level, it doesn't yet look nice enough in the macro level.

Micro level:

  1. Selecting a page for which we track on specific claim: select * from wbc_entity_usage wbc_a where wbc_a.eu_aspect like 'C%' and not exists(select * from wbc_entity_usage wbc_b where wbc_b.eu_page_id=wbc_a.eu_page_id and wbc_b.eu_aspect='X') limit 5;
  2. Doing some change in wikidata that doesn't affect that property indeed doesn't create rc record: select * from recentchanges where rc_source ='wb' and rc_title = 'Γιάννης_Μπουτάρης' and rc_namespace=0 limit 5;/*empty results */

Macro level - there are still many records from WB

select left(rc_timestamp, 8), count(*) from recentchanges where rc_source ='wb' and rc_timestamp>20170918000000 group by  left(rc_timestamp, 8);
+-----------------------+----------+
| left(rc_timestamp, 8) | count(*) |
+-----------------------+----------+
| 20170918              |    93493 |
| 20170919              |    31978 |
| 20170920              |    39313 |
| 20170921              |     5753 |
+-----------------------+----------+

Or in hour resolution (since not much time passed since deployment):

select left(rc_timestamp, 10), count(*) from recentchanges where rc_source ='wb' and rc_timestamp>20170918000000 group by  left(rc_timestamp, 10); 
| 2017091800             |      120 |
| 2017091801             |      363 |
| 2017091802             |     2098 |
| 2017091803             |      284 |
...
2017092100             |      145
2017092101             |      189
2017092102             |     1615
2017092103             |      248

This is probably because there are still as many as X usages as there were previously. I guess the next step to get it fixed is to either go on with T172914 ? or try to deploy it on rowiki/hewiki where there is extensive use of getBestStatements(Q, P) function and less full entity usage.

I assume the latter also means we are not decreasing the number of irrelevant changes shown in the watchlist and recent changes yet?

I assume the latter also means we are not decreasing the number of irrelevant changes shown in the watchlist and recent changes yet?

Potentially yes. How much impact this actually has is hard to predict. If most pages only use a small fraction of the statements on the items, and a lot of unused statements get added or edited, then the effect is large.

We still track Other ("O") and All usages ("X") on elwiki like we did before. If we conclude the test there ok, we can stop doing this and move forward here and with T172914.

Please note that no longer tracking O or X usages along also requires T176413: Make AffectedPagesFinder take STATEMENT_USAGE into account.

I assume the latter also means we are not decreasing the number of irrelevant changes shown in the watchlist and recent changes yet?

Potentially yes. How much impact this actually has is hard to predict. If most pages only use a small fraction of the statements on the items, and a lot of unused statements get added or edited, then the effect is large.

(Although DBA will probably not like this way of estimation) Here is a very rough estimation:

  • 1827 statement changes in wikidata since 20/09 should appear in recentchanges
  • 33603 is the actual number of statement changes appear

which is 5% relevance...

select substr(rc_comment, locate('[[Property:', rc_comment)+11, locate(']]:', rc_comment)-locate('[[Property:', rc_comment)-11)=substr(wbc_c.eu_aspect,3) AS RC_EU_ATTRIBUTED, count(*) from recentchanges inner join page on rc_namespace=page_namespace and rc_title=page_title inner join wbc_entity_usage wbc_c on eu_page_id=page_id and eu_aspect like 'C%' where rc_source='wb' and rc_comment like '/* wb%claim-create:% */ [[Property:P%' and rc_timestamp>20170920000000 group by substr(rc_comment, locate('[[Property:', rc_comment)+11, locate(']]:', rc_comment)-locate('[[Property:', rc_comment)-11)=substr(wbc_c.eu_aspect,3);
+------------------+----------+
| RC_EU_ATTRIBUTED | count(*) |
+------------------+----------+
|                0 |  1227231 |
|                1 |     1827 |
+------------------+----------+

select count(*) from recentchanges where rc_source='wb' and rc_comment like '/* wb%claim-create:% */ [[Property:P%' and rc_timestamp>20170920000000;
+----------+
| count(*) |
+----------+
|    33603 |
+----------+

Populating elwiki will be done in the next few hours, I'll then post updated stats.

In order to keep moving fast, I suggest targeting another mid-sized wiki… any suggestions? kowiki and trwiki we're suggested earlier and IMO are good next targets. What do you think @jcrespo?

Mentioned in SAL (#wikimedia-operations) [2017-09-26T14:26:30Z] <hoo> mwscript refreshLinks.php --wiki elwiki --namespace 0 on terbium has finished (T151717)

Now that all articles have been refreshed (see T151717#3621993/T151717#3621975 for a comparison):

+----------+
| COUNT(*) |
+----------+
|  2646007 |
+----------+
1 row in set (16.76 sec)

+--------+-----------+
| No     | eu_aspect |
+--------+-----------+
| 269763 | X         |
| 192387 | S         |
| 150830 | T         |
|  99712 | L.el      |
|  80258 | O         |
|  57459 | C.P373    |
|  49197 | C.P21     |
|  36119 | C.P856    |
|  31948 | C.P570    |
|  31616 | C.P18     |
|  30382 | C.P569    |
|  27893 | C.P27     |
|  27515 | C.P69     |
|  27495 | C.P40     |
|  27407 | C.P26     |
|  27379 | C.P451    |
|  27161 | C.P22     |
|  27073 | C.P25     |
|  26649 | C.P106    |
|  25711 | C.P509    |
+--------+-----------+
20 rows in set (27.83 sec)

Table stats (see T151717#3622100 for comparison):

mysql:wikiadmin@db1038 [elwiki]> SHOW TABLE STATUS like "wbc_entity_usage"\G
*************************** 1. row ***************************
           Name: wbc_entity_usage
         Engine: InnoDB
        Version: 10
     Row_format: Compact
           Rows: 1950418
 Avg_row_length: 74
    Data_length: 145162240
Max_data_length: 0
   Index_length: 218972160
      Data_free: 6291456
 Auto_increment: 3553049
…
1 row in set (0.00 sec)

mysql:wikiadmin@db1035 [elwiki]> SHOW TABLE STATUS like "wbc_entity_usage"\G
*************************** 1. row ***************************
           Name: wbc_entity_usage
         Engine: InnoDB
        Version: 10
     Row_format: Compact
           Rows: 2670782
 Avg_row_length: 73
    Data_length: 195395584
Max_data_length: 0
   Index_length: 287178752
      Data_free: 4194304
 Auto_increment: 3553050
…
1 row in set (0.00 sec)

On elwiki we saw the number of statement usage to be about 5.3 times the number of all + other usages ((1.867e6)/(272.7e3+80.554e3) = 5.28, data taken from Grafana today).

For enwiki that would mean about 25 million statement usages (compared to a total of 14.8 million usages now). For ruwiki that would mean about 97 million statement usages (compared to a total of 33.9 million usages now).

For kowiki that would mean about 1.4 million statement usages (compared to a total of 959.4 thousand usages now). For trwiki that would mean about 760 thousand statement usages (compared to a total of 656 thousand usages now).

I am a bit lost with the estimation- is that realistic, is the number of usages more or less right with your estimation, or is it going to be a quadratic expression because we do not know accurately the number of statement usages per statement? If the estimations are right, multiplying table size by 3 is not a big deal- 97 million rows/20GB is "ok". Also, some rows/indexes are going away at some point (or may be confusing tables)? I do not see a reason not to keep deploying except if they are too dynamic (writes increase too much), in which case we should then check the idea of separating those.

Offtopic: wb_terms, on the wikidata server, has become a 700GB beast- that worries me more for maintenance reasons.

I am a bit lost with the estimation- is that realistic, is the number of usages more or less right with your estimation, or is it going to be a quadratic expression because we do not know accurately the number of statement usages per statement?

I guess we can't know for sure, but going for more wikis and then re-running the numbers should get us a better idea.

Also, some rows/indexes are going away at some point (or may be confusing tables)?

eu_touched is supposed to be dropped (T144010), but that's it.

I do not see a reason not to keep deploying except if they are too dynamic (writes increase too much), in which case we should then check the idea of separating those.

This should not be the case (the initial deployment might hit hard, though). On big template/ module changes the table might see quite some changes, but I don't think this is much out of the usual.

Offtopic: wb_terms, on the wikidata server, has become a 700GB beast- that worries me more for maintenance reasons.

After moving the search to elastic, we don't need some of the fields and indexes there anymore (probably term_search_key and term_weight and the associated indexes). Also two fields (term_entity_id, term_entity_type) will be dropped and we had a problem with duplicate rows.

Change 382414 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[operations/mediawiki-config@master] Enable Statement usage tracking on kowiki and trwiki

https://gerrit.wikimedia.org/r/382414

Change 382414 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable Statement usage tracking on kowiki and trwiki

https://gerrit.wikimedia.org/r/382414

Table stats on trwiki pre-deploy:

No      eu_aspect
24      L.en
460     L.tr
130725  O
508625  S
3829    T
12367   X
mysql:wikiadmin@db1074 [trwiki]> SHOW TABLE STATUS like "wbc_entity_usage"\G
*************************** 1. row ***************************
           Name: wbc_entity_usage
         Engine: InnoDB
        Version: 10
     Row_format: Dynamic
           Rows: 642959
 Avg_row_length: 73
    Data_length: 47300608
Max_data_length: 0
   Index_length: 63291392
      Data_free: 7340032
 Auto_increment: 1345002
…

mysql:wikiadmin@db1090 [trwiki]> SHOW TABLE STATUS like "wbc_entity_usage"\G
*************************** 1. row ***************************
           Name: wbc_entity_usage
         Engine: InnoDB
        Version: 10
     Row_format: Dynamic
           Rows: 661714
 Avg_row_length: 73
    Data_length: 48349184
Max_data_length: 0
   Index_length: 64339968
      Data_free: 6291456
 Auto_increment: 1345002
…

Table stats on kowiki pre-deploy:

No      eu_aspect
9       L.en
65      L.ko
178798  O
637352  S
62877   T
80297   X
mysql:wikiadmin@db1094 [kowiki]> SHOW TABLE STATUS like "wbc_entity_usage"\G
*************************** 1. row ***************************
           Name: wbc_entity_usage
         Engine: InnoDB
        Version: 10
     Row_format: Compact
           Rows: 956740
 Avg_row_length: 77
    Data_length: 74563584
Max_data_length: 0
   Index_length: 86687744
      Data_free: 5242880
 Auto_increment: 1728033
…

mysql:wikiadmin@db1079 [kowiki]> SHOW TABLE STATUS like "wbc_entity_usage"\G
*************************** 1. row ***************************
           Name: wbc_entity_usage
         Engine: InnoDB
        Version: 10
     Row_format: Compact
           Rows: 893685
 Avg_row_length: 77
    Data_length: 69320704
Max_data_length: 0
   Index_length: 80396288
      Data_free: 5242880
 Auto_increment: 1728033
…

Mentioned in SAL (#wikimedia-operations) [2017-10-05T13:57:24Z] <ladsgroup@tin> Synchronized wmf-config/Wikibase-production.php: Enable Statement usage tracking on kowiki and trwiki (T151717) (duration: 00m 50s)

After about 4 days we have:

mysql:wikiadmin@db1076 [trwiki]> SELECT COUNT(*) FROM wbc_entity_usage WHERE eu_aspect LIKE "C.%";
+----------+
| COUNT(*) |
+----------+
|   230846 |
+----------+

mysql:wikiadmin@db1076 [trwiki]> SELECT COUNT(*) FROM wbc_entity_usage WHERE eu_aspect IN ("X", "O");
+----------+
| COUNT(*) |
+----------+
|   143210 |
+----------+
1 row in set (0.23 sec)
mysql:wikiadmin@db1079 [kowiki]> SELECT COUNT(*) FROM wbc_entity_usage WHERE eu_aspect LIKE "C.%";
+----------+
| COUNT(*) |
+----------+
|  1180263 |
+----------+
1 row in set (0.53 sec)

mysql:wikiadmin@db1079 [kowiki]> SELECT COUNT(*) FROM wbc_entity_usage WHERE eu_aspect IN ("X", "O");
+----------+
| COUNT(*) |
+----------+
|   259588 |
+----------+
1 row in set (0.50 sec)

So we currently have 4.55 times the O+X usages on kowiki and 1.6 times the O+X usages on trwiki. If our estimation from elwiki holds, we should only see a slight increase from here on kowiki… but I highly doubt this after just 4 days with no mass purging (at least from our side).

On kowiki currently about 21.7% of all pages having O or X usages have seen a page link update since the deploy.

mysql:wikiadmin@db1086 [kowiki]> SELECT COUNT(*) FROM (SELECT COUNT(*) FROM page INNER JOIN wbc_entity_usage ON eu_page_id = page_id WHERE eu_aspect IN ('X', 'O') AND page_namespace = 0 AND page_links_updated < '2017100514' GROUP BY page_id) AS sub;
+----------+
| COUNT(*) |
+----------+
|   105681 |
+----------+
1 row in set (6.85 sec)

mysql:wikiadmin@db1086 [kowiki]> SELECT COUNT(*) FROM (SELECT COUNT(*) FROM page INNER JOIN wbc_entity_usage ON eu_page_id = page_id WHERE eu_aspect IN ('X', 'O') AND page_namespace = 0 AND page_links_updated > '2017100514' GROUP BY page_id) AS sub;
+----------+
| COUNT(*) |
+----------+
|    22885 |
+----------+
1 row in set (3.36 sec)

Due to this I expect (assuming all pages use Wikidata equally… less often updated pages use Wikidata less often though, I guess) kowiki to have up to 5.5M statement usages. That would mean over 21 times the number of O or X usages!

What you write is ok, but IF you want our opinion, can you translate that into increase of row storage and inserts/other write activity compared to the full table size/previous state?

What you write is ok, but IF you want our opinion, can you translate that into increase of row storage and inserts/other write activity compared to the full table size/previous state?

mysql:wikiadmin@db1094 [kowiki]> SHOW TABLE STATUS like "wbc_entity_usage"\G
*************************** 1. row ***************************
           Name: wbc_entity_usage
         Engine: InnoDB
        Version: 10
     Row_format: Compact
           Rows: 2138399
 Avg_row_length: 65
    Data_length: 140689408
Max_data_length: 0
   Index_length: 214941696
      Data_free: 4194304
 Auto_increment: 2928907

mysql:wikiadmin@db1079 [kowiki]> SHOW TABLE STATUS like "wbc_entity_usage"\G
*************************** 1. row ***************************
           Name: wbc_entity_usage
         Engine: InnoDB
        Version: 10
     Row_format: Compact
           Rows: 2227585
 Avg_row_length: 68
    Data_length: 153288704
Max_data_length: 0
   Index_length: 232833024
      Data_free: 6291456
 Auto_increment: 2928976

While the number of usage rows grew by about a factor 2.3, the data length grew by a factor of 1.9 on db1094 and by 2.2 on db1079. The index length grew by factor of about 2.5 on db1094 and by 2.9 on db1079.

I have the feeling that these numbers could be meaningless on such small wikis, given the issues on recentchanges with only some large wikis such as commons and ruwiki. Could the same issue happen to those a x10 growth. I know the case is different, pages * edits, vs deduplicating items * statements on current state, but could you evaluate those high-wikidata bandwith wikis to see if similar issues could happen?

What I mean is that the number are ok to proceeed (not a big deal), but still worried for the large wikis. I know you do not have all the answers, I just was talking aloud.

I have the feeling that these numbers could be meaningless on such small wikis, given the issues on recentchanges with only some large wikis such as commons and ruwiki. Could the same issue happen to those a x10 growth. I know the case is different, pages * edits, vs deduplicating items * statements on current state, but could you evaluate those high-wikidata bandwith wikis to see if similar issues could happen?

This is not trivial to estimate, as you can guess. I have two ways in mind for doing this, for once we could just shoot for it and enable it on such a large Wikipedia and hope for the best, otherwise revert the config (which we can do at anytime). This could also be done for maybe an hour only to get an idea.

Also we could look at going for cawiki which is known for the intensive Wikidata use (but still has way fewer Wikidata all usages per page than ruwiki). Another very well suited wiki would be cewiki, which has even more Wikidata all usages per page than ruwiki and it seems to broadly use ruwiki's Wikidata module.

@jcrespo Would they seem doable for you right now? The worst case estimation would make for up to 32M new usages on cawiki (3.6 times the current usages) and 75M on cewiki (13.5 times the current usages)… but I expect these numbers to be a gross overestimation (based on the worst case estimate for kowiki), it probably wont be nearly as bad… especially as it seems that elwiki uses many all accesses to just get labels (so these wont turn into any new usages).

I'll enable the tracking on the two wikis (cawiki, cewiki) tomorrow then.

Btw, the estimate for kowiki changed down from 5.5M statement usages to 5.15M already (that's just a increase by a factor of 19.9)… I guess this will go down further as more pages see updates.

51.5M you meant, maybe?

No for kowiki we're indeed talking about "just" a few (~5) million statement usages… after all it has only about 1.5% the number of all+other usages ruwiki has (about 0.5% if we only include all usages).

Change 383439 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[operations/mediawiki-config@master] Move WB client "disabledUsageAspects" setting into $wmgWikibaseDisabledUsageAspects

https://gerrit.wikimedia.org/r/383439

Change 383440 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[operations/mediawiki-config@master] Enable Statement usage tracking on cawiki and cewiki

https://gerrit.wikimedia.org/r/383440

Change 383439 merged by jenkins-bot:
[operations/mediawiki-config@master] Move WB client "disabledUsageAspects" setting into $wmgWikibaseDisabledUsageAspects

https://gerrit.wikimedia.org/r/383439

Change 383440 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable Statement usage tracking on cawiki and cewiki

https://gerrit.wikimedia.org/r/383440

Mentioned in SAL (#wikimedia-operations) [2017-10-11T12:38:06Z] <hoo@tin> Synchronized wmf-config/InitialiseSettings.php: Enable Statement usage tracking on cawiki and cewiki (T151717) (duration: 00m 47s)

Mentioned in SAL (#wikimedia-operations) [2017-10-11T12:44:53Z] <hoo@tin> Synchronized wmf-config/InitialiseSettings.php: (temp) Disable Statement usage tracking on cawiki (T151717) (duration: 00m 48s)

Change 383556 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[operations/mediawiki-config@master] Disable statement usage tracking on cawiki

https://gerrit.wikimedia.org/r/383556

Change 383556 merged by jenkins-bot:
[operations/mediawiki-config@master] Disable statement usage tracking on cawiki

https://gerrit.wikimedia.org/r/383556

Mentioned in SAL (#wikimedia-operations) [2017-10-11T12:54:16Z] <hoo@tin> Synchronized wmf-config/InitialiseSettings.php: Disable Statement usage tracking on cawiki (T151717) (duration: 00m 47s)

Table sizes pre-deploy to cawiki and cewiki:

mysql:wikiadmin@db1079 [cawiki]> SHOW TABLE STATUS like "wbc_entity_usage"\G
*************************** 1. row ***************************
           Name: wbc_entity_usage
         Engine: InnoDB
        Version: 10
     Row_format: Compact
           Rows: 11676926
 Avg_row_length: 124
    Data_length: 1450704896
Max_data_length: 0
   Index_length: 1864957952
      Data_free: 258998272
 Auto_increment: 27091066
…

mysql:wikiadmin@db1094 [cawiki]> SHOW TABLE STATUS like "wbc_entity_usage"\G
*************************** 1. row ***************************
           Name: wbc_entity_usage
         Engine: InnoDB
        Version: 10
     Row_format: Compact
           Rows: 11275586
 Avg_row_length: 129
    Data_length: 1454882816
Max_data_length: 0
   Index_length: 1839742976
      Data_free: 279969792
 Auto_increment: 27091066
…
mysql:wikiadmin@db1078 [cewiki]> SHOW TABLE STATUS like "wbc_entity_usage"\G
*************************** 1. row ***************************
           Name: wbc_entity_usage
         Engine: InnoDB
        Version: 10
     Row_format: Compact
           Rows: 5958796
 Avg_row_length: 78
    Data_length: 465485824
Max_data_length: 0
   Index_length: 564248576
      Data_free: 9437184
 Auto_increment: 10184030
…

mysql:wikiadmin@db1077 [cewiki]> SHOW TABLE STATUS like "wbc_entity_usage"\G
*************************** 1. row ***************************
           Name: wbc_entity_usage
         Engine: InnoDB
        Version: 10
     Row_format: Compact
           Rows: 5825230
 Avg_row_length: 81
    Data_length: 474906624
Max_data_length: 0
   Index_length: 568295424
      Data_free: 11534336
 Auto_increment: 10184030
…

Usages pre-deploy:
cawiki:

No      eu_aspect
5767425 L.ca
412499  L.en
442257  L.oc
585871  O
649594  S
3046987 T
1174675 X

cewiki:

No      eu_aspect
130585  L.ce
3905    L.en
126406  L.ru
160246  O
172761  S
1798197 T
3645127 X

cawiki had to be reverted after a few minutes as the number of usages added skyrocketed, I saw over 22,000 usages being added in a minute. (Usages: 23832; SLEEP(60); Usages: 46490). A total of 90430 statement usages were added in just 6 minutes on cawiki (~15k per minute or 250/s).

cawiki had (at the time statement usage were disable again) 90,430 statement usages from just 1,118 pages. This means each of these pages had an average of about 81 statement usages (median: 90) and one page had 196 even.

Top 10 pages:

mysql:wikiadmin@db1094 [cawiki]> SELECT COUNT(*) AS No, eu_page_id FROM wbc_entity_usage WHERE eu_aspect LIKE "C.%" GROUP BY eu_page_id ORDER BY No DESC LIMIT 10;
+-----+------------+
| No  | eu_page_id |
+-----+------------+
| 196 |      90242 |
| 192 |      53024 |
| 179 |      68079 |
| 177 |     236868 |
| 166 |     153616 |
| 165 |      40888 |
| 164 |     881994 |
| 162 |     577523 |
| 158 |    1113825 |
| 157 |     114421 |
+-----+------------+

On elwiki there are only 1,853,264 statement usages on 88,106 pages, thus each of these pages has an average of only 21 statement usages (median: 14).

Top 10 pages:

mysql:wikiadmin@db1072 [elwiki]> SELECT COUNT(*) AS No, eu_page_id FROM wbc_entity_usage WHERE eu_aspect LIKE "C.%" GROUP BY eu_page_id ORDER BY No DESC LIMIT 10;
+-----+------------+
| No  | eu_page_id |
+-----+------------+
| 210 |     519024 |
| 100 |     196573 |
|  96 |     189235 |
|  96 |      13235 |
|  96 |     233523 |
|  93 |     171936 |
|  93 |      16531 |
|  92 |     186490 |
|  92 |     435266 |
|  90 |     189233 |
+-----+------------+

On cewiki there are up to now only 99,814 statement usages on 5,865 pages, thus each of these pages has an average of only 17 statement usages (median: 14).

Top 10 pages:

mysql:wikiadmin@db1078 [cewiki]> SELECT COUNT(*) AS No, eu_page_id FROM wbc_entity_usage WHERE eu_aspect LIKE "C.%" GROUP BY eu_page_id ORDER BY No DESC LIMIT 10;
+-----+------------+
| No  | eu_page_id |
+-----+------------+
| 187 |       2237 |
| 154 |       2443 |
| 125 |       9584 |
| 125 |       9516 |
| 122 |       2680 |
| 121 |      62650 |
| 121 |       2278 |
| 119 |       2284 |
| 118 |       2179 |
| 116 |       9609 |
+-----+------------+

1, @hoo so how do we move forward with deployment to larger wikis/wikis with more wikidata usage such as cawiki? Do we have an issue with DB size? (e.g requires more thought on wbc_entity_usage design) or is it disabled in cawiki because it updated too fast? (so we just need to make the buildup slower?) I think most of the usages tracking are going to be static once it finish the first build-up?

  1. Can we also count the recentchanges rows from wikidata on day granularity before/after deployments? I guess whatever it cost increase the wbc_entity_usage in cawiki, it is probably acceptable if recentchanges is getting smaller.

1, @hoo so how do we move forward with deployment to larger wikis/wikis with more wikidata usage such as cawiki? Do we have an issue with DB size? (e.g requires more thought on wbc_entity_usage design) or is it disabled in cawiki because it updated too fast? (so we just need to make the buildup slower?) I think most of the usages tracking are going to be static once it finish the first build-up?

It's going to be stable most of the time, but then (on big template/ module changes) it can very very quickly change a lot.
For cawiki: I've contacted them to fix the problem that caused this spike (https://ca.wikipedia.org/w/index.php?title=M%C3%B2dul_Discussi%C3%B3:Wikidata&oldid=18938979#Critical_performance_improvement), once that is fixed, we should be good to go ahead here.
In general, we will probably need to revisit T176273: Move the wbc_entity_usage table onto a dedicated DB shard here, especially for bigger wikis.

  1. Can we also count the recentchanges rows from wikidata on day granularity before/after deployments? I guess whatever it cost increase the wbc_entity_usage in cawiki, it is probably acceptable if recentchanges is getting smaller.

We can only weigh these up to a certain amount, as a spike on one table does not usually mean an equal reduction of activity on the other. As we can't yet stop recording the X/O usages, we have no idea what the actual effect will be, yet.

The cawiki module has been fixed now, but many other wikis still have the problematic code T178114: Many Wikipedia's Wikidata module iterate over all entity claims if a Statement is searched for by property label.

I suggest to re-enable the usage tracking on cawiki next week with close monitoring and see how it works out.

Change 384592 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[operations/mediawiki-config@master] Re-enable Statement usage tracking on cawiki

https://gerrit.wikimedia.org/r/384592

I've scheduled the deployment for Wednesday (October 18) 18:00–19:00 UTC. I'll closely monitor the situation after, but I don't expect any more havoc.

Moved the deploy to 8 UTC tomorrow (October 18), per @Marostegui. Will be deployed together with T177155: Find test wiki(s) for new description usage and enable there.

thiemowmde triaged this task as Medium priority.
thiemowmde updated the task description. (Show Details)

Change 384592 merged by jenkins-bot:
[operations/mediawiki-config@master] Re-enable Statement usage tracking on cawiki

https://gerrit.wikimedia.org/r/384592

Mentioned in SAL (#wikimedia-operations) [2017-10-18T08:41:48Z] <hoo@tin> Synchronized wmf-config/InitialiseSettings.php: Re-enable Statement usage tracking on cawiki (T151717) (duration: 00m 50s)

We're already at 6,213,948 statement usages on cawiki (this is a lot).

The number of usages is still growing strong there, but it seems to be within sane bounds.

Is there anything left to do here or can it be closed?

Well, enabling it on all wikis is one thing that we need to do and with T185693: Implement a (more liberal) usage aspect deduplicater (days: 3) I think it's fine to move forward in everywhere. Give me a detailed enabling plan and consider it done.

Change 414654 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[operations/mediawiki-config@master] Enable statement usage tracking in several wikis

https://gerrit.wikimedia.org/r/414654

Change 414654 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable statement usage tracking in several wikis

https://gerrit.wikimedia.org/r/414654

Mentioned in SAL (#wikimedia-operations) [2018-02-26T14:48:01Z] <ladsgroup@tin> Synchronized wmf-config/InitialiseSettings.php: [[gerrit:V|Enable statement usage tracking in several wikis (T151717)]] (duration: 00m 57s)

Change 415264 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[operations/mediawiki-config@master] Reduce the batch size of statment usage tracking to 33

https://gerrit.wikimedia.org/r/415264

Change 415264 merged by jenkins-bot:
[operations/mediawiki-config@master] Reduce the batch size of statment usage tracking to 33

https://gerrit.wikimedia.org/r/415264

Mentioned in SAL (#wikimedia-operations) [2018-02-28T18:42:42Z] <niharika29@tin> Synchronized wmf-config/Wikibase.php: Reduce the batch size of statment usage tracking to 33 T151717 (duration: 00m 57s)

Mentioned in SAL (#wikimedia-operations) [2018-02-28T18:44:05Z] <niharika29@tin> Synchronized wmf-config/InitialiseSettings.php: Reduce the batch size of statment usage tracking to 33 T151717 (duration: 00m 57s)

Just noting that this is enabled eveyrwhere except commons, hywiki, cebwiki, and warwiki