Page MenuHomePhabricator

[ENG] Create soft and hard entity usages to keep tracking and balance recent change updates
Open, Needs TriagePublic

Description

Problem:
Entity Usages are meaningful for two purposes:
1- They enable change propagation and pages are up to date when the data in Wikidata is changed.
2- Then people are notified with this update in Recent Change table, this increases the transparency and clarity.

Currently, many modules are designed to use Wikidata data for maintaining some organization about the article (most common example: categorization). This means not all Entity Usages are meant to be directly displayed on the page. They sometimes are being used to decide if the Wikidata item has an image, with no intention using this image.

However, these usages are not irrelevant or unnecessary either. It still makes sense to propagate changes to these articles to maintain categorization (or other side functions). On the other hand all of the updates about these entities might not be relevant for Recent Change table, as they don't have a direct effect on the page (and we try to reduce the load in RC table).

Example1:
Template: https://fr.wikipedia.org/wiki/Mod%C3%A8le:Infobox_Biographie2
Page: https://fr.wikipedia.org/wiki/Ry%C5%8Dichir%C5%8D_Yoshida
Entity Usage:

    Japon
        Déclaration : P571
    créateur ou créatrice
        Déclaration : P279
...
    artiste
        Déclaration : P279
...
    musicien ou musicienne
        Déclaration : P279
...

There is no reason to create a Recent Change log for all musician articles when item createur is updated, but there is a reason that any change on createur should be propagated.

Example2:
Template: https://en.wikipedia.org/wiki/Template:Wikidata image
Page: https://en.wikipedia.org/w/index.php?title=Benson_Jack_Anthony&action=info
P18 is only being accessed to add some categories No_local_image_but_image_on_Wikidata not to be displayed on the page.

Solution:
To keep the same behavior and not creating noise in RC table, we create soft and hard type Entity Usages.

SoftChanges propagated, page will be reparsedNo Recent Change log createdFor more indirect usages, i.e. existence checks in modulesA new type of behavior
HardChanges propagated, page will be reparsedRecent Change log createdFor usages directly affecting the pageThe current behavior

Challenges:
It is not always very clear to distinguish soft and hard usages. This work has to go closely with module observation. From our previous investigation (T410630 and T403008), we already know a very common pattern is existence checks in if blocks and transitive calls are causing indirect Entity Usage which will be mentioned as soft after this ticket is implemented.

Acceptance criteria:

  • New functions for soft usages for users to use should be created and they should be documented.
  • Wikipedia article page categorizations based on Wikidata data should be kept. The current behavior on client side should remain same
  • Entity Usage table size should remain the same.
    • If there are both soft and hard EU of the same statement, hard one should override the the soft so they still will be a single log on EU table.

Next step:
Use new soft and hard entity usage implementation in modules

Event Timeline

Neslihan_Turan_WMDE renamed this task from Create soft and hard entity usages to keep tracking and balance recent change updates to [ENG] Create soft and hard entity usages to keep tracking and balance recent change updates.Mon, Feb 9, 8:31 AM

Will there be parser function access as well (whether new parser functions or new parameters to the existing ones)? c:Template:Wikidata Infobox currently uses {{#property:P373}} to populate Category:Uses of Wikidata Infobox missing Commons Category statement (and AFAIK doesn’t use P373 of the page’s own item for anything else) – a prime example for what shouldn’t populate recent changes. Given how resource-savvy this template already is, routing this check through Lua just to make the usage “soft” would probably be a no-go.

  • Entity Usage table size should remain the same.

By “same”, you mean the same number of bytes or the same number of rows? If the former, then adding a new column doesn’t fulfill the AC, so the only way to determine the “softness” of the entity usage is re-parsing the page and checking the parse result. Which may be okay, since the page needs to be reparsed anyway for both kinds of usages, but this would cause a slight delay in the insertion of the RC entries, since that would have to wait for the parse result.

Even if you mean the latter, collapsing could cause unexpected increases: if a page uses 32 properties “hard” and 2 properties “soft”, then currently the table records a single row with aspect C, but the 32 “hard” properties would mean 32 rows with aspects (say) C.P31, C.P569 etc., plus the two “soft” usages C.P18 and C.P373. (It’s quite an edge case, but if a template happens to be written in a way that triggers this case, it’s likely to trigger this case on a great amount of pages.)

…considering the latter, maybe the best solution is indeed determining whether to insert a RC row based on a re-parse and not adding a new column: that would mean that if usage is collapsed, and a statement is changed that isn’t actually used, the phantom RC entry could be avoided – bringing an improvement even before on-wiki templates start using “soft” usage.

Hey @Tacsipacsi , thanks a lot for the inputs. I will try to answer some points you mentioned. Lets keep challenge this suggestion so that we will implement the best one at the end.

Will there be parser function access as well

Yes, there will be both new Lua functions and parser functions for soft entity usage. Because they occur both through templates and modules and we aim to make a fix for both of them without creating an extra Lua requirement.

By “same”, you mean the same number of bytes or the same number of rows?

It is the same number of rows, sorry for the confusion. Not even an an extra column but probably just a letter C.s letter.

Even if you mean the latter, collapsing could cause unexpected increases: if a page uses 32 properties “hard” and 2 properties “soft”, then currently the table records a single row with aspect C

For the collapse, we can still keep the current behavior and yes it wouldn't fix anything in regard to RC table in case of collapse and it would maintain the current number of rows of EU table. Alternatively, we can count hard and soft usages separately and come up with new limits for both of them. This would increase the size of EU table but balance RC table size on the other hand, we can decide based on the DB constraints.

maybe the best solution is indeed determining whether to insert a RC row based on a re-parse

I think this is the most intuitive approach towards this problem but the reparse solution also has its own limitations.

  • Not all pages has a cash, so we don't fix anything about these pages for RC logs, they will still be noisy in RC table.
  • We would need to shift all RC logic in client side to happen after parse. This requires big refactor on client side.

Looking forward hearing further opinions of yours on this.