Page MenuHomePhabricator

[Task] Store the badges of sitelinks in a queryable format
Closed, ResolvedPublic

Description

The badges of sitelinks should be stored in another table to make them queryable. This table should contain a badge id and the sitelinks site and wiki. Alternativly, we can add the badges to the wb_items_per_site table but this might become tricky because there is a primary key on the siteid/page columns so we cannot create a new row for each badge but have to put them all into one value which then makes querying them harder and not really efficient.

Details

Reference
bz70229

Event Timeline

bzimport raised the priority of this task from to Normal.Nov 22 2014, 3:40 AM
bzimport set Reference to bz70229.
bzimport added a subscriber: Unknown Object (MLST).
Bene created this task.Aug 31 2014, 9:47 AM
Bene added a comment.Sep 27 2014, 2:52 PM

I think this should get higher priority because it blocks removal of the badge templates from some Wikipedias. I want some discussion in the development team first how the database table should look like before we proceed with implementing this part of the badges.

I'm not sure I agree that we need such a table (but I can be convinced).

In any case, the basic structure should be:

( badge, siteId, page )

with the primary key covering all three fields, in this order. This allows us to efficiently find all pages that have a given badge, on all sites or a given site.

If we also want to be able to find all the badges associated with a given page (on a given site), we'd need another index covering ( siteId, page ). This would be useful if we wanted to show badges in listings of pages (e.g. add icons for special pages in the Special:Allpages listing.

For convenience, we may want to have a column for the associated item id. This would remove the need to join against wb_item_per_site in cases where we need to know the item ID. However, we generally don't care about the associated item in this context, do we?

aude added a comment.Sep 29 2014, 12:53 PM

@daniel if we don't need such table, then how do you suggest we make badges / badge usage queryable?

Lydia_Pintscher removed a subscriber: Unknown Object (MLST).
Lydia_Pintscher removed a subscriber: Unknown Object (MLST).
Stryn added a subscriber: Stryn.Dec 18 2014, 10:02 PM
Superyetkin raised the priority of this task from Normal to High.Dec 19 2014, 4:56 PM
Superyetkin added a subscriber: Superyetkin.

I agree that badges should be stored in another table. By the way, I was told that the current way of accessing the badges is through JSON blob but I am not sure how to do that. It would be a good idea to make these items live on their own, considering the fact that tool developers are mostly using direct SQL queries to obtain what they need.

Lydia_Pintscher lowered the priority of this task from High to Normal.Dec 27 2014, 11:49 AM
Lydia_Pintscher set Security to None.
aude added a comment.Mar 11 2015, 3:22 PM

I don't think there is good solution to this other than have the badges in a database table.

We have two options:

  1. schema change on the wb_items_per_site table to add a badge column (null by default).
  2. add a new table that can be joined to the ips_row_id column of wb_items_per_site, containing ips_row_id + badge id

with option 1, in most cases, the value would be null and maybe we want to go with option 2 to keep wb_items_per_site smaller, or maybe the impact isn't too excessive and option 1 would be ok.

Bene added a comment.Mar 23 2015, 9:56 PM

I don't think there is good solution to this other than have the badges in a database table.
We have two options:

  1. schema change on the wb_items_per_site table to add a badge column (null by default).
  2. add a new table that can be joined to the ips_row_id column of wb_items_per_site, containing ips_row_id + badge id

with option 1, in most cases, the value would be null and maybe we want to go with option 2 to keep wb_items_per_site smaller, or maybe the impact isn't too excessive and option 1 would be ok.

Option 1 will be a bit tricky because there may be several badges for one sitelink and thus one column could contain several badges (?) This obviously cannot work so perhaps we need option two.

I'm not sure I agree that we need such a table (but I can be convinced).
In any case, the basic structure should be:
( badge, siteId, page )
with the primary key covering all three fields, in this order. This allows us to efficiently find all pages that have a given badge, on all sites or a given site.
If we also want to be able to find all the badges associated with a given page (on a given site), we'd need another index covering ( siteId, page ). This would be useful if we wanted to show badges in listings of pages (e.g. add icons for special pages in the Special:Allpages listing.
For convenience, we may want to have a column for the associated item id. This would remove the need to join against wb_item_per_site in cases where we need to know the item ID. However, we generally don't care about the associated item in this context, do we?

I'm not sure if it is better to include ips_row_id or siteId, page pairs. I rather tend to Daniel's suggestion because we will query for badges on a specific site which requires a join otherwise. Including the item id may also be helpful at some points.

Bene added a comment.Apr 8 2015, 3:33 PM

I don't think there is good solution to this other than have the badges in a database table.
We have two options:

  1. schema change on the wb_items_per_site table to add a badge column (null by default).
  2. add a new table that can be joined to the ips_row_id column of wb_items_per_site, containing ips_row_id + badge id

with option 1, in most cases, the value would be null and maybe we want to go with option 2 to keep wb_items_per_site smaller, or maybe the impact isn't too excessive and option 1 would be ok.

Option 1 will be a bit tricky because there may be several badges for one sitelink and thus one column could contain several badges (?) This obviously cannot work so perhaps we need option two.

The wording in this statement is not correct. Of course option 1 can work but we will face some anomalies (I only found the German Wikipedia article on that topic) because we'd have to extend the unique key of site_id and page to site_id, page and badge_id. This has to be handled carefully for insert and delete operations. However, this solution is way more efficient than option 2 because it avoids joins between the two tables. Furthermore, as we only have a very small percentage of good or featured articles the redundancy caused by this solution is negligible.

To sum up, both options are arguable. Option 1 is more efficient but sensitive in terms of insert/delete operations. Option 2 requires joins but is free of anomalies.

What else needs to be done to have this feature implemented?

daniel added a comment.Apr 9 2015, 3:40 PM

I'll try to summarize:

  • adding a badge_id column to wb_items_per_site won't work because there may be more than one badge per page.
  • referencing ips_row_id from another table would work, but would be brittle, since row ids are not stable, and we have no triggers or foreign key constraints to keep this in sync.
  • an extra table with ( badge, siteId, page ), and a unique key covering these three fields, as well as an index on ( siteId, page ) for update, seems the best option.

The SiteLinkTable class should be extended to maintain the new table. Care should be taken to avoid orphan entries in the badge table, which do not have a corresponding entry in the pagelinks (items_per_site) table.

He7d3r added a subscriber: He7d3r.May 5 2015, 6:13 PM
Bene renamed this task from Store the badges of sitelinks in a queryable format to [Task]Store the badges of sitelinks in a queryable format.Aug 30 2015, 12:45 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 30 2015, 12:45 PM
Bene renamed this task from [Task]Store the badges of sitelinks in a queryable format to [Task] Store the badges of sitelinks in a queryable format.Aug 30 2015, 12:45 PM
Bene claimed this task.
Bene added a subscriber: Ladsgroup.

Change 234958 had a related patch set uploaded (by Bene):
Add badges_per_sitelink table

https://gerrit.wikimedia.org/r/234958

Change 234959 had a related patch set uploaded (by Bene):
Add BadgeLookup, BadgeStore and BadgeTable

https://gerrit.wikimedia.org/r/234959

Change 234960 had a related patch set uploaded (by Bene):
Use a BadgeStore in SiteLinkTable to update badges

https://gerrit.wikimedia.org/r/234960

Bene moved this task from Doing to Review on the Wikidata-Sprint-2015-08-18 board.Aug 31 2015, 9:09 AM

Change 234968 had a related patch set uploaded (by Bene):
Use a BadgeStore in SiteLinkTable to update badges

https://gerrit.wikimedia.org/r/234968

Change 234960 abandoned by Bene:
Use a BadgeStore in SiteLinkTable to update badges

Reason:
Stfu git :S

https://gerrit.wikimedia.org/r/234960

daniel added a comment.EditedAug 31 2015, 10:16 AM

Thinking about it, we could use page_props on the client wikis to store badges (one key per badge, the value would be irrelevant). That would not need any new table, or any schema change. But it would not allow cross-wiki queries. Looking at the proposed implementation though, efficient queries across clients would not be supported anyway.

Bene added a comment.Aug 31 2015, 10:38 AM

Is it possible to change the page props when the site doesn't get updated? How would you synchronize the page props with the actual badges on Wikidata. Also, is it possible to store multiple values in the page props? There can be pages with several badges.

We sync the page_props with the badges on wikidata the same way we sync sitelinks: we update the parser cache on change notifications. The relevant code is already in SidebarHookHandlers. It even puts badges into page_props, but currently all badges into one property, instead of using separate property keys for the different badge ids.

Change 235043 had a related patch set uploaded (by Bene):
Add badges to page props

https://gerrit.wikimedia.org/r/235043

Change 235043 merged by jenkins-bot:
Add badges to page props

https://gerrit.wikimedia.org/r/235043

Bene closed this task as Resolved.Aug 31 2015, 6:23 PM
Bene moved this task from Review to Done on the Wikidata-Sprint-2015-08-18 board.

Change 234968 abandoned by Bene:
Use a BadgeStore in SiteLinkTable to update badges

https://gerrit.wikimedia.org/r/234968

Change 234959 abandoned by Bene:
Add BadgeLookup, BadgeStore and BadgeTable

https://gerrit.wikimedia.org/r/234959

Change 234958 abandoned by Bene:
Add badges_per_sitelink table

Reason:
Per Daniel

https://gerrit.wikimedia.org/r/234958

Bene added a comment.Sep 1 2015, 9:57 AM

@daniel do the page props need to be populated in some way? Is there a maintenance script to run?

Katie says we probably need a script.

It should probably do a null edit on all pages on the client that have a badge starting from https://www.wikidata.org/wiki/Special:WhatLinksHere/Q17437796