Factor EntityPerPage::getEntitiesWithoutTerm out into it's own service
Closed, ResolvedPublic

Description

EntityPerPage::getEntitiesWithoutTerm should move into a separate EntityWithoutTermFinder, and should be re-implemented based on the page table instead of entity_per_page.

NOTE: This probably requires the wb_terms table to be changed to using prefixed entity IDs first, so such IDs can be joined against the title text in the page table.
NOTE: This function is only needed for Special:EntitiesWithoutLabel and Entities without description. We should evaluate how much need there is for this functionality before blocking on it, or investigating a lot of time.
daniel created this task.Jul 20 2016, 12:40 PM
Restricted Application added a subscriber: Zppix. · View Herald TranscriptJul 20 2016, 12:40 PM
daniel updated the task description. (Show Details)
daniel raised the priority of this task from High to Needs Triage.
daniel updated the task description. (Show Details)Jul 20 2016, 12:57 PM
daniel updated the task description. (Show Details)Jul 20 2016, 1:00 PM

I just discussed with Daniel if we actually need Special:EntitiesWithoutDescription and Special:EntitiesWithoutLabel They lost their usefulness mostly as Wikidata grew. If you want to find items without a label in a given language the resultset is in most cases too large to be useful. A tool like https://tools.wmflabs.org/wikidata-terminator/ is needed to help find the items where a label or description is actually important to have.
The one remaining usecase for the special pages is then doing this for properties. Here the number is manageable and useful. So we can optimize for this case and remove the item case.

hoo added a comment.Jul 20 2016, 1:10 PM

I just discussed with Daniel if we actually need Special:EntitiesWithoutDescription and Special:EntitiesWithoutLabel They lost their usefulness mostly as Wikidata grew. If you want to find items without a label in a given language the resultset is in most cases too large to be useful. A tool like https://tools.wmflabs.org/wikidata-terminator/ is needed to help find the items where a label or description is actually important to have.
The one remaining usecase for the special pages is then doing this for properties. Here the number is manageable and useful. So we can optimize for this case and remove the item case.

True. I would suggest to restrict them by configuration rather than hard coding the supported entity types. They might still be useful for third parties with only a few items and we get support for that (almost) for free as it seems.

@hoo What drives the restriction is our desire to drop the wb_entity_per_page table. Does terminator need that table? Cab it maintain its own copy?

The idea behind restricting to Properties is: Without wb_entity_per_page, joining against the wb_terms table is inefficient. It would still work for Properties though, because there aren't that many of them.

Izno added a subscriber: Izno.Jul 20 2016, 1:21 PM

I just discussed with Daniel if we actually need Special:EntitiesWithoutDescription and Special:EntitiesWithoutLabel They lost their usefulness mostly as Wikidata grew. If you want to find items without a label in a given language the resultset is in most cases too large to be useful. A tool like https://tools.wmflabs.org/wikidata-terminator/ is needed to help find the items where a label or description is actually important to have.
The one remaining usecase for the special pages is then doing this for properties. Here the number is manageable and useful. So we can optimize for this case and remove the item case.

Generally agreed with the rationale here, but we'll get questions from newbies "where did this page go"/"there's one for properties but not items, why?".

Maybe instead the special page can be extended to enable a more powerful query? You identify that there exist tools that can do this, and there is obviously a use case for identifying these pages, and I think there's enough reason to have support for an out-of-the-box solution for e.g. 3rd parties per hoo.

Change 305496 had a related patch set uploaded (by Hoo man):
Split the EntityPerPage interface

https://gerrit.wikimedia.org/r/305496

Change 305496 merged by jenkins-bot:
Split the EntityPerPage interface

https://gerrit.wikimedia.org/r/305496

thiemowmde triaged this task as Normal priority.Sep 1 2016, 3:02 PM
thiemowmde removed a project: Patch-For-Review.
thiemowmde moved this task from Proposed to Review on the Wikidata-Sprint-2016-07-19 board.

Change 314182 had a related patch set uploaded (by Hoo man):
Move EntitiesWithoutTermFinder::getEntitiesWithoutTerm

https://gerrit.wikimedia.org/r/314182

thiemowmde assigned this task to hoo.Oct 5 2016, 7:08 AM

Change 314182 merged by jenkins-bot:
Move EntitiesWithoutTermFinder::getEntitiesWithoutTerm

https://gerrit.wikimedia.org/r/314182

hoo moved this task from Review to Doing on the Wikidata-Sprint-2016-09-21 board.Oct 5 2016, 2:12 PM

Change 314554 had a related patch set uploaded (by Hoo man):
New SqlEntitiesWithoutTermFinder implementation

https://gerrit.wikimedia.org/r/314554

Change 314555 had a related patch set uploaded (by Hoo man):
Introduce the "entitiesWithoutTermEntityTypes" setting

https://gerrit.wikimedia.org/r/314555

hoo moved this task from Doing to Review on the Wikidata-Sprint-2016-09-21 board.Oct 6 2016, 1:54 PM

Change 315694 had a related patch set uploaded (by Thiemo Mättig (WMDE)):
Remove hard-coded supportedEntityTypesForEntitiesWithoutTermListings default

https://gerrit.wikimedia.org/r/315694

hoo added a comment.Oct 18 2016, 8:54 AM

Just to re-explain this, because I think it might not have been clear enough before:

Currently we join the wb_entity_per_page table against the wb_terms table (term_entity_id = epp_entity_id AND term_entity_type = epp_entity_type).
After this change, we will join the page table against the wb_terms table. Given the wb_terms table only has the entity type and the numeric entity id, we will need to use REPLACE() in SQL for this join (term_entity_type = "known-entity-type" AND term_entity_id = REPLACE(page_title, 'known-entity-prefix', '')). Due to this we can only provide this functionality for entity types where we know how to programmatically construct the entity id serialization from the entity-type and the numeric entity id.

we can only provide this functionality for entity types where we know how to programmatically construct the entity id serialization from the entity-type and the numeric entity id.

We do know this for all entity types that provide an entity-id-composer, see https://phabricator.wikimedia.org/diffusion/EWBI/browse/master/WikibaseMediaInfo.entitytypes.php;a9a53ac3ffb8d1470cec5813ed641cca76484eac$101. You can get this via WikibaseRepo::getDefaultInstance()->getEntityIdComposer(). For legacy reasons the EntityIdComposer class always supports Items and Properties, even if they do not have an entity-id-composer configured.

hoo added a comment.Oct 18 2016, 1:26 PM

@thiemowmde: The essential part is that we need to be able to do this in SQL, not (only) in PHP.

thiemowmde moved this task from Proposed to Review on the Wikidata-Sprint board.
thiemowmde moved this task from incoming to in current sprint on the Wikidata board.

Change 314554 merged by jenkins-bot:
New SqlEntitiesWithoutTermFinder implementation

https://gerrit.wikimedia.org/r/314554

Change 314555 merged by jenkins-bot:
Introduce the "supportedEntityTypesForEntitiesWithoutTermListings" setting

https://gerrit.wikimedia.org/r/314555

Change 315694 merged by jenkins-bot:
Remove hard-coded supportedEntityTypesForEntitiesWithoutTermListings default

https://gerrit.wikimedia.org/r/315694

daniel closed this task as Resolved.Jan 5 2017, 7:00 PM
Ladsgroup moved this task from Review to Done on the Wikidata-Sprint board.Mar 22 2017, 12:36 PM