Page MenuHomePhabricator

Find a solution for SpecialEntitiesWithoutPage (EntitiesWithoutTermFinder)
Closed, ResolvedPublic

Description

SqlEntitiesWithoutTermFinder currently works by joining the wb_terms table to the page table to find entities that don't have specific type of term in a specific or any language.

This is used in Special:EntitiesWithoutDescription and Special:EntitiesWithoutLabel.

Can this be somehow moved to using Elastic?

Do we even need these still? Maybe we should go for the simple solution and sunset them in favor of terminator?

Event Timeline

hoo created this task.May 8 2018, 10:45 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 8 2018, 10:45 AM
Smalyshev added a comment.EditedMay 8 2018, 8:38 PM

Yeah elastic should be able to retrieve items with no label/description. It needs a separate query, and we need to define how to sort them (alphabetically?) but otherwise should be no problem I think.

Assuming we also want it to work without CirrusSearch, we probably need some kind of generic API. Didn't look at the code yet, so if there's none we'll have to create it.

Also given T190022: Separate the CirrusSearch/Elastic-specific code from Wikibase code base we probably need some hook or something to plug Elastic part in?

One limitation ElasticSeach has is that it doesn't scroll past 10k. Not sure if this is relevant.

EBernhardson added a subscriber: EBernhardson.EditedMay 8 2018, 9:36 PM

As another option, this would be a very simple option to provide as an additional filter in fulltext search as a keyword. I'm not sure what the use cases are for this tool and if that would help. something like haswblabel:en|de and that can be negated with a -.

Vvjjkkii renamed this task from Find a solution for SpecialEntitiesWithoutPage (EntitiesWithoutTermFinder) to fedaaaaaaa.Jul 1 2018, 1:11 AM
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot renamed this task from fedaaaaaaa to Find a solution for SpecialEntitiesWithoutPage (EntitiesWithoutTermFinder).Jul 1 2018, 3:38 PM
CommunityTechBot raised the priority of this task from High to Needs Triage.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added a subscriber: Aklapper.
Addshore added a subscriber: Addshore.

The campsite isn't going to immediately work on this, this should probably be decided by Wikidata-Ugly-Cat-Trailblaze (wb_terms trail blazing)

Smalyshev triaged this task as Normal priority.May 2 2019, 7:00 AM

One limitation ElasticSeach has is that it doesn't scroll past 10k. Not sure if this is relevant.

@Lydia_Pintscher thoughts on this specific point?

For wikidata.org is it okay to limit the paging to 10k?

Currently you can page past this, but I don't know if that is really useful / adding any value.

As another option, this would be a very simple option to provide as an additional filter in fulltext search as a keyword. I'm not sure what the use cases are for this tool and if that would help. something like haswblabel:en|de and that can be negated with a -.

That is also an interesting thing to think about.
I'm sure @Lydia_Pintscher can also help with usecases of the pages, maybe doing this through search would be enough?
If not, the special pages could just wrap search somehow, and expose the results in the same format as they currently are.

I would be in favor of sun setting, let me get you some numbers of page views of this special page.

The number of usages in the last week:

uri_path	hitcount
/wiki/Special:EntitiesWithoutLabel/ru	2
/wiki/Special:EntitiesWithoutDescription/de	2
/wiki/Special:EntitiesWithoutDescription/en	3
/wiki/Special:EntitiesWithoutLabel/fr/property	11
/wiki/Special:EntitiesWithoutLabel/uk/property	59
/wiki/Special:EntitiesWithoutDescription	8
/wiki/Special:EntitiesWithoutLabel	50
/wiki/Special:EntitiesWithoutDescription/ru	2
/wiki/Special:EntitiesWithoutLabel/de	7
/wiki/Special:EntitiesWithoutLabel/en	4
/wiki/Special:EntitiesWithoutLabel/zh/property	2
/wiki/Special:EntitiesWithoutLabel/ko/item	3
12 rows selected (772.989 seconds)
jeblad added a subscriber: jeblad.Tue, Oct 29, 12:21 PM

The use case has never gone away, but the pages are hard to find and isn't used. That is a problem they share wit a lot of special pages, and it will not go away by linking to some external tool. (In fact the problem grows, but that is another discussion.)

Part of the problem is that the special pages lives on the repo, where the problem created by lack of labels or descriptions are small, while the problem they was meant to solve exist on the clients where a reference to an entity with a missing label often end up with a Q-id. The response from some users at Wikipedias communities are rather tiresome…

Note that the language communities (especially those of any size) mostly exists at the clients, not so much at the repo. That is part of the reason why the special pages aren't used very much. Using special pages at the repo is simply out of the users mental model.

The WBridge will probably solve most of this, especially for the labels, not so sure about the descriptions though.

The users mental model is probably to search for items without a label on the client site, while it is probably easier to implement on the repo.

Addshore claimed this task.Tue, Oct 29, 1:49 PM
Restricted Application added a project: User-Addshore. · View Herald TranscriptTue, Oct 29, 1:49 PM

I would be in favor of sun setting, let me get you some numbers of page views of this special page.

We should consider but the usecases of wikidata.org and also of other Wikibase users.

I agree with jeblad that the usecase doesn't go away. The most important is "I want to add labels to the Items and Properties that are used most so that other people can use the data in my language as well". The problem with the current page is that it is totally overwhelming and meaningless for 65 million Items. We would need to get in some kind of importance ranking - ideally linked to how often the item is used as a value in other statements but we can talk about others as well.
Is this something we could address with hooking into Elastic?

EBernhardson added a comment.EditedTue, Oct 29, 8:08 PM

Elastic already has filters for things such as "pages with labels in language x" and these can be negated. I'm not entirely sure, but i think the incoming links count is at least related to how often the item is used. Due to the way wikidata is structured the incoming link count isn't nearly as good of an indicator of popularity as it is on other wikis, top linked things look to be things bot's have linked heavily (happens on wikipedias too, just looks more pronounced in these results). So basically:

https://www.wikidata.org/w/index.php?sort=incoming_links_desc&search=-haslabel%3Apl

This actually looks pretty perfect! Thank you.
The few quick searches I did did return meaningful results. For German for example the label for the Item for GeneDB was missing. This is an Item that is linked in millions of references and definitely should have a label.

So maybe we can replace the special page with a simple form that lets you select the language and then redirects you to the search page with the correct parameters?

You can also use * in place of a language code to search for entities that (don’t) have labels in any language. For example, there are 540 items without any labels, descriptions, and statements.

Elastic search is great but we need to have a solution (maybe?) for third party installations that don't have elastic installed.

I would be fine for now with this working only if you have Elastic installed.

Should not be to difficult to do this from the client too…? Please…? :D

Addshore closed this task as Resolved.EditedWed, Oct 30, 1:17 PM

Resolving the task figuring this out, will make a new one for doing this

T236901: Use elastic for SpecialEntitiesWithoutPage (EntitiesWithoutTermFinder)