Page MenuHomePhabricator

The fulltext search engine should rank items with human edits higher than bot edits
Open, Needs TriagePublic

Description

Wikicite created many Wikidata items and that's great but sometimes that makes searching hard. If I type "heart rhythm" in the newly proposed search engine it doesn't manage to list the correct item at the top.
There are similar concerns for other bot created articles in geology, musical songs and films.

I think a good solution would be to store the amount of human edits that each item gets as a variable that's meaningful for the search ranking. Items about scientific articles that don't have any human edit should be downranked as a result.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 19 2017, 1:45 PM

I personally don't think this is a good approach, we should just make the scientific article rank lower for a bit as a start.

It's definitely possible to deboost them (at least those the same way we do for disambig pages (maybe not the same value, we can tweak it). I am not sure what process should be here - @Lydia_Pintscher maybe you have an idea? Should I put a note in project chat and if no objections just put it in the config?
Probably won't happen until January anyway - we're in "no deloys for holidays" zone, but we could roll out config change in January when everybody's back.

Yeah that seems like a good way forward.

Scientific articles are one class of entities where this problem exists and given that they have more words in the title and we have >10 million of them they are the most important. In the past with the old search I however remember similar issues with songs and geonames derived geographic items as well.
If we would import those 2 million German companies, they would likely also produce a lot of hits.

I'm okay, with fixing it by deranking scientific articles specifically but I would expect that even if we derank any class of entries that are problematic at the moment sooner or later we will add a new big dataset by bot that brings up search results that would be better to not outrank human created items.