Page MenuHomePhabricator

Explore getting articles ranked by sitelinks before anything else
Open, MediumPublic

Description

We can maybe use WDQS to fetch articles ranked by sitelinks, instead of the current implementation of first getting the mostpopular 500 articles, and subsequently ranking them by sitelinks.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 23 2017, 6:45 PM
leila triaged this task as Low priority.Feb 23 2017, 10:50 PM

The current query I'm using needs some sort of qualifier before getting the sitelinks, otherwise it times out. It's currently using the featured article badge, but that doesn't provide good results. Do you have any suggestions, @leila?

SELECT ?item ?itemLabel ?count WHERE {
  {
    SELECT ?item (COUNT(DISTINCT ?article) AS ?count) WHERE {
      ?article wikibase:badge wd:Q17437796 .
      ?article schema:about ?item .
    }
    GROUP BY ?item
    ORDER BY DESC(?count)
    LIMIT 500
  }
  
  FILTER EXISTS {
    ?sitelink schema:about ?item .
    ?sitelink schema:isPartOf <https://en.wikipedia.org/> .
  }
  
  FILTER NOT EXISTS {
    ?sitelink schema:about ?item .
    ?sitelink schema:isPartOf <https://de.wikipedia.org/> .
  }
  
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "en" .
  }
}
ORDER BY DESC(?count)
leila added a comment.Mar 7 2017, 1:36 AM

@schana do you know what causes the timeout? is it the number of articles (items) the query needs to consider?

@leila That would be my guess. The way it's written, without the qualifier it would be selecting every article for every item.

leila added a comment.Mar 7 2017, 5:18 PM

'got you. @schana I have some ideas here but they will need some more work. My suggestion is that we focus on the stub expansion needs at the moment since that's time sensitive and come back to this task in a few weeks.

leila moved this task from Next up to Backlog on the Recommendation-API board.Mar 8 2017, 6:06 PM
leila moved this task from Backlog to Paused on the Recommendation-API board.Mar 8 2017, 6:18 PM
leila raised the priority of this task from Low to Medium.Apr 13 2017, 4:14 PM
leila moved this task from Paused to Next up on the Recommendation-API board.
leila added a comment.Apr 13 2017, 5:14 PM

@schana please reach out to Joseph to see if he can provide some leads for addressing this.

@leila is this a question you'd be interested in getting qualitative comparison data on?

leila added a comment.Apr 13 2017, 8:45 PM

not at this point, @Capt_Swing. This task is one step towards T162912 and we know we should have our backend ready to handle this kind of changes. Thanks for having an eye on us though. :)

Moving the email thread here:

In support of providing translation recommendations, we're trying to improve the algorithm. Currently, the recommendations are initially sourced from the most popular (by pageviews) articles over the past two weeks. We want to have the ability to instead have them initially sourced based on the number of languages they exist in.
I've been exploring WDQS, but I can't find a performant way to structure the query (https://phabricator.wikimedia.org/T158889#3061794).

I have a version of wikidata on hadoop though, on which we could write a similar query (I don't even really understand the query in the task).
Maybe having it parallelised could help?

Hi @Smalyshev, is the following something you can help us with?

Alternatively, could we periodically build a dataset offline with enough entries to provide recommendations? It seems that as the algorithm grows in complexity, it could slow the performance until it is no longer satisfactory.

We may have to do this at some point, and we can certainly start experimenting with it. Let's talk about it tomorrow in our meeting. However, for features that are easy to extract, such as pageviews in the source language or number of languages that contain that article, we should be able to have an algorithm that can provide recommendations on the fly.

The query is counting the number of sitelinks for each wikidata item and sorting them in descending order. The problem I'm having is that with the way the data is structured, the query covers too large of dataset to be performed in a reasonable amount of time. This led to needing some initial filter to limit the sitelinks query to 500 items.
It would be preferable if we could remove the initial filter and have some way to rank all items by the number of sitelinks they have. Can the version of wikidata on hadoop support this?

@Smalyshev Is there an easy way to rank all articles by the count of sitelinks?

@JAllemandou Is there existing documentation for querying the version of wikidata on hadoop?

Hi !
No doc for the hadoop version of wikidata: it's was only me playing.
However there is code here: https://gerrit.wikimedia.org/r/#/c/346726/
And a page with very simple bottstraping use here: https://wikitech.wikimedia.org/wiki/User:Joal/Wikidata_Graph

As stated by email, I'll try to build a small spark job soon.

I spoke with @Smalyshev, and the initial thinking is that Elastic Search may be able to accommodate this if the sitelink data is added as a field. I'm still gathering more info to determine which route we want to take for surfacing this data.

I'm working with the sitelink data (T162912) and could use it for ranking as a side effect of the model building.

bmansurov moved this task from Next up to Backlog on the Recommendation-API board.Feb 27 2019, 2:39 PM