Page MenuHomePhabricator

Check to see if popularity_score is wrong in a lot of articles
Closed, ResolvedPublic

Description

For some articles the popularity_score seems incredibly high on some pages.

For History_of_the_Houston_Rockets popularity_score is 2.1e-4 which is way higher than the main page at 9e-6.

This does not seem to reflect the actual pageview numbers

Event Timeline

Deskana subscribed.

This particular example is fixed... but we don't know whether there are other examples where this is incredibly wrong.

We can run some queries in hive to figure out if there are other examples where this is wrong.

Deskana renamed this task from popularity_score seems wrong in some articles to Check to see if popularity_score is wrong in a lot of articles.Oct 18 2016, 5:43 PM
Deskana moved this task from needs triage to Current work on the Discovery-Search board.

Main_page:

select * from popularity_score where agg_days=7 and month=10 and day=10 and project = "en.wikipedia" and page_id = 15580374;
en.wikipedia  15580374        9.509535516926464E-6    7       2016    10      10

Pages with score > main page

select count(1) from popularity_score where agg_days=7 and month=10 and day=10 and project = "en.wikipedia" and score > 0.000009509535516926464;
670917

So, 670k pages with a score higher than the main page in en.wikipedia. Certainly seems wrong :S Going to spend some time looking over how we calculate the popularity score to make sure it's sane.

For comparison, in the month of September https://tools.wmflabs.org/topviews/?project=en.wikipedia.org&platform=all-access&date=last-month&excludes= shows the most popular pages (un-check 'Show only mainspace pages'). Over a full month main page has 2 orders of magnitude more views than any other individual page. Our stats are aggregated over a week instead of month, but should be similar.

Yea, our calculation is very, very wrong somewhere:

select project, page_id, score from popularity_score where agg_days=7 and month=10 and day=10 and project = "en.wikipedia" order by score desc limit 5
en.wikipedia    51111498        0.9830508474576272
en.wikipedia    11151489        0.9442508710801394
en.wikipedia    13915183        0.9270516717325228
en.wikipedia    12330714        0.66
en.wikipedia    36903581        0.33496932515337424

The most popular page almost has a score of 1. That page is: Category:Political office-holders in Mississippi

Change 316931 had a related patch set uploaded (by EBernhardson):
Convert popularity_score into hql script

https://gerrit.wikimedia.org/r/316931

Not sure what's wrong with the spark implementation, but i figured we only used spark because we thought it was going to get more complex and it never did. Simplified it all down to a simpler hive script. New results (in ebernhardson.popularity_score):

select * from popularity_score where agg_days=7 and month=10 and day=10 and project = "en.wikipedia" and page_id = 15580374;
en.wikipedia    15580374        0.0939092050094579      7       2016    10      10
select count(1) from popularity_score where agg_days=7 and month=10 and day=10 and project = "en.wikipedia" and score > 0.0939092050094579;
0
en.wikipedia    15580374        0.0939092050094579
en.wikipedia    18746613        0.0010522045684449714
en.wikipedia    32146723        9.184855789850993E-4
en.wikipedia    36254613        8.842523452389042E-4
en.wikipedia    42652013        7.870274771042674E-4

Those pages are:

[0] => "Main Page"
[1] => "AMGTV"
[2] => "Proyecto 40"
[3] => "Flow control"
[4] => "XHamster"

This data looks a lot more sane, and looks roughly match the titles and order of magnitudes shown on the topviews labs project (sourced from pageviews api). We may have to re-tune the weight used in completion suggester and full text search though.

Change 316931 merged by jenkins-bot:
Convert popularity_score into hql script

https://gerrit.wikimedia.org/r/316931