For some articles the popularity_score seems incredibly high on some pages.
For History_of_the_Houston_Rockets popularity_score is 2.1e-4 which is way higher than the main page at 9e-6.
This does not seem to reflect the actual pageview numbers
For some articles the popularity_score seems incredibly high on some pages.
For History_of_the_Houston_Rockets popularity_score is 2.1e-4 which is way higher than the main page at 9e-6.
This does not seem to reflect the actual pageview numbers
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
Convert popularity_score into hql script | wikimedia/discovery/analytics | master | +94 -327 |
This particular example is fixed... but we don't know whether there are other examples where this is incredibly wrong.
We can run some queries in hive to figure out if there are other examples where this is wrong.
Main_page:
select * from popularity_score where agg_days=7 and month=10 and day=10 and project = "en.wikipedia" and page_id = 15580374; en.wikipedia 15580374 9.509535516926464E-6 7 2016 10 10
Pages with score > main page
select count(1) from popularity_score where agg_days=7 and month=10 and day=10 and project = "en.wikipedia" and score > 0.000009509535516926464; 670917
So, 670k pages with a score higher than the main page in en.wikipedia. Certainly seems wrong :S Going to spend some time looking over how we calculate the popularity score to make sure it's sane.
For comparison, in the month of September https://tools.wmflabs.org/topviews/?project=en.wikipedia.org&platform=all-access&date=last-month&excludes= shows the most popular pages (un-check 'Show only mainspace pages'). Over a full month main page has 2 orders of magnitude more views than any other individual page. Our stats are aggregated over a week instead of month, but should be similar.
Yea, our calculation is very, very wrong somewhere:
select project, page_id, score from popularity_score where agg_days=7 and month=10 and day=10 and project = "en.wikipedia" order by score desc limit 5 en.wikipedia 51111498 0.9830508474576272 en.wikipedia 11151489 0.9442508710801394 en.wikipedia 13915183 0.9270516717325228 en.wikipedia 12330714 0.66 en.wikipedia 36903581 0.33496932515337424
The most popular page almost has a score of 1. That page is: Category:Political office-holders in Mississippi
Change 316931 had a related patch set uploaded (by EBernhardson):
Convert popularity_score into hql script
Not sure what's wrong with the spark implementation, but i figured we only used spark because we thought it was going to get more complex and it never did. Simplified it all down to a simpler hive script. New results (in ebernhardson.popularity_score):
select * from popularity_score where agg_days=7 and month=10 and day=10 and project = "en.wikipedia" and page_id = 15580374; en.wikipedia 15580374 0.0939092050094579 7 2016 10 10
select count(1) from popularity_score where agg_days=7 and month=10 and day=10 and project = "en.wikipedia" and score > 0.0939092050094579; 0
en.wikipedia 15580374 0.0939092050094579 en.wikipedia 18746613 0.0010522045684449714 en.wikipedia 32146723 9.184855789850993E-4 en.wikipedia 36254613 8.842523452389042E-4 en.wikipedia 42652013 7.870274771042674E-4
Those pages are:
[0] => "Main Page" [1] => "AMGTV" [2] => "Proyecto 40" [3] => "Flow control" [4] => "XHamster"
This data looks a lot more sane, and looks roughly match the titles and order of magnitudes shown on the topviews labs project (sourced from pageviews api). We may have to re-tune the weight used in completion suggester and full text search though.