Check to see if popularity_score is wrong in a lot of articles
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	dcausse
	Oct 14 2016, 11:05 AM

Description

For some articles the popularity_score seems incredibly high on some pages.

For History_of_the_Houston_Rockets popularity_score is 2.1e-4 which is way higher than the main page at 9e-6.

This does not seem to reflect the actual pageview numbers

Details

	Subject	Repo	Branch	Lines +/-
	Convert popularity_score into hql script	wikimedia/discovery/analytics	master	+94 -327

Customize query in gerrit

Event Timeline

dcausse created this task.Oct 14 2016, 11:05 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 14 2016, 11:05 AM

+ Discovery-Search, CirrusSearch

Restricted Application added a project: Discovery-ARCHIVED. · View Herald TranscriptOct 14 2016, 11:11 AM

This particular example is fixed... but we don't know whether there are other examples where this is incredibly wrong.

We can run some queries in hive to figure out if there are other examples where this is wrong.

• Deskana renamed this task from popularity_score seems wrong in some articles to Check to see if popularity_score is wrong in a lot of articles.Oct 18 2016, 5:43 PM

• Deskana moved this task from needs triage to Current work on the Discovery-Search board.

• Deskana edited projects, added Discovery-Search (Current work); removed Discovery-Search.

EBernhardson claimed this task.Oct 19 2016, 6:42 PM

EBernhardson moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.

Main_page:

select * from popularity_score where agg_days=7 and month=10 and day=10 and project = "en.wikipedia" and page_id = 15580374;
en.wikipedia  15580374        9.509535516926464E-6    7       2016    10      10

Pages with score > main page

select count(1) from popularity_score where agg_days=7 and month=10 and day=10 and project = "en.wikipedia" and score > 0.000009509535516926464;
670917

So, 670k pages with a score higher than the main page in en.wikipedia. Certainly seems wrong :S Going to spend some time looking over how we calculate the popularity score to make sure it's sane.

For comparison, in the month of September https://tools.wmflabs.org/topviews/?project=en.wikipedia.org&platform=all-access&date=last-month&excludes= shows the most popular pages (un-check 'Show only mainspace pages'). Over a full month main page has 2 orders of magnitude more views than any other individual page. Our stats are aggregated over a week instead of month, but should be similar.

Yea, our calculation is very, very wrong somewhere:

select project, page_id, score from popularity_score where agg_days=7 and month=10 and day=10 and project = "en.wikipedia" order by score desc limit 5
en.wikipedia    51111498        0.9830508474576272
en.wikipedia    11151489        0.9442508710801394
en.wikipedia    13915183        0.9270516717325228
en.wikipedia    12330714        0.66
en.wikipedia    36903581        0.33496932515337424

The most popular page almost has a score of 1. That page is: Category:Political office-holders in Mississippi

Change 316931 had a related patch set uploaded (by EBernhardson):
Convert popularity_score into hql script

https://gerrit.wikimedia.org/r/316931

gerritbot added a project: Patch-For-Review.Oct 20 2016, 6:44 AM

Not sure what's wrong with the spark implementation, but i figured we only used spark because we thought it was going to get more complex and it never did. Simplified it all down to a simpler hive script. New results (in ebernhardson.popularity_score):

select * from popularity_score where agg_days=7 and month=10 and day=10 and project = "en.wikipedia" and page_id = 15580374;
en.wikipedia    15580374        0.0939092050094579      7       2016    10      10

select count(1) from popularity_score where agg_days=7 and month=10 and day=10 and project = "en.wikipedia" and score > 0.0939092050094579;
0

en.wikipedia    15580374        0.0939092050094579
en.wikipedia    18746613        0.0010522045684449714
en.wikipedia    32146723        9.184855789850993E-4
en.wikipedia    36254613        8.842523452389042E-4
en.wikipedia    42652013        7.870274771042674E-4

Those pages are:

[0] => "Main Page"
[1] => "AMGTV"
[2] => "Proyecto 40"
[3] => "Flow control"
[4] => "XHamster"

This data looks a lot more sane, and looks roughly match the titles and order of magnitudes shown on the topviews labs project (sourced from pageviews api). We may have to re-tune the weight used in completion suggester and full text search though.

Change 316931 merged by jenkins-bot:
Convert popularity_score into hql script

https://gerrit.wikimedia.org/r/316931

EBernhardson moved this task from not in use - please delete to Needs Reporting on the Discovery-Search (Current work) board.Oct 24 2016, 3:27 PM

debt closed this task as Resolved.Oct 27 2016, 7:56 PM

Check to see if popularity_score is wrong in a lot of articlesClosed, ResolvedPublicActions

Description

Details

Event Timeline

Check to see if popularity_score is wrong in a lot of articles
Closed, ResolvedPublic
Actions