Wikipedia Portal Dashboard: filter out requests for 'search-redirect.php'
Closed, ResolvedPublic2 Story Points

Description

Due to research done here - it appears that a vast majority of the traffic to Wikipedia Portal (that is being redirected from the Wikipedia Portal) is specifically using search-redirect.php.

We need to adjust the Portal dashboard data collection scripts to filter out these requests to search-redirect.php, as indicated by this data:

daterequestreferrerrequestsproportion
2016-06-20otherother1977750.216
2016-06-20othersearch-redirect.php2240.000
2016-06-20otherWikipedia Portal1381850.151
2016-06-20search-redirect.phpother104570.011
2016-06-20search-redirect.phpsearch-redirect.php7380.001
2016-06-20search-redirect.phpWikipedia Portal5699280.621

Here's the Hive query that @mpopov used (for future reference):

ADD JAR /home/bearloga/Code/analytics-refinery-jars/refinery-hive.jar;
CREATE TEMPORARY FUNCTION classify_referrer AS 'org.wikimedia.analytics.refinery.hive.SmartReferrerClassifierUDF';
USE wmf;
SELECT request, referrer, COUNT(1) AS requests
FROM (
  SELECT
  CASE WHEN referer RLIKE('^(https?://www\.)?wikipedia\.org/+search-redirect\.php\??.*') THEN 'search-redirect.php'
       WHEN referer RLIKE('^(https?://(www\.)?)?wikipedia\.org.*$') THEN 'Wikipedia Portal'
       ELSE 'other' END AS referrer,
  CASE WHEN uri_path = '/search-redirect.php' THEN 'search-redirect.php'
       ELSE 'other' END AS request
  FROM webrequest
  WHERE year = 2016 AND month = 06 AND day = 20  
    AND webrequest_source = 'text'
    AND content_type RLIKE('^text/html')
    AND uri_host RLIKE('^(www\.)?wikipedia.org/*$')
    AND classify_referrer(referer) IN ('internal', 'external', 'unknown')
    AND NOT (referer RLIKE('^http://localhost'))
) AS refined_webrequests
GROUP BY request, referrer;
debt created this task.Jun 22 2016, 2:34 PM
Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptJun 22 2016, 2:34 PM
mpopov claimed this task.Jun 23 2016, 4:45 PM
mpopov set the point value for this task to 2.
mpopov moved this task from Backlog to In progress on the Discovery-Analysis (Current work) board.

Change 295739 had a related patch set uploaded (by Bearloga):
Filter out search-redirect.php requests

https://gerrit.wikimedia.org/r/295739

Change 295739 merged by Bearloga:
Filter out search-redirect.php requests

https://gerrit.wikimedia.org/r/295739

Patch up. Erased pageviews and referer data from 1 May 2016. Backfilling now sans search-redirect.php requests; will make a note of the drop(s) on the dashboard.

debt added a comment.Jun 23 2016, 7:09 PM

Cool and thanks! Looking forward to seeing the note on the dashboard and the *new* results!

Change 295750 had a related patch set uploaded (by Bearloga):
Note search-redirect.php filtering

https://gerrit.wikimedia.org/r/295750

Change 295750 merged by Bearloga:
Note search-redirect.php filtering

https://gerrit.wikimedia.org/r/295750

Change 295753 had a related patch set uploaded (by Bearloga):
Deploy

https://gerrit.wikimedia.org/r/295753

Change 295753 merged by Bearloga:
Deploy

https://gerrit.wikimedia.org/r/295753