Page MenuHomePhabricator

Determine if wikipedia.org portal is redirecting to itself
Closed, ResolvedPublic3 Estimated Story Points

Description

There is a bit of traffic that comes to the wikipedia.org portal that has the HTTP referrer 'wikipedia.org'. This implies that the page is somehow redirecting traffic to itself.

@ksmith noted these lines in the source code of the portal page might be the reason and it'd be interesting to know if it really is:

<!-- Search form -->
<div class="search-container">
<form class="pure-form" id="search-form" action="//www.wikipedia.org/search-redirect.php" data-el-section="search">
<fieldset>

Event Timeline

debt triaged this task as Low priority.Mar 23 2016, 8:25 PM

Regarding @ksmith's observation:
The URL www.wikipedia.org/search-redirect.php is the end-point for the search form. So when someone
makes a search on the portal page, they visit that URL. The flow looks like this:

*internet* -> www.wikipedia.org -> www.wikipedia.org/search-redirect.php -> *results page or article*
              ^                    ^
              |                    |
          referrer: *internet*    referrer: www.wikipedia.org

So the referrer of www.wikipedia.org/search-redirect.php is www.wikipedia.org.

I don't know how we count visits to the wikipedia.org portal, but if we count users visiting www.wikipedia.org/search-redirect.php
as visitors to www.wikipedia.org, that could explain why it seems that the portal is referring traffic to itself.

I'm thinking that this now belongs on the Discovery-Analysis instead of Discovery-Portal-Sprint. Plus, I merged in https://phabricator.wikimedia.org/T130388 which was essentially a duplicate of this.

mpopov set the point value for this task to 3.

Okay, looks like a vast majority of the traffic to Wikipedia Portal (from Wikipedia Portal) is specifically to search-redirect.php. @debt I suggest creating a ticket to adjust the Portal dashboard data collection scripts to filter out requests to search-redirect.php

daterequestreferrerrequestsproportion
2016-06-20otherother1977750.216
2016-06-20othersearch-redirect.php2240.000
2016-06-20otherWikipedia Portal1381850.151
2016-06-20search-redirect.phpother104570.011
2016-06-20search-redirect.phpsearch-redirect.php7380.001
2016-06-20search-redirect.phpWikipedia Portal5699280.621

Here's the Hive query I used (for future reference):

ADD JAR /home/bearloga/Code/analytics-refinery-jars/refinery-hive.jar;
CREATE TEMPORARY FUNCTION classify_referrer AS 'org.wikimedia.analytics.refinery.hive.SmartReferrerClassifierUDF';
USE wmf;
SELECT request, referrer, COUNT(1) AS requests
FROM (
  SELECT
  CASE WHEN referer RLIKE('^(https?://www\.)?wikipedia\.org/+search-redirect\.php\??.*') THEN 'search-redirect.php'
       WHEN referer RLIKE('^(https?://(www\.)?)?wikipedia\.org.*$') THEN 'Wikipedia Portal'
       ELSE 'other' END AS referrer,
  CASE WHEN uri_path = '/search-redirect.php' THEN 'search-redirect.php'
       ELSE 'other' END AS request
  FROM webrequest
  WHERE year = 2016 AND month = 06 AND day = 20  
    AND webrequest_source = 'text'
    AND content_type RLIKE('^text/html')
    AND uri_host RLIKE('^(www\.)?wikipedia.org/*$')
    AND classify_referrer(referer) IN ('internal', 'external', 'unknown')
    AND NOT (referer RLIKE('^http://localhost'))
) AS refined_webrequests
GROUP BY request, referrer;
debt moved this task from Done to Resolved on the Discovery-Analysis (Current work) board.