Page MenuHomePhabricator

Collect data to inform the Discovery Department how people are using the portal at www.wikipedia.org
Closed, ResolvedPublic

Description

Use the schema implemented in https://phabricator.wikimedia.org/T98569

Code review

The patch for this task is up for review on GitHub. It can be tested (make sure to start up a dev EL server on localhost:8080) via http://earldouglas.github.io/www.wikipedia.org/

Some questions to consider:

  • Does anything look obviously bad/broken/dangerous?
  • Does the sampling rate seem reasonable for www.wikipedia.org? Note that EventLogging is currently about at about 300% 100% capacity.
    • We'll be going with a 1% sampling rate, which should put us at ~3 events per second
  • Can anyone think of a better way to version control this? e.g. move it from a Meta template into a git repo?
    • This seems to be the established convention
  • Does it support enough browsers (tested so far in FF, Chrome, IE6, IE7, and IE8)

Deployment

  • Edit the temp version of the template on meta
  • Request on the Talk page that the changes be imported
  • Find an admin on meta to import the changes
  • Clear cache, smoke test

TODOs from feedback on email thread:

  • I noticed that the schema version in the code is different from *Schema:WikipediaPortal*
    • Solution: the schema version is up to date
  • document.getElementsByClassName won't work on IE8.
    • Solution: try both the above and document.body.getElementsByTagName
  • a.onclick (form.onsubmit) ... it overrides any other onclick (onsubmit) event handlers
    • Solution: no other handlers exist to be overridden
  • And as I understand, you’re using navigator.sendBeacon, which is not supported in IE
    • Solution: no use of the Beacon API
  • get some numbers on how much/little load times would increase
  • extend this analytics work to other portals such as Wiktionary
    • Solution: nope

Event Timeline

Manybubbles assigned this task to Jdouglas.
Manybubbles raised the priority of this task from to Medium.
Manybubbles updated the task description. (Show Details)

Some things to figure out:

  • How to load the requisite JavaScript
    • i.e. mw.eventLog is undefined on www.wikipedia.org
  • How to deploy the WikipediaPortal schema
    • i.e. do not want "[WikipediaPortal] Missing or empty schema"
    • This guide seems to be missing the actual deployment steps

To enable the schema, add the following to LocalSettings.php:

$wgEventLoggingSchemas[ 'WikipediaPortal' ] = 12144429;

This appears to do the trick. Not sure how to code review it, since the Wikipedia portal is a template, but here it is for funsies. I think JSON.parse and JSON.stringify will need to be replaced with old-browser-compatible replacements.

'use strict';

(function () {

  /////////////////////////////////////////////////////
  // The following should probably be library functions

  var stringify = function (x) {
    return JSON.stringify(x);
  };

  var parse = function (x) {
    return JSON.parse(x);
  };

  var withCountryCode = function (k) {
    var match = /GeoIP=(\w\w)/.exec(document.cookie);
    if (match) {
      k(match[1]);
    } else {
      throw new Error('No country code available');
    }
  };

  var isSampled = Math.random() < 1 / 1000;

  var foreach = function (selector, f) {
    var xs = document.querySelectorAll(selector);
    for (var i = 0; i < xs.length; i++) {
      f(xs[i]);
    }
  };

  //
  /////////////////////////////////////////////////////

  var logEvent = function (section_used, country, destination) {

    var capsule = {
      schema: 'WikipediaPortal',
      revision: 12144429,
      webHost: window.location.hostname,
      wiki: 'www',
      event: {
        section_used: section_used,
        country: country,
      },
    };

    if (document.referrer) {
      capsule.event.referer = document.referrer; // sic
    }
    if (destination) {
      capsule.event.destination = destination;
    }

    console.log('Logging event: ' + stringify(capsule.event));

    // var beaconHost = 'bits.wikimedia.org';
    var beaconHost = 'localhost:8080';
    var beacon = document.createElement('img');
    beacon.src = '//' + beaconHost + '/event.gif?' + encodeURIComponent(stringify(capsule));

  };

  if (isSampled) {
    withCountryCode(function (country) {

      var attachA = function (section_used) {
        return function (a) {
          a.onclick = function () {
            logEvent(section_used, country, a.href);
          };
        };
      };

      var attachForm = function (section_used) {
        return function (form) {
          form.onsubmit = function () {
            logEvent(section_used, country, form.action);
            return true;
          };
        };
      };

      foreach('.link-box', attachA('primary links'));
      foreach('.search-form', attachForm('search'));
      foreach('form[name=searchwiki]', attachForm('language search')); 
      foreach('.langlist:not(.langlist-other) a', attachA('secondary links'));
      foreach('.langlist-other a', attachA('other languages'));
      foreach('.otherprojects a', attachA('other projects'));

    });
  }

})();
var isSampled = Math.random() < 1 / 1000;

This will give you a non-uniform distribution. Do var isSampled = Math.floor( Math.random() * 1000 ) === 0; instead.

var foreach = function (selector, f) {
  var xs = document.querySelectorAll(selector);
  for (var i = 0; i < xs.length; i++) {
    f(xs[i]);
  }
};

document.querySelectorAll is broken in IE8 and unimplemented in IE<8.

// var beaconHost = 'bits.wikimedia.org';
var beaconHost = 'localhost:8080';
var beacon = document.createElement('img');
beacon.src = '//' + beaconHost + '/event.gif?' + encodeURIComponent(stringify(capsule));

The example I showed you had not been updated to use the Beacon API, which we now use on browsers that implement it. This code snippet should do the trick:

var sendBeacon = navigator.sendBeacon
    ? function ( url ) { try { navigator.sendBeacon( url ); } catch ( e ) {} }
    : function ( url ) { document.createElement( 'img' ).src = url; };

(Actually, the Image.src = approach won't work at all; see below.)

var attachForm = function (section_used) {
  return function (form) {
    form.onsubmit = function () {
      logEvent(section_used, country, form.action);
      return true;
    };
  };
};

This won't work. Navigating away from the page aborts any pending network requests. In practice, it means that the image request is never made. In the past, we have worked around this by returning false to prevent the browser from taking the default action of submitting the form (or following the link, in the case of click-thru tracking), submitting the event, and then manually re-triggering the action. But this technique delays navigation in the best case and it breaks it in the worse case, so we no longer condone it. Using the Beacon API exclusively will work around this issue, but be aware that you won't have data from browsers that do not implement it.

Updated per @ori's comments.

  • This uses the Beacon API, so it won't work for browsers that don't support it
  • It also still uses JSON.parse and JSON.stringify, so we'll probably want to swap those for JQuery or something else
  • It also also assumes some new class names that we'll need to update in the site template
'use strict';

(function () {

  /////////////////////////////////////////////////////
  // The following should probably be library functions

  var stringify = function (x) {
    return JSON.stringify(x);
  };

  var parse = function (x) {
    return JSON.parse(x);
  };

  var withCountryCode = function (k) {
    var match = /GeoIP=(\w\w)/.exec(document.cookie);
    if (match) {
      k(match[1]);
    } else {
      throw new Error('No country code available');
    }
  };

  var isSampled = window._DEV || Math.floor(Math.random() * 1000) === 0;

  var foreach = function (className, f) {
    var xs = document.getElementsByClassName(className);
    for (var i = 0; i < xs.length; i++) {
      f(xs[i]);
    }
  };

  //
  /////////////////////////////////////////////////////

  var logEvent = function (section_used, country, destination) {
    try {
      var capsule = {
        schema: 'WikipediaPortal',
        revision: 12144429,
        webHost: window.location.hostname,
        wiki: 'www',
        event: {
          section_used: section_used,
          country: country,
        },
      };

      if (document.referrer) {
        capsule.event.referer = document.referrer; // sic
      }
      if (destination) {
        capsule.event.destination = destination;
      }

      console.log('Logging event: ', stringify(capsule.event));

      var beaconHost = window._DEV ? 'localhost:8080' : 'bits.wikimedia.org';
      var beaconUrl = '//' + beaconHost + '/event.gif?' + encodeURIComponent(stringify(capsule));
      navigator.sendBeacon(beaconUrl);
    } catch (e) {
      console.error('Error logging event:', e);
    }
  };

  if (isSampled) {
    withCountryCode(function (country) {

      var attachA = function (section_used) {
        return function (a) {
          a.onclick = function () {
            logEvent(section_used, country, a.href);
          };
        };
      };

      var attachForm = function (section_used) {
        return function (form) {
          form.onsubmit = function () {
            logEvent(section_used, country, form.action);
            return true;
          };
        };
      };

      foreach('primary-link', attachA('primary links'));
      foreach('search-form', attachForm('search'));
      foreach('language-search-form', attachForm('language search')); 
      foreach('secondary-link', attachA('secondary links'));
      foreach('other-language', attachA('other languages'));
      foreach('other-project', attachA('other projects'));

    });
  }

})();

Perhaps we should make country code optional. This would simplify the code to eliminate the need for the withCountryCode function.

@Ironholds:

what would that mean for the data?

Are you referring to making country code optional? It would mean that we make our best effort to get the country code (i.e. from the cookie), but if it's not present, we still log the event with what we *do* know about it.

I'm not sure if countryless events are valuable, but it seems like it doesn't hurt to log them.

To be determined: the rate at which we want to log events from the Wikipedia Portal -- it might be an enormous amount of traffic.

I can give you a count of how many text/html hits it gets, if that'd be useful.

Ok, I think we're pretty close now. No Beacon API required, yet it should still wait for event.gif to be loaded before proceeding to a new page.

See the changes here: https://github.com/earldouglas/www.wikipedia.org/compare/e91ad41...HEAD

See a demo (which uses "development mode", looking for the EventLogging server on your localhost:8080, e.g. via vagrant) here: http://earldouglas.github.io/www.wikipedia.org/

Houston, we have IE support. That's a "go" for both IE6 and IE7.

See the changes here: https://github.com/earldouglas/www.wikipedia.org/pull/1/files

Check it out (expects the EL server running on localhost:8080, e.g. via vagrant): http://earldouglas.github.io/www.wikipedia.org/

Jdouglas set Security to None.
Jdouglas updated the task description. (Show Details)

I think FF, Chrome, IE6, IE7, and IE8 is good enough for now. Let's go with that and worry about the extras later.

Based on feedback, we need to iron out some acceptance criteria before we can proceed:

  • Where should the JavaScript be maintained?
  • Is it reasonable to maintain both the TypeScript source and the JavaScript source? It's certainly beneficial for maintenance.
  • What are the performance requirements for the expected degredation caused by this change?
  • Should we ignore browsers that don't support the Beacon API?
  • What are the requirements for the quantity of content that should be loaded (i.e. extra JS libraries)?

@Deskana: I think it is on your plate to help James resolve the non-technical questions (performance, old browsers, payload size).

@Tfinc @Manybubbles: Can one of you help with the two technical questions (where to maintain the code, and whether to use TypeScript)?

@Jdouglas: Please be sure to mention this ticket in the Thursday standup.

Time scheduled with @Jdouglas on Tuesday 23rd June to answer product questions.

Here are my thoughts/recommendations on the above, to be discussed next week:

Where should the JavaScript be maintained?

  • Advantages of storing in in Meta
    • It's version controlled, it dogfoods MW
  • Disadvantage of storing it in Meta
    • It's unclear how it gets served
    • It's unclear where to put other files/scripts/stylesheets
  • Recommendations
    • If we can support additional files leave it on Meta, document deployment process
    • If we can't support additional files, move it to git

Is it reasonable to maintain both the TypeScript source and the JavaScript source? It's certainly beneficial for maintenance.

  • It's a small effort to learn type annotations
  • The maintainability benefit is worth the cost of spending the effort
    • JS devs shouldn't have any trouble reading the source
    • It makes the source much easier to maintain, because it communicates more information
    • It adds little overhead do deployment - just a tsc command to generate mostly similar deployable JS

What are the performance requirements for the expected degredation caused by this change?

  • This one is far outside my expertise
  • The current page is already very hefty, and visibly locks up my browser for a second or two
    • I recommend separating this into a new task to investigate ways to optimize it

Should we ignore browsers that don't support the Beacon API?

  • If we have data that shows this to be an insignificant portion of our traffic, yes, else no

What are the requirements for the quantity of content that should be loaded (i.e. extra JS libraries)?

  • See performance improvement investigation task mentioned above
Deskana renamed this task from Implement EventLogging schema for initial data collection around the Wikipedia portal to Collect data to inform the Discovery Department how people are using the portal at www.wikipedia.org.Jun 24 2015, 8:59 PM

I tweaked the title of the task somewhat to more accurately represent what I want from it.

With the revised context of this task, I suspect we don't need EventLogging instrumentation at all, assuming we can get access to the HTTP request logs from Varnish (or wherever) HDFS.

Jdouglas raised the priority of this task from Medium to High.Jul 7 2015, 5:24 PM
Jdouglas lowered the priority of this task from High to Medium.

Booting this back to the backlog as it's dead in the water, and doesn't help us achieve our Q1 goal.

The source code for this has been forked into the Wikimedia organization on GitHub:

https://github.com/wikimedia/www.wikipedia.org

Why is this only targeting the Wikipedia portal?

Because we have other schemas for other projects and the portal is a
particularly unstudied part of the infrastructure with its own very
specific makeup.

Deskana claimed this task.

Plenty of data is being collected and dashboarded on http://discovery.wmflabs.org/portal/, so I think we can consider this resolved.