Page MenuHomePhabricator

[L] Instrument MediaSearch results page
Open, Needs TriagePublic

Description

Once the main vue MediaSearch patch (T251940) and the QuickView patch have been merged (T256158) we need to instrument the results so that we can measure users' behaviour

Things we want to measure:

  • when a user clicks to go to QuickView
  • when a user clicks through from QuickView to a detailed view of an image (audio and video will be handled in T263154)
  • search session length
  • number of searches in a search session
  • total number of search sessions

We may need to create our own schema for what we're measuring - consult with @nettrom_WMF about this
Update: Here is the schema: Media Search measurement specification

Also will probably need to consult with the Analytics Engineering team

Note: currently SearchSatisfaction is being measured, which afaik is a combination of click-through rate from a search and dwell time. Whoever does this will need to figure out if we automatically get this for MediaSearch, and if it is see if we can remove dwell time for commons because it's probably not relevant for images. If we don't get it automatically perhaps we can still adapt the existing SearchSatisfaction code for our purposes

Event Timeline

Cparle created this task.Jul 16 2020, 2:53 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 16 2020, 2:53 PM
Cparle renamed this task from Instrument click events for QuickView and clickthrough from QuickView on MediaSearch results page to Instrument MediaSearch results page.Jul 16 2020, 3:37 PM
Cparle added a project: Analytics.
Cparle updated the task description. (Show Details)
Cparle updated the task description. (Show Details)Jul 16 2020, 3:41 PM
LGoto moved this task from Triage to Tracking on the Product-Analytics board.Jul 20 2020, 4:09 PM
Milimetric moved this task from Incoming to Event Platform on the Analytics board.Jul 20 2020, 4:45 PM
Milimetric added a subscriber: Milimetric.

Please do create new schemas on the modern event platform (https://wikitech.wikimedia.org/wiki/Event_Platform/Instrumentation_How_To, schemas will show up at https://schema.wikimedia.org/#!/secondary/jsonschema)

And let us know if you have any trouble

CBogen added a subscriber: CBogen.Jul 21 2020, 5:56 PM

Adding a note here that we also want to instrument and measure concept chips once those are ready (T256431). We need some clarity on exactly what we're measuring here - would love some advice from @nettrom_WMF .

CBogen renamed this task from Instrument MediaSearch results page to [L] Instrument MediaSearch results page.Aug 26 2020, 4:36 PM
CBogen updated the task description. (Show Details)
egardner claimed this task.Mon, Sep 21, 7:05 PM

I'm about to start working on adding the instrumentation for this, but before I do I have some high-level questions about what's involved. I've never worked with the new event platform and have only limited experience with EventLogging; I apologize if any of these questions are really obvious.

  1. If I understand things correctly, before we can add code to log events we need to define and publish a schema that is specific to MediaSearch (the SearchSatisfaction schema can be used to measure the old search for comparison). I took a look at the Measurement Specification document mentioned above, but I see that it still has "draft" in the title. Does this need to be finalized? In particular it looks like the list of actions which we’re going to be tracking will need to be explicitly defined.
  1. When it comes to translating the above specifications into a working schema file, I’m a little vague on what the process actually entails. Is this a matter of cloning the appropriate Git repo (this one, maybe?), then using the jsonschema-tools CLI to generate the appropriate files, and then pushing those up to the remote? Do I need to request any special rights or credentials in order to do this? Is there a review process for adding a new schema?
  1. Once the new MediaSearch schema is published and available at schema.wikimedia.org, it looks like some new configuration needs to be added to the InitialiseSettings.php file in the mediawiki-config repo. It looks like we’ll need to add a few lines to the wgEventStreams configuration variable that point to this newly-published schema. Is this correct?
  1. After we have created, published, and configured this new schema, *then* we can start to actually instrument our code to log events. Should this be done primarily client-side (using calls to mw.eventLog.logEvent ), or is it preferable to log events in PHP whenever possible?
  1. I’m hoping I’ll be able to test things locally to ensure that everything works properly (and without actually sending any data to the production event log). I assume/hope there is a way to set up EventLogging locally to point to a local schema definition. How would I confirm that the appropriate data is actually being recorded? Some of the documentation seems to be Vagrant-specific, but I'm using the MediaWiki-Docker environment for local dev.
Nuria added a subscriber: Nuria.Mon, Sep 21, 11:45 PM

@egardner probably a quick meeting with @nettrom_WMF or @jlinehan would be of help

Thanks @Nuria for the suggestion; @nettrom_WMF and I just had a productive chat about schemas and measurement strategy.

Here are some thoughts about next steps regarding analytics instrumentation for MediaSearch based on that discussion:

  1. Since there is already an extensive schema for SearchSatisfaction in the new analytics plaform, it seems like it would be a good idea to rely on this as much as possible for now rather than creating a new schema for MediaSearch from the get-go. I'd say we should start by adding instrumentation to capture the events that are already defined in the SearchSatisfaction schema. We can target a different stream when we log things from MediaSearch so that we can keep this data separate from what is recorded from regular search pages. This should cover the basic actions the user will do on the page: entering a term, selecting an item from auto-complete, clicking or hovering over a result item, scrolling down the page, etc.
  2. For the relatively small number of MediaSearch-specific actions the user can take (interactions with the Quickview element, the tabs for different file types, concept chips, etc.), we could use the extraParameters property in the SearchSatisfaction schema for now, since it allows us to pass in an open-ended string value as a payload. This would give us a little more flexibility as we figure out just what we need to measure, continue changing the UI around, etc.
  3. Eventually we could either 1) update the SearchSatisfaction schema to include a few additional properties for MediaSearch-specific features (assuming the Search team is okay with that), or 2) create a very minimal schema that inherits from SearchSatisfaction and adds our custom properties on top of it, once we are comfortable with formalizing what we need to capture.

Assuming this sounds okay, I think the path to implementation will become much simpler because we can skip the schema-creation step for now and rely on what already exists. In the interest of time and simplicity I think this would be a good approach, at least for the initial instrumentation.

Sounds good to me @egardner

@dcausse and @EBernhardson, just an FYI about @egardner's comment above.

Hi @egardner!
You've got everything right! I'll respond to a few specific questions here and we can talk more in person tomorrow.

Should this be done primarily client-side (using calls to mw.eventLog.logEvent ), or is it preferable to log events in PHP whenever possible?

In the new system, the client-side JS call in EventLogging is mw.eventLog.submit. logEvent is for legacy events only.
I think the new EventLogging PHP submit function is not yet merged, but it looks ready to go to me: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EventLogging/+/623459 Ping @Mholloway, can this be merged and deployed?

Once that gets deployed, you should be able to log to EventGate in the same way using either JS or PHP, whichever makes the most sense for your use case.

How would I confirm that the appropriate data is actually being recorded? Some of the documentation seems to be Vagrant-specific, but I'm using the MediaWiki-Docker environment for local dev.

I've never used MW-Docker, and IIUC there are multiple of these environments? One from RelEng and one from ...someone else? Anyway, I've got a patch for EventLogging to include an eventgate-devserver that is just waiting for someone to review. @mpopov was able to get it to work, but had some issues with NodeJS on his host machine. I'd be happy to help you get it working. If we can make sure it works then I think we can merge that patch.

Eventually we could either 1) update the SearchSatisfaction schema to include a few additional properties
or 2) create a very minimal schema that inherits from SearchSatisfaction and adds our custom properties on top of it

I think either of these should be fine. I don't think we have an example of including a non 'fragment' schema in a concrete schema, but there isn't really a difference except for intention.

I wonder though, SearchSatisfaction is a 'legacy' schema, in that it was imported from metawiki and has a bunch of deprecated fields. It isn't going away any time soon (or ever?), so you can certainly re-use it, but if you will only be using a few of its fields, perhaps it would be worthwhile creating a new non-legacy schema for your use case?

we could use the extraParameters property in the SearchSatisfaction schema for now, since it allows us to pass in an open-ended string value as a payload.

An open ended string means that you will have to use string parsing Hive functions to do analysis. Event Platform does support map types, which are almost as good as schema-less, except that every value in the map must be the same type.

This comment was removed by egardner.

Thanks @Ottomata for your help in getting things working today. For the sake of posterity I'm going to include some notes here about wiring up a local (MW-Docker-based) dev environment with the eventgate-wikimedia dev server. This setup will be useful for confirming that our MediaSearch analytics instrumentation is working as expected before merging any patches.

  1. The EventLogging and EventStreamConfig extensions need to be downloaded and enabled locally.
  2. In the EventLogging extension, check out this patch, and then run npm install in the extension directory.
    • Mac OS users may have issues installing node-rdkafka – if so, follow the OS-specific instructions for how to install this library
  3. Once everything has been installed successfully, you can run npm run eventgate-devserver in a stand-alone terminal process (you can see the output as events get logged, which is useful); output is also written to an events.json file in the EventLogging directory.
  4. In order to properly send events from MediaWiki to this dev server, you'll need the following config in LocalSettings:
wfLoadExtension( 'EventStreamConfig' );
wfLoadExtension( 'EventLogging' );
# EventStreamConfig.  All streams need an entry here.
# This is not an EventLogging config, but a more generic
# stream configuration for all things that use streams.
$wgEventStreams = [
	[ "stream" => "eventlogging_Test" ],
];
# This is the legacy eventlogging-devserver URI
$wgEventLoggingBaseUri = 'http://localhost:8100/event.gif';
# This is the new eventgate-devserver URI
$wgEventLoggingServiceUri = 'http://localhost:8192/v1/events';
# Always enable EventLogging debugMode in MW Vagrant.
$wgUserDefaultOptions['eventlogging-display-web'] = 1;
# By default EventLogging waits 30 seconds before sending
# batches of queued events.  That's annoying in a dev env.
$wgEventLoggingQueueLingerSeconds = 1;
# For 'legacy' schemas that have been migrated to Event Platform
# (like Test and SearchSatisfaction), EventLogging needs to know
# that when mw.eventLog.logEvent is called, it should end up
# calling mw.eventLog.submit instead.  By setting the
# Schema here to the $schema URI, EventLogging will know what to do.
# This config is ONLY needed for 'legacy' schemas.  New
# Event Platform events should call mw.eventLog.submit directly.
$wgEventLoggingSchemas = [
	"Test" => "/analytics/legacy/test/1.1.0",
];
# Register streams by name from wgEventStreams
# to be usable by EventLogging.
$wgEventLoggingStreamNames = [
	"eventlogging_Test",
];
  1. To actually send a test event from the browser, navigate to a wiki page and run the following code from the console: mw.eventLog.logEvent("Test", {"OtherMessage": window.location.href}). Event data matching what was submitted should appear in the dev server console in real time.

Also, since SearchSatisfaction is still a legacy schema, it may be worth writing a new MediaSearch schema after all: this is a simple example that could serve as a starting point.

Hey @Ottomata (and @egardner!), I'm still catching up here but I think https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EventLogging/+/623459 is good to merge if it looks good to you. The follow-up patch adding sampling config support probably needs some discussion (because I guess there's no shared concept of a "session" between MediaWiki JS and PHP?).

Mholloway added a comment.EditedThu, Sep 24, 9:01 PM

Oh, I think @jlinehan wanted to give it another look as well.

Ottomata added a comment.EditedThu, Sep 24, 9:21 PM

So, I'm disappointed that we need EventStreamConfig set up for dev envs. https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EventLogging/+/629826 should fix that, but it does change the behavior of how the client works, so we'll dev need some good review of that to get it merged.

Hopefully the additions of config descriptions in extension.json and the devserver/README.md file there help explain what is going on.
Ping @mpopov @jlinehan

Nuria added a subscriber: mforns.Fri, Sep 25, 3:40 PM

cc @mforns that will be working on dev environment for MEP next quarter

Change 630681 had a related patch set uploaded (by Eric Gardner; owner: Eric Gardner):
[mediawiki/extensions/WikibaseMediaInfo@master] [WIP] Instrument MediaSearch analytics using modern Event Platform

https://gerrit.wikimedia.org/r/630681

I have an in-progress patch that adds some basic analytics to MediaSearch, using the draft schema from the linked sub-task.

Here's an example of the kind of data that is captured on each step of a typical search using what I have currently. This data is all from my local dev environment where I only have a few files, hence the low numbers of results being returned.

  1. User performs a new search on the Special:MediaSearch page using the JavaScript UI; they start out in the "images" tab by default. This is logged as a search_new action; other data includes the search string used, the current media type, and the total number of results for that media type that have been loaded thus far.
{"action":"search_new","query":"Portland","media_type":"bitmap","total_result_count":1,"$schema":"/analytics/media_search/1.0.0","session_id":"7ac5e846f28dec6ba943","meta":{"stream":"analytics.media_search","dt":"2020-09-29T18:15:11.094Z"},"client_dt":"2020-09-29T18:15:10.082Z"}
  1. User changes to a different tab, which causes an API call to be made which attempts to load results for the same term with a media type corresponding to the new tab. This causes two events to be logged: first a tab_change action (includes the new media_type the user has just switched to as well as the original search term); next a search_load_more action is logged to represent the new API request. The MediaSearch UI treats tab changes as a "continuation" of an existing search (even though a new search API request is being dispatched every time), so it seemed logical to log this as a search_load_more action instead of a search_new action. Load more actions would also be fired as the user scrolls down the page to see more results within a tab (in that case there would be no tab_change events in between).
{"action":"tab_change","query":"Portland","media_type":"other","$schema":"/analytics/media_search/1.0.0","session_id":"7ac5e846f28dec6ba943","meta":{"stream":"analytics.media_search","dt":"2020-09-29T18:15:22.863Z"},"client_dt":"2020-09-29T18:15:21.850Z"}
{"action":"search_load_more","query":"Portland","media_type":"other","total_result_count":1,"$schema":"/analytics/media_search/1.0.0","session_id":"7ac5e846f28dec6ba943","meta":{"stream":"analytics.media_search","dt":"2020-09-29T18:15:22.864Z"},"client_dt":"2020-09-29T18:15:22.585Z"}
  1. User goes back to the initial "images" tab that they started on. There are no more results to load here, so only a single tab_change action is recorded.
{"action":"tab_change","query":"Portland","media_type":"bitmap","$schema":"/analytics/media_search/1.0.0","session_id":"7ac5e846f28dec6ba943","meta":{"stream":"analytics.media_search","dt":"2020-09-29T18:15:28.700Z"},"client_dt":"2020-09-29T18:15:27.688Z"}
  1. The user clicks a result in the "images" tab. This logs a result_click action. I decided not to store the search term here, but I did include the skin (this changes the behavior of the quickview element, and seemed relevant; we could record skin for every event if it made sense to do so). position is also worth mentioning – this is just the index of the clicked result in the total. Different types of results are displayed in different ways, so trying to preserve exact grid coordinates doesn't seem very meaningful (image aspect ratio will factor in, as will screen size, etc- there are no fixed columns for the image grid, but there are for the video grid). The page_id of the result is also stored. Finally, some results will trigger the QuickView element (images, audio, video), while others will not (pages and "other"). I figure we probably want the same event logged whenever a result is clicked regardless of result type, so I have included a has_quickview property which has a value of true or false.
{"action":"result_click","media_type":"bitmap","skin":"vector","page_id":4,"position":0,"has_quickview":true,"$schema":"/analytics/media_search/1.0.0","session_id":"7ac5e846f28dec6ba943","meta":{"stream":"analytics.media_search","dt":"2020-09-29T18:15:34.983Z"},"client_dt":"2020-09-29T18:15:33.939Z"}
  1. The user interacts with the Quickview element they just expanded, clicking the "more details" button to open the resulting File page in a new tab. As with result clicks, the quickview_more_details_click action logs the page id. We could also log the media type here if desired. If the user had closed the quickview panel instead by clicking the "X" button, a quickview_hide event would have been logged instead. Eventually we'll log other quickview interactions here as well.
{"action":"quickview_more_details_click","page_id":4,"$schema":"/analytics/media_search/1.0.0","session_id":"7ac5e846f28dec6ba943","meta":{"stream":"analytics.media_search","dt":"2020-09-29T18:15:42.358Z"},"client_dt":"2020-09-29T18:15:42.042Z"}
  1. The user clears their search term (thus also clearing all results in all tabs) by clicking the "x" inside the search input at the top of the page. This logs a search_clear event. Not sure if any other information should be included here beyond the standard metadata.
{"action":"search_clear","$schema":"/analytics/media_search/1.0.0","session_id":"7ac5e846f28dec6ba943","meta":{"stream":"analytics.media_search","dt":"2020-09-29T18:15:48.990Z"},"client_dt":"2020-09-29T18:15:47.981Z"}

At this point the cycle can begin again. The session ID will persist until the user re-loads the page, so it should be clear that additional search_new actions are being done in the same sitting.


Does the above look like generally the sort of data we want to capture? Anything important left out (or that should not be included based on what is here)? The current implementation only tracks searches for users with JS enabled – is that ok, or do we need to log initial search actions on the server? The current implementation will log server-rendered search results (say if the user navigates directly to Special:MediaSearch?q=searchterm), but logging won't start until JS initializes.

@egardner I just remembered that @mwilliams created T263172 - does this plan cover the ability to answer the questions he posed there? Should I close that ticket if it's covered by this one?

@CBogen I think my latest updates to this patch and the schema should capture enough data to answer those questions. Every time the user changes a filter an event will be logged with that filter's new value; it should be possible to get a sense of all filters used in a given session by grabbing all filter_change events and looking at filter types, values, whether or not things were unset after being set, etc.