Page MenuHomePhabricator

Implement an EventLogging schema for session tracking
Closed, ResolvedPublic

Description

As discussed, we should be looking at what users do after they click through from search. Do they go to one page and then leave? Go to multiple pages in a chain? What?

I'll write out a schema and then Engineering can take a look at the implementing :). The schema should also allow us to gather numbers on queries or search actions per session.

Event Timeline

Ironholds claimed this task.
Ironholds raised the priority of this task from to Needs Triage.
Ironholds updated the task description. (Show Details)
Ironholds added a project: Discovery-ARCHIVED.
Ironholds added subscribers: Ironholds, Manybubbles.

@Ironholds will rescope this task per the discussion in the data planning meeting on 2nd June.

Ironholds set Security to None.

@EBernhardson Can you provide a bit of additional information to this task that documents your current progress on it?

Rescoping == bernie is writing the schema! Thanks bernie!

The code and the schema are written and in code review (not sure who is going to review though, will talk about that at standup today). After standup i'l also document here.

Actually the code i wrote so far just goes one page deep, I'll update this to track deeper.

Change 217421 had a related patch set uploaded (by EBernhardson):
Measure bounce rate and dwell time for search results

https://gerrit.wikimedia.org/r/217421

The developed schema is at https://meta.wikimedia.org/wiki/Schema:TestSearchSatisfaction
I've tried to describe things fairly well there, and commented the related patch heavily as well. I suppose one more round of documentation here can't hurt as well :)

First a few caveats:

  • Only collects data from users with the sendBeacon functionality. That means we will not be collecting data from old browsers(>1 yr). We will also not be collecting any data for IE or Safari.
  • Only collects data from logged in users. This is due to the way varnish caching works at the foundation for anonymous users. I talked to Krinkle and he suggested a method that removes this delay in the future (for all anon data collection, not just us). But this first time around we have to wait the 30 days for varnish caches to be updated. https://gerrit.wikimedia.org/r/#/c/217534/

What it collects:

When a user completes a search and ends up on the search engine results page (SERP) we decide if they will be a part of the test group. 1 in 1000 users are selected to be in this test group. If the user is not selected they will stay not selected for the next twenty minutes. Once the users search session times out they will also not be selected again for another twenty minutes. The search session lasts for 10 minutes. Every time the user goes back to a SERP (either same search via back button or a new search) this 10 minutes is reset.

There are three events we fire: searchEngineResultPage, visitPage, leavePage. Whenever the user goes to a SERP the searchEngineResultPage event is fired. Clicking any result link to an article page (loosely defined as any link that starts with /wiki/) triggers the visitPage event. This is triggered upon reaching the destination page so it will work for normal clicks, middle clicks into a new tab, etc. Leaving that page triggers a leavePage event. Clicking a link within the content portion of this page repeats the visitPage/leavePage events on the following page. These events will continue firing until the user leaves the site, visits some link outside the content area of the page (sidebar, search, etc), or their ten minute search session times out (but if they do a new search while the session is still active that ten minutes resets back to ten).

The schema includes a 'depth' parameter which indicates how many clicks away from the SERP the user is. The SERP itself is 0, pages linked directly from the SERP are 1, and it keeps increasing in the expected manner. All events also include a pageId. This is generated per visited page and allows correlating individual visitPage/leavePage events to each other. On top of that a logId parameter is generated to allow deduplication of events that get sent multiple times by the sendBeacon functionality (as the SendBeaconReliability test indicated it might).

Change 217421 merged by jenkins-bot:
Measure bounce rate and dwell time for search results

https://gerrit.wikimedia.org/r/217421

So, when did this go live? Because I'm seeing 1022 events in the database. Feeeels like we've messed up somewhere.

This is currently only tracking logged in users, and we are taking a 1 in 1000 sampling of users (not queries). I'm expecting as soon as the part to collect from anon's too (https://gerrit.wikimedia.org/r/#/c/224440/) that number should increase to something a little more expected

Gotcha! Okay; this was marked as "done" so I assumed everything was deployed. Is it not standard to declare "done" on deployment?

Well, by "done" i'm guessing its basically the title "implement an event logging schema". That is done, but there were implementation details that meant a particular patch had to be running in prod for 30 days (which has passed as of last thurs i think) before we could turn it on for all users.