Page MenuHomePhabricator

Create/revive Search Platform team metrics dashboard
Closed, DeclinedPublic

Description

Create or revive a dashboard for the Search Platform team (formerly Discovery) that includes the below metrics, and the question we are trying to answer with them:

  1. Be able to filter by the following dimensions: platform (Desktop, Mobile Web, Android, iOS); bot vs non-bot, language
    • What do included metrics look like when we segment users/queries?
  2. Search Engagement (= number and percentage of queries with a dwell time > 10 seconds) [note to self: Search Engagement formerly known as Search Satisfaction/User Engagement].
    • Are users finding full text search results relevant/useful?
  3. Number and percentage of full-text searches, "Go" box, morelike, autocomplete searches
    • What type of searches are being used?
  4. Average click position of clicked on search result (num_click_sessions) & dwelt on click through (num_checkin_sessions)
    • How far down the list of results was the result that the searcher clicked/dwelt on? How good is our relevancy ranking?
  5. Number and percentage of abandoned sessions.
    • How many users are unsatisfied with search results that they are leaving?
  6. Number of WDQS time outs [out of scope for prod-analytics team]
    • How often is Wikidata Query Service failing to return results for a user's query?
  7. Number and percentage of queries with "did you mean" suggestions.
    • How well are we accommodating imprecise search queries?
  8. Number and percentage of "did you mean" suggestions clicked on.
    • Are relevant are our results for imprecise search queries?
  9. Number of requests to WDQS and Linked Data Fragments, dumps, mediawiki APIs [out of scope for prod analytics team]
    • What services are people using to get data from Wikidata?
  10. Number and percentage of zero results
    • What happened (recently) that may have drastically affected how many queries are returning zero results?
  11. Top queries and top keywords
    • Are there any common query patterns worth doing anything about?
  12. Top returned documents (articles) and top clicked through documents
    • Are there patterns in specific search results that are worth doing anything about?

We have excluded any metrics that would require building a new thing: i.e. a "smiley face" search satisfaction survey that would need to be built.

Meta-requests
Search Platform currently lacks analyst support in two major ways:

  1. Technical. Legacy bespoke dashboards were built in shiny and R and the team lacks the current resources and technical expertise to maintain this code ourselves.
  2. Data expertise. The team is able to do rudimentary data analysis, but lacks the expertise to really validate statistical significance, whether we are capturing the right signals to test our hypotheses, etc.

Even if a dashboard were to be built for us, it runs the risk of growing stale and obsolete without the ability to actively maintain/tune it, which we are unable to do on our own. To ensure long-term value and avoid wasted effort in constantly (re)building dashboards that grow stale, it would be ideal to have access to resources to help us maintain our metrics in the long term.


Include relevant timelines/deadlines, OKRs
we would like to have this in time for the next quarterly planning cycle

How will you use this data product?
To understand what Search performance/features/functionality looks like in production, and discover and prioritize future Search Platform team work.

Is this request urgent or time sensitive?
no

Details

Other Assignee
nettrom_WMF

Event Timeline

Hi @MPhamWMF! My team will triage and prioritize this on Tuesday, April 6th. To help with that, can you please provide some additional details to the prompts I added in the description? Thanks!

Thanks @mpopov! I filled out the additional requested details. Please let me know if you need more description!

kzimmerman added a subscriber: kzimmerman.

Most of this will need to wait until we can hire an analyst to support Search, but we'll look into what we can provide in the near term

MPhamWMF updated the task description. (Show Details)

oops, didn't mean to remove you @kzimmerman -- i had an edit sitting around that preceded your triaging of the ticket.
Thanks for the update regarding the timing and priority of this.

@MPhamWMF - @JKatzWMF and I spoke about your request, and Carol flagged that having basic search metrics is a high priority.

Can you pare this down to your highest priority needs (or questions you're currently addressing), so we can work on those in the near term?

I still think it's critical that we get budget next year for analytics support for Search.

Thanks, Kate. I'll reorder the requests in this ticket by (descending) order of priority by end of day Mon April 19.

@kzimmerman, items are now ordered in descending order of priority for us, so you should be able to draw the line wherever you need to based on what you're able to handle.
I know some of these items will also probably need more discussion as we dig in, so just keep me posted when that is necessary. I also know some of these items may already exist somewhere, so best case scenario is some of them might be freebies

kzimmerman added a subscriber: nettrom_WMF.

Assigning to @nettrom_WMF; @mpopov will support

@MPhamWMF : I've been working on putting together a notebook to aggregate user engagement metrics (full text query sessions with a dwell time of > 10 seconds), so we have a working example of using data from SearchSatisfaction. From digging into data from that schema, it appears to only be instrumented on desktop. Could you check in with the engineers on the team about whether that's the case? I want to make sure I'm not looking for non-existent data.

I'll be picking this up again next week and aim to have a prototype dashboard based on that metric as well as one metric using CirrusSearch as the dataset (e.g. number of searches) ready as soon as possible.

@MPhamWMF : I've been working on putting together a notebook to aggregate user engagement metrics (full text query sessions with a dwell time of > 10 seconds), so we have a working example of using data from SearchSatisfaction. From digging into data from that schema, it appears to only be instrumented on desktop. Could you check in with the engineers on the team about whether that's the case? I want to make sure I'm not looking for non-existent data.

This is the expected case, search is only instrumented on the desktop skin.

I've added @cchen as the second assignee to this task. She'll be picking up this task as we transition my work from Search and Structured Data. So far that means she'll start working on item number 3.

I've completed data gathering for item number 2 and have drafted a dashboard in Superset. I'll be back tomorrow with a link, as the dashboard needs a little more documentation on it to make it clear how it works and why it works in particular ways.

@MPhamWMF : We now have a test dashboard for user satisfaction (item number 2), it's available here: https://superset.wikimedia.org/r/628

The dashboard is based on data from June 1 to July 12. It's limited to desktop sessions per the previous comments on this task.

I've got it set up to show number of sessions, sessions w/a click on a result, and sessions w/a checkin on a daily basis. That chart can be filtered by platform, wiki, language code of the wiki (e.g. all English-language wikis), agent type (user, automated, bot), and time range.

There's also a chart showing the proportion of sessions with a dwell time of > 10s. In this test setup we can only calculate proportions on a per-wiki basis, so that chart works best when looking at one or a handful of specific wikis. This chart can also be limited by time range.

Lastly, I added a chart showing the decay in counts of sessions based on a checkin time larger or equal to a specific value. This is also limited to filtering by specific wikis.

I wanted to get this to you so you can start exploring it, seeing if it helps answer questions and spur ideas. The limitation of filtering is something that can be fixed by importing the data into Druid. For example the Session length dashboard has proportions and percentiles and bucketing because of that.

@nettrom_WMF , thanks! I'm currently getting a Presto error that says access denied and that I should contact chart owners for assistance. Have tried reloading a few times but didn't seem to help. Any advice?

@nettrom_WMF , thanks! I'm currently getting a Presto error that says access denied and that I should contact chart owners for assistance. Have tried reloading a few times but didn't seem to help. Any advice?

That's probably because I hadn't updated the permissions to the underlying data, sorry! Could you try again?

If it still doesn't work, could you let me know what the specific error message you get is? You should have access to the data based on the permissions discussed in the comments in T270438, and hopefully it was just me missing the permissions that was needed.

Thanks, the charts load now! I'll take a look and get back to you in the next week with feedback. thanks!

@nettrom_WMF , I met with @cchen today to look over the dashboard. I'm sure yall will talk in more detail later. Looks mostly pretty good so far! Quick overview of things we went over:

  1. what is the definition of checkin time/sessions? (was hard to evaluate the third graph without a better understanding of what this is)
  2. I'd like eventually have some help setting good/reasonable baselines based on the data
  3. Labeled axes would be helpful
  4. Add a search dwell time graph by raw number in addition to the existing percentage of sessions
  5. "Search satisfaction" is a bit of a loaded term since it apparently has different meanings. I've adjusted the working metric based on our existing definition (dwell time > 10s) as Search Engagement, and have been trying to update any relevant mentions in documentations -- would be nice for the dashboards to stay consistent with this terminology as well

Might have been a few things I missed, but I think Connie took better notes than I did. Search's main KR is tied to improving Search Engagement in a TBD number of official languages of emerging communities. It looks like the second graph will be able to show us this information. We have not determined yet if this will be done at the project level or by the query level -- I recognize this is much more difficult to in terms of knowing which language a query is supposed to be in, but might be more relevant in projects like Commons, which are coded as en, despite serving many different languages.

@MPhamWMF : I've updated the dashboard based on the feedback from you and @cchen. Here are the changes that have been made:

  1. I've added a description of what "search sessions" means, as well as describing visiting a page and dwelling on a page. When updating the dashboard I noticed that I used both "dwell time" and "checkin" as phrases for the same thing, and thought that having both would be very confusing. The dashboard should now be using "dwell time" consistently ("checkin" is the name of the event in the SearchSatisfaction schema, and doesn't need to be used outside of code).
  2. I've labelled all axes in all charts.
  3. I've added a chart showing the number of sessions together with the percentage of sessions with dwell time > 10 seconds.
  4. I went through to check if "search engagement" was used consistently and couldn't find any places where it wasn't. Let me know if you come across an inconsistency as I'd like the phrasing to be correct.
  5. I grabbed more data so it's up to date as of yesterday, because that was trivial.

A couple of additional things I noticed when working on this:

  1. In the task description, "Search Engagement" is defined as a dwell time of over 10 seconds. The first ping event on a page happens at 10 seconds. I've taken this to mean that at least two pings are needed, meaning 20 seconds, for a session to count towards search engagement. I'm wondering if I've misinterpreted that and we should change it? The way the data is stored makes changing that easy, so it's not a significant change.
  2. When it comes to baselines, we'd probably want to wait until activity picks up again as summer vacation time in Europe and the US comes to a close. Search engagement for large wikis appears to be fairly stable across time, but has increased during the period of lower activity. I'm expecting it to revert back down as activity picks up. It's also clearly dependent on the wiki. I'm sure @cchen and I can help figure out some good ways to measure this.

Thanks, @nettrom_WMF! I'll take a look and get back to you soon

This is looking great! Thanks for making the changes listed.

I went through to check if "search engagement" was used consistently and couldn't find any places where it wasn't. Let me know if you come across an inconsistency as I'd like the phrasing to be correct.

I don't think it should be used inconsistently anywhere, since I started using it as a new term to differentiate it from "search satisfaction" and "user engagement".

In the task description, "Search Engagement" is defined as a dwell time of over 10 seconds. The first ping event on a page happens at 10 seconds. I've taken this to mean that at least two pings are needed, meaning 20 seconds, for a session to count towards search engagement. I'm wondering if I've misinterpreted that and we should change it? The way the data is stored makes changing that easy, so it's not a significant change.

As per the IRC chat with Erik, I think it makes sense to use >=10s

When it comes to baselines, we'd probably want to wait until activity picks up again as summer vacation time in Europe and the US comes to a close. Search engagement for large wikis appears to be fairly stable across time, but has increased during the period of lower activity. I'm expecting it to revert back down as activity picks up. It's also clearly dependent on the wiki. I'm sure @cchen and I can help figure out some good ways to measure this.

This sounds good to me. Let me know when is a good time to start talking about this.

question:

  1. For the last chart with number of sessions with dwell time >= k, if I'm reading the description correctly, these buckets are not exclusive -- i.e. the bar/number in the 0s bucket includes all the other buckets, and so on moving rightwards. So if I wanted only the number of sessions with <10s dwell time, I'd subtract the 10s bucket from the 0s bucket.
nettrom_WMF updated Other Assignee, added: nettrom_WMF; removed: cchen.

As per the IRC chat with Erik, I think it makes sense to use >=10s

I've updated the dashboard to use the new definition. This means that num_checkin_sessions in the "Full text search sessions" chart should match the number of sessions in the "Dwell time num. sessions by wiki chart" for a given date, provided agent_type in the filters is set to "user". I confirmed that this is the case for English Wikipedia, and I expect it to hold for other wikis as well.

question:

  1. For the last chart with number of sessions with dwell time >= k, if I'm reading the description correctly, these buckets are not exclusive -- i.e. the bar/number in the 0s bucket includes all the other buckets, and so on moving rightwards. So if I wanted only the number of sessions with <10s dwell time, I'd subtract the 10s bucket from the 0s bucket.

This is correct. I did it this way because the difference between buckets changes as the dwell time increases: it's 10s up to 60s, then 30s to 240s, and 60s until the limit of 420s. When I experimented with the visualization, this would lead to varying numbers of sessions in those buckets and it was quite confusing. Hence I chose to make it look like a Kaplan-Meier plot, even though we're working with counts and not proportions (the latter will be easier when the data is moved to Druid and pre-processed).

I'm reassigning this task to @cchen and moving myself to the secondary assignee slot. I've set up automatic data gathering for this dashboard using a daily cron job on my account on stat1006. Once Airflow is available, we can move the daily aggregation to that platform, and also set up Druid ingestion as well. Currently our options are to set up a shared cron job and use Oozie for the transfer to Druid, and that would create some tech debt we'd have to pay back later.

Oh, I forgot! I've pulled in the remaining data for August so the dashboard is now up-to-date as of today, and the cron job will keep it up to date.

Gehel added a subscriber: Gehel.

No progress has been done, when we have a more direct need for metrics / dashboards, we can re-open or re-describe the needs based on our future understanding.