Page MenuHomePhabricator

Include fulltext search results Page Previews of sufficient dwell time in Search Metrics dashboard
Closed, ResolvedPublic5 Estimated Story Points

Description

When a user visits fulltext search results and is using a mouse, the user can hover over a link and get a Page Preview.

As a viewer of the Search Metrics dashboard, when I view the dashboard, in addition to knowing when fulltext search sessions resulted in a satisfied session by virtue of a conventional pageview, I would like to ensure that I'm also aware of satisfied searches when a fulltext search session would have been considered as a virtual pageview.

Acceptance Criteria:

  • Search metrics collection is arranged / verified for being able to stitch virtual pageviews into the Search Metrics, Web dashboard pipeline
  • Search Metrics, Web dashboard is updated in some fashion to visualize the percentage of virtual pageviews
  • Verbiage in the Search Metrics, Web dashboard is adjusted to provide context about virtual pageviews and how they fit into the scheme of what constitutes satisfied versus unsatisfied fulltext searches. As "content interactions" are classified as meaningful in both a pageview and Page Preview context, it follows that it's okay to consider a "content interaction" a satisfied search. Verbiage may need to be removed, added, or altered, or all of these, in order to make the narrative text clear.

Event Timeline

Gehel set the point value for this task to 5.Sep 23 2024, 3:29 PM

Change #1076827 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/Popups@master] Record canonical special page names in virtual pageviews

https://gerrit.wikimedia.org/r/1076827

AFACT virtual page views are only registered by the popups extension. On review of the data one difficulty arises, the data is all logged against the translated name of the special page. Patch above changes so that special pages log against the canonical name, which should make our job much easier. For initial evaluation I'm looking at popups on pages where the source url has something that looks like a search=<query> uri parameter and was submitted to ns -1 (special pages).

There is a different, but similar thing, provided by SearchVue. This is deployed to even fewer wikis (pt, ru, id, ca, no, hu, nl and uk, wikipedias only). It provides a search preview which is similar but different, in that it's takes screen space from the side of the page, rather than using a popup that sits over the page. Likely not worth considering, but i collected some information about it's usage to put into context. This simply counts the number of open events, without consideration to dwell times (separately, i'm not sure what consistitues a virtualpageview and instead am trusting the data to be accurate).

sourceest daily pv
virtualpageviews90k-130k / day
searchvueless than 9 -12k / day

In terms of where to put the data, it looks like we have most of what we would need to join an aggregation of wmf.virtualpageview_hourly with our regular search_satisfaction_metrics and store it in the same table. Perhaps it would misleading to inject it into that table though, since it's wouldn't follow the same rules? In particular when calculating search satisfactions metrics we groupBy the pageViewId which is only available from our schema. We then group those per-pageview rows by the searchSessionId to get our per-session metrics, again this is only available from our schema.

So we can possibly estimate the number of search result pages that registered a virtualpageview, but I'm not 100% certain yet on how that will integrate into the rest of the numbers or the dashboard.

After pondering this a bit and looking over how we would join the two datasets, I wondered if there isn't an easier way. We could subscribe in the browser to the events that virtual page view is emitting, and then re-emit them to the search satisfaction schema. This would unify them into the existing data collection and avoid having to do awkward joins in the backend data analysis. I've tested this out locally and it looks like it should do what we need.

To this end a few patches will be provided:

  • To schemas-event-secondary, to add the new action type to the allowable list of actions
  • To WikimediaEvents, to start producing the new events.
  • To discolytics, to consider the virtual page view action as a potential source of satisfaction
  • To airflow-dags, to add the new columns to the hive table and update the released discolytics artifact

Change #1077110 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/WikimediaEvents@master] search satisfaction: track virtual page views

https://gerrit.wikimedia.org/r/1077110

Change #1076827 merged by jenkins-bot:

[mediawiki/extensions/Popups@master] Record canonical special page names in virtual pageviews

https://gerrit.wikimedia.org/r/1076827

Change #1077731 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[schemas/event/secondary@master] SearchSatisfaction: Add virtualPageView action

https://gerrit.wikimedia.org/r/1077731

Change #1077731 merged by jenkins-bot:

[schemas/event/secondary@master] SearchSatisfaction: Add virtualPageView action

https://gerrit.wikimedia.org/r/1077731

Jdlrobson subscribed.

Hey @dr0ptp4kt how does this impact the web team and what assistance do you need from us here (if any)? Some guidelines for what would be useful to us are documented here: https://www.mediawiki.org/wiki/Readers/Web/Working_with_us

@Jdlrobson just ACK'ing. I'm back from some travel and hope to circle back to this within the next business week.

Okay, I'm caught up on the patches. I see from the activity in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Popups/+/1076827 that it looked like it was okay for the change to be done from search engineering.

That said, @Jdlrobson mind taking a quick look at https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Popups/+/1076827/1/src/changeListeners/pageviews.js quickly to confirm that there aren't downstream hardcoded expectations regarding the source_title field?

The change looks to be an improvement for specificity of what special page was actually used without having to cross-reference translated special page names (which are themselves probably slightly more prone to drift, plus just a bear to join on). Looking at https://codesearch.wmcloud.org/things/?q=source_title and https://codesearch.wmcloud.org/analytics/?q=source_title&files=&excludeFiles=&repos= (I didn't check elsewhere, Codesearch seems to be reindexing of something in some other repos) and doing a little bit of searching around Superset for "virtual" and "interaction" plus searching about https://datahub.wikimedia.org/ for similar sorts of things, I can't quickly identify a downstream expectation about that particular field that might be broken (e.g., if there's special casing in an SQL query or other data pipeline), but I don't know for certain. Nevertheless, if you happen to know of anything off the top of your head that would be worth knowing in case we need to revert / amend.

I think Erik already looked around based on the interactions, I just figure if you have a couple minutes to jog your memory it would be welcome!

Thanks for sharing the Readers/Web/Working_with_us link as a reminder - always good to coordinate up front and apologies that I didn't get you added to this ticket at the outset. If y'all got a chance to connect on a side thread already about the patch, great, but if not, anyway, for the future something we can plan for scheduling as appropriate for the maintaining team (in this case, Web).

Change #1077110 merged by jenkins-bot:

[mediawiki/extensions/WikimediaEvents@master] search satisfaction: track virtual page views

https://gerrit.wikimedia.org/r/1077110

Mentioned in SAL (#wikimedia-operations) [2024-10-28T22:09:54Z] <ebernhardson@deploy2002> Started deploy [airflow-dags/search@99eb6f3]: T375387: update discolytics to 0.27.0

Mentioned in SAL (#wikimedia-operations) [2024-10-28T22:10:44Z] <ebernhardson@deploy2002> Finished deploy [airflow-dags/search@99eb6f3]: T375387: update discolytics to 0.27.0 (duration: 00m 50s)

All the background bits have now been deployed and the hive table is now being augmented with virtual page view metrics. I've re-run the aggregations since last wednesday which should capture most of the time the events might have been shipping. Still need to update the dashboards to incorperate this data.

Updated https://superset.wikimedia.org/superset/dashboard/530 with minimal info in the Fulltext Abandonment chart. If I understood the metric properly only ~1% of sessions get a virtual PV without a click, I don't know if this is enough to consider it as a meaningful interaction nor if could be considered as successful sessions.
It is not entirely obvious how to integrate this data in current charts of the dashboard, it might make sense to have a "Fulltext Engagement" section where we could graph the various ways user interacts with the SERP.

Gehel subscribed.

The virtual PV is low enough to not be significant in the context of Search Abandonment. If we start tracking engagement in a more detailed way, then it would make sense to include virtual PV in that context. We've done enough work on this to be comfortable to close this.