Page MenuHomePhabricator

[Epic] Search metrics 2024
Open, Needs TriagePublic13 Estimated Story Points

Description

To better understand the impact of Search on our users, we want to have a few high level metrics that give us visibility on how Search is used.

See https://docs.google.com/document/d/1p8CTKnYnuGKxeNcE0rNXAOAXiu5xdagaxVFVwdJXs9s/edit#heading=h.o9e996ds23up for more details.

This is a tracking task, specific metrics will be tracked in sub-tasks.

Scope

  • General purpose metrics
  • Audiences
    • People external to the Search Platform team who want some understanding of how well Search is performing and what impact it has. This includes upper management, product management, and to a lesser extent our communities
    • Search Engineers who want to have a high level understanding of Search

Dimensions

For all metrics proposed below, we should be able to slice the data by multiple dimensions:

  • By time (obviously), in particular, we want monthly numbers (X millions of searches per month is more relatable to understand the overall impact) as well as more granular numbers (days or minutes)
  • By wiki
  • By editor status (for logged in users, we should be able to find the number of edits, even if we can’t associate a specific search to an edit or read workflow)
  • By bot / non bot
  • By media (desktop web, mobile web, android app, IOS app, portal, direct API calls) and type of search (full text search, autocomplete, morelike): this might get a bit tricky, since there isn’t complete alignment. For example, mobile apps use autocomplete search to present a search result page. We might need to refine what we exactly mean here, or what dimensions make the most sense.

Proposed implementation

To limit dependencies to other teams, we propose a first implementation based on Jupyter notebooks, which allows fast iteration cycles with a technical stack that is well understood by the Search Platform team.

Should we find those metrics useful to expose more widely, we might require support from Product Analytics to put in place Superset dashboards, or other more user friendly ways of accessing them.

Event Timeline

Gehel renamed this task from Search metrics 2024 to [Epic] Search metrics 2024.Feb 23 2024, 2:48 PM
Gehel added a project: Epic.
Gehel updated the task description. (Show Details)

In terms of the requested dimensons:

  • Time: easy (obviously)
  • Wiki: typically easy
  • Editor Status: Harder. For the existing cirrussearch logging we intentionally don't know who the user is. We would need to augment the events with a bucketed (to preserve some anonymity) count. For the metrics that would be calculated from web requests I'm unsure, will need to review what other analysts have done to see if this is something we have available.
  • By bot / not bot. It depends how we want to define bot. For metrics based on webrequest we should probably use their definition. For metrics based on the search logs we will either need to roll or own (based on prior experience and implementations), or review what is being done for webrequest and try and align with it as much as possible to have comparable numbers.
  • By media. Both webrequest and cirrus logs include the user agent map, which attempts to bucket the agent into various groups, such as the device family which distinguishes between phones, laptops, etc. Optimistically we can re-use that dimension.
  • By type of search. Historically this has been hard when starting from logs. We know that the requests are coming in, but we generally don't know from which user interface the request came from. We can use some heuristics that detect the shape of our programatic requests, but that will be fragile and we will certainly forget to update it when the implementation changes.

Short run we determined that the following are the initial focus:

Gehel set the point value for this task to 13.Apr 15 2024, 3:38 PM

For those following along, have a look at the comment in T358349#9727873 to identify the notebook helping to fill a table in @EBernhardson's namespace and an example Superset.

Erik, nice work so far!

I'm interested to see migration of the the coarse grained session ratios in the subtasks, which are expressed in the previous notebooks such as T358352-user-sessions-using-search.ipynb brought into the Superset dashboard (the Python-deduced number of actors, as well as unique_devices_per_domain_daily divisors are helpful in particular for the AC).

The other piece is quantification of external search engines (Google, DuckDuckGo, etc.) (bullet point in T358351) so it's shown along with this first party search stuff. Of course we know their relative share is large, this is more about being able to see these things at a glance together in the context of search metrics.

I understand T358350 is in the works yet for being able to more reliably see the number of search sessions and do the ratios as well - my impression is this is stuff that gets into Search Satisfaction and searchToken (possibly within the context of a time window depending on how well that's connected to search's notion of a search session in its backend instrumentation). Does that look feasible at least across mobile web and desktop web?

Other than the notes above, which pertain more to some additional fields and Superset niceties, here were the other observations...

As for looking at things like top-N wikis, top-N countries pieces, I think the previous Notebooks - which are really cool - maybe would be the way for the interested reader with access to have their thoughts provoked. I'm inclined in some ways to also have this expressed in Superset somehow. Would a separate chart be feasible for this without too much extra effort? I think the other items up above on this here comment take precedence, but I hope we can express some of this other thought provoking stuff - but curious to hear level of effort. If it's too much a hassle, I think folks could just look at the Notebooks (provided you're happy with their counting methods as being close to what's in the now-consolidated notebook that pipes data to your table).

Finally, I took notes while closely reading T358345.ipynb for conformance to what I believe to be the case in webrequest and real browser behavior. Nice work there. Here are the notes, hoping you can have a think on them:

"# Not sure if the desktop/mobile switcher can cause an actor to be in both. Maybe?"

This could result in a different actor due to the uri_host being different. But, this is more of an edge case behavior from what I've seen in the real practice regarding search-based behavior. It may be worth having an additional hasher that uses normalized_uri.project and normalized project_family (not the entire normalized_uri data structure, because the normalized_uri.qualifiers field will specify "m" as a value as an example) if one wanted to see if there's anything around this for cross-over on the same device (e.g., on a NAT'd home IP). It gets a little too complicated in practice here, I'm afraid, so I don't think it's worth spending too much time looking for that kind of behavior and adjusting action counts, at least not yet.

To be clear, I think normalized_host as a field in the DDL for your namespaced-table is still helpful, this is more about the matter of classifying a given UA instantiation across both the desktop and mobile web domains for the main content projects. As you noted, normalized_host still basically stands in for uri_host without loss of signal.

select uri_host, normalized_host.project_family from webrequest where year = 2024 and month = 4 and day = 4 and hour = 4 and normalized_host.project_family not like 'wik%' and normalized_host.project_family != 'mediawiki'  limit 1000;

Some other thoughts. Probably a good idea to narrow in on project_family values in preprocessing if possible, or adjust Superset. Best would probably be to specifically club together the top level content projects that have MediaWiki installations, from the projects listed at www.wikimedia.org . The pageview traffic to the wikis that aren't listed on www.wikimedia.org is very likely to be a small percentage. But this may be good to denoise just a little. I suppose it could be broken out to things that are top level content projects and things that are not top level content projects if really wanting to analyze anything that (there are non-content projects that are running MediaWiki, and there are things that are just not MediaWiki and we really don't want those inflating divisors). Although, was the thought to just let the Superset user handle this in aggregations? Possible to update the Superset example?

Note that there are a couple interesting browser bar / browser search bar behaviors.

In Firefox for desktop, there's a dropdown of search providers, depending on the user's default or updated configuration. If the user types a search expression and then clicks the Wikipedia icon they are sent to, for example, https://en.wikipedia.org/wiki/Special:Search?search=fish&sourceid=Mozilla-search . I don't think this is super common behavior. Wanted to note it, though.

Similarly, Safari for iOS / iPadOS has the ability for users to type the query "wiki <something>" in the address bar. We know that searching for "wiki <something>" is a real user behavior. It's a thing people do in external search, and obviously URL bars (whether they employ an external engine), etc., but this is also an interesting browser specific behavior. I'm not sure how well this may or may not be ascertained.

Thanks again! See you Tuesday.

For those following along, have a look at the comment in T358349#9727873 to identify the notebook helping to fill a table in @EBernhardson's namespace and an example Superset.

Erik, nice work so far!

I'm interested to see migration of the the coarse grained session ratios in the subtasks, which are expressed in the previous notebooks such as T358352-user-sessions-using-search.ipynb brought into the Superset dashboard (the Python-deduced number of actors, as well as unique_devices_per_domain_daily divisors are helpful in particular for the AC).

It's certainly possible in superset, although I guess I had forgotten about how inflexible superset is. I was hoping to be able to repeat a templated chart with different aggregations, but it seems like in superset we have to save a chart for each combination of aggregate columns and metrics. Additionally i did some custom sorting in pandas that made the particular aggregation easier to review, but i can't seem to replicate a decent multi-index table in superset. Best i can do i a custom datasource with some CTE magic, but that isn't scalable to each individual chart. Otherwise we have to interact with the raw data which isn't terrible, but also isn't that easy or pleasant (even just formatting %'s...) in superset. For exploration of data i think i entirely agree that python + pandas was more approachable.

I've added a few defined metrics to the table in superset which makes additional charts easier to make. It's still a bit awkward, perhaps simply my limited familiarity with superset. I created a dashboard that mimics what was in the T358352 nodebook, see https://superset.wikimedia.org/superset/dashboard/516/
But I'm not convinced it's better than the notebook for exploration, perhaps only for having something to look at down the line and share with others. In the notebook i could easily ask more questions of the data, while this is quite static.

The other piece is quantification of external search engines (Google, DuckDuckGo, etc.) (bullet point in T358351) so it's shown along with this first party search stuff. Of course we know their relative share is large, this is more about being able to see these things at a glance together in the context of search metrics.

Yes i added the referrer classes to the aggregation, but while thinking about it today and looking at the data I realized it's not correct in the current incarnation. I got distracted today trying to make superset do what i want (it only half did...and it's ugly), so i didn't end up resolving that yet. Will get it figured out soon enough.

I understand T358350 is in the works yet for being able to more reliably see the number of search sessions and do the ratios as well - my impression is this is stuff that gets into Search Satisfaction and searchToken (possibly within the context of a time window depending on how well that's connected to search's notion of a search session in its backend instrumentation). Does that look feasible at least across mobile web and desktop web?

Other than the notes above, which pertain more to some additional fields and Superset niceties, here were the other observations...

As for looking at things like top-N wikis, top-N countries pieces, I think the previous Notebooks - which are really cool - maybe would be the way for the interested reader with access to have their thoughts provoked. I'm inclined in some ways to also have this expressed in Superset somehow. Would a separate chart be feasible for this without too much extra effort? I think the other items up above on this here comment take precedence, but I hope we can express some of this other thought provoking stuff - but curious to hear level of effort. If it's too much a hassle, I think folks could just look at the Notebooks (provided you're happy with their counting methods as being close to what's in the now-consolidated notebook that pipes data to your table).

What's going to be easiest is probably a dedicated notebook that reads the same data superset is reading into pandas, and then does all the magic there. Leverage all the pre-calculation and keep the flexibility of a full programming language.

Finally, I took notes while closely reading T358345.ipynb for conformance to what I believe to be the case in webrequest and real browser behavior. Nice work there. Here are the notes, hoping you can have a think on them:

"# Not sure if the desktop/mobile switcher can cause an actor to be in both. Maybe?"

This could result in a different actor due to the uri_host being different. But, this is more of an edge case behavior from what I've seen in the real practice regarding search-based behavior. It may be worth having an additional hasher that uses normalized_uri.project and normalized project_family (not the entire normalized_uri data structure, because the normalized_uri.qualifiers field will specify "m" as a value as an example) if one wanted to see if there's anything around this for cross-over on the same device (e.g., on a NAT'd home IP). It gets a little too complicated in practice here, I'm afraid, so I don't think it's worth spending too much time looking for that kind of behavior and adjusting action counts, at least not yet.

To be clear, I think normalized_host as a field in the DDL for your namespaced-table is still helpful, this is more about the matter of classifying a given UA instantiation across both the desktop and mobile web domains for the main content projects. As you noted, normalized_host still basically stands in for uri_host without loss of signal.

select uri_host, normalized_host.project_family from webrequest where year = 2024 and month = 4 and day = 4 and hour = 4 and normalized_host.project_family not like 'wik%' and normalized_host.project_family != 'mediawiki'  limit 1000;

Some other thoughts. Probably a good idea to narrow in on project_family values in preprocessing if possible, or adjust Superset. Best would probably be to specifically club together the top level content projects that have MediaWiki installations, from the projects listed at www.wikimedia.org . The pageview traffic to the wikis that aren't listed on www.wikimedia.org is very likely to be a small percentage. But this may be good to denoise just a little. I suppose it could be broken out to things that are top level content projects and things that are not top level content projects if really wanting to analyze anything that (there are non-content projects that are running MediaWiki, and there are things that are just not MediaWiki and we really don't want those inflating divisors). Although, was the thought to just let the Superset user handle this in aggregations? Possible to update the Superset example?

I suppose i was considering that as long as the daily data is reasonably small i could punt decisions on how to aggregate the project to display time. The intermediate
data expects the consumer to group over the data anyways, it should be able to concat a few pieces of the normalized url into a form that matches it's aggregation goals. So for example in the dashboard above it doesn't consider anything about the wikis, but to select the countries is uses a CTE with a limit to select the top N countries and then calculates all the metrics for the countries from the data. I imagine other bits could do similarly, selecting the top N wikis by pageviews or some such.

Note that there are a couple interesting browser bar / browser search bar behaviors.

In Firefox for desktop, there's a dropdown of search providers, depending on the user's default or updated configuration. If the user types a search expression and then clicks the Wikipedia icon they are sent to, for example, https://en.wikipedia.org/wiki/Special:Search?search=fish&sourceid=Mozilla-search . I don't think this is super common behavior. Wanted to note it, though.

Similarly, Safari for iOS / iPadOS has the ability for users to type the query "wiki <something>" in the address bar. We know that searching for "wiki <something>" is a real user behavior. It's a thing people do in external search, and obviously URL bars (whether they employ an external engine), etc., but this is also an interesting browser specific behavior. I'm not sure how well this may or may not be ascertained.

Thanks again! See you Tuesday.

The four sub-tickets were combined into a single gitlab MR with two calculations, and found in: https://gitlab.wikimedia.org/repos/search-platform/notebooks/-/merge_requests/3. These currently populated two daily partitioned hive tables and i've filled them with data for the months of march and april. Expecting that going forward we will want to move the metrics calculation to airflow, and decide on which metrics are worth dashboarding.