Page MenuHomePhabricator

[REQUEST] Get pageview data for MediaWiki core JS docs
Closed, ResolvedPublic

Description

Name for main point of contact and contact preference
@apaskulin (Feel free to ping me here on Phabricator.)

What teams or departments is this for?
JSDuck to JSDoc migration project, collaboration between the Technical Documentation, Web, and Design Systems teams

What are your goals? How will you use this data or analysis?
We're looking to learn more about how people use the HTML JavaScript docs for MediaWiki core: how many visits there have been to the docs, which pages have been visited, and any other data available. We will use this data to make decisions about which content we include on the site and which feature we prioritize.

What are the details of your request? Include relevant timelines or deadlines
Query the webrequest tables in Hive for data about visits to https://doc.wikimedia.org/mediawiki-core/master/js/ and subpages, including visits to other versions with the URL format https://doc.wikimedia.org/mediawiki-core/REL{version_number}/js/ (example: https://doc.wikimedia.org/mediawiki-core/REL1_41/js/). We're particularly interested in data before November 28, 2023, so considering that the webrequest tables are purged after 90 days, it would be great to get this data by mid-to-late January 2024. Whatever format is easiest is fine. Thanks for your help!

Is this request urgent or time sensitive?
See note above about getting data before it is purged.

Event Timeline

mpopov subscribed.

Removing Product Analytics for now. It may be that someone from PA ends up taking this on but not necessarily.

I have submitted this as a request for Research and Decision Science: https://app.asana.com/0/1205562963911611/1206268926202590/f

Please use the RDS request process in the future. See https://office.wikimedia.org/wiki/Research_and_Decision_Science/Requests for more details.

I checked in with the RDS group about this and we do not have the bandwidth to take this on as a separate request.

However, you may be able to get help with querying the data through our teams' consultation hours

Thanks for the response! I'll try that.

@apaskulin: Here's the data aggregated across the time period we discussed: https://docs.google.com/spreadsheets/d/1iSISeYRdW2ZMf0eZUL4Rp5SwJpPrmx6ohbQDuOc-GI0/edit#gid=0

I removed the /mediawiki-core from the paths to make them more readable / easier to scan.

Here's the query & Python code I wrote during the consultation appointment:

select
  year, month, day,
  regexp_replace(uri_path, '/mediawiki-core', '') as uri_path,
  count(1) as n_requests
from wmf.webrequest
where webrequest_source = 'text'
  and year = {year} and month = {month} and day = {day}
  and uri_host = 'doc.wikimedia.org'
  and starts_with(uri_path, '/mediawiki-core')
  and regexp_like(uri_path, '/js/')
  and content_type = 'text/html'
  and not regexp_like(uri_path, '/source/')
group by 1, 2, 3, 4
order by n_requests desc
import wmfdata as wmf
import pandas as pd
import datetime as dt

start_date = dt.date(2023, 10, 20)
end_date = dt.date(2023, 12, 10)

results = list()

for this_day in pd.date_range(start_date, end_date):
        results.append(wmf.presto.run(query.format(year = this_day.year, month = this_day.month, day = this_day.day)))
        print('Retrieved data for {year} - {month} - {day}'.format(year = this_day.year, month = this_day.month, day = this_day.day))
        
jsdocs_reqs = pd.concat(results)

jsdocs_agg = jsdocs_reqs[['uri_path', 'n_requests']] \
    .groupby('uri_path') \
    .aggregate('sum') \
    .sort_values(by='n_requests', ascending=False)

jsdocs_agg.to_csv('jsdocs_requests.csv')

There is no sensitive data or user data – it is no risk.

mpopov changed the task status from Declined to Resolved.Jan 18 2024, 12:45 AM

In the 50 days between October 20 and December 10, 2023, there were a significant amount of pageviews to the homepage of the MediaWiki core JS docs: 1,683,311 including all versions of MediaWiki. However, there were limited pageviews to other pages of the docs: fewer than 12 pageviews for each of ~30 other pages and no pageviews for most pages of the site. This leads me to conclude that use of the MediaWiki core JS HTML docs is minimal, while interest in the docs may be high.

Sidenote: This data also showed between 25-160 pageviews for each of a few URLs that are invalid links: mw.module_util, jQuery.fn.updateTooltipAccessKeys, util.js, and mediawiki.module_util. I thought maybe these links were originating from broken links on wiki, but I wasn't able to find any reference to them on Meta-Wiki or mediawiki.org.

Any thoughts about this data or these conclusions is welcome!

This leads me to conclude that use of the MediaWiki core JS HTML docs is minimal, while interest in the docs may be high.

That's a reasonable conclusion.

I suspect developers primarily rely on the documentation capabilities of the IDEs they're using. From my own experience whenever I've needed to look up functions (e.g. https://doc.wikimedia.org/mediawiki-core/master/js/mw.user-method-generateRandomSessionId.html) it was outside of development context (I don't have an IDE setup for MW dev environment) and specifically when I've needed to refer to a function – e.g.

Okay, so there are a few components being tested here:

In the 50 days between October 20 and December 10, 2023, there were a significant amount of pageviews to the homepage of the MediaWiki core JS docs: 1,683,311 including all versions of MediaWiki. However, there were limited pageviews to other pages […], and no pageviews for most pages of the site. This leads me to conclude that use of the MediaWiki core JS HTML docs is minimal, while interest in the docs may be high. […]

That's a reasonable conclusion.

If these numbers are based on server-side web request logs, then I believe this contains a slight misunderstanding. Consider:

To the webserver, these are identical requests for /mediawiki-core/REL1_41/js/. The hash fragment is private to the browser and kept client-side only, where JavaScript then swaps out different virtual "pages" as you browse. JSDuck is thus a so-called "single-page app" (SPA) which means that all pages cary the same URL. As such, these 1.6M requests aren't bounces on the homepage, but navigations to any number of possible entry pages within the JSDuck site.

It also means that it is an undercount because once you open such a URL, any subsequent page view does not produce a new webrequest.

There are a handful of separate standalone files in the JSDuck output, such as "view source" embeds like https://doc.wikimedia.org/mediawiki-core/REL1_41/js/source/mediawiki.String.html. Those do have their own URL, but are rarely viewed. By and large, all "pages" share the same webrequest URL, and thus would appear as such in statistics.

Thanks for this clarification, @Krinkle! It seems like we wouldn't be able to get any insight into which pages were most viewed, but it seems safe to say use of the docs was very high, especially considering the undercount. Out of curiosity, due to the undercount issue, could we consider this 1.6M unique sessions?

Thanks for this clarification, @Krinkle! […] Out of curiosity, due to the undercount issue, could we consider this 1.6M unique sessions?

That depends on what we mean by session and by unique. It represents how many the MediaWiki JSDuck docs for a given version (e.g. master, or REL1_41) were "opened". Any subsequent clicks within the same tab in those docs, given it JSDuck was an SPA, would not make new navigation requests to the server. However, opening it in a separate tab, would need to open it there and thus make a request there, that we count. Browsing a different version (eg. master and REL1_41), would make a request we count. Closing it and re-opening it, even if only a second later, and even if in the same tab, will naturally have to make a request and thus counted. However, those are likely edge cases. So the 1.6M is likely a close approximations of how traditional trackers count "sessions", but not uniques.

For example, if we pretend we had some tracking code to create a cookie like "mysession=1" with an expiry of 5min, and prolong that cookie for as long as you keep browsing within the 5min. And then on the server, if we pretend we had counted HTML requests where the cookie was absent (i.e. start of a new pageview session), then that would probaby give you a similar number. The difference would be that our 1.6M would over-count in some scenarios:

  • "hard refresh"
  • "open in new tab"
  • "go to different MW version"

Browsers naturally make a web request in these cases, and thus if we misuse the req count as session count, we'd be over-counting these cases. On the other side, traditional session counts are by no means perfect either and may over-count in the other direction in other scenarios due to being time-based and thus cause things to be counted as a new session even if the user didn't do anything that feels to them as a new session (why would a click after 6min be different from a click after 5min, if it's within the same tab you had open?). I'm not in any way suggesting that we use the raw req count as a session counter, but my guess it'd be fairly close, maybe within 10%. You'd likely want to exclude bots/spiders. I hope this helps understand how and when browsers make requests to a server for a web page that serves an SPA.

I hope this helps understand how and when browsers make requests to a server for a web page that serves an SPA.

Thanks for this extra context, @Krinkle!