Page MenuHomePhabricator

Spike : How will we Measure % of edits coming from users without JS
Closed, ResolvedPublicSpike

Description

This task represents the work required to figure out what data we already have, and what if any instrumentation is required, to measure the % of edits coming from users without JS.

Event Timeline

Mayakp.wiki changed the subtype of this task from "Task" to "Spike".Apr 29 2020, 9:40 PM

Proof of concept posted at https://meta.wikimedia.org/wiki/User:MPopov_(WMF)/Notes/JS_support

The methodology isn't perfect, and I'd love for someone like @phuedx or @DLynch to review it. Specifically I'm wondering about cases where certain files which we would always expect to be requested together with the page's HTML…aren't.

Data referenced in the notes:

I don't think that method can be relied upon, mostly because of caching.

The request for /w/load.php?lang=en&modules=startup&only=scripts&raw=1&skin=vector is served with a not-terribly-aggressive cache header, so the browser should only be keeping it around for 5 minutes... but that's still enough to mean that anyone who navigates *back* to the main page after viewing any other article is going to not-request the JS/CSS. If you could perhaps filter to "requests from IPs which have not accessed the site in the last 30 minutes" or similar, that might avoid this issue... but I don't know how plausible that query is.

(There's also the issue of scrapers / bots / browser-prefetching loading just the HTML of the page and not loading the resources, but I don't know the size of that effect.)

When we move along to actually checking for the rates on edits rather than generic front-page hits, this is going to get even less reliable, because "visiting the edit page without having viewed literally anything else on the site and so having clean caches" seems like an unlikely scenario,

For edits, I really do think the EditAttemptStep analysis designed in the parent task is going to get us a much more accurate view of it.

I agree with @DLynch. Especially when it comes to counting edits, using existing EventLogging data should be more reliable than browser resource requests (so long as we only need sampled percentages and not absolute numbers). It's also not clear to me how resource request data would actually be used for counting edits, as that isn't mentioned in the proof of concept document. I can imagine how it could be used for looking at visits to the editing interface, but not actual edits. For that, I think T240697#6094827 would work better.

LGoto triaged this task as Medium priority.May 4 2020, 6:15 PM
LGoto edited projects, added Product-Analytics (Kanban); removed Product-Analytics.
LGoto moved this task from Next 2 weeks to Doing on the Product-Analytics (Kanban) board.

One thing I keep thinking is relevant here is that ad-blockers can block client-side EventLogging from getting through (from what I know, uBlock Origin does this by default). This can skew the measurement based on that, so having a sense of that skewness would be meaningful.

EditAttemptStep is a combination of server-side and client-side EventLogging, and we have a similar combination in the Growth Team's HomepageVisit (server-side) and HomepageModule (client-side) schemas. I did an analysis of those schemas back in February to understand to what extent client-side EventLogging was available on a per-visit and per-user basis for our four target wikis (the Czech, Korean, Vietnamese, and Arabic Wikipedias) when users visit the Newcomer Homepage. The results can be found in T243632#5898720.

To summarize the findings: on a per-user basis (Table 3 in the comment linked above), we find a fairly consistent rate of about 10% of users having no client-side EventLogging data. The proportion of users who appear to always have some client-side data varies in the 70–80% range. The rest had various amounts of visits with it either enabled or disabled. When measuring this on a per-visit basis (Table 2 in the comment), we see higher proportions with no data available. I kind of expected that, as I'd think that more prolific users are also more tech savvy and/or privacy conscious.

Translating this to EditAttemptStep, if the measurement is on a per-edit basis, I expect to see relatively high proportions of "non-JS" as prolific users are more likely to have ad-blockers or other privacy measures in place, or have turned JS off.

Client-side EventLogging respects browser do-not-track settings, and server-side doesn't. We could actually improve server-side in this regard, I think, by checking for the DNT HTTP header and treating it as equivalent to the client-side setting. That might help our tracking here.

@nettrom_WMF - Ugg, I imagine that means that everyone with an ad blocker will show up as a no JavaScript editor, thus terribly skewing the results (as there are likely to be far more editors using ad blockers than editors that actually have JS turned off).

Perhaps we could work out some sort of baseline adblocker-but-JS rate by looking at what percentage of edits are made with VE and EventLogging and seeing how that compares to our anticipated rate based on our sampling-rate? Then apply that to the WikiEditor data to compensate?

Perhaps we could work out some sort of baseline adblocker-but-JS rate by looking at what percentage of edits are made with VE and EventLogging and seeing how that compares to our anticipated rate based on our sampling-rate? Then apply that to the WikiEditor data to compensate?

I was talking to @Mayakp.wiki about the DNT problem the other day and I keep wondering if we could add a server-side EventLogging component to VE the way that WikiEditor has. (P.S. I agree that there's an argument to be made about adding a DNT check to server-side component of EL extension.)

Still, the biggest problem with using any numbers from VE is that it's opt-in by registered users so it's not representative of the general population of editors.

I keep wondering if we could add a server-side EventLogging component to VE the way that WikiEditor has.

It'd be sort of a pain -- there's not an equivalent to WikiEditor being on a specific pageload. We could maybe add some logging to one of the API calls that loads the page... but we'd have to provide some new session information to the call, and we can't strictly guarantee that it'd be VE that's using it.

Still, the biggest problem with using any numbers from VE is that it's opt-in by registered users so it's not representative of the general population of editors.

That's a very enwiki-centric viewpoint! I.e. that depends on the wiki you're looking at, and on (I think) most wikis VE is the default, including for anons. (I mention this just because if you believe that about VE, you might be making bad assumptions about the representative nature of the WikiEditor numbers...)

That's a very enwiki-centric viewpoint! I.e. that depends on the wiki you're looking at, and on (I think) most wikis VE is the default, including for anons. (I mention this just because if you believe that about VE, you might be making bad assumptions about the representative nature of the WikiEditor numbers...)

Thanks for pointing that out! That's a really great point. Okay, I did some parsing of the current mw config so here's a snapshot for reference:

settingwiki
VE is secondary editorenwiki
VE is secondary editoreswiki
VE is secondary editorfrwiktionary
VE is secondary editorhewiki
VE disabled for anon usersenwiki
VE disabled for anon userseswiki

@mpopov there's also the visualeditor-nondefault group in config, which mostly consists of the various wikitionaries / wikisources / wikiquotes and similar... but has a few mainline wikipedias on it -- ganwiki, iuwiki, kkwiki, kuwiki, srwiki, tenwiki, tgwiki, uzwiki, and (probably the biggest?) zhwiki.

@mpopov here's a quick implementation of respecting the DNT header server-side, for discussion: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EventLogging/+/595005

...and that got merged. So, adblockers may still be cutting off the beacon requests, but DoNotTrack should be applied to client-side and server-side equally now.

EDIT: Okay, it bounced off a jenkins failure, so it didn't get merged. Alas.

@Nuria - Actually, it looks like there's a bigger question to resolve first: T252438 ([Spike] Should EventLogging support DNT?)

In T252438 we have concluded that EL should not support DNT (patch Remove DoNotTrack support added by @DLynch )
which brings us back to the question of ad-blockers that will show up as a no JavaScript editor, thus terribly skewing the results (as there are likely to be far more editors using ad blockers than editors that actually have JS turned off).

which brings us back to the question of ad-blockers that will show up as a no JavaScript editor

A though about this cause I am not sure this premise is correct, an editor with an ad blocker that uses VE will show as absence of data entirely on the client side but data will appear on the server side tagged with "VE". An editor w/o javascript can only use wikitech editor as far as i can see, thus it is not counted when looking at "VE" tags, correct?

Thus, if the goal is to know the number of edits on VE affected by ad blockers whose data you do not have you can substract "total VE tagged edits" - "VE edits whose data we have harvested".

if the goal is to look at no javascript edit cases the only data useful for that is wikitech edits, I think. The non-js case here can be measured if specific instrumentation is added for the wikitech case with js enabled. Can explain in more detail if you want to explore that possibility.

@Nuria - We were hoping that we could handle this with the existing instrumentation. Right now, in both editors, the ready action is recorded on the client-side by EventLogging (as far as I understand). We were hoping to use that as an indication of whether someone has JavaScript turned on or off. Do you know if ad blockers interfere with EventLogging recording the ready event on the client-side? Do ad blockers prevent all client-side EventLogging? If so, what's the best way to work around that? Would we have to build some type of custom tracking system?

Do you know if ad blockers interfere with EventLogging recording the ready event on the client-side?

Adblockers do not interfere with javascript running on the page but rather with "sending" the information. They prevent http requests to certain urls as most beacons just harvest data to be used for ad display.

The url that eventlogging sends data to is "beacon/event", which is present of popular adblocker url lists:
https://easylist.to/easylist/easyprivacy.txt

Is there anything in the edit URL parameters or in the form POST body for legitimate forms-based and VE editing (i.e., non-bot...I guess unless headless browsing bots do things that evade detection) that could be relied upon?

For example, it seems like mw-twocolconflict-js POST body key-value seems to be present when the <textarea> editor is running in the context of a JS-enabled browser, but it isn't present for a true no-JS user.

To be a bit more robust, I guess NORLQ code might need to run on the edit page in order to ensure that Grade C and X unconditionally would get some sort of nonce like that added in (I haven't looked at mw-twocolconflict-js but the principle holds for anything we want to measure more reliably - of course both an RLQ and a NORLQ nonce could be used to differentiate).