Sat, Oct 13
Fri, Oct 12
Thanks again for the analysis and the recommendations!
From this analysis, I'd strongly recommend ignoring ReadingDepth data coming from Android native browser; iOS Chrome prior to 11.3 and Chrome <=38.
I guess that this was meant to read "iOS prior to 11.3", correct? (cf. above)
FWIW, I encountered the same kind of error in beeline around the same time last week (October 2 I believe). Below is the query and the log (reproduced today). The same query works fine in Hive, and SET hive.auto.convert.join=false;fixes it in beeline as well.
By "information reduction" (in both of these fields), I meant that several possible values will be mapped to the same value.
E.g. the arrays [0,1,2] and  in the EL data will both result in the integer 0 in the Druid data. In the task description and T201873 we had IIRC understood the term "flattened into a string" as mapping e.g. the array [0,1,2]into the string '[0,1,2]'.
We all know that.
Thu, Oct 11
Wed, Oct 10
This is a term from the instrumentation DACI, it's perhaps useful to get familiar with that first (and then work on any necessary clarifications there).
My understanding has been that this task is largely separate from the question which of the resulting data can be kept beyond 90 days. I would expect we will receive guidance from the Legal team (or in the future, Privacy) regarding this question, and that this guidance would depend on the specific data being logged (e.g., whether it contains page names or not). @dr0ptp4kt , can you clarify the scope?
Tue, Oct 9
Insprired by a suggestion of @Jdlrobson, here is a version of the above query by iOS version, showing a clear change at iOS 11.3, but also some oddities at earlier versions like 9.1:
Be that as it may - we do actually have data in the webrequest table for Wikitech. Using a somewhat simplistic pageview definition, here are the 100 most viewed pages for September 2018 (without spiders) according to that data. Looks quite plausible.
Back to the view in Turnilo: This looks very exciting indeed!
BTW, I understand we are focusing on use in Turnilo for now, but out of curiosity (and considering the task description) I checked Superset too and didn't see this data there yet. I clicked "scan new datasources", which appears to have imported it, alongside data from some other schemas:
Another question: It seems that the dimensions lack e.g. Ua Browser Major and other user agent derived fields (that we have and use in e.g. https://turnilo.wikimedia.org/#pageviews_daily/ ). In the web team we often need these when evaluating EL data, see e.g. this example from earlier today: T204143#4650771 . Could they be added, analogously to the pageviews data?
@mforns Great to hear that Druid already allows ingestion of array types! But just to clarify, it seems that this involves information reduction of some kind? At least I'm only seeing scalar values in the selection dropdown (below).
If that's the case, could we document how that works - does it always pick the first element of the array? (i.e. [0,2,5] --> 0, etc.)
Mon, Oct 8
With regards to the first 2, I'd need more detailed information on the versions data is missing for Chrome Mobile iOS, Android (stock browser) and (desktop) Chrome. Chrome iOS is very different from Chrome for Android (one uses webkit and the other blink for rendering). For desktop, at least Chrome 39 is needed and for Android stock browser I still don't really understand why this browser is still around and I suspect it's in maintenance mode - I wouldn't be surprised if it doesn't support sendBeacon or performance.
So about the rest of the result I had mentioned above that looked fine:
Sun, Oct 7
By the way, we have some data on how often links are being opened in a new tab (or window), i.e. how frequently a new mw.user.sessionId()is generated in course of a browser session (in the usual sense that aligns with session cookie storage).
Sat, Oct 6
Yes, until the end of January it looks like (see also our timeline document).
Fri, Oct 5
Sounds like a good idea! In the meantime, I have submitted a patch to at least add a caveat to the existing documentation.
Wed, Oct 3
Thanks for the technical background, @elukey! I think it would be useful to add some guidance to the documentation. Developers might find concrete rate limits particularly useful (like the one we stated earlier about the old MySQL system). Especially since there was a sense earlier that the new Hadoop infrastructure would basically relieve us of worrying about throughput limitations.
There was a lack of clarity about the expected event increase from https://gerrit.wikimedia.org/r/463875 , causing some misunderstanding with Analytics Engineering and the postponing of the deployment earlier today:
PPS: I added a section to the documentation.
Tue, Oct 2
PS: This solves my own use case and I think that of some other Python users too. Personally I wouldn't mind closing this task, although the problem as stated hasn't been solved yet, and users of other languages might not yet have a way to circumvent it.
Cool! This works great for me. I tweaked it a bit to make the from_email and to_email parameters optional, autogenerating them based on the server name and user name.
In: # cf. https://phabricator.wikimedia.org/T168103#4635031 : notebookservername = !hostname notebookserverdomain = notebookservername+'.eqiad.wmnet' username = !whoami
The event rate before and after deploy looks plausible from a glance at Grafana - closing this task now.
Mon, Oct 1
And to follow up on T204609#4630216, the newly added wikis appear to exhibit an issues clickthrough rate that is similarly low as on lvwiki (it's a bit higher on fawiki with 0.69% so far). This looks like a good reason to increase the sampling ratio to 100% on the smaller (non-enwiki) wikis, in order to have a better chance to detect changes (if any) with statistical significance. E.g. jawiki will have about 5 million mobile views during the two weeks of the test; if perhaps 1 million of these views will be to pages with issues, that already would not be enough to detect a 5% increase at 0.3% clickthrough rate.
We now have two full hours of data. As a first check, here is the ratio of pageloaded events to all mobile web pageviews for enwiki (analogously to T204609#4607546 for lvwiki). 2.8% at a sampling ratio of 20% would extrapolate to 16% of pageviews, which is still consistent with 19% of enwiki mainspace pages having Ambox issues (T201123#4494446).
After some other checks that looked fine (will post the detailed results here), I happened to look at the frequency of action types again.
Fri, Sep 28
@Nuria I figured that percentiles including the median might be more demanding, but I didn't expect that the mean would be a problem too. Considering that Druid's aggregators include sums and counts, is there a way to calculate their quotient (i.e. the mean) later in Turnilo or Superset?
On the Persian Wikipedia, the ratio of pages with (Ambox) issues is around 5%: https://phabricator.wikimedia.org/T201123
Two bugs involving overlapping text on the Catalan Wikipedia:
For the record: @Nuria and I discussed this task earlier this week, and I understand that the AE team feels that due to the workload from other projects it might not be possible to implement this (specifically, the subtask T201873 ) before the end of Q2. The team has suggested to instead tackle T205562: Ingest data into druid for readingDepth schema first, as it might be an easier case.
As indicated over at T205562#4626349 , in many or most cases we will want to treat such integer fields as measures, rather than as dimensions. It seems bucketing only makes sense for the latter.
Oh, those time fields (visibleLength and totalLength are particularly relevant) are to be understood as measures, not as dimensions, to use the terms referred to in https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Schema_Guidelines and the Druid documentation. It's actually very similar to the examples from the draft guidelines about "time spent" and "time since last action".
Re-checking the ratio of pageloaded events per pageview after the fix for T205355 has been deployed:
This looks much more plausible now than earlier (T204609#4607546), with rates around the estimated 10%.
Thanks! Does this mean we can consider the AC "Analyse any errors that are introduced in the EventLogging pipeline relating to this change" fulfilled?
Thu, Sep 27
Wed, Sep 26
Remind me, did we do QA for this schema on Mobile Safari? If you and/or @Ryasmeen saw valid events on that browser, I would agree that it's reasonable to assume for now that we can use its data.
Tue, Sep 25
They seemed to be closely connected to me, also because of the "correct and amend them later as needed" part. But we can split them if you prefer.
I left out the product manager user in the one about editing/drafting schemas, only because the few I've talked to haven't had the need to edit them. We can ask around and see if I missed that and there is actually a use case for them here.
I was also curious how the impact of the sitemap rollout would look like for the desktop domain it.wikipedia.org itself:
For the record: @Jdlrobson has found the likely reason for the initially low event rate ("Minerva A/B tests are not subject to HTML caching time. Config added inside SkinMinerva is subject to the rules of HTML caching and can take several days ..."). The fix is being worked on at T205355: A/B config flag should be subject to ResourceLoader caching rules not HTML caching rules
Mon, Sep 24
Sep 22 2018
Sep 21 2018
Here is a look at the ratio of pageloaded events from the PageIssues to all applicable views on lvwiki (more precisely, mobile web (-domain) pageviews to mainspace pages, excluding spider views).
On Latvian Wikipedia, the ratio of pages with issues is around 10%: https://quarry.wmflabs.org/query/29838 (using the above approach to count Ambox-using pages, adapting @TheDJ's queries and combining them into one)
Thanks @TheDJ ! (also for taking care to only count distinct pages, as the query used by the templatecounts tool that the task description proposed to use for this question actually counts multiple template occurrences on the same page separate, cf. T201123#4476734.)
Thanks for confirming!
The only caveat is that fields (dimensions) with high cardinality, like pageToken, sessionToken, pageTitle and pageIdSource perform very bad in Druid, so I would blacklist them from Druid ingestion if possible.
Yes, that can be ruled out. Compare the PageIssues event rate from T204609#4601701 (or  below) with e.g. the print button event rate of the Print schema (lvwiki, Minerva, sampled at 10%).
- events triggered errors due to uri length and were not processed
Can be ruled out. That was a very rare occurrence even in T196904 (where the event query string contained a page title / URL twice, and we only have one page title field here). Besides, it wouldn't explain the inconsistent logging for the same page in the https://lv.m.wikipedia.org/wiki/Filozofija example.
- events havent made it to hive yet
Super unlikely. (Other schemas, e.g. Print , don't seem to be seeing such a delay. And repeating the query from T204609#4601701 >13h later doesn't show any retroactive increases in the events logged.)
I'll think about other reasons in meantime...
No red flags in issuesVersion, isAnon, and namespaceId either.
SELECT event.issuesVersion AS issuesVersion, COUNT(*) AS events FROM event.pageissues WHERE year >0 GROUP BY event.issuesVersion;
And the distribution of values of the sectionNumbers and issuesSeverity fields looks plausible too - at least there are a lot of different kinds of combinations represented.
Here is the distribution of actions so far. This does not look impossible a priori (although it would mean quite a low issue clickthrough ratio of <=2% in both control and test). So the missing events are not caused by an entire category of actions missing.
Sep 20 2018
On the other hand, here is a list of the 100 pages that have generated the most events so far.
We have over half a day's worth of data in the table now, including from daytime hours in Latvia, and
the caches should have caught up. But the event rate remains surprisingly low (see Grafana and query below) - about 1-2 events per minute, whereas lv.m.wikipedia.org receives 70-80k views/day currently (https://stats.wikimedia.org/v2/#/lv.wikipedia.org/reading/total-page-views/normal|bar|1-Month|access~mobile-web ) or around 40-60 views/minute. Maybe content quality is very high on this wiki... (Will do further checks.)
By the way, can we find out / keep track of which other schemas are now using this newly standardized pageview token?
We haven't really decided into which direction to take this task from here. Which of the above combinations would be most useful to extend to the other four languages? (logged in vs all views, mobile domain vs. mobile + desktop)
Here is the (or an) answer to the second question from the task, about the top 50 pages outside the article namespace (enwiki, July, logged in views, desktop+mobile).
OK, the pageissues table just materialized in Hadoop with the data from the first hour - 19 events, 14 of which seem to be your test views. Let's wait a bit for the caches...
Sep 19 2018
And here is the analogous list of special page with the most logged in views on the mobile site (en.m.wikipedia.org in this case), also for July 2018. Comparing with the above result, one finds e.g. that 6% of views to Special:Watchlist are on the mobile domain.
To add a clarification from kickoff, for the record: This will not switch on the instrumentation, so we would still need to resort to other means for checking events are being sent correctly on a particular wiki.
Seconding this request:
@Groceryheist will be working with @ovasileva and myself on this project under a WMF contract, doing research on understanding reader behavior, focused on ReadingDepth EventLogging data (in combination with data from webrequest and pageview_hourly).
The three access groups listed in the task are what we have determined as necessary for this work (per https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Access_Groups and https://wikitech.wikimedia.org/wiki/SWAP#Access ).
Sep 18 2018
- On English Wikipedia, there were a number of sudden drops on desktop between May and July 2017, where the avg return time within 31 days decreased from around 5.5 to 1.0 or 2.0 days. Similar drops during this timeframe were also seen in for Wikimedia and Wikisource projects and from US, Japan, France, and Russia countries. I’ll investigate further by looking through the raw dataset and using daily histograms of return time around those dates.
Interesting! So it seems that the average may have been integer-valued on these drop days in F26025897 ? That would point to a data artifact.
This would be great. Just to double-check (apologies if that's a naive question): Would that query parameter survive across several issue clicks and modal clicks? e.g. https://en.m.wikipedia.org/wiki/Pharmacovigilance?pageissues=new2018 --> https://en.m.wikipedia.org/wiki/Pharmacovigilance?pageissues=new2018#/issues/all