Leaving this here since we were talking about it earlier today in this context:
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Mar 12 2019
Mar 11 2019
In T217851#5014982, @ovasileva wrote:
[...]
@Jdlrobson - a question here. Are we able to add anything outside the menu to the schema (for example the history button, things that will go into the overlay menu, etc)?
I guess we should have been more specific in the task description - "functionaI" is not quite synonymous to "sending data" ;) (I had already run basically the same query beforehand...) But presumably the above also means that the underlying instrumentation code looks solid on first glance?
Mar 9 2019
I have updated T207280 with more concrete proposals on instrumentation and metrics based on the above discussion.
I started to draft a schema at https://meta.wikimedia.org/wiki/Schema:MobileWebShareButton , per the discussion at T181195#4982077 ff. It is partly modeled after the Print schema, e.g. includes an event for when the button is shown (triggered during a normal page load) so we can calculate the button's clickthrough rate directly, which is a more useful success metric than the absolute number of clicks.
In T207280#4701474, @phuedx wrote:
...
Relatedly, we could query the wmf.webrequest table for requests with a specific query parameter rather than include the code that I wrote in T207280#4676470.
Pro: We wouldn't have to write any server-side code like in my comment above.
Con: We wouldn't be able to plot a graph in Grafana without any server-side code.Both this solution and the solution in my comment above would both be impossible if we were to use hash fragments as they aren't sent to the server.
BTW, one can actually get a graph in Turnilo when using a wprov query parameter, using the (albeit heavily sampled) webrequest dataset there. See e.g. this example for the existing "Share a link to an article (from lead image toolbar, or link preview)" Android app feature. With the usual benefits ;)
Mar 8 2019
Mar 7 2019
In T214180#5009483, @pmiazga wrote:@Tbayer is there an official way to reserve wprov value? What do we have to do, to introduce a wprov value for AMC shares? I hope it's just editing the wiki page and reserving a new value.
In T214180#4934790, @pmiazga wrote:@nray - yes, that is correct, and for now lets keep it that way. I know that there are some ideas to remove the mobile domain (I have no idea if this is going to happen)
@alexhollender - we can add some param to the the URL, like https://www.mediawiki.org/wiki/Special:MyLanguage/Talk:Reading/Web/Advanced_mobile_contributions?amc_source=optin (I wouldn't worry about caching as this is served to logged-in users only).
The official, cache-kosher way to do this is to use a wprov parameter, which also makes querying the data slightly easier. But as you note, cache fragmentation shouldn't be a big issue here, also because it's just about two pages ;)
Then with we would have to query hive to get the number of page views with that GET amc_source param. Difficulty - easy to medium. /cc @Tbayer
And once we have confirmed that the schema is working for this purpose, we need to extend it with a field indicating whether the user has enabled AMC (should probably be a separate task).
We should also document the sampling method and ratio on the talk page. (Apparently InitialiseSettings.php sets a default value of 0.5, but that's obviously not the actual rate.)
In T214998#4929968, @tstarling wrote:It complicates SEO in the sense that, when I wrote this task, I was looking at Google Search Console for a few of our domains with an eye towards SEO for sister projects, and found that mobile traffic was split 50/50 between the dashboards for the m and non-m subdomains. So it was hard to draw any conclusions without manually aggregating the data.
In T214998#5005391, @Krinkle wrote:In T214998#4929700, @Jdlrobson wrote:[..] This feels like an RFC to me. [..]
I've put it in our inbox to discuss this (or next) week, to figure out if it needs an RFC, and if not, we'll suggest an alternate facilitator to help solve the use cases and problems described in this task.
Mar 6 2019
In T201339#4999964, @Niedzielski wrote:
...
- We won't be implementing this server side as part of this ticket.
- The reasons for frontend changes only (care of @Jdlrobson):
- Context is important. If a user is sharing a URL in IRC or elsewhere it's probably meant to contain action=edit.
(I understand you mean "it's probably meant for the 'edit page' use case, not for the 'access contributions and other user-centered tools' use case".)
Mar 5 2019
In T203498#5001047, @Neil_P._Quinn_WMF wrote:It looks like CDH 6.1, which includes Hive 2.1.1, was released in December.
@elukey, what's the current thinking about deploying this? I'm sure there are many complexities: going from CDH 5 to 6 sounds like a tricky upgrade, I know there's been discussion of switching from CDH to Hortonworks or BigTop, and I've heard the larger plan is to move away from Hive and towards Presto anyway.
Is that indeed the plan? (The linked page doesn't mention whether we intend to actually abandon Hive altogether, or just to add Presto as an alternative for certain use cases like the Public Data Lake.)
If yes, what is the anticipated timeframe for this? Depending on how long we are still going to use Hive, an upgrade would still seem worthwhile.
Mar 4 2019
In T217438#4997575, @elukey wrote:@Tbayer what kind of access is needed? I guess analytics-privatedata-users but just want to be sure :)
Per https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Host_access_granted that looks correct, but @Nuria / Analytics Engineering should be able to confirm for sure.
Mar 3 2019
Mar 2 2019
(Update: we discussed this earlier this week and decided that a simple EventLogging schema would be useful. I'm going to create one, but there are still some relevant questions open in the other task, see T181195#4982077 ff.)
It seems that the Web Share API also supports a "text" field to accompany the shared link. Do we want to make use of this or just pass the naked link (plus maybe page title) for now?
See also T56829 and various other tasks linked there.
In T181195#4982077, @alexhollender wrote:@pmiazga here is a first stab at filling out the Beta feature "template". We can discuss and iterate when we meet.
- Feature rationale: we believe that by adding a share button to the article page on mobile web people will share Wikipedia articles more often, ultimately leading to more knowledge being spread and consumed.
- What we want to learn: we want to learn if people will discover and use this feature. We know that there is a share functionality built into most browsers, however our hypothesis is that since our share button will be more front-and-center, it will help make sharing top of mind for users, as well as facilitate an easier sharing process.
- Instrumentation: [TBD] T207280
- Success/failure criteria: we will be tracking the number of taps to the share button. We will control for experimental/exploratory taps.
This seems a good idea but might be a bit of a challenge regarding the instrumentation (see above).
If the number of taps is above __ we will consider the feature successful.
I guess this means the number of taps per pageview? And what might be a good way to get a baseline/benchmark (to fill in the __)?
- Duration: we think that four months worth of data will give us a sufficient understanding of how the feature is being used.
How did we arrive at that number?
In T181195#4804640, @pmiazga wrote:can we instrument it in a way so that we can find out if they followed-through with sharing a the link?
yes, that's possible, we need to add some argument to the shared url, something like ....?share_btn, then we will be able to verify how many users are visiting Wikipedia via share links.
But sharing the link and clicking on the shared link are two different actions, taken by (usually) two different users.
Feb 27 2019
In T216096#4976316, @mforns wrote:@Tbayer event_sanitized.readingdepth is backfilled using the new whitelist.
I have vetted the resulting data and it looks good to me, but please do a quick check.
I spot-checked by comparing the daily number of events with each sample field set between the sanitized and unsanitized version, and they matched. Thanks again!
The target in the annual plan was:
"Number of edits on mobile increases by 20% year-over-year in target languages. "
We should be aware that this is from the org-wide annual plan, not from the more specific Audiences annual plan which gives 10% as target instead - e.g. "Mobile web edit rate on target wikis: 10% increase", similarly for the other three related team goals. Also, "target wikis" are interpreted very differently there, see e.g. T210660 for the web team's version; I believe the apps teams may not be using the mobile-heavy segment for this either. Where did we document the decision to interpret "target wikis" in the org-wide annual plan as mobile-heavy wikis? This would also be useful input for T215976. (Back in July, there had been a sense that these are two different things - will also follow up on an email thread from back then.)
In T213488#4985201, @elukey wrote:
...
Also @HaeB, do you have an example of dashboard that I can use to trigger this issue? I tried today and failed to reproduce :(
Actually it works for me too now, on the same view where I previously encountered this error (result now; compare to the unsmoothened chart I ended up using instead in our presentation last month). Good news! Did anything change since the time that this bug was filed? (I'm seeing "0.26.3-wikimedia2" in https://superset.wikimedia.org/static/assets/version_info.json .)
Feb 26 2019
In T216628#4986235, @alexhollender wrote:@ovasileva @Tbayer proposed text:
We are actively developing new features for advanced editors (contributors?). By turning this mode on you will automatically get new features as they are released. Any feedback or collaboration would be appreciated.
Shouldn't this still somehow indicate it's a mobile-only feature that won't affect people's desktop editing experience? (The "Over the next few months..." text that follows doesn't clarify this either. And btw, it seems oddly time-bound - are we going to update it after the rollout starts?)
Feb 25 2019
In T214524#4968923, @elukey wrote:@Tbayer should be fixed now! Thanks for the report!
Marked the list as final in the task description, and added some more detail from the recent investigations in e.g. T215597. I also clarified for pending changes that we should not count automatic approvals of pending edits, only manual ones.
@pmiazga Let's talk about this and avoid rolling our own here - e.g. there is already an existing mechanism that avoids cache fragmentation: https://wikitech.wikimedia.org/wiki/Provenance
Feb 22 2019
Per discussion with @ovasileva , I'm closing this task now, and splitting off the investigation of the timing anomalies into T216852. (To recap, while we found out a lot of things about how the underlying timers work and reached a level of confidence that these anomalies should not materially affect the main takeaways for this A/B test, we have not yet found a satisfying explanation for them.)
Closing this now (I trust @mpopov can handle any followup questions, or may already have done so in other contexts).
For the record, in case they might be of use for future reference, below are some of the results I had shared in form of a SWAP notebook back in October 2017:
I'm going to set aside some time again on Monday (Feb 25) to wrap this up, including documenting what we now know about the data issue with the referrers (and whether/how it might affect the validity of the results for this request). Let me know in case the needs here have changed in the meantime, or also if anything else occurred to you that should be considered in the analysis.
PS: There is a detailed discussion of how various time-related fields in the Popups schema are generated at T182314#3956099 .
Following up here:
Yup, thank you DBA crew! Your work is very much appreciated.
Feb 21 2019
In T216658#4973662, @Ottomata wrote:I haven't looked into it, but the naming of PrefUpdate_5563398_15423246 is unusual. IIRC, tables with an extra suffix are some kind of backup or archive table that exist because of either some migration or bug. I'm sure it has real data in it, but possibly the collation mismatch has something to do with some old data issue or migration?
I guess this comment crossed with T216658#4973658 - see the explanation there. One could for example double-check if any of the changes implemented in T160454 could have affected collation.
In T216658#4973620, @Tbayer wrote:In T216658#4972706, @Milimetric wrote:This sucks but we're not likely to work on it, as we're moving away from mysql. We don't want to be mean though, so we can help sqoop this stuff into Hadoop if you need to use your painful workaround too much.
Understood, but do we know the reason for this discrepancy? Does it have to do with the general changes to the event capsule (T179625, see also T179540) that happened inbetween PrefUpdate_5563398_15423246 and PrefUpdate_5563398 ?
PS: actually I think that happened later; rather, the relevant capsule change was T160454 (I also just added that to the documentation).
If it's an issue that affects more than one EL schema, it would be worth documenting it on e.g. https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging , with a link to @nettrom_WMF's workaround.
In T216658#4972706, @Milimetric wrote:This sucks but we're not likely to work on it, as we're moving away from mysql. We don't want to be mean though, so we can help sqoop this stuff into Hadoop if you need to use your painful workaround too much.
In T211197#4951773, @phuedx wrote:@Edtadros and I worked through testing the
4th5th acceptance criterion together during our 1:1 today.Here's are the server-side EventLogging events that I captured while opting in and out of AMC mode on http://reading-web-staging.wmflabs.org/wiki/Special:MobileOptions:
[...]
To the "and compatible" part of the AC: note well the "clientValidated": true in the events above.
Feb 20 2019
Thanks! Out of paranoia an abundance of caution, I double-checked that the "value" values are consistent with the "isDefault" values, which seems to be the case:
In T214093#4966772, @Ottomata wrote:technically there could be 2 different headers that differ only by _ vs -.
Good thing we are doing this as map type with string keys then. Specific (object/struct) field names should never have '-' in them. :)
In T214093#4970109, @EBernhardson wrote:In T214093#4970086, @Tbayer wrote:In T214093#4969454, @EBernhardson wrote:I don't think the actual date inside the WMF-Last-Access header makes any difference.
I assume the discussion here is confined to the particular CirrusSearch use case, correct?
(The actual date is definitely being used elsewhere currently and could be relevant to various other use cases of MEP.)Sure, but the context was important. All pages visited will set your WMF-Last-Access data to today. Unless you can guarantee that your event stream will fire on the first possible web request that a user makes in a day the WMF-Last-Access data will be useless outside of the webrequest table. It will simply say "today" (but with a date).
In T214093#4969454, @EBernhardson wrote:I don't think the actual date inside the WMF-Last-Access header makes any difference.
I assume the discussion here is confined to the particular CirrusSearch use case, correct?
(The actual date is definitely being used elsewhere currently and could be relevant to various other use cases of MEP.)
If we are trying to track revert rates closer to real time, our current best strategy is querying the API and using the mw-reverts package. However, this isn't very performant.
Indeed, but mwreverts also offers the option to use the (MySQL replica) database instead of the API, which should be much faster.
(The db option did not work on PAWS last time I tried to use it there. I filed https://github.com/mediawiki-utilities/python-mwreverts/issues/8 about this, @Halfak looked a bit into it and said it should work there too in principle, but would need some work fixing.)
Feb 19 2019
In T214444#4949168, @Tbayer wrote:I understand this shouldn't have any impact on the logged events and their data (except maybe as consequence of the performance improvement in general), but please flag it in case that assumption turns out to be wrong.
Feb 18 2019
To wrap this up, I extended the above queries for all Wikipedias (using a PAWS notebook).
In T193051#4947170, @Lea_WMDE wrote:Thanks @Tbayer! I had a look now and I am wondering if the data for page previews is still being tracked
- The Popups schema is no longer collecting data, although you should be able to reactivate it with a simple configuration change, as we haven't yet removed the instrumentation code (this task).
- The less detailed VirtualPageView schema is still sending data. Its main purpose is to provide aggregated content consumption numbers (how often a given page has been previewed for >1sec) that are stored in the Virtualpageview_hourly table - rather than answering product questions about how the previews feature is being used per se; we used the Popups schema for that.
/ if we still have data from previous tracking endevours that I could look at concerning the questions outlined in T214493?
Yes, there is still data in the usual places where EventLogging data is stored, e.g. the event.popups Hive table. (I guess you may already have looked at the results published at https://www.mediawiki.org/wiki/Page_Previews/2017-18_A/B_Tests and perhaps the further details in the Phab task(s) linked from there.)
Feb 17 2019
Feb 16 2019
Yes, that should be a separate task (and may require involvement from other teams) .
Feb 15 2019
Here is a first result: the top 15 by pageviews for January 2019, with known bots/spiders excluded. (To get the domain, combine project and access method - e.g. "it.wikipedia" "mobile web" means it.m.wikipedia.org, "en.wikipedia" "desktop" means en.wikipedia.org.)
However, it seems to be missing a few domains, like when I query for blog.wikimedia.org (or phabricator.wikimedia.org)
As mentioned (admittedly somewhat obliquely) on the documentation page linked in my email, the pageview data is limited to "production sites", which currently does not include blog.wikimedia.org and phabricator.wikimedia.org. There is some traffic data for both domains in other places, but we can be pretty certain already that neither of them are in the top 15 domains by pageviews, so it's probably not worth retrieving numbers for these for this purpose.
Feb 14 2019
I guess this task can be closed now?
Great, thanks a lot! The sample fields were introduced in September, so no need to go further back. (CC @Groceryheist )
Blocked on code review and an answer to T216096#4953210 from someone familiar with the whole EL pipeline and the purging mechanism (@mforns?).
@Jdrewniak points out that in https://github.com/wikimedia/mediawiki-skins-MinervaNeue/blob/f07985c6dee5106da8f381a47214e7349fcd147e/resources/skins.minerva.scripts/pageIssuesLogger.js#L65 the spelling is still page-issues-b_sample/ page-issues-a_sample (i.e. like on the schema page, not like in Hive).
NB: The names of these sample field names are spelled with underscores in Hive (e.g. page_issues_b_sample, see below) but with dashes in the schema page (e.g. page-issues-b_sample ). Which version does the whitelist require?
Will it be possible later to backfill/update/extend either virtualpageview_hourly or pageview_hourly with data derived from this table? (cf. T212414#4864672)
PS: patch is at https://gerrit.wikimedia.org/r/490514 (seems @gerritbot is lagging a bit currently)
It looks like we had forgotten to whitelist the actual pageID field in addition to the page title, probably because it was only introduced shortly after this task was created (it's in the current version of the schema page but not yet deployed). I should have caught that before +2ing Nuria's patch. I submitted a fix as part of 209051, also for the related revision ID field.
Found (and fixed) an oversight regarding ReadingDepth: T216096
Feb 13 2019
In case it's useful, keep in mind that it's possible to query the webrequest table for more detail on the event in question: