Tue, Nov 21
that will be a problem in case of the Popups schema (and possibly others too which are no longer stored in MySQL),
If the Popups experiment is over and the volume of events will remain low, we can re-enable MySQL imports for it.
I see, but that will be a problem in case of the Popups schema (and possibly others too which are no longer stored in MySQL), as the advice in the documentation doesn't work for them: "If you want to access EL historical data (that has been kept for longer than 90 days), you'll find it in the MariaDB hosts".
So we should exempt that table until the proper purging strategies are implemented on Hive too. Is there already a task for that BTW?
So, not very often. Only about 6 invalid events in the last ~24hours.
Interesting, but the Popups schema has a very low event rate in general, because the experiment was deactivated last week (T178500). It would be good to know the ratio of errors to correctly logged events during the time of the experiment.
There was quite a bit of testing, see e.g. the link in my previous comment. But only on desktop, as far as I'm aware. Hence the question.
From https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Data_retention_and_auto-purging#Work_in_progress I understand that this will implement the existing purging whitelist. I'll clarify the task description accordingly.
Totally agree about the usability issues described. But I'm not quite sure I understand the additional data argument - what exactly is meant by "Number in file namespace (6) seems high"? That users are more likely to tap the button on files pages than on others? Such a conclusion would need a comparison with the pageview numbers in general.
Mon, Nov 20
Sat, Nov 18
As already alluded to in the comments to the doc, this has tradeoffs (saves some bytes of transferred data, but makes many queries more complicated and slower, and could also be an impediment to ingesting data into Druid/Pivot).
Fri, Nov 17
Thu, Nov 16
Thanks! This has stopped now (T178500), so feel free to go ahead.
Not sure what happened between 9 and 12:40 (but a similar no events and spike happened on Popups https://grafana.wikimedia.org/dashboard/db/eventlogging-schema?orgId=1&from=now-24h&to=now&var-schema=Popups)
BTW, as noted by @Jdlrobson over at T179914#3764603 , there was a weird gap followed by a spike earlier on Nov 15, a few hours before the stop of the test, which similarly happened for the Print schema.
@Jdlrobson and I discussed this a bit more on IRC right after T178500#3764662 . Apparently the "blocked until Wednesday" note in the task description had caused some confusion, although I'm not seeing anything unclear about the subsequent, more specific wording "It should be stopped on Thursday, Nov 16th, after we have collected four full weeks of data". (For those unfamiliar with the rationale, it is much preferable to do analysis for timespans of entire weeks because of the strong weekly (and daily) seasonality of reader behavior might distort results otherwise. And after launch, the experiment took at least a day to reach the full event rate, clearly an effect of caching, which we had similarly observed in previous iteration.) Jon and I briefly discussed re-enabling it at the next opportunity a few hours afterwards, but that would not have served this purpose of addressing seasonality.
This is a bit late in the game, but did we ever test the Schema:Print instrumentation for mobile/Minerva?
Recall that there had been a bit of confusion at T169730: Define and implement instrumentation for printing on desktop web, which (as the task name still says) was initially intended for desktop only, but came to be extended to mobile later. However, that was only after @bmansurov and I had done our testing rounds.
Wed, Nov 15
For context, this came out of this discussion on Facebook.
The goal here is to first confirm the conclusion that the bot/spider by that particular organization that came up in my quick spot check is indeed causing these anomalies across the board, and then to enable @Effeietsanders to contact them so that they can update their user agent to be in line with the request at https://meta.wikimedia.org/wiki/User-Agent_policy of including the string "bot", which would prevent this from distorting the pageview data for such articles.
Wait, this was meant to be deployed tomorrow, not today. See task description.
(To record some more information here while other conversations are ongoing:)
To obtain some examples, one could start from Z591#12542 (@Jdx' most recent report on files that had already been deleted but needed purging) and search Special:Log on the corresponding wiki for the file names derived from each URL, arriving at these entries.
Unless I'm overlooking something, there are no encoding discrepancies in these four examples.
Tue, Nov 14
Clarified the intention of the last checkbox item per today's standup.
@mforns It's not necessary for analysis purposes, but can't hurt much either.
BTW I will follow up on some other loose ends here soon and then close this task.
Mon, Nov 13
Note that we may still want to reactivate it, probably with a lower rate, to measure a new thing after T180036: Instrument time to first user link interaction is implemented.
Could this be held off two more days, when the data collection for this one ends (T178500)? Having to join two tables with incompatible formats is likely to add a lot of unnecessary complexity to the analysis.
Fri, Nov 10
Considering that the Popups schema is restricted to sendBeacon capable user agents anyway,
and comparing https://caniuse.com/#feat=beacon with the "Browser compatibility" section at https://developer.mozilla.org/en-US/docs/Web/API/Performance/now , I guess that the theoretical answer is yes, in this context.
If not, for completeness what should we send in its absence. I'm guessing undefined should be fine?
Something that ends up as NULL in the EventLogging MySQL table might be best.
Wed, Nov 8
In the longer run, analytics is removing the mysql server that hosts event logging next quarter, so things are going to have to move off mysql anyways sooner than later.
I was curious about the provenance of this statement, so Erik and I talked a bit about this yesterday. It turned out to be based on a remark by @Milimetric on IRC last week, but @Milimetric has since clarified that while the Analytics Engineering team does start to recommend to move schemas/analysis to Hive, there are no set plans to switch off the MySQL access at this point. (In fact Ops spun up a new MySQL server for EventLogging just last week, which I understand alleviates some of the immediate infrastructure concerns.) The Analytics Engineering team has previously stated that they don't want to take decisions about the future setup of EL unilaterally (T159170#3064701).
Tue, Nov 7
Do we already have experience creating histograms such as the above in Grafana? (keeping in mind T179426#3737738 )
Also, would we aim to send this metric only for the first (earliest) link interaction, or select the minimum (per pageview) server-side using a page token?
Thanks @phuedx! So I think this discrepancy shows it's better to rely on a longer timespan. Below, I have extended it to four weeks (Oct 1-28), which is actually OK to do provided one is content to relegate one's query to new "nice" queue on Hive (and prepared to wait a bit longer in case there are more timely queries running in the normal, non-nice queue - but even so, this query only took less than two hours for these four weeks' worth of data).
And here is the same data in form of a cumulative histogram, to make it easier to read out percentiles (e.g. the median is around 5 seconds, the tenth percentile is <0.5 seconds - again, subject to rounding errors):
Here is a histogram:
As indicated above, the restriction to integer timestamps introduces some rounding errors, basically smearing out the graph a bit horizontally.
Mon, Nov 6
I don't quite understand the "cannot evolve on its own" argument; isn't that the case for any and all schema pages on Meta? (They are all tied to code, whether generic or instrumentation-specific.)
Thanks anyway, @Ladsgroup! Just to confirm and for the benefit of others who may pick up this task one day: Did your exploration include any of the options mentioned in T166752#3336898 (adding the image information to the beginning of the description, or adapting a watermarking plugin)?
Fri, Nov 3
No, these are not in the utc-millisec format, at least according to the documentation I can find: https://www.npmjs.com/package/json-gate : "A number or an integer containing the number of milliseconds that have elapsed since midnight UTC, 1 January 1970." There are no milliseconds in this data, these are integer Epoch seconds only.
Thu, Nov 2
Closing this now - the aforementioned daily query is still running, but now automated (T175227). I have also been sharing some insights from observing the data it has been generating (e.g. here and here).
Interesting findings! Food for thought... we should probably reach out to other users of this data to get more input on the best choice going forward; how about posting to Analytics-l?
@phuedx Thanks for documenting the query used (T177969#3687130 )! Can you also specify the timespan for which it was ran? (I.e. the concrete values of M, N, O, P and Q.) I re-ran it for a different timespan - October 31 - and got quite different results. E.g. 38.15% for Chrome 61 on Windows 10 instead of 8.83%, but only 6.55% for "Other" instead of 14.38%, etc. This may be because the total number of downloads in the timespan used was too low, and hence the statistical error (random variation) too large.
Wed, Nov 1
Regarding bonus question 1, I took a quick look at that curve in Pivot (restricted to North America because timezones), but this doesn't seem to be something that can be eyeballed easily. Can discuss more in person.
Regarding the second question: day-7 retention does not seem to have changed notably with the rollout of 5.6.0.
Yes, the Popups schema has both a pageLoaded event and events for every link interaction, so this is doable (assuming pageloaded is a good starting point to count this time from). Be aware though that it will need to be based on the server-side timestamp field which only has a resolution of one second (combined with the client-side totalInteractionTime field that has a millisecond resolution).
[15:47:59] <ottomata> HaeB: no storage issues in hadoop. we are maintaining a temporary custom import/refine job for this schema, while we work on more generically supporting eventlogging data in hadoop
[15:48:23] <ottomata> i think we can keep running the custom job for yall a while longer, seems fine with me
From my understanding, there is no space issue on Hadoop, so it would be no problem to continue the test for say a week or two. But that's the area of expertise of Analytics Engineering and/or Ops - I'll ask in #wikimedia-analytics to confirm.
Tue, Oct 31
Mon, Oct 30
Yes, it's an interesting theory, but please note that the reports in that channel are not listing all new files or deleted files in general, but those likely to be WP0 piracy uploads. And the uploaders of these problematic files have adapted a naming scheme that includes parentheses.
Oct 23 2017
I guess the state of the art here may be ua-parser, which indeed seems to rely on the presence of the string "Android" (with this exception to exclude mobile IE on Windows Phone, and two other tests to catch Kindle and UCWEB on Android).
Oct 17 2017
Thanks for the update on the timing! We can chat about this in person at the offsite now, but to record a result about the first part already (populating the map): This seems to be getting a bit less than 400k requests per day on average, with 94-95% of them for a single query string (all parameters identical - is this for some kind of initial coordinate?):
Oct 16 2017
Yes, I think I understand the difference. (I would be fine with calling the process of making the data available in Hive "import" too, but I can see why you prefer to call it "refining". BTW, for cross-reference, I understand that T177783 refers to this process.)
For the record, the equivalent of T158172#3625184 for the "seen" limit we later settled on (1000ms instead of 1500ms):
Oct 13 2017
Folks, we spent quite a bit of time just a few months ago on a comprehensive review of the purging settings for all apps schemas, which included discussion of this field (cf. e.g. T164125 ). Which schemas are affected exactly? Please do not change the settings without an opportunity for the apps teams to review the tradeoffs involved (CC @JMinor).