Right, that true of course.
It doesn't solve the case described in T92457#4173454 (the same video used with two different thumbtimes...)
I think this task was always a bit unhelpfully vague, so I've opened a more specific task at T197839. I'm not sure if it's feasible for us to fix it, as page images only captures the title of the associated page image and would likely need quite a large re-architecture to support that but we'll see.
Wed, Jun 20
First, it would create a lot of extra work for the editors who would have to transfer locally chosen thumbtimes to Commons as the default thumbtime, and update them there in case they are changed locally.
I'm not sure it's true that it creates a lot of work.
We've only seen 1 case in the wild so far where the page image has been the initial screen and the initial screen of that video was a black screen.
Well, but that's a very narrow interpretation of the issue that this task should have resolved. It is more appropriately described as "the initial screen is not a suitable page image".
To record another takeaway from today's meeting: We did consider sampling/bucketing by page instead of by session ID, which (cf. T191532#4162147 ) could potentially avoid the FOUC, and which one might be able to implement based on page IDs alone. However, @Jdlrobson pointed out the pragmatic argument for using session IDs, which is that by now we have a lot of experience with that approach, and can use existing code.
Mon, Jun 18
Fri, Jun 15
No, that sounds like a reasonable suggestion, and @mforns makes a great point about the importance of keeping project and language_variant consistent with pageview_hourly.
If there is interest, we could estimate how often the truncation would happen, by looking at the distribution of page name length (encoded and unencoded) weighted by pageviews.
Wed, Jun 13
Consistent in what sense? Recall that the main purpose of this schema is as an intermediate step for the virtualpageview_hourly table, where we only need the source's page title, page ID and namespace (consistent with the pageview_hourly table) - not the full URL.
Tue, Jun 12
Mon, Jun 11
Fri, Jun 8
I should have noted that editCountBucket was pulled from https://meta.wikimedia.org/wiki/Schema:Popups in the hope that we might have some easily reusable code there. (It wasn't mentioned in the task description, but @ovasileva and I agree that having this data might be useful for understanding how the effect of the new design might differ by editor experience level.)
I agree it could be preferable to use numbers (0,1,5,100, 1000) instead of strings ("0 edits", "1-4 edits", "5-99 edits", ...), but if we just use the existing setup, that's totally fine too.
Here is a chart of PDF downloads per day (we already used this data last month in the quarterly check-in):
Thu, Jun 7
Wed, Jun 6
Is this going to be carried forward into the 2018-19 annual plan? Improvements in this area would be very valuable for reader analytics in Audiences, too.
Thanks for the explanation, @fdans ! It seems like the best option for now regarding T189307 is to convert that kind of pageview_hourly query into an equivalent (if much slower) webrequest query that uses the old ua-parser regex. (Or do you happen to see a better solution?)
Since Turnilo has now officially replaced Pivot, it would be great to update the documentation on Wikitech (the main page at https://wikitech.wikimedia.org/wiki/Analytics/Systems/Pivot , and others where Pivot is mentioned).
Tue, Jun 5
I threw up a first draft of the schema at https://meta.wikimedia.org/wiki/Schema:PageIssues , based on the current task description. We still need to fully decide on the sampling/bucketing strategy.
Sat, Jun 2
Sounds good, thanks!
Sorry, but the pageviews-daily dataset as it was linked in the task description is still broken: While the quoted error message is gone, it still only offers the "Count" measure (which is rather meaningless and likely to mislead users into believing we get only between 3 and 4 million pageviews per day overall) and not the "View Count" measure that we need.
Fri, Jun 1
What does the modal action "external links" mean? (does the modal contain any links pointing outside Wikipedia, or is it still possible to click on external links in the article while the modal shows?)
Yes - I discussed briefly with @Dbrant before filing this task, and it could turn out there's additional effort needed on the app's side, but for now it seems that the work done in the context of T110702 (probably in connection with the addition of referrer data T192779 ) could suffice.
This task has a somewhat a convoluted history, but after spending some time reviewing the various parts I think we can check all the boxes and "sign off". The only part that is not entirely clear to me from the comments above is whether anyone tested the standard print path ("Share" --> "Print") for Chrome mobile (per T179915). Clarifying that might still be useful, but it doesn't block the main analysis at T179915 regarding the download button usage.
- We did change the referer_class code, but we deployed it beginning of May, not April (5th to be precise).
Does that refer to T191714: Add Ecosia and Startpage to list of search engines (which according to this log was deployed on May 2), or to some other change that could have affected the data as well?
Thu, May 31
Related: https://www.itu.int/en/ITU-D/Statistics/Documents/publications/wsisreview2014/WSIS2014_review.pdf (a UN report from a few years back that made use of quite a bit of - mostly or even entirely - publicly available Wikipedia data)
Tue, May 29
I'm actually not 100% certain when I last saw it working (I do recall checking some pageview date in recent days since the switchover, but probably only used the hourly version).
For the record, below is an example of the queries I have been using for this. This was based on the detailed analysis in https://phabricator.wikimedia.org/T157404 (for Pakistan - task set to private because the examination involved looking at some IP information), while including two other countries - Iran and Afghanistan - that showed a similarly anomalous pattern of IE7 views widely surpassing those from newer IE versions.
Related to question #3 in the task description, I noticed that the number of IE7 pageviews has dropped at lot from May 21 to May 22; it seems these are now counted as (mainly) IE11 (and some as IE8 and IE9):
Japan is on the list, but the Performance team had found larger changes in some of the other countries listed in the task description.
You could calculate: "daily-user-pageviews-for-jp.wikipedia.org-in-Japan-in-desktop" divided by "daily-unique-devices-in-Japan-in-jp.wikipedia.org" and get a timeseries for that would have 1 point per day. If effect of datacenter is significant I would expect to see a hiccup on that timeseries after the datacenter launch, meaning that there are "longer sessions".
Thanks for the suggestion! But per the task description, we are already examining the numerator and denominator separately. I don't expect that this quotient (views / devices) would yield much additional insight. Or to put it differently: Unless the switchover caused a decrease in the number of unique devices for some reason, these "longer sessions" would already be reflected in the pageview metric.
@JAllemandou Did the "about 10% of the offset" estimate in the task description refer to the daily metric?
For the monthly unique devices, the impact may have been much larger (looking at the total uniques_estimate - haven't examined the offset part separately yet):
Fri, May 25
Incident report (in progress): https://wikitech.wikimedia.org/wiki/Incident_documentation/20180524-wikidata
May 19 2018
I am not sure what privacy and bandwidth limitations had to do with this regrettable communications issue. Sure, those are necessary and sometimes difficult topics to discuss. But any questions related to these two had already been resolved at that point - we had agreed what data would be stored in the aggregate table and who would implement the aggregation (again, thanks for your work on this!). Rather, the frustration on my side was about things like people making strong but erroneous statements about what our team's data needs supposedly were, and what this task supposedly consisted in, instead of simply acknowledging and fixing the clear oversight that had been pointed out. And on your side, I understand that much of the frustration was about seeing yourself accused of deliberately diverging from the task as written when you implemented the aggregation. Again, that was not my intention and I had thought that this had been clear in T186728#4170881 , but perhaps there is something I missed, and I'm interested in what could have done to avoid that misunderstanding.
Please understand, though, that we Analytics are trying our best on our side as well. And that in the view of this situation where conflicts already exist between some of us, aggressive comments do usually not help reach results, but rather make the conflicts bigger.
Thanks (belatedly) for solving this mystery by following up with App Annie and doing further research on this! BTW this is also consistent with the earlier observation that these spikes have always been confined to a single country and a single day, and appear to increase the baseline by a round number like 10,000.
May 18 2018
May 16 2018
Thanks! Already looked at it in Superset with the web team yesterday.
May 11 2018
Thanks @mforns, also for keeping the existing data up earlier while the fix was implemented (I was able to use it for our quarterly check-in deck this week). Will take a look at the new version soon.
(Since March, this conversation has been continuing elsewhere, mainly with @JMinor on the Readers team's side. Since it appeared not all participants were seeing benefits of being able to whitelist this field in appropriate cases, I'm posting here a sketch of use cases that Josh drafted earlier with my support:)
That may be a good idea, although the limiting factor here is not database load or computations in Python but the large number of web API accesses. (I have been using the API version of mwreverts because last year I couldn't get the database version to work on PAWS, even with Aaron's support - filed at https://github.com/mediawiki-utilities/python-mwreverts/issues/8 .)
May 10 2018
@Fjalapeno Do you need any additional data from us here?
Clarified the scope of this task per @Charlotte, and split off the question about the volume of general edits into T194424.
I also updated my analysis of reverts for description edits for the last few months, addressing question 2 in this task. (This consisted just of re-running the existing PAWS notebook and thus wasn't much work at all, but still took a while as the calculation kept failing because - I think - of time limits on PAWS; the revert analysis is a bit computationally expensive and took 8 hours when it finally completed successfully; next time we run this one probably needs to invest some time to split it into several shorter timespans).
May 9 2018
@Tbayer do you have an idea of the amount of work that would be required to add a banner like WMDE's?
I imagine this is a fairly harmless change to the theme, but @Volker_E should be able to provide more of an expert answer. And IMHO we should do this anyway (i.e. even if the existing blog is moved to a newly created "archive" domain, which, as mentioned above, I would recommend against).
Thanks for your work on this! (speaking as one of the victims, as user HaeB != haeb)
It seems that your upstream pull request has since gotten merged?
mmm.. the research for this metric about bots was plentiful and much of the "user" marked traffic is excluding from counting, the bulk the work we did on how did we excluded bot traffic wrongly labeled as user is explained here:
Yes, I'm familiar with that page. As far as I can see, it doesn't report any results about how many non-nocookie requests might be coming from undetected bots. And in any case, this was 2015 and it's 2018 now.
Apropos, I see that under "worklog" there you mentioned "[then-] recent updates to bot regex that affect this data, pour regex catches more bots via user agent (quite a bit more)", underlining the importance of the present task in general.
May 8 2018
Great work, @MNeisler! Some additional remarks inline.
To add, for the record: The Breitbart-related coverage and social media attention which likely is a confounding factor here is summarized e.g. in https://www.haaretz.com/us-news/.premium-breitbart-declares-war-on-wikipedia-in-facebook-s-fight-against-fake-news-1.5991915 (paywalled, but may be accessible via Google).
Potential next steps if needed could include trying to expand the set of new articles pages pending a complete list from Facebook or by using Wikipedia categories
For the record: we have been thinking about using https://en.wikipedia.org/wiki/Category:Media_in_the_United_States
but overall it appears that the article context feature had a very small effect on Facebook referred pageviews. The excerpt displayed by Facebook in the article context feature is large so it's possible that many Facebook users who access this feature do not clickthrough to the Wikipedia articles.
To clarify: by now we can safely assert that the effect on overall Facebook-referred pageviews is very small. That said, we still have options for exploring question 2 ("Estimate the number of additional daily pageviews resulting from the feature") further. From F17344233, it appears that we could state a lower bound of about 400 daily pageviews for this just based on this fairly small sample of 9 articles, which we might be able to increase a lot when including the long tail of all articles in the aforementioned category. Focusing on the top-referred news media articles first was a great initial approach, but (besides the shape of the chart) the fact that the article about a comparatively small website like the Daily Wire surpassed those for e.g. the NYT or Fox News is another indicator that this was indeed dominated by the controversies/attention generated by Facebook's announcement itself, rather then the feature per se.
May 7 2018
For communicating the dormancy, I think one might want to add a little note to the theme in either case (a bit like what WMDE does on their old website, see the orange note on top of https://wikimedia.de/wiki/Hauptseite ); and on the other hand, all blog posts carry their publication date in the URL anyway.
Congratulations, Sherlock! ;)
To add: From our conversation on Friday, I also understand that the new behavior is now a bit closer to how a web browser would handle it, i.e. the app views are now more comparable to the pageviews we are registering on the web.
For our core metrics reporting for Q3, I think the takeaway is that the year-over-year comparison is still not valid yet until the next quarter, but perhaps we can limit it to March 2017 vs. March 2018 - CC @mpopov.
This has been live in (I understand) all the planned countries for several weeks now, so we should have enough traffic data for a before vs. after comparison; also, the Performance team has published their data on the immediate speed changes they have been measuring. @MNeisler is going to take on this task; we should meet soon and discuss the approach in detail.
May 6 2018
Why not simply leave the existing post URLs (like https://blog.wikimedia.org/2018/05/03/why-i-women-wikipedia/ ) intact and just redirect the blog's main page https://blog.wikimedia.org/ to https://wikimediafoundation.org/news/ ? That might save quite a bit of work and help avoid unforeseen technical complications. What is the rationale for creating and maintaining a new domain like blogarchives.wikimedia.org ?
May 4 2018
May 3 2018
The reason to hash app_install_id is because these events would end up somewhere where we would be able to join with behavioral data sent by mobile apps, which we DON'T want
To clarify just in case, it's fine to log app_install_id in connection with user actions, it has been done in many different schemas for years. And "behavioral data" would seem to describe this data here too.
So I guess the "don't want" here refers to connecting users IDs with those other schemas via the app install ID, right? (in which case, fully agreed, although it seems we had been trying to prevent that with Method 1 or Method 2 anyway)
Yes, it is considered impossible for practical purposes to come up with a source value when given the hash value alone (assuming that we choose a well-established hash function whose security has been widely vetted).
But in situations where one has the additional information that the hash can only come from a fairly limited set of source values, this is no longer true. That is well known and for example forms the reason why passwords hashes are always stored with a salt. The situation here is even worse - the list of existing users is public and fairly small (<200 million accounts across all WMF wikis, much less when applying some easy heuristics, e.g. limiting to recently active users).
even if they knew exactly which hashing function was used.
Which they would, considering that our code is open source ;)
May 2 2018
Thanks for working on this! As @Nuria points out in the task description, it looks important for data quality to keep this updated. Back in 2015 (T106134), @dr0ptp4kt suggested to automate these updates. Is that possible?
I agree that option b) is obviously preferable (because it simplifies things in the future).