Page MenuHomePhabricator

Current pageview definition leads to unwanted statistics lifecycle
Closed, ResolvedPublic5 Estimated Story Points

Description

issue discovered in French Wikipedia's Bistrot: https://fr.wikipedia.org/w/index.php?title=Wikip%C3%A9dia:Le_Bistro/30_janvier_2017#bizarreries

French wikipedia page named ''Désiré Dihau" has pageviews before it has been created !

First quick research on 1 hour shows that pages counted as pageviews for this specific case before page creation are using POST on URL fr.wikipedia.org,/w/index.php,?title=D%C3%A9sir%C3%A9_Dihau&action=submit, and are having the same exact url referer.

Smalltalking with @Trizek-WMF (who showed me the case), we came to conclusion that this case is probably due to hitting preview or show changes when modifying a page (actions action=submit).
This should be confirmed and discuss as to whether or not it should be counted as pageview.

Event Timeline

Milimetric shifted this object from the S1 Public space to the Restricted Space space.Jan 30 2017, 5:14 PM
Milimetric changed the visibility from "Public (No Login Required)" to "All Users".
Milimetric shifted this object from the Restricted Space space to the S1 Public space.Jan 30 2017, 6:50 PM
Milimetric changed the visibility from "All Users" to "Custom Policy".
Milimetric changed the edit policy from "All Users" to "Custom Policy".

@Milimetric, thanks for bringing this to our attention. Just to clarify, no personally identifiable information is exposed directly about editors, correct? The information that is exposed is essentially the fact that a new page is being worked on. We're not, for example, exposing the IP address or username associated with the new page, correct?

@APalmer_WMF, FYI.

Yes, no PII is directly being exposed. Except for if someone accidentally pasted PII into the title of the article that they're working on. Then that would be exposed. So, for example, I start a draft:

User:Milimetric/Sandbox/World_Wetlands_Day<<PII Pasted here>>_2017

and I notice halfway through working on it that I pasted some PII, but I've already hit preview a few times. At that point, pageviews including the PII would have been recorded in our system and those would eventually make their way out into the pageview API, where they become public.

Unfortunately it would be really really hard to distinguish these kinds of "Preview button" pageviews from regular pageviews. So we have no way of cleaning out data that's already there without wiping the whole set of data or looking through it manually.

matmarex changed the edit policy from "Custom Policy" to "Custom Policy".
matmarex added a subscriber: matmarex.

(Changed the policies to allow all task subscribers to view/edit it – in particular, Trizek couldn't see it.)

(Changed the policies to allow all task subscribers to view/edit it – in particular, Trizek couldn't see it.)

Thanks :)

I have more information about what happen, and my guess was correct: user mandariine has created the article on her computer, and then put it on the wiki to have a preview and polish it on her computer. action=submit is triggered and counted as a view.

Thank you, sorry for missing Trizek in the first policy. And thank you, Trizek, for the additional detail.

Based on https://meta.wikimedia.org/wiki/Research:Page_view - these sorts of page views should not be counted (It is in /w/ not /wiki/)

@Milimetric, @JAllemandou, we discussed this in the weekly Security team meeting and we don't see there being major security implications here. It seems to be more of an Analytics-centric functional issue.

I definitely take your point about the possibility for PII being the page title. Can you scope how large the problem may be? How many requests of this type are currently tracked in Pageview data?

I'm waiting for more details, but apparently for the example given, the user was not logged-in. Mandariine has reported since that she has done the same thing with another article, but she was logged in. There is no stats recorded prior to the publication.

It's basically impossible to estimate, @dpatrick. Thanks for discussing in your weekly meeting, we will use this task to change the code going forward then (agreed with @Bawolff that it's weird it was included, maybe it's signed out vs. in like @Trizek-WMF says). If anyone ever thinks of this again and goes, OH NO, we have to delete it ALL, just ping back.

Milimetric triaged this task as High priority.
Milimetric edited projects, added Analytics-Kanban; removed Analytics.
Milimetric set the point value for this task to 5.

For the record, the associated patch is https://gerrit.wikimedia.org/r/#/c/335639/
(Filling in for gerritbot, who I guess can't post on restricted tasks?)

Thanks for the link, @Tbayer. As I mentioned in IRC, your point about this being public makes sense. If you want to make it public, you can edit the task and remove the custom policy.

Thanks for the link, @Tbayer. As I mentioned in IRC, your point about this being public makes sense. If you want to make it public, you can edit the task and remove the custom policy.

Done ;) (To recap for later reference: we considered that one the one hand this task is referred from the pageview definition change log at https://meta.wikimedia.org/wiki/Research:Page_view , and on the other hand both the pageview discrepancies that gave rise to it and the code change that resulted from it are public anyway. What's more, the latter are likely to be rediscovered in other contexts for the foreseeable future.)

To document the effect on total pageviews for later reference, here is a plot of the daily percentage (for the timespan where data was still available). It appears to have been consistently below 0.05%, and (predictably) concentrated on desktop.

action submit pageviews.png (512×702 px, 44 KB)

Data via

SELECT year, month, day, CONCAT(year,'-',LPAD(month,2,'0'),'-',LPAD(day,2,'0')) AS date,
ROUND(100* SUM(IF(uri_query LIKE '%action\=submit%', 1, 0))/SUM(1),3) AS submit_percentage,
ROUND(100* SUM(IF(uri_query LIKE '%action\=submit%' AND access_method = 'mobile web', 1, 0))/SUM(1),3) AS submit_percentage_mobile_web
FROM wmf.webrequest
WHERE year = 2017 
AND agent_type = 'user'
AND is_pageview = TRUE
GROUP BY year, month, day
ORDER BY year, month, day ASC
LIMIT 10000;

Thanks for the link, @Tbayer. As I mentioned in IRC, your point about this being public makes sense. If you want to make it public, you can edit the task and remove the custom policy.

Done ;)

Hm, actually, it looks like removing the "Security" project is not enough for making the task publicly visible, and I don't seem to the permissions necessary for the other actions described at https://www.mediawiki.org/wiki/Phabricator/Security#How_to_Lower_the_Security_of_a_Task . @Aklapper , could you help out?

T153742

Thanks for the link, @Tbayer. As I mentioned in IRC, your point about this being public makes sense. If you want to make it public, you can edit the task and remove the custom policy.

Done ;)

Hm, actually, it looks like removing the "Security" project is not enough for making the task publicly visible, and I don't seem to the permissions necessary for the other actions described at https://www.mediawiki.org/wiki/Phabricator/Security#How_to_Lower_the_Security_of_a_Task . @Aklapper , could you help out?

So I understand from @Aklapper that this is currently not possible due to T153742 :(

mmodell changed the visibility from "Custom Policy" to "Public (No Login Required)".Jun 18 2017, 5:22 PM
mmodell changed the edit policy from "Custom Policy" to "All Users".
mmodell added a subscriber: mmodell.

I have changed the policy to public.