Page MenuHomePhabricator

Please review: new WMDE public data set on stat1005
Closed, ResolvedPublic

Description

Hello,

please review the README.txt file describing the future public data set that we need for WMDE Analytics on stat1005. The file is found in:

/srv/published-datasets/wmde-analytics-engineering/TechnicalWishes/AdvancedSearchExtension

The README.txt should be quite informative in respect to whether the data set contains any private information (it doesn't).

Thanks a lot.

Event Timeline

@GoranSMilovanovic could you provide me with a direct link to that file? Thanks!

@Lea_WMDE No. The file is now found in the designated directory on the stat1005 server. I will send you a copy via e-mail.

Analytics-Data-Quality Please please could anyone do this 5 minutes review for us so that we can sync our Labs based Dashboard w. the public data set? Thanks a lot.

Hi @GoranSMilovanovic
Sorry for the delay in taking care of this task.

I see that your data set has the parsed user agent in it, that contains, i.e.:

{"wmf_app_version": "blah", "os_minor": "blah", "os_major": "blah", "is_bot": false, "device_family": "Other", "os_family": "Other", "browser_minor": "Blah", "is_mediawiki": false, "browser_major": "123", "browser_family": "SomeFamily"}

WMF's privacy policy[1] considers "user-agent information" as "personal information". Note, even if the UserAgentString is parsed, as is the case, it is still potentially identifying and falls into the "user-agent information" label. As such we can not store it for more than 90 days and we can not make it public (even if the rest of the data set does not contain information that indicates user preference). BTW, I saw that you very carefully created non-sensitive boolean fields to store properties about the search string used, without storing the full search string. We appreciate that :]

Question: The README explains that the userAgent field is used to filter bots. Is there any way that field can be sanitized to still contain the information that allows to filter bots without being potentially identifying? Would its "is_bot" sub-field enough for that? Is it possible to precompute whether the user is a bot or not in the R script and store just a boolean isBot?

[1] https://wikimediafoundation.org/wiki/Privacy_policy

@mforns Thank you for the review!

  • The data set will be sanitized as per your request; it's trivial operation and will not affect the rest of the process.
  • The non-sensitive boolean fields used to store the properties search string properties were generated by @Addshore who wrote the event logging for these data, designed the SQL schema, and is in general very careful in these things, so the applause goes to him - not me.
  • I think that the conclusion of one recent Phab discussion was that the full search string is actually stored somewhere in Hadoop, and I will have to make use of it sooner or later - but I don't remember exactly where it is stored. Maybe @Lea_WMDE knows more.

All in all, I will redesign the data set to comply with the privacy constraints and ping here again for a (hopefully) final review. Thanks a lot @mforns !

So what we want to do is T187039: Measure change in keyword usage with AdvancedSearch. I am not interested at all in anything specific about single users, except knowing that this was a user and not a bot, and which keywords were used, so we can see if our extension increases the overall use of them.

The discussion @GoranSMilovanovic mentioned is in this ticket: T187907: Count keyword usage over time

Hi @mforns,

the asExtensionUpdate.csv file - that one that needs to be made public to support our Dashboard that runs from the CloudVPS service - is now sanitized as per your request:

  • the removal of the userAgent (i.e. the parsed UserAgentString ) field - done;
  • the respective change taken into account in the corresponding README.txt file - done.

The file and the respective README.txt are found in:

/home/goransm/RScripts/TechnicalWishes/AdvancedSearchExtension

Please let me know when (and if) the file can go public. Thanks!

Hi @GoranSMilovanovic :]
The data set is good for publishing now.
Thanks for applying the changes!