Page MenuHomePhabricator

Code Review Needed: WMDE Summer Banner Campaign Analytics - stat1002
Closed, ResolvedPublic

Description

This is rather trivial, but again - better safe then sorry, because publicly available data are being produced.

Two scripts are being run several times during the day from stat1002: https://github.com/wmde/wmde_campaignAnalytics

The scripts are run from: /home/goransm/RScripts/sbc2017 and deliver .tsv files to /a/published-datasets/wmdecampaigns/sbc2017/

Both scripts do the work needed for the WMDE Summer Banner Campaign Analytics; they both access the Hadoop webrequest table, and do not collect nor expose any private data at all.

Thanks.

Event Timeline

@Jan_Dittrich are you familiar with R scripts and can have a quick look if they're fine?

@Tobias_Schumann_WMDE: I had a look and it looks fine to me, however I did not debug through it.

Could please describe the data that is being made public so we can look at the data, rather than having to look at the code that produces it?

@Nuria The data are found in /a/published-datasets/wmdecampaigns/sbc2017/ on stat1002

sbc2017_PROD_BannerImpressions.R extracts only the uri_query field from webrequest constrained by particular values of uri_host, uri_query, and uri_path

sbc2017_PROD_BannerClicks.R provides only counts constrained by some particular values of uri_host, uri_query, and uri_path

@GoranSMilovanovic

Would you be so kind as to check with us whether data can be public before it is public?

Let's please stop these jobs from producing data until we have had time to determine whether this data can be of public nature.

Please move files to a private location and we can take a look.

@Nuria

Unfortunately, the code had to be developed in a rapid manner, for a campaign that we have started running on 07/11/2017, so there was not time to ask you to review the data before the they went public. Also, I have relied on my best judgment based on your recent assessment of a very, very similar dataset from one of our previous campaigns.

However: the data are now being moved from /a/published-datasets/wmdecampaigns/sbc2017/ to a private location on stat1002, and will remain there.

Thank you.

@GoranSMilovanovic Normally when we made data public is for the word to consume and we document datasets, announce them and make them available (after having stablished that data , in fact, can be made public). In this case seems like these files are intermediate products of your analysis but are not for the world to consume (even if data is of public nature). If that is the case please be so kind as not to put those files in published datasets going forward.