Fri, Sep 22
Just FYI that mysql backend cannot really deal with this volume of data, any subsequente test data will be available through hadoop backend,.
Thu, Sep 21
We are talking about one Popup table right? (well, two probably Popups_16364296 and Popups_16364296_15423246) Can we do away with the rest of the tables or you also need those too?
@Tbayer: we can put all data in hadoop and remove it entirely from MySQL, does that sound good?. I should be able to refine all data in hadoop in one go and put it on my db for you to take a look, once you let me know is good we can drop table in MySQL and move the popUps table to archive database. Let us know if this sounds good.
Wed, Sep 20
Super thanks for following up
Tue, Sep 19
Ok, we are ready to drop PageCreation_7481635_1542324 and PageCreation_7481635 from MySQL ping @elukey
Working now on removing events from mediawiki
@Jan_Dittrich : bucketing is available as part of wikimedia events, see an example of usage as part of serach code: https://github.com/wikimedia/mediawiki-extensions-WikimediaEvents/blob/master/modules/ext.wikimediaEvents.searchSatisfaction.js#L100-L158
Mon, Sep 18
Fri, Sep 15
And community members and the public do not have that option and are thus left with the faulty data.
This is most certainly not true, any dataset comes with cavetas and issues and this is just one of several that you should be aware of: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageview_hourly#Changes_and_known_problems_since_2015-06-16
Thu, Sep 14
ping @Jseddon Can we filter the incorrect rows?
Will set up meeting
We cannot update pivot due to it being closed source so we will not be able to add dataset to pivot (thus far it cannot deal with monthly granularity), it can however be added to druid and be accessed with superset when available
Ping @Tbayer what is the status of this vetting?
Cassandra, zookeeper, druid, hadoop, kafka
Wed, Sep 13
We already discussed this issue on this ticket: https://phabricator.wikimedia.org/T141506#2575088 and I second @BBlack 's
opinion. In a gist: i do not think this traffic should be removed, it is real (if unintentional), we count real requests coming to our servers and these are very real requests. I understand that the magnitude of this event is large but it is really not the only one we have of this type (on 2015 there was a similar one that measured at some point 5% traffic overall) and as mentioned on ticket Iam of the opinion that we should keep what we count as close to reality as possible as it is the best way to make sense of data.
Ahh, my mistake!
Let's make sure we update documentations and anounce release of new data once this task is done. Probably blogpost worthy
@JAllemandou Let's document dataset on https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream (cadence, availability) announce it to analytics@ and reserach list before we close this task.
Confirming that zero carrier is visble, pinged zero folks about it