For a quick stab at this data I will submit counts for visibleLength and similar measures to graphite and visualize those in a dashboard (you will have percentiles but not other dimensions).
getting it approved, and then worrying about whether we want to change it later.
Schemas are changeable, changes just have to be backwards compatible, just like you would for a public API.
You can also test your whole workflow in beta: https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/TestingOnBetaCluster
@Niharika +1 to @Milimetric response
We prefer to have as few data on MySQL as possible, we recommend hive as the best system to look at data . MySQL would not scale for our needs and we cannot guarantee it will keep for more schemas going forward.
Traffic is very small to the site about 300 people per day max.
@elukey: confirming that we have set up deletion for files like hdfs dfs -text /wmf/data/raw/eventlogging_client_side/eventlogging-client-side/hourly/2018/10/23/11/eventlogging-client-side.1006.6.855145.1676061642.1540292400000 after 90 days?
Mon, Nov 19
@Tbayer: do you have some more comments related to vetting of this metric or is this the only one?
@Ladsgroup i tested this and several other variations, none of which worked.
Sat, Nov 17
I can see data climbing up but your event widget is empty, please take a look. Bulk of traffic comes from fb mobile
Fri, Nov 16
It is worth looking at already existing event data, if we want to reuse the logic that reads events and persists those to hive partitions cannot be schema dependent, at this time partitions are:
@awight FYI that events need to abide to a schema that can be persisted to sql: https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Schema_Guidelines and that schema changes should be backwards compatible, once those requirements are taken care of you can take advantage of the current systems and dump data from kafka into hadoop
I think we are mixing things here, scooping mediawiki tables to data lake on hadoop (private cluster) and having that data available in labs (public, sanitized data). @Tbollinger request has to do with having data in hadoop such is "joinable" with other data that already exists there. This does not necessarily imply that the data will be available on labs as data scooped might be of private nature.
Piwik data is updated once a day.
Thu, Nov 15
I think you need to flash out a bit more what are the questions you want answer and evaluate whether druid is the best tool to answer those.
FYI that mediawiki_ipblocks is scooped monthly.
Wed, Nov 14
Please see changes on xml dumps being discussed: https://phabricator.wikimedia.org/T174031
@atgo,: piwik provides by default stats of the website you are coming from if your browser is sending the referrer
Tue, Nov 13
Once data comes in you can see how you instrumentation is working @Prtksxna
Mon, Nov 12
This worked great and bogus output is no longer there.
Sat, Nov 10
ah, missing an option. good cmd for reference:
Fri, Nov 9
Tested job but failed, looking: https://yarn.wikimedia.org/cluster/app/application_1540803787856_42410
Reassiging to @bearloga who is working with android team.
All data is available on hive table mediacounts, you can hit as many files as neeeded with a hive sql query.See:
Just pointing out that the main question that the ticket lists is: "to measure how often blocked users attempt (and fail) to edit pages."
Anything we have in the stack that is rate limiting by IP
Varnish comes to mind.
Thu, Nov 8
@MusikAnimal understood, just get your credential s to query cluster. To see how "approximate' is this approximation rather than look at files with raw data you can query the tables that already host that same data from which files are derived.
Sounds fine, traffic is just real small, < 10 users per day.
@Krenair: we are looking how to best import the public dataset from labs, we have already looked into scooping data from the non public data hosts and the sanitization is a lot harder than you might (by no means as simple as "runing your own views") so we need to come up with a strategy for us to scoop data from labs efficiently. Let's keep this ticket about the ways we can make the scooping task easier.
Wed, Nov 7
Data for events appears on "events" widget which is currently empty.
@atgo: piwik requires same user/pw than turnilo. You should see a pop up with the usual ldap user/pw box.
Tue, Nov 6
any guidance here on how this is done most optimally in practice if the sampling rates are actually different for event type 1 and event type 2 and we want to be able to stitch them together?
I might have not understood question as this is already a solved problem on eventlogging data (data is retained and cross linked across schemas for 90 days, not forever)
Mon, Nov 5
We tried to scoop the change_tag table in 2018-10 sqoop but it is not working , will consult with team and revert changes if needed
Code change looks good. So I understand: given that google translate is proxying requests it seems that it would do requests via a user agent that identifies request as coming from google translate so we would not need the header. Is that not the case?
We can easily turn the feature off if you guys have some fixing to do, and turn it back on later.
Thank you. We are going to try to see if our fix (which is ignoring some requests) fixes issue, we just deployed it and will monitor fleet for the next hour. If it were not to work we will ping you.
Rather than all columns be prefixed by the table name have them be just the column name when returning query results
This is preventing us from refining pageviews (due to alarms of data loss, which seem to be false positives but alert as to something not going as expected up in the pipeline)