User Details
- User Since
- Feb 11 2015, 6:02 PM (251 w, 3 d)
- Availability
- Available
- IRC Nick
- joal
- LDAP User
- Unknown
- MediaWiki User
- JAllemandou (WMF) [ Global Accounts ]
Thu, Dec 5
I looked at data from https://analytics.wikimedia.org/published/datasets/periodic/reports/metrics/wmcs/ and I have 2 comments:
- I find it misleading that wikis_by_wmcs_edits.tsv contains percents without mentioning it in the filename while the other two contain absolute edit values and percents in relation to their names.
- First of the month is used to mention the named month data - 2019-11-01 means data from 2019-11-01 to 2019-11-30
Wed, Dec 4
New dataset available @GoranSMilovanovic. Pinging @Groceryheist as I also generated the items per page.
hdfs dfs -ls /user/joal/wmf/data/wmf/mediawiki/wikidata_parquet | tail -1 drwxr-xr-x - analytics joal 0 2019-12-04 18:31 /user/joal/wmf/data/wmf/mediawiki/wikidata_parquet/20191202
Tue, Dec 3
Mon, Dec 2
Hi @srishakatux - New data is available in table wmf.editors_daily for month '2019-11'. I think except from table name, your queries should work as-is. Let me know!
Fri, Nov 29
Thu, Nov 28
Hi @srishakatux :)
Code is ready (see https://gerrit.wikimedia.org/r/#/c/analytics/refinery/+/552510/), but won't get merge before next week.
I'll backfill the december month next week, then let you know so that you can test and update the queries.
Something to note: I have renamed the table you're using, as it's not geoeditors-inly related, but more braodly storing information on editors. It's now:editors_daily (instead of geoeditors_daily).
I'll ping you when the change has been deployed and teh data available :)
Wed, Nov 27
Does this being closed mean we can access data on kafka?
Tue, Nov 26
@elukey: We should apply the same treatment for stat1007 :)
Adding action-type to the base table makes the result be extremely similar. The very small difference is due to imperfect user-data we join to to gather bot information. I hope this is ok for you @srishakatux and @bd808.
- Data from Bryan's script:
wiki_db | wmcs_edits | total_edits | wmcs_percent |
commonswiki | 573263 | 4157574 | 13.79% |
enwiki | 414401 | 5851360 | 7.08% |
wikidatawiki | 12216608 | 19628856 | 62.24% |
Thanks for the command @bd808. We're adding the change_type dimension to the underlying dataset, allowing for more use-cases (wmcs included).
About bot-filtering, being in the case of edits and not pageviews, we use a different approach. We use mediawiki user-groups, a user being in the bot group is flagged in the user_is_bot_by dimension with the value group, and we also use a regexp (https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/mediawikihistory/user/UserEventBuilder.scala#L24), a user having its username matching it is flagged with the value name is the column (that column is an array, a user can have both values). Indeed detection is fragile :)
Fri, Nov 22
Analysis of non-edit cuc_type for 3 big wikis:
spark.sql(""" select cuc_type, wiki_db, count(1) as c from wmf_raw.mediawiki_private_cu_changes where month = '2019-10' and wiki_db in ('enwiki', 'commonswiki', 'wikidatawiki') group by cuc_type, wiki_db order by cuc_type, wiki_db """).show(100, false)
New task: T238858
Thu, Nov 21
Task created: https://phabricator.wikimedia.org/T238855
Hi @srishakatux, thanks a lot for having double checked the data. The discrepancy is due to bot-edits being removed from geoeditors data. I'm very sorry not to have pinpointed that earlier :(
I'm assuming bot edits are important for your metric (a lot of bots are run from wmcs IIUC).
I think the best idea would be to add bots-edits to the geoeditors dataset, as it would be valuable for other analysis as well. We currently flag bots in 2 ways: by group (when the user is in the BOT group), and by name, when its name matches this regexp.
We (analytics) need to make a decision on how we want to add that info to the table, and then your requests should just work as is without difference, except for that one other thing: I think Bryan's script is counting more than edits as it doesn't filter for cuc_type (see https://www.mediawiki.org/wiki/Manual:Recentchanges_table#rc_type).
Let's confirm the above plan works for you @srishakatux .
Tue, Nov 19
I'm glad I double checked. Unfortunately the above 2 patches will probably not be enough. I've done a quick analysis on logs from a mediarequest backfilling job (~100Gb logs):
val df = spark.read.text("/var/log/hadoop-yarn/apps/analytics/logs/application_1573208467349_68436")
Mon, Nov 18
Another brand new tools we could have a look at: https://github.com/uber/marmaray
Fri, Nov 15
Thu, Nov 14
Nov 7 2019
Hi @Iflorez,
Data presented in Wikistats comes from AQS endpoints (see here for a list of pageviews related endpoints and their description).
The wikistats URL you have pasted above calls https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/mr.wikipedia.org/mobile-app/all-agents/monthly/2017100100/2019110700 to retrieve data.
Done! Example:
spark2-shell --master yarn --driver-memory 4G --executor-memory 8G --executor-cores 4 --conf spark.dynamicAllocation.maxExecutors=32 --jars /srv/deployment/analytics/refinery/artifacts/refinery-job.jar
Done! Hidding the turnilo aweful link under this.
Nov 6 2019
The main wonder I have around using analytics infra for serving page-count is on the delay we have in update. We update the data once every month, so for many pages data will be drifting. Depending on the purpose of the count, maybe it's not critical, but I speak as to that. If we decide to try to use Druid, let's discuss this in a meeting :)
Nov 5 2019
Nov 4 2019
Makes sense - Thanks for the explanation @mforns :)
I checked for message length in one day of webrequest, and we top at 4916 bytes.
I think Kafka will be fine as per message-size, and I assume it'll also be fine on global data growth.
Same for hadoop.
Let's make this happen.
Thanks for the fast loop over format @Nuria and @BBlack.
Indeed having a single field named TLS formatted as described above is best (we could even go for smaller keys, but it makes it less human readable).
Hadoop will then process that into an explicit map, and we'll load them in turnilo as separate fields.
Something to keep in mind: Turnilo is sampled (1 over 128), so for instances with very small probability of appearance, full data processing in Spark will be needed.
Nov 1 2019
Great plan @Milimetric! One precision: We only have geoeditors_daily up to 2019-09, so we can backfill from there but no more, and we need to nullify up to 2019-08.
hdfs dfs -ls /wmf/data/wmf/mediawiki_private/geoeditors_daily Found 1 items drwxr-x--- - analytics analytics-privatedata-users 0 2019-10-02 08:44 /wmf/data/wmf/mediawiki_private/geoeditors_daily/month=2019-09
Side note: This retention period feels small!! Is the deletion scheme deleting more than expected?
After reviewing data-deletion scripts, wikitext_history snapshots are deleted, but 6 of them are kept.
See https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/analytics/refinery/job/data_purge.pp#L128 and https://github.com/wikimedia/analytics-refinery/blob/master/bin/refinery-drop-mediawiki-snapshots.
Table is filled with data converted manually, new job seems configured correctly. We'll know about it when dumps are fully imported.
Checked logs from this morning, they look good (nothing to import yet, but no error)
Oct 31 2019
Some maths about datasize increase using approximated ratios for values TLS-Version, Key-Exchange, Auth and Cipher from https://grafana.wikimedia.org/d/000000458/tls-ciphersuite-explorer?orgId=1. I have made assumptions that we need full-cipher, and took values from cipher as an approximation.
+1. Let's fix the missing days and start backfilling backward.
Got nerd-snipped here, but I found it: https://issues.apache.org/jira/browse/HIVE-22008
Created T236985 about spark/hive like difference
Oct 30 2019
I think this problem could be related to T226730 (preventing most Special:XXX pages to be flagged as pageviews).
Thanks @Vgutierrez - I think representing those values in a map (or an array) is probably the easiest and most flexible way.
@elukey : Will you teach me varnish_kafka so that I can those values from VCL_Log into a new field?
Oct 29 2019
I have generated avro files for 2019-09 dumps, and ran quite some queries on them with limited amount of RAM needed per executor for spark, and without issue on Hive. Parquet read-optmizations were definitely what was causing issues here.
I however hit some interesting finding: