The analytics hadoop cluster could also be of use here: the task can easily take advantage of parallelization.
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Apr 25 2019
Apr 23 2019
In T220507#5129134, @Neil_P._Quinn_WMF wrote:I agree with the overall philosophy of being very explicit and precise in this dataset, but I do still wonder if it's necessary to provide registration timestamps from both user and logging. It's a big deal that those timestamps differ for 37% of users, but how big are those differences? If it's usually just a matter of seconds, then it doesn't seem necessary to pay the complexity cost.
@Nuria: The caused_by_user_text field contains the event-performer user_text so additional_infois not accurate enough IMO. We could use a complex structure for caused_by given that we have user_id, user_text and event_type, but I'm not sure if it makes things easier.
In T221460#5128400, @fdans wrote:We have to remove mediawiki_history_druid
Community has spoken, we'll find workarounds - Thanks a lot @ArielGlenn for helping driving this :)
Super good idea and good presentation of the difficulty :)
Maybe one day ;)
Regarding the quick-lookups, I suggest using spark in shell mode (whether in python or in scala):
- Extract the subset of data you're after and register it as a temporary table (spark.sql("SELECT * from wmf.mediawiki_history WHERE snapshot = '2019-03' and wiki_db = 'mywiki' and page_title = 'a title'").createOrReplaceTempView("myview")
- Cache the view for fast access: spark.table("myview").cache()
- Access the data as needed: spark.sql("SELECT count(1) from myview").show()
With the above solution, the first access (reading data and caching the table) takes some time (a few minutes max I'd say) then other requests to myview are subsecond.
I'm not working on this (yet?) - Seems event related.
Apr 22 2019
Apr 19 2019
The user part of this task is in testing with the datasource located at hdfs:///user/joal/wmf/data/wmf/mediawiki/history/snaphsot=2019-03 and hdfs:///user/joal/wmf/data/wmf/mediawiki/user_history/snaphsot=2019-03 (along with a bunch of other changes).
Checking improvements in new datsource.
// Current datasource - normally the problem is present in here spark.read.parquet("/wmf/data/wmf/mediawiki/history/snapshot=2019-03").createOrReplaceTempView("mwh_old")
spark.sql("select event_user_is_bot_by, count(1) as c from mwh group by event_user_is_bot_by").show(20, false)
+--------------------+----------+
event_user_is_bot_by | c |
+--------------------+----------+
[name] | 306912801 |
[] | 2597239764 |
null | 491818452 |
[group] | 169490265 |
[name, group] | 1289385512 |
+--------------------+----------+
spark.sql("select caused_by_user_text, count(1) as c from mwph group by caused_by_user_text order by c desc limit 20").show(20, false)
+--------------------------+--------+
caused_by_user_text | c |
+--------------------------+--------+
null | 54186484 |
Lsjbot | 17517304 |
Research Bot | 16512521 |
TuanminhBot | 11023605 |
Meta-Wiki Welcome | 8611228 |
Sk!dbot | 7342837 |
Wikimedia Commons Welcome | 7243842 |
GZWDer (flood) | 6897283 |
Dcirovicbot | 6656209 |
Bot-Jagwar | 5984240 |
Fæ | 4883049 |
QuickStatementsBot | 4423838 |
Maintenance script | 2967554 |
New user message | 2678820 |
Wikinews Welcome | 2599072 |
Welcoming Bot | 2527432 |
Loveless | 2404348 |
Panoramio upload bot | 2312489 |
MediaWiki message delivery | 1889428 |
Liangent-bot | 1743545 |
+--------------------------+--------+
Data is available in new test-datasource located at /user/joal/wmf/data/wmf/mediawiki/user_history:
Confirmation of problem resolution in new test-datasource located at /user/joal/wmf/data/wmf/mediawiki/user_history:
Hi @Neil_P._Quinn_WMF, sorry for the big comment above - Do you mind having a look and confirming this looks ok for you? Many thanks :)
Confirmation of problem resolution in new test-datasource located at /user/joal/wmf/data/wmf/mediawiki/user_history:
// Current datasource - normally the problem is present in here val odf = spark.read.parquet("/wmf/data/wmf/mediawiki/history/snapshot=2019-03") val oudf = spark.read.parquet("/wmf/data/wmf/mediawiki/user_history/snapshot=2019-03")
Apr 17 2019
Hi @Neil_P._Quinn_WMF - Test data is available :)
Here is an example of accessing it in scala-spark2:
val history = spark.read.parquet("/user/joal/wmf/data/wmf/mediawiki/history/snapshot=2019-03") history.where("event_entity = 'revision' and wiki_db = 'enwiki' and revision_tags is not null and size(revision_tags) > 0").select("event_timestamp", "revision_id", "revision_tags").show(100, false)
Hi @nettrom_WMF - I have a test dataset for you that include this data (example in scala-spark2:
val user_history = spark.read.parquet("/user/joal/wmf/data/wmf/mediawiki/user_history/snapshot=2019-03") user_history..where("caused_by_event_type = 'alterblocks' and wiki_db = 'itwiki' and start_timestamp like '2019-01%' and source_log_params['sitewide'] = 'false' and source_log_params['7::restrictions'] is not null").select("source_log_params").show(100, false)
Actually it'll not finish - We just killed it as we need to restart the cluster (planned maintenance - see https://lists.wikimedia.org/pipermail/engineering/2019-April/000695.html). Sorry for that :( Hopefully you'll still have enough logs.
Hi @bmansurov,
I've been monitoring the current run of the recommender (https://yarn.wikimedia.org/proxy/application_1553764233554_69057/), and I think the approach in term of cluster usage is not sustainable. Particularly, generating the pageviews datasets for the top 50 languages should be done in a single pass, and written in partitioned folders (using partitionBy). Currently, there are 50 runs each reading the whole pageview needed data multiple times - This represents reading 50 *2 *1Tb of data. Doing it in one pass will save a lot of IOs as well as a lot of time :)
@Neil_P._Quinn_WMF : Indeed the validation job failed for expected reasons (a higher than usual group-bot-removal, a dimension against which our validation is not very stable). We restarted the job manually with higher threshold: https://hue.wikimedia.org/oozie/list_oozie_coordinator/0171576-181112144035577-oozie-oozi-C
In this job, the email was configured to be sent to multiple emails, but it seems it failed (I didn't received it either): https://hue.wikimedia.org/oozie/list_oozie_workflow_action/0171577-181112144035577-oozie-oozi-W%40send_success_email/?coordinator_job_id=0171576-181112144035577-oozie-oozi-C.
Let's double check again next month if the email gets sent.
Apr 16 2019
Apr 15 2019
Here is the definition we agred on with @Milimetric: removal of user_is_bot_by_name boolean field and addition of 2 new fields: user_is_bot_by: Array[String] and user_is_bot_by_historical: Array[String]. They can contain 2 different values as of now: group (when the user is in bot group) and name_regex (if the username contains bot). Having an array also allows us for possible new methods (machine learning?).
I did a quick analysis using Spark on user data after @nettrom_WMF comment:
import com.databricks.spark.avro._ // user table data val u = spark.read.avro("/wmf/data/raw/mediawiki/tables/user/snapshot=2019-03"). select("wiki_db", "user_id", "user_registration") // logging table data val l = spark.read.avro("/wmf/data/raw/mediawiki/tables/logging/snapshot=2019-03"). where("log_type = 'newusers' and log_user is not null and log_user > 0"). select("wiki_db", "log_user", "log_timestamp") // joined data on wiki_db and user_id val j = u.join(l, u("wiki_db") === l("wiki_db") && u("user_id") === l("log_user")).cache()
Apr 12 2019
Thanks for your comment @nettrom_WMF - I should have explained the plan more thoroughtly.
In the next changes for mediawiki-history, we will add fields for pages and users, ending up in having pageCreationTimestamp and pageFirstEditTimestamp coherent by page_id for each page-event, and similarly for users.
the precise definition of how those values is as follow:
- pageCreationTimestamp - Timestamp of the page-create event in logging table if it exists, null otherwise.
- pageFirstEditTimestamp - Timestamp of the oldest revision associated to the page (by page_id), whether in revision or archive table.
- userCreationTimestamp - oldest from user_registration (in user-table) and user-create event in logging table (if both exist, otherwise the existing one if only one exist, otherwise null).
- userFirstEditTimestamp - Timestamp of the oldest revision associated to the user (by user_id), whether in revision or archive table.
Apr 11 2019
Ping @Neil_P._Quinn_WMF and @nettrom_WMF - I'll move forward with the suggested implementation this end of week to have it tested next week :)
Apr 9 2019
Apr 8 2019
Some queries are computed using hadoop for wikidata (see https://github.com/wikimedia/analytics-refinery/tree/master/oozie/wikidata). If SQL over recent-changes works for, that's great :)
I did a quick analysis over request-patterns: 94% of edits-per-page requests made on April 4th were on a timespan of more than 1 year with mostly daily granularity. I have made the following patch to restrict timespan of per-page requests to 1 year: https://gerrit.wikimedia.org/r/c/analytics/aqs/+/502198
Apr 5 2019
Apr 4 2019
Reading about this - Would delayed data be interesting? This information is accessible in hadoop :)
In T219910#5083908, @Nuria wrote:@JAllemandou the title of this graph looks like it needs changing?
Apr 3 2019
Shall I send a PR updating rate-limiting in restbase for edits-per-page requests to 10?
https://github.com/wikimedia/restbase/blob/master/v1/metrics.yaml#L1551
I think the problem experienced yesterday could have been prevented by T189623.
Data backup:
- Actual number of queries - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&panelId=42&fullscreen&orgId=1&from=now-1d%2Fd&to=now-1d%2Fd
- Per-segment number of queries - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&panelId=43&fullscreen&orgId=1&from=now-1d%2Fd&to=now-1d%2Fd
We prefer the ISO-8601 strings for serialization everywhere.